ML Infra Engineer - Supercomputing
Physical IntelligenceSan FranciscoPosted 3 April 2026
Job Description
ML Infra Engineer - Supercomputing
Physical Intelligence builds general-purpose AI for the physical world. Training our models requires orchestrating thousands of accelerators across a heterogeneous fleet of GPU and TPU clusters — spanning different hardware generations, cloud providers, and cluster topologies.
Today, researchers often need to know which cluster to target, what resources are available, and how to configure their jobs accordingly. That doesn't scale. We need a scheduling and compute layer that makes the right placement decision automatically — routing jobs to the best cluster based on availability, hardware fit, cost, and priority — so researchers can focus entirely on the science.
This role owns that problem end-to-end: the scheduling systems, the placement logic, the cluster management layer, and the operational tooling that keeps it all running.
This is not cloud DevOps. It's not about standing up clusters and walking away. It's a systems role for people who care about intelligent resource allocation, utilization, fault tolerance, and making large-scale distributed training seamless.
THE TEAM
The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. You will work closely with ML Infra (training systems), data platform, and research teams to ensure compute scheduling is never the bottleneck.
IN THIS ROLE YOU WILL
- Own Intelligent Job Scheduling and Placement: Design and build multi-tenant scheduling systems that automatically place training jobs on the best available cluster based on hardware requirements, topology, availability, cost, and priority. Support fair resource sharing across teams and projects with quota management, priority tiers, and preemption policies. Abstract away cluster differences so researchers submit jobs without needing to know where they will land.
- Scale Multi-cluster Orchestration: Build the control plane that manages the job lifecycle across diverse clusters (mixed GPU/TPU, multi-generation hardware, on-prem/cloud) and enables seamless job migration, failover, and re-scheduling.
- Optimize Accelerator Utilization and Efficiency: Monitor and optimize GPU/TPU utilization across the entire fleet. Implement priority, preemption, queueing, and fairness policies that balance research velocity with cost efficiency.
- Ensure Scaling and Stability: Implement fault detection, automatic recovery, and resilience for long-running multi-node training jobs. Manage health checking, node management, and scaling to thousands of accelerators.
- Support Inference and Robot Deployment: Extend scheduling and orchestration to inference workloads, including deploying models to edge devices on physical robots.
- Enhance Observability and Developer Experience: Build the dashboards, alerting, SLOs, and debugging tools necessary for researchers to understand job status and for the team to ensure high scheduling quality and cluster reliability.
WHAT WE HOPE YOU’LL BRING
We’re intentionally flexible on exact background, but strong candidates usually have:
- Strong software engineering fundamentals
- Experience building or operating job scheduling / resource management systems at scale
- Experience with large-scale compute clusters (GPU and/or TPU)
- Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents)
- Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy
- Understanding of how ML training workloads behave — long-running, multi-node, sensitive to stragglers, topology-dependent
- A bias toward owning systems end-to-end, from design to operation
- Enjoy working closely with researchers and unblocking fast-moving projects
BONUS POINTS IF YOU HAVE
- Experience building multi-cluster or federated scheduling systems
- Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card