ML Infra Engineer - Platform
Rhoda AIPalo AltoPosted 26 March 2026
Job Description
ML Infra Engineer - Platform
At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We train large models on a large NVIDIA B200 GPU cluster. The cluster runs at high capacity, and our next step is to make it predictable, reliable, and measurably efficient so researchers spend less time babysitting jobs and more time advancing model capability and real-world robot behavior. We're looking for a Cluster Reliability / SRE owner to help us build the operational foundation.
What You'll Do
Own fleet health and node reliability
- Build and operate node health checks (GPU, CPU, memory, NIC, storage) and automated health scoring
- Detect and mitigate stragglers (e.g., thermal/power throttling, ECC issues, network degradation)
- Implement automatic quarantine/drain policies and safe reintegration workflows
- Drive uptime improvements through preventative maintenance, root-cause analysis, and runbooks
Build observability and fast diagnosis
Establish "source of truth" telemetry for cluster + jobs:
- GPU health and performance signals (clocks, throttling, error rates)
- Network and storage performance indicators (latency, throughput, tail behavior)
- Job-level health (retries, hangs, step-time anomalies)
Create dashboards and alerts that answer:
- "Why did this job slow down or hang?"
- "Where did our GPU-hours go this week?"
- "Which nodes/racks are degrading performance?"
Standardize logging/metrics patterns for training jobs to make triage consistent and fast.
Improve scheduling, placement, and utilization
- Reduce wasted capacity caused by fragmentation and poor placement
- Implement or tune policies for: topology-aware placement / constraints for large distributed runs, backfilling and queue discipline, safe preemption/requeue behaviors
- Partner with researchers to ensure scheduling supports: long-running pretraining, evaluation runs and ablations, fast iteration loops
Automation and operational excellence
Eliminate manual toil with automation:
- Safe auto-retry / auto-resume patterns
- Hang detection and automated triage signals
- Templates/guardrails for job submission
Own incident response practices:
- Clear escalation paths
- Postmortems with action items
- Measurable reductions in repeat incidents
Participate in an on-call rotation (with a strong automation-first culture).
Partner closely with research and performance engineering
Work tightly with researchers and ML systems/perf engineers to identify system-level causes of training inefficiency (I/O stalls, stragglers, NCCL hangs, etc.). Provide a stable, observable platform that enables deep performance optimization work to be effective and repeatable.
What We're Looking For
- Strong experience operating production systems with an SRE/reliability mindset (automation-first, measurable outcomes, incident discipline)
- Experience with large-scale compute environments (GPU clusters, HPC, distributed compute, or cloud ML platforms)
- Solid Linux fundamentals and comfort debugging across layers: kernel/driver, networking, storage, runtime, and application
- Experience building observability systems: metrics, logs, traces, alerting, dashboards, and meaningful SLOs
- Ability to diagnose ambiguous issues and drive them to resolution with clear hypotheses an ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card
More jobs at Rhoda AI
See all →More Node jobs
See all →Lead Machine Learning Engineer
Serve Robotics · USA (remote)
Deal Strategy Analyst, EMEA
Lucid Software · Amsterdam, NL
Sr./Staff Software Engineer
Toma · San Francisco, CA
US Defense Engineering Lead
Mattermost · Denver, Colorado, United States; Honolulu, Hawaii, United States; San Antonio, Texas, United States; San Diego, California, United States; United States; Washington, District of Columbia, United States