ML Infra Engineer - Platform

Palo AltoPosted 26 March 2026

Tech Stack

Node Rails Go Scala Kubernetes PyTorch AI Linux

Job Description

ML Infra Engineer - Platform At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality. We train large models on a large NVIDIA B200 GPU cluster. The cluster runs at high capacity, and our next step is to make it predictable, reliable, and measurably efficient so researchers spend less time babysitting jobs and more time advancing model capability and real-world robot behavior. We're looking for a Cluster Reliability / SRE owner to help us build the operational foundation. What You'll Do Own fleet health and node reliability - Build and operate node health checks (GPU, CPU, memory, NIC, storage) and automated health scoring - Detect and mitigate stragglers (e.g., thermal/power throttling, ECC issues, network degradation) - Implement automatic quarantine/drain policies and safe reintegration workflows - Drive uptime improvements through preventative maintenance, root-cause analysis, and runbooks Build observability and fast diagnosis Establish "source of truth" telemetry for cluster + jobs: - GPU health and performance signals (clocks, throttling, error rates) - Network and storage performance indicators (latency, throughput, tail behavior) - Job-level health (retries, hangs, step-time anomalies) Create dashboards and alerts that answer: - "Why did this job slow down or hang?" - "Where did our GPU-hours go this week?" - "Which nodes/racks are degrading performance?" Standardize logging/metrics patterns for training jobs to make triage consistent and fast. Improve scheduling, placement, and utilization - Reduce wasted capacity caused by fragmentation and poor placement - Implement or tune policies for: topology-aware placement / constraints for large distributed runs, backfilling and queue discipline, safe preemption/requeue behaviors - Partner with researchers to ensure scheduling supports: long-running pretraining, evaluation runs and ablations, fast iteration loops Automation and operational excellence Eliminate manual toil with automation: - Safe auto-retry / auto-resume patterns - Hang detection and automated triage signals - Templates/guardrails for job submission Own incident response practices: - Clear escalation paths - Postmortems with action items - Measurable reductions in repeat incidents Participate in an on-call rotation (with a strong automation-first culture). Partner closely with research and performance engineering Work tightly with researchers and ML systems/perf engineers to identify system-level causes of training inefficiency (I/O stalls, stragglers, NCCL hangs, etc.). Provide a stable, observable platform that enables deep performance optimization work to be effective and repeatable. What We're Looking For - Strong experience operating production systems with an SRE/reliability mindset (automation-first, measurable outcomes, incident discipline) - Experience with large-scale compute environments (GPU clusters, HPC, distributed compute, or cloud ML platforms) - Solid Linux fundamentals and comfort debugging across layers: kernel/driver, networking, storage, runtime, and application - Experience building observability systems: metrics, logs, traces, alerting, dashboards, and meaningful SLOs - Ability to diagnose ambiguous issues and drive them to resolution with clear hypotheses an ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atRhoda AI

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card

More jobs at Rhoda AI

See all →

Technical Accounting & Reporting Manager

Palo Alto · 26 March 2026

Head of Talent

Palo Alto · 26 March 2026

Research Engineer — Foundational Models & World Models for Robotics

Palo Alto · 26 March 2026

Marketing Communications Manager

Palo Alto · 26 March 2026

More Node jobs

See all →

Lead Machine Learning Engineer

Serve Robotics · USA (remote)

Deal Strategy Analyst, EMEA

Lucid Software · Amsterdam, NL

Sr./Staff Software Engineer

Toma · San Francisco, CA

US Defense Engineering Lead

Mattermost · Denver, Colorado, United States; Honolulu, Hawaii, United States; San Antonio, Texas, United States; San Diego, California, United States; United States; Washington, District of Columbia, United States