Staff Software Engineer - AI Clusters Production Engineering & SRE

Redwood City, CA (Hybrid)$241k – $331kPosted 3 April 2026

Tech Stack

Node Python Go Rust C++Scala Kubernetes Helm ArgoCD eBPF PyTorch AI Linux

Job Description

Biohub is a 501(c)(3) biomedical research organization building the first large-scale scientific initiative combining frontier AI with frontier biology to solve disease. We build the technology to help scientists around the world use AI-powered biology to study how cells operate, organize, and work as part of systems to understand why disease happens and how to correct it. With our compute capacity, AI research and engineering, and state-of-the-art technology for measuring, imaging, and programming biology, we are enabling scientists worldwide to use AI-powered biology to advance our understanding of human health. The Team The AI Cluster Production Engineering team is part of the AI Compute Platform organization at Biohub, a non-profit research lab committed to open science and open-source AI. We own the design, operation, and reliability of large-scale multi-GPU AI clusters that power frontier AI biology research: protein language models, genomic foundation models, and scientific reasoning systems built to be shared, not monetized. Our clusters run Slurm on Kubernetes infrastructure and support everything from day-to-day AI researcher workflows to multi-node hero training runs at thousands of GPUs. The team works at the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep AI infrastructure problems and building AI systems critical to the entire AI organization. The Opportunity CZ Biohub's mission is to cure or prevent all human disease. Achieving that requires training frontier-scale AI biology models, and that demands reliable, high-performance compute infrastructure. This is production engineering work at a frontier AI lab, with the twist that the mission is biology and the science is open. You'll keep GPU clusters running at high utilization, debug the toughest distributed systems failures, and build the operational foundations for scaling to multi-thousand GPU hero runs. The technical problems are genuinely hard (e.g., multi-node distributed training, InfiniBand fabrics, large-scale storage, Slurm at scale) inside an organization where the work is aimed at helping people, not optimizing ad revenue. What You'll Do Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes. Build the systems, automation, and processes that keep clusters healthy, and that enable fast, efficient recovery when things break. Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers. Build the tooling and operational patterns that make these failures easier to detect, diagnose, and prevent. Design and execute GPU cluster scaling plans, systematically validating storage, networking, interconnect, and scheduler behavior as clusters grow to support larger training runs. Build automation and tooling to manage cluster operations at scale: capacity planning, GPU utilization monitoring workload manager policy management, and pod lifecycle automation. Drive configuration-as-code practices, ensuring cluster state is reproducible and auditable, and managed through version-controlled pipelines. Collaborate directly with AI researchers and hero run leads to understand training workload patterns and design infrastructure that meets frontier-scale requirements. Own the vendor relationship on technical issues — escalating SEV1s, coordinating across multiple partners and network backbone teams, holding them accountable to root/proximate cause analysis and SLAs. Contribute to capacity planning: projecting GPU demand, managing cluster expansion across GPU generations, and coordinating multi-cluster strategy. Improve operational resilience, reducing mean time to detect and resolve incidents, reducing toil through automation, and developing runbooks that scale the team's operational knowledge beyond any individual. What You'll Bring 8+ years of AI/ML infrastructure engineering experience, with deep expertise in at l ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atBioHub

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card