HPC Engineer

Costa Mesa, California, United StatesPosted 3 April 2026

Tech Stack

Node Scala Perl AWS Azure Ansible TensorFlow PyTorch AI Computer Vision Linux

Job Description

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years. ABOUT THE ROLE Anduril is seeking a High Performance Computing (HPC) System Engineer to directly support our most sensitive programs. You will be a part of the team building and maintaining large scale HPC infrastructure. You will have the opportunity to work with and learn from some of the world’s best engineers and cybersecurity professionals as you help to implement cutting edge systems. You will work directly to support systems deployed across the globe in support of national security missions. WHAT YOU'LL DO Work in a fast-paced, customer-focused environment supporting high-profile operational and research requirements. Architect and deploy advanced GPU infrastructure, leading the design, deployment, and lifecycle management of cutting-edge NVIDIA hardware including H100, H200, and B200/B300 systems. Ability to rack, stack, cable, and configure physical servers and multi-node GPU systems from end to end. Configure HPC and AI environments, including job schedulers (e.g., Slurm), multi-user login environments, and cluster management software (e.g., Warewulf, NVIDIA Base Command, RunAI). Implement and fine-tune high-speed interconnects (e.g., NVLink, NVSwitch, InfiniBand/NDR) crucial for large-scale distributed training. Configure and manage large-scale, high-performance storage platforms in the multiple petabytes range, optimized for AI/ML data access patterns. Install, configure, and maintain the application stack on HPC clusters, including traditional simulation software (StarCCM+, Ansys, Matlab) and the core AI/ML software stack (NVIDIA drivers, CUDA, PyTorch, TensorFlow). Implement and manage GPU virtualization and sharing technologies, such as Multi-Instance GPU (MIG), to maximize resource utilization across diverse workloads. Troubleshoot complex, system-wide issues related to application performance, user access, compute nodes, storage, and job queueing services. Utilize NVIDIA Data Center GPU Manager (DCGM) and additional tools to proactively monitor GPU health and performance, diagnosing and resolving training bottlenecks in collaboration with ML engineers. Ensure the security and integrity of the server and cluster infrastructure through regular audits, patching, and proactive security measures. Collaborate closely with engineering and AI/ML research stakeholders to gather requirements and architect robust, scalable solutions. Manage the hardware lifecycle, from quoting and procuring hardware from vendors to creating and executing deployment schedules. Provide technical guidance, mentoring, and architectural leadership to other team members. REQUIRED QUALIFICATIONS 7+ years of experience in designing, developing, and implementing large scale compute enterprise systems and solutions Strong Knowledge and experience with High Performance Computing concepts to include cluster architecture file system, and high-speed infiniBand/ethernet interconnections Proven expertise in one or more of the following, Red Hat Enterprise Linux, Ubuntu, HPC, GPU, Azure or AWS cloud services Strong understanding and experience with systems automation tools (Ansible, Salt, Puppet) Experience in HPC technologies such as parallel/distribution file systems (e.g., Lustre, GPFS, Pure, VAST) Working knowledge o ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atAnduril Industries

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card