HPC Engineer
Anduril IndustriesCosta Mesa, California, United StatesPosted 3 April 2026
Job Description
Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.
ABOUT THE ROLE
Anduril is seeking a High Performance Computing (HPC) System Engineer to directly support our most sensitive programs. You will be a part of the team building and maintaining large scale HPC infrastructure. You will have the opportunity to work with and learn from some of the world’s best engineers and cybersecurity professionals as you help to implement cutting edge systems. You will work directly to support systems deployed across the globe in support of national security missions.
WHAT YOU'LL DO
Work in a fast-paced, customer-focused environment supporting high-profile operational and research requirements.
Architect and deploy advanced GPU infrastructure, leading the design, deployment, and lifecycle management of cutting-edge NVIDIA hardware including H100, H200, and B200/B300 systems.
Ability to rack, stack, cable, and configure physical servers and multi-node GPU systems from end to end.
Configure HPC and AI environments, including job schedulers (e.g., Slurm), multi-user login environments, and cluster management software (e.g., Warewulf, NVIDIA Base Command, RunAI).
Implement and fine-tune high-speed interconnects (e.g., NVLink, NVSwitch, InfiniBand/NDR) crucial for large-scale distributed training.
Configure and manage large-scale, high-performance storage platforms in the multiple petabytes range, optimized for AI/ML data access patterns.
Install, configure, and maintain the application stack on HPC clusters, including traditional simulation software (StarCCM+, Ansys, Matlab) and the core AI/ML software stack (NVIDIA drivers, CUDA, PyTorch, TensorFlow).
Implement and manage GPU virtualization and sharing technologies, such as Multi-Instance GPU (MIG), to maximize resource utilization across diverse workloads.
Troubleshoot complex, system-wide issues related to application performance, user access, compute nodes, storage, and job queueing services.
Utilize NVIDIA Data Center GPU Manager (DCGM) and additional tools to proactively monitor GPU health and performance, diagnosing and resolving training bottlenecks in collaboration with ML engineers.
Ensure the security and integrity of the server and cluster infrastructure through regular audits, patching, and proactive security measures.
Collaborate closely with engineering and AI/ML research stakeholders to gather requirements and architect robust, scalable solutions.
Manage the hardware lifecycle, from quoting and procuring hardware from vendors to creating and executing deployment schedules.
Provide technical guidance, mentoring, and architectural leadership to other team members.
REQUIRED QUALIFICATIONS
7+ years of experience in designing, developing, and implementing large scale compute enterprise systems and solutions
Strong Knowledge and experience with High Performance Computing concepts to include cluster architecture file system, and high-speed infiniBand/ethernet interconnections
Proven expertise in one or more of the following, Red Hat Enterprise Linux, Ubuntu, HPC, GPU, Azure or AWS cloud services
Strong understanding and experience with systems automation tools (Ansible, Salt, Puppet)
Experience in HPC technologies such as parallel/distribution file systems (e.g., Lustre, GPFS, Pure, VAST)
Working knowledge o ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card
More jobs at Anduril Industries
See all →Technical Program Manager, Electronic Warfare
Costa Mesa, California, United States · 3 April 2026
UAV Operator, Launched Effects
Atlanta, Georgia, United States · 3 April 2026
Technical Operations Engineer, Bolt
Costa Mesa, California, United States · 3 April 2026
Staff Technical Program Manager, Warfighter Systems
Bellevue, Washington, United States · 3 April 2026