Job Description
<div class="content-intro"><p>Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.</p></div><h3>ABOUT THE TEAM</h3>
<p>The Production Engineering team is a newly formed organization within Anduril's Software Platform, dedicated to ensuring the reliability, performance, and scalability of mission-critical systems that directly support our warfighters in the field. We solve complex reliability challenges at massive scale, ensuring that critical components of Lattice—Anduril's autonomous command and control platform—operates flawlessly in the most demanding operational environments.<br><br>This is a foundational role and you will be among the first hires building this team from the ground up. You'll have the unique opportunity to shape the technical direction, establish best practices, and define what production engineering excellence means at Anduril. Our team operates at the intersection of software engineering and systems reliability, building the infrastructure, tooling, and processes that keep our systems operational 24/7/365.</p>
<h3>ABOUT THE ROLE</h3>
<p>We are seeking an experienced Senior Site Reliability Engineer who is passionate about building resilient, highly available systems that scale to meet the demands of the core systems powering Lattice. You will work closely with platform engineering teams, product developers, and field operations to proactively identify reliability risks, implement defensive strategies, and continuously improve the operational excellence of our software platform. If you thrive on solving hard problems at scale and want your work to have direct impact on national security, this is the role for you.</p>
<h4>WHAT YOU’LL DO</h4>
<ul>
<li>Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform</li>
<li>Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues</li>
<li>Build and maintain infrastructure automation using tools like Terraform, Kubernetes operators, and custom tooling to manage large-scale distributed systems</li>
<li>Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability</li>
<li>Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers, graceful degradation, and chaos engineering</li>
<li>Develop capacity planning models and performance testing frameworks to ensure systems can handle growth and peak operational demands</li>
<li>Create runbooks, documentation, and training materials to enable teams to operate production systems effectively</li>
<li>Lead cross-functional efforts to improve deployment safety through progressive rollouts, automated testing, and rollback capabilities</li>
<li>Implement security best practices and compliance controls for production environments handling sensitive defense data</li>
<li>Build tooling and automation to reduce toil and improve operational efficiency for the engineering organization</li>
<li>Participate in on-call rotations and serve as an escalation point for critical production incidents</li>
</ul>
<h4>REQUIRED QUALIFICATIONS</h4>
<ul>
<li>7+ years of engineering ex ... (truncated, view full listing at source)