AI Infrastructure Operations Engineer

Cerebras Systems
Sunnyvale CA or Toronto CanadaPosted 1 March 2026

Job Description

<div class="content-intro"><p><span data-contrast="none">Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. </span><span data-ccp-props="{"134233117":false,"134233118":false,"201341983":0,"335559685":0,"335559737":240,"335559738":240,"335559739":240,"335559740":279}"> </span></p> <p>Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. <a href="https://openai.com/index/cerebras-partnership/">OpenAI recently announced a multi-year partnership with Cerebras</a>, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. </p> <p>Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.</p></div><p><strong><span data-contrast="auto">About The Role</span></strong></p> <p><span data-contrast="auto">We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="auto">You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success. </span><span data-ccp-props="{}"> </span></p> <h4><strong><span data-contrast="auto">Responsibilities</span></strong></h4> <ul> <li><span data-contrast="auto">Manage and operate multiple advanced AI compute infrastructure clusters.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Monitor and oversee cluster health, proactively identifying and resolving potential issues.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Maximize compute capacity through optimization and efficient resource allocation.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Deploy, configure, and debug container-based services using Docker.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Contribute to the development and improvement of our monitoring and support processes.</span><span data-ccp-props="{}"> </span></li> <li><span data-contrast="auto">Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.</span><span data-ccp-props="{}"> </span></li> </ul> <h4><span data-contrast="auto">Skills And Requ ... (truncated, view full listing at source)