AI Infrastructure Operations Engineer

Cerebras Systems
Sunnyvale CA or Toronto CanadaPosted 1 March 2026

Job Description

<div class="content-intro"><p><span data-contrast="none">Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. </span><span data-ccp-props="{"134233117":false,"134233118":false,"201341983":0,"335559685":0,"335559737":240,"335559738":240,"335559739":240,"335559740":279}"> </span></p> <p>Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. <a href="https://openai.com/index/cerebras-partnership/">OpenAI recently announced a multi-year partnership with Cerebras</a>, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. </p> <p>Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.</p></div><h4>About The Role</h4> <p>The AI Infrastructure Operations Engineer (SiteOps) is an entry-level individual contributor role focused on the deployment, bring-up, monitoring, and first-line troubleshooting of Cerebras AI infrastructure in data center environments. The role supports CS systems, cluster server hardware, cluster networking hardware, and hardware telemetry and monitoring tools.</p> <p>Support reliable operation and scale-out of Cerebras AI clusters by executing defined hardware bring-up and validation procedures, monitoring telemetry, performing first-line troubleshooting, and escalating issues using established workflows.</p> <h4>Responsibilities</h4> <ul> <li>Assist with deployment and bring-up of CS-X systems, cluster servers, and networking hardware;<br>• Execute power-on sequencing, readiness checks, and validation tests.<br>• Monitor hardware telemetry, alerts, and dashboards.<br>• Perform first-line troubleshooting and structured escalation.<br>• Collect logs, telemetry, and observations during incidents.</li> </ul> <p><strong>Incident Support Tooling</strong></p> <ul> <li>Participate in incident response under senior engineer guidance;<br>• Use existing monitoring, telemetry, and incident tracking tools.<br>• Provide feedback on tooling and process gaps.</li> </ul> <p><strong>Learning Development</strong></p> <ul> <li>Build working knowledge of Cerebras system architecture;<br>• Learn cluster hardware and networking fundamentals.<br>• Shadow senior engineers during complex debugging.<br>• Progress toward independent ownership of defined workflows.</li> </ul> <p>Explicit Non-Responsibilities</p> <ul> <li>No people management;<br>• No final escalation authority.<br>• No ownership of cluster architecture, hardware design, or tooling architecture.</li> </ul> <h4>Required Qualifications</h4> <p>Bachelor’s degree in a relevant engineering field or equivalent experience; 0–3 years experience in hardware operations, systems engineering, or datacenter environments; basic familiarity with server hardware, networking fundamentals, and Linux systems.</p> <h4>Preferred Qualifications</h4> <p>Internship or early-career experience in datacenter or hardware lab environments; exposure to monitoring or telemetry systems; comfort working in data centers.</p> <p><strong>What Success Looks Like</strong></p> <p>Consistent and correct execution of hardware bring-up procedures, early identification and escalation of issues, improving documentation quality, and clear progression toward more independent operation ... (truncated, view full listing at source)