Senior Software Development Engineer in Test (SDET) - AI Cluster

Cerebras Systems
Sunnyvale, CA; Toronto, Ontario, CanadaPosted 1 March 2026

Job Description

<div class="content-intro"><p><span data-contrast="none">Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. </span><span data-ccp-props="{"134233117":false,"134233118":false,"201341983":0,"335559685":0,"335559737":240,"335559738":240,"335559739":240,"335559740":279}"> </span></p> <p>Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. <a href="https://openai.com/index/cerebras-partnership/">OpenAI recently announced a multi-year partnership with Cerebras</a>, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. </p> <p>Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.</p></div><p><span data-contrast="none">In AI infrastructure organization, simplifying large hardware deployments with push button, single pane of glass for observability/monitoring and software capabilities for build-in resiliency are some of the key focus areas. As senior software development engineer in Test, we are looking for a candidate who can make a big impact on how we test and validate thousands of nodes in large deployments to ensure the cluster is 99.999% reliable.</span> <br> <br><strong><span data-contrast="none">Responsibilities</span></strong><br><span data-ccp-props="{}"> </span></p> <ul> <li data-leveltext="-" data-font="Aptos" data-listid="1" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Aptos","469769242":[8226],"469777803":"left","469777804":"-","469777815":"hybridMultilevel"}" data-aria-posinset="1" data-aria-level="1"><span data-contrast="none">You will be hired to innovate and execute tests on cutting edge AI infrastructure. Be a thinker, define optimized test strategies and methodologies.</span></li> <li data-leveltext="-" data-font="Aptos" data-listid="1" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Aptos","469769242":[8226],"469777803":"left","469777804":"-","469777815":"hybridMultilevel"}" data-aria-posinset="1" data-aria-level="1">Cerebras is growing and innovating at a rapid pace and so is the ML community and AI models. Be a quick learner, adapt to new technologies, and bring your expertise. We are looking to hire a team with a diverse skill set.</li> <li data-leveltext="-" data-font="Aptos" data-listid="1" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Aptos","469769242":[8226],"469777803":"left","469777804":"-","469777815":"hybridMultilevel"}" data-aria-posinset="1" data-aria-level="1">Deep understanding of how large-scale distributed ML training and inference works. Build a strong understanding of how to break these large distributed systems challenge into smaller components that can be unit tested.</li> <li data-leveltext="-" data-font="Aptos" data-listid="1" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Aptos","469769242":[8226],"469777803":"left","469777804":"-","469777815":"hybridMultilevel"}" data-aria-posinset="1" data-aria-level="1">Automate first approach - In large scale deployment, automation drives efficiency and scalability. Aim for 100% automated tests to test all cluster features in areas of high availability, failure scenarios, ... (truncated, view full listing at source)