Founding Reliability Engineer

Sieve
San FranciscoPosted 21 February 2026

Job Description

About UsSieve is the only AI research lab exclusively focused on video data. We combine exabyte-scale video infrastructure, novel video understanding techniques, and dozens of data sources to develop datasets that push the frontier of video modeling. Video makes up 80% of internet traffic and has become the enabling digital medium powering creativity, communication, gaming, AR/VR, and robotics. Sieve exists to solve the biggest bottleneck in the growth of these applications: high-quality training data.We’ve partnered with top AI labs and did $XXM last quarter alone, as a team of just 12 people. We also raised our Series A earlier this year from Tier 1 firms such as Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleWe process petabytes of video across thousands of nodes and multiple cloud environments. As we scale, reliability, observability, and security become existential.We’re hiring our first engineer fully dedicated to the infrastructure foundation of Sieve. This is a high-ownership role for someone who thinks deeply about:throughput and system stabilitymonitoring and incident responsesecurity and least-privilege designreducing operational burden for the entire engineering teamYou’ll work directly with our CTO and our founding engineers to build the core tooling that powers all of engineering.This role is for someone who spends their time thinking deeply about reliability, throughput, observability, and security. You’re the kind of engineer who is always anticipating failure modes, eliminating operational risk, and designing systems that don’t break.If something goes down, you take it personally, and you thrive in that level of responsibility.What You’ll DoWork with engineering to design and validate the infrastructure powering PB-scale workloadsBuild and maintain Terraform-managed multi-cloud deploymentsImprove cloud and data security (SSO, IAM, least privilege, auditability)Own incident response and harden systems against failureDevelop CI/CD systems that minimize user error and maximize safetyBuild monitoring + alerting platforms (Prometheus, OpenTelemetry, VictoriaMetrics)Wrap internal reliability tooling with simple UIs for engineersRequirements3+ years building internal infrastructure at scaleExperience on-call for Sev 0 / Sev 1 production incidents (L3 preferred)Strong cloud experience (GCP, AWS, Oracle, Cloudflare, etc.)Deep Infrastructure-as-Code experience (Terraform preferred)Familiarity with Argo, Helm, Kustomize, or similar deployment toolsExperience operating observability systems (Prometheus, OTel, VictoriaMetrics)Backend fundamentals in Python, Go, Rust, or C++Strong networking + security intuition, including SSO implementationHigh ownership mindset over critical systemsBonusExperience building lightweight internal tooling (APIs, dashboards, Svelte)Familiarity with object storage systems (“buckets”)Active GitHub or portfolio projectsLocationIn-person at our SF HQ.
Apply Now

Direct link to company career page

Share this job