Staff DevOps Engineer/SRE

FlexAI
Bangalore, IndiaPosted 6 April 2026

Job Description

Staff DevOps Engineer/SRE About FlexAI Build and Deploy AI the right way, anywhere. The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood. Founded by Brijesh Tripathi , who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity. If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you ! Role Overview FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments. You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity. What You’ll Do Own Reliability & Architecture: Design and evolve the infrastructure backbone for our AI and PaaS platform Build highly available, fault-tolerant, and scalable systems Define and drive SRE practices (SLIs, SLOs, error budgets) Build Infrastructure at Scale: Lead Infrastructure as Code using Pulumi Own and scale Kubernetes clusters and containerized workloads Standardize and automate infrastructure for global deployments CI/CD & Automation: Design and scale CI/CD pipelines for fast, reliable releases Build self-healing systems and automated remediation workflows Drive GitOps and platform engineering practices Observability & Performance: Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces) Identify and resolve performance bottlenecks (latency, throughput, cost) Lead incident response, root cause analysis, and postmortems Leadership & Collaboration: Partner with backend, AI, runtime, and security teams Guide infrastructure decisions and scaling strategy Mentor engineers and raise the bar on reliability and engineering standards Security & Resilience: Embed security into infrastructure and deployment workflows Design for resilience (disaster recovery, chaos testing, capacity planning) What You'll Need to Be Successful 8+ years of experience in DevOps, SRE, or Infrastructure Engineering Proven experience operating large-scale, distributed systems in production Deep expertise in: Kubernetes & container orchestration Pulumi (or similar IaC tools) Cloud or hybrid environments (AWS, GCP, Azure, or on-prem) Observability stacks (Prometheus, Grafana, OpenTelemetry) Strong experience with CI/CD, automation, and release engineering Proficiency in Python, Go, or Bash Strong systems thinking and debugging skills in high-scale environments Experience defining and operating with SLOs / SLAs Experience in startup environments Comfortable leveraging AI coding tools and agents to move faster Nice to Have Experience with AI/ML infrastructure or GPU workloads Familiarity with distributed or high-performance compute systems Exposure to platform engineering / internal developer platforms Experience scaling systems from Beta to production Why FlexAI Work on cutting-edge AI infrastructure Build systems that power developers and enterprises High ownership, fast execution, real impact Collaborative, high-caliber team
Apply Now

Direct link to company career page

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card

Share