Staff DevOps Engineer/SRE
FlexAIBangalore, IndiaPosted 6 April 2026
Job Description
Staff DevOps Engineer/SRE
About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi , who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Role Overview
FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.
You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.
What You’ll Do
Own Reliability & Architecture:
Design and evolve the infrastructure backbone for our AI and PaaS platform
Build highly available, fault-tolerant, and scalable systems
Define and drive SRE practices (SLIs, SLOs, error budgets)
Build Infrastructure at Scale:
Lead Infrastructure as Code using Pulumi
Own and scale Kubernetes clusters and containerized workloads
Standardize and automate infrastructure for global deployments
CI/CD & Automation:
Design and scale CI/CD pipelines for fast, reliable releases
Build self-healing systems and automated remediation workflows
Drive GitOps and platform engineering practices
Observability & Performance:
Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
Identify and resolve performance bottlenecks (latency, throughput, cost)
Lead incident response, root cause analysis, and postmortems
Leadership & Collaboration:
Partner with backend, AI, runtime, and security teams
Guide infrastructure decisions and scaling strategy
Mentor engineers and raise the bar on reliability and engineering standards
Security & Resilience:
Embed security into infrastructure and deployment workflows
Design for resilience (disaster recovery, chaos testing, capacity planning)
What You'll Need to Be Successful
8+ years of experience in DevOps, SRE, or Infrastructure Engineering
Proven experience operating large-scale, distributed systems in production
Deep expertise in:
Kubernetes & container orchestration
Pulumi (or similar IaC tools)
Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
Observability stacks (Prometheus, Grafana, OpenTelemetry)
Strong experience with CI/CD, automation, and release engineering
Proficiency in Python, Go, or Bash
Strong systems thinking and debugging skills in high-scale environments
Experience defining and operating with SLOs / SLAs
Experience in startup environments
Comfortable leveraging AI coding tools and agents to move faster
Nice to Have
Experience with AI/ML infrastructure or GPU workloads
Familiarity with distributed or high-performance compute systems
Exposure to platform engineering / internal developer platforms
Experience scaling systems from Beta to production
Why FlexAI
Work on cutting-edge AI infrastructure
Build systems that power developers and enterprises
High ownership, fast execution, real impact
Collaborative, high-caliber team
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card