Senior DevOps Engineer/SRE

Bangalore, IndiaPosted 6 April 2026

Tech Stack

Python Go Scala AWS Azure GCP Kubernetes Terraform CI/CD Pulumi OpenTelemetry AI Agents

Job Description

Senior DevOps Engineer/SRE About FlexAI Build and Deploy AI the right way, anywhere. The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood. Founded by Brijesh Tripathi , who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity. If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you ! Role Overview FlexAI is looking for a Senior DevOps / SRE Engineer to build and operate the infrastructure powering our AI and PaaS platform. You’ll work closely with developers to ensure our systems are reliable, performant, and scalable , while enabling fast product iteration. This role is hands-on and execution-focused, with opportunities to contribute to system design and reliability practices as we scale. What You’ll Do Build & Operate Infrastructure: Build and maintain infrastructure for our AI and PaaS platform Deploy and operate Kubernetes clusters and containerized services Implement Infrastructure as Code using Pulumi (or similar tools) Reliability & SRE Practices: Help define and implement SLIs, SLOs, and error budgets Improve system reliability, availability, and performance Participate in on-call rotations , incident response, and postmortems CI/CD & Automation: Build and improve CI/CD pipelines for reliable and fast releases Automate operational workflows and reduce manual toil Contribute to GitOps and platform engineering practices Observability & Performance: Implement and maintain observability using VictoriaMetrics, Grafana (metrics, logs, traces) Monitor systems and troubleshoot performance issues (latency, throughput, cost) Collaboration: Work closely with developers, platform, and AI teams to support production systems Help debug issues across infrastructure and application layers Contribute to improving engineering productivity and developer experience What You’ll Need to Be Successful 4+ years of experience in DevOps, SRE, or Infrastructure Engineering Experience operating production systems at scale Hands-on experience with: Kubernetes & containers Infrastructure as Code (Pulumi, Terraform, etc.) Cloud or hybrid environments (AWS, GCP, Azure, or on-prem) Observability tools (Prometheus, Grafana, OpenTelemetry) Experience with CI/CD systems and automation Proficiency in Python, Go, or Bash Strong debugging and problem-solving skills Familiarity with SLOs and reliability practices Experience working in startup or fast-paced environments Comfortable leveraging AI coding tools and agents Nice to Have Experience with AI/ML infrastructure or GPU workloads Familiarity with distributed systems or compute platforms Exposure to platform engineering concepts Experience supporting systems from Beta to production Why FlexAI Work on cutting-edge AI infrastructure Build systems that power developers and enterprises High ownership, fast execution, real impact Collaborative, high-caliber team

Apply Now

Direct link to company career page

More jobs atFlexAI

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card