Senior DevOps Engineer/SRE
FlexAIBangalore, IndiaPosted 6 April 2026
Job Description
Senior DevOps Engineer/SRE
About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi , who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Role Overview
FlexAI is looking for a Senior DevOps / SRE Engineer to build and operate the infrastructure powering our AI and PaaS platform.
You’ll work closely with developers to ensure our systems are reliable, performant, and scalable , while enabling fast product iteration. This role is hands-on and execution-focused, with opportunities to contribute to system design and reliability practices as we scale.
What You’ll Do
Build & Operate Infrastructure:
Build and maintain infrastructure for our AI and PaaS platform
Deploy and operate Kubernetes clusters and containerized services
Implement Infrastructure as Code using Pulumi (or similar tools)
Reliability & SRE Practices:
Help define and implement SLIs, SLOs, and error budgets
Improve system reliability, availability, and performance
Participate in on-call rotations , incident response, and postmortems
CI/CD & Automation:
Build and improve CI/CD pipelines for reliable and fast releases
Automate operational workflows and reduce manual toil
Contribute to GitOps and platform engineering practices
Observability & Performance:
Implement and maintain observability using VictoriaMetrics, Grafana (metrics, logs, traces)
Monitor systems and troubleshoot performance issues (latency, throughput, cost)
Collaboration:
Work closely with developers, platform, and AI teams to support production systems
Help debug issues across infrastructure and application layers
Contribute to improving engineering productivity and developer experience
What You’ll Need to Be Successful
4+ years of experience in DevOps, SRE, or Infrastructure Engineering
Experience operating production systems at scale
Hands-on experience with:
Kubernetes & containers
Infrastructure as Code (Pulumi, Terraform, etc.)
Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
Observability tools (Prometheus, Grafana, OpenTelemetry)
Experience with CI/CD systems and automation
Proficiency in Python, Go, or Bash
Strong debugging and problem-solving skills
Familiarity with SLOs and reliability practices
Experience working in startup or fast-paced environments
Comfortable leveraging AI coding tools and agents
Nice to Have
Experience with AI/ML infrastructure or GPU workloads
Familiarity with distributed systems or compute platforms
Exposure to platform engineering concepts
Experience supporting systems from Beta to production
Why FlexAI
Work on cutting-edge AI infrastructure
Build systems that power developers and enterprises
High ownership, fast execution, real impact
Collaborative, high-caliber team
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card