Staff Linux & DevOps SSE

Bangalore, IndiaPosted 6 April 2026

Tech Stack

Node Python Go Scala Docker Ansible GitHub Actions eBPF PyTorch AI Linux

Job Description

Staff Linux & DevOps SSE About FlexAI Build and Deploy AI the right way, anywhere. The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood. Founded by Brijesh Tripathi , who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Paris, Silicon Valley, and Bangalore, united by a shared mission: to deliver more compute with less complexity. If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you ! Role Overview FlexAI is seeking a Staff Linux & Systems Engineer to architect, build, and operate large-scale bare-metal AI/HPC GPU clusters. This role extends beyond hands-on systems engineering into technical leadership, platform architecture, and fleet-scale infrastructure ownership. You will lead platform bring-up across the full stack (UEFI/BIOS → bootloaders → OS → kernel/device enablement), drive low-level networking performance (RoCEv2/InfiniBand), ensure GPU/accelerator stack readiness, and establish repeatable automation frameworks for provisioning, compliance, and reliability at scale. This role is suited for engineers who are deeply comfortable operating across firmware, kernel, PCIe, and distributed AI infrastructure — and who can translate low-level expertise into scalable platform systems and engineering standards. JD-Senior Linux & Systems Engin… What You'll Do Platform Architecture & Fleet Ownership: Architect and lead end-to-end bring-up of AI/HPC server platforms from firmware to production cluster deployment Define standards for UEFI/BIOS configuration, SecureBoot, TPM/MeasuredBoot, GRUB, PXE/iPXE provisioning workflows Establish scalable patterns for fleet provisioning, configuration management, and lifecycle operations across GPU clusters Own technical roadmap for bare-metal AI infrastructure and systems reliability at scale Platform & Boot Enablement: Lead server bring-up including UEFI/BIOS configuration, bootloader flows, and secure boot pipelines Architect automated BMC/IPMI/Redfish workflows for out-of-band provisioning and fleet management Standardize platform initialization processes across heterogeneous hardware environments Diagnose and resolve complex boot, firmware, and hardware initialization issues OS & Kernel Engineering: Architect, build, and harden custom Linux (Ubuntu) images optimized for AI and HPC workloads Lead kernel tuning for performance-sensitive workloads (NUMA, IRQ affinity, cgroups, namespaces) Diagnose and resolve kernel and user-space performance issues using perf, ftrace, eBPF, and bpftrace Drive system-level optimizations for latency, throughput, and resource utilization across clusters PCIe, Driver & Device Enablement: Lead validation of PCIe topologies and advanced features (ACS, ARI, ATS, SR-IOV, IOMMU/VFIO) Own GPU/NIC driver bring-up, firmware validation, and device performance optimization Root-cause complex regressions across kernel, drivers, firmware, and userspace layers Partner with hardware vendors to resolve low-level device and platform issues Provisioning & Automation at Scale: Architect idempotent Ansible-based provisioning frameworks and automation pipelines Build scalable golden images and repeatable provisioning workflows for large GPU fleets Develop Python/Pytest validation harnesses for pre- and post-provisioning checks Implement drift detection, remediation, and compliance automation across infrastructure GPU / Accelerator & HPC Stack Readiness: Lead enablement of NVIDIA CUDA, NCCL, GPUDirect RDMA and AMD RO ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atFlexAI

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card