Senior Cloud Infrastructure Engineer

Gatik
Mountain View, CA$180k – $240kPosted 27 January 2026

Job Description

Who we are Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent deliveries while streamlining freight movement by reducing congestion. The company focuses on short-haul, B2B logistics for Fortune 500 retailers and in 2021 launched the world’s first fully driverless commercial transportation service with Walmart. Gatik's Class 3-7 autonomous trucks are commercially deployed across major markets, including Texas, Arkansas, and Ontario, Canada, driving innovation in freight transportation. The company's proprietary Level 4 autonomous technology, Gatik Carrier™, is custom-built to transport freight safely and efficiently between pick-up and drop-off locations on the middle mile. With robust capabilities in both highway and urban environments, Gatik Carrier™ serves as an all-encompassing solution that integrates advanced software and hardware powering the fleet, facilitating effortless integration into customers' logistics operations. About the role We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office! What you'll do Cloud-Native Orchestration Kubernetes Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads. GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization. Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools. Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts. Data Engineering CI/CD Pipelines Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats. GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts. Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry. Agentic DevOps CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry. Model Management Lifecycle (MLOps) Experiment Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration. Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation. High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime. Distributed Training ML Systems Support Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod. Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training. Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU ... (truncated, view full listing at source)
Apply Now

Direct link to company career page

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card

Share