Lead Site Reliability Engineer, DevOps

PunePosted 22 March 2026

Tech Stack

Python Go Scala AWS Azure GCP Kubernetes Terraform CI/CD Helm Linux

Job Description

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world! Job Title Senior Site Reliability Engineer (SRE) – Observability & DevOps Role Summary We are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems. This role requires both deep technical expertise and production ownership mindset. Primary Responsibilities Observability & Monitoring Design, implement, and maintain end-to-end observability using: Prometheus for metrics collection Alertmanager for alert routing, deduplication, and escalation Grafana for visualization and dashboards AppDynamics for APM, transaction tracing, and application health Build actionable dashboards for: SLIs, SLOs, and error budgets Application, infrastructure, and platform health Reduce alert fatigue by implementing signal-based alerting and proper severity models Data & Metrics Platform Manage and optimize ClickHouse for: High-volume metrics, logs, or traces Long-term retention and fast analytical queries Work on schema design, performance tuning, and cost optimization Reliability & Operations Define and measure SRE best practices (SLIs, SLOs, SLAs) Participate in incident response, postmortems, and root cause analysis Drive reliability improvements through automation and capacity planning Automation & Engineering Develop tooling and automation using at least one scripting/programming language Automate monitoring onboarding, alert generation, dashboard creation Improve operational efficiencies across DevOps tooling Required Technical Skills (Must-Have) Core Skills Strong Linux fundamentals Troubleshooting, performance tuning, networking, system internals Scripting / Programming (Any one or more): Python (preferred), Bash, Go, or similar Observability Tools (Hands-on): Prometheus Alertmanager Grafana AppDynamics Data Platform: Hands-on experience with ClickHouse Monitoring & Alerting Concepts Metrics vs logs vs traces Golden signals (latency, traffic, errors, saturation) Alert thresholds, routing policies, escalation strategies Preferred / Nice-to-Have Skills Kubernetes monitoring (Prometheus Operator, kube-state-metrics) Infrastructure as Code (Terraform, Helm) CI/CD observability Cloud platforms (AWS / Azure / GCP) Experience managing observability at scale (100+ services / platforms) Senior-Level Expectations Ability to architect observability solutions, not just operate them Strong production troubleshooting and incident ownership Mentoring junior engineers Influence DevOps and SRE best practices across teams Communicate clearly with developers and leadership Experience & Qualification 5-7 years of experience in SRE / DevOps / Production Engineering Experience operating high-availability, large-scale systems Proven background in observability-driven reliability improvements Join our talent community and receive the latest Qualys news, content, and be first in line for new job opportunities. Join our Talent Community! Qualys, Inc. (NASDAQ: QLYS) is a pioneer and leading provider of disruptive cloud-based security, compliance and IT solutions with more than 10,000 subscription customers worldwide, including a majority of the Forbes Global 100 and Fortune 100. Qualys helps organizations streamline and automate their security and compliance solutions onto a single platform for greater agility, better business outcomes, and substantial cost savings.

Apply Now

Direct link to company career page

More jobs atQualys

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card