Senior Manager of Engineering, Production Infrastructure

Klaviyo
Boston, MAPosted 24 February 2026

Job Description

<div class="content-intro"><p><em>At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Want to learn more about life at Klaviyo? Visit <a class="_ymio1r31 _ypr0glyw _zcxs1o36 _mizu194a _1ah3dkaa _ra3xnqa1 _128mdkaa _1cvmnqa1 _4davt94y _4bfu18uv _1hms8stv _ajmmnqa1 _vchhusvi _kqswh2mm _ect4ttxp _syaz13af _1a3b18uv _4fpr8stv _5goinqa1 _f8pj13af _9oik18uv _1bnxglyw _jf4cnqa1 _30l313af _1nrm18uv _c2waglyw _1iohnqa1 _9h8h12zz _10531ra0 _1ien1ra0 _n0fx1ra0 _1vhv17z1" href="http://klaviyo.com/careers" data-renderer-mark="true">klaviyo.com/careers</a> to see how we empower creators to own their own destiny.</em></p></div><p>Klaviyo powers growth for thousands of businesses, and our RD teams build on shared platform primitives. As the Senior Manager, Production Infrastructure, you’ll lead the teams behind our paved roads—compute runtimes, service networking/ingress, and observability—so product engineers can move fast on a stable, cost‑disciplined foundation. You’ll publish opinionated defaults (“golden paths”), install SLO discipline, and make reliability and developer experience measurable across the company. </p> <p>This is a hands‑on leadership role: you’ll stay close to architecture and operations, review designs and PRs, jump into incidents when needed, and prototype reference solutions that set the standard.</p> <h3><strong>How You’ll Make a Difference</strong></h3> <ul> <li>Own and evolve platform primitives in scope (compute runtimes, service networking/ingress, observability) with clear APIs, SLOs, runbooks, and support tiers.</li> <li><strong>Lead by example technically: </strong>drive design reviews, review PRs, and author reference implementations, starter repos, and Terraform/Helm modules that demonstrate the golden path.</li> <li>Deliver golden paths and self‑service scaffolding; reduce time‑to‑first‑service and lead time for changes.</li> <li>Raise the bar on reliability: incident response (blameless), alert hygiene, capacity planning, and on‑call health.</li> <li><strong>Be production‑close: </strong>participate in critical incident response and postmortems; trace issues across Kubernetes, service mesh, and data paths; convert learnings into durable fixes, guardrails, and policy‑as‑code.</li> <li><strong>Standardize observability end‑to‑end: </strong>expand OpenTelemetry adoption, define log/trace schemas, and make SLOs and error budgets first‑class in dashboards and alerts.</li> <li>Evolve our Kubernetes and networking layers: plan cluster upgrades, right‑size node/Pod configs, harden ingress/gateway policies, and advance mTLS/service identity and traffic shaping.</li> <li><strong>Advance CI/CD and GitOps: </strong>ensure fast, safe deploys with progressive delivery, automatic rollbacks, and pre‑prod environments that mirror prod; enforce guardrails via policy‑as‑code.</li> <li>Stand up a concise scorecard (SLO coverage, incident frequency/severity, lead time, MTTR, developer platform NPS, cost‑to‑serve) and drive consistent trend improvements.</li> <li>Partner with Security, Data Platform, and Product to clarify ownership boundaries and enable safe, fast delivery.</li> <li>Improve cost‑to‑serve via quotas, right‑sizing, and showback in partnership with Finance.</li> <li>Transform workflows by putting AI at the center, building smarter systems and ways of working from the ground up; pilot AI‑assisted runbooks and incident summarization to shorten resolution time.</li> </ul> <h3><strong>Who You Are</strong></h3> <ul> <li>7–10+ years in infra/SRE/platform with 3–5+ years leading teams (including managers or staff/lead ICs).</li> <li>Demonstrated SRE practices (SLI/SLO design, ... (truncated, view full listing at source)