Senior Site Reliability Engineer
TriumphSan Francisco HQPosted 2 April 2026
Job Description
Senior Site Reliability Engineer
ABOUT US
Triumph https://triumpharcade.com makes mobile gaming more thrilling by letting players wager -- and win -- real money, play in mass multiplayer games, and compete in social tournaments. We've built the top app in our App Store category https://apps.apple.com/us/app/triumph-play-for-cash/id1608987929 and sustained exponential month-over-month growth on our revenue and active players. We're hyper-scaling our team and continuing to innovate, launching new products like Rips, our collectibles app https://apps.apple.com/us/app/rips-by-triumph/id6751921248, which has found huge success and is continuing to expand. And we're just getting started.
Triumph is backed by some of the top consumer VCs including Goodwater Capital, General Catalyst, and DraftKings Drive Fund.
THE ROLE
As a Senior Site Reliability Engineer, you'll own the reliability and scalability of the backend systems powering both Triumph Arcade and Rips through rapid, sustained growth - ensuring that real-time matchmaking, payment processing, and fraud detection operate with the integrity and performance our players depend on.
WHAT YOU'LL DO
- Define and own SLOs, error budgets, and incident response practices across Triumph's production systems
- Own our infrastructure-as-code foundations using Terraform - provisioning, modularity, and keeping our environments consistent and auditable
- Design and enforce IAM policies across cloud environments, ensuring least-privilege access controls that meet the compliance bar of a real-money platform
- Build and improve observability infrastructure - distributed tracing, alerting, and dashboards - so engineers have the signal they need to move fast without breaking things
- Own the reliability of our real-time matchmaking and payment infrastructure, including failure recovery, idempotency, and transaction integrity
- Own platform security end-to-end - enforcing least-privilege IAM policies, hardening firewall configurations, and building the threat detection and alerting systems that keep a real-money platform ahead of bad actors
- Lead postmortems and build the operational runbooks and tooling that prevent repeat incidents
- Proactively load test and stress test our systems to find breaking points before our players do - then own the improvements that raise the ceiling ahead of launches, tournaments, and viral moments
QUALIFICATIONS
- 5+ years of SRE or infrastructure engineering experience, with production ownership of high-traffic, real-time consumer systems
- Deep experience with observability tooling (Datadog, Grafana, OpenTelemetry, or equivalent) and a strong intuition for what to measure and why
- Hands-on experience with cloud infrastructure at scale (AWS, GCP, or Azure) — you've designed for failure, not just uptime
- Familiarity with real-time backend patterns: WebSockets, event streaming, pub/sub, queue-based architectures
- Bonus: experience in real-money gaming, fintech, or any domain where transaction integrity and fraud surface area are first-class concerns
- You write code - Python, Go, or similar - and treat infrastructure as a software problem
WHY TRIUMPH?
- High growth. Build a high-scale consumer platform that touches gaming, finance, and social with the autonomy to set our web direction.
- High agency. Small, high-impact engineering team that is growing rapidly with significant opportunity for leadership and growth.
- High energy. Passionate team who are proud of our work and velocity (16x year over year growth).
- Competitive salary and benefits. $400/mo lunch credit, healthcare, vision, dental, 401k, etc.
Our team gathers 5 days a week at Triumph’s headquarters at Levi’s Plaza in San Francisco.
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card