Senior/Staff Infrastructure Engineer

San FranciscoPosted 7 April 2026

Tech Stack

Node Python Rails AWS Kubernetes CI/CD AI

Job Description

Senior/Staff Infrastructure Engineer ABOUT HEDRA: Hedra is an AI that bridges the gap between market intelligence and content generation. By analyzing your existing assets, Hedra ensures every new creation is deeply aligned with your audience preferences, market trends and your brands core identity. Backed by $45 million from premier investors like A16Z, Index, and Abstract Ventures, we are building the world’s most advanced models for unified media understanding. Today, Hedra powers the creative workflows of 10 million users and 20% of the Fortune 500. Join our team of world-class researchers and engineers as we define the next generation of creative super intelligence. SUMMARY As a Senior/Staff Infrastructure Engineer, you will own the reliability, availability, and operability of our core Python web services running at scale on AWS. You will be responsible for designing, maintaining, and improving the production infrastructure that keeps Hedra online: Kubernetes for orchestration, AWS as the core cloud platform, and Postgres on RDS as a key managed data service. Your work will focus on building a highly available runtime environment for our services, ensuring we can ship quickly while staying resilient through incidents, traffic spikes, and growth. You will design robust deployment patterns on Kubernetes, optimize our use of AWS (networking, load balancing, scaling, resilience), and put in place the observability and alerting we’re currently missing — from system-level metrics to product health signals. You’ll also partner with product engineers to make Python a great place to build: smoothing out CI/CD, runtime configuration, and production debugging. This is a hands-on infrastructure role, not a product feature role. You will work closely with engineering leadership and product teams, but your primary mandate is to keep our services healthy, observable, and ready to scale. We're looking for a full-time hire in our San Francisco office. EXPERIENCE We’re looking for candidates who have: - At least 4+ years in infrastructure / SRE / platform / backend operations roles at technology companies - At least 3+ years running a critical Python web application in production on AWS - Strong experience operating services on Kubernetes, including: - Designing deployment strategies (rolling, blue/green, canary) - Autoscaling, resource limits/requests, capacity planning - Debugging pod/node issues and cluster-level problems - Solid experience with AWS for high availability, such as: - Multi-AZ architectures, load balancers, security groups, IAM basics - Using managed services (RDS, S3, queues, caches, etc.) effectively - Understanding maintenance windows, failure modes, and regional/AZ considerations - Experience improving observability for production systems: - Implementing or refining system metrics (CPU, memory, disk, network, pod/node health) - Adding application and product health metrics (latency, error rates, key business KPIs) - Standing up useful dashboards, traces, structured logging, and actionable alerts - Comfort working with Python services at scale: - CI/CD pipelines, dependency management, runtime configuration - Performance tuning, concurrency models, and production debugging - Practical experience with Postgres on RDS: - Running it reliably in production (backups, restores, monitoring, failover) - Coordinating version upgrades and schema changes with minimal disruption - A developer experience mindset: - Making it easier and safer for engineers to deploy and operate services - Improving tooling, scripts, and workflows around our infrastructure and observability - A pragmatic approach to reliability and incident response: - Participating in or leading on-call rotations and incidents - Running postmortems, designing runbooks, and putting guardrails around risky operations - Strong communication skills and the ability to collaborate with product engineers and other stakeholders on trade ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atHedra

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card

More jobs at Hedra

See all →

Outbound SDR

San Francisco · 26 March 2026

Agentic Engineer

San Francisco; New York · 24 February 2026

Senior / Staff Backend Engineer

San Francisco; New York · 24 February 2026

Senior / Staff Full-Stack Engineer

San Francisco; New York · 24 February 2026

More Node jobs

See all →

Developer Relations Engineer, Tools

Tenstorrent · Austin, Texas, United States; Fort Collins, Colorado, United States; Portland, Oregon, United States; Santa Clara, California, United States; Toronto, Ontario, Canada; United States

Performance Architect, AI HW

Tenstorrent · Toronto, Ontario, Canada

Power Architect, AI Data Center Chiplets

Tenstorrent · United States

SOC Emulation Engineer - Hardware Emulation Infrastructure

Tenstorrent · Austin, Texas, United States; Santa Clara, California, United States