Senior/Staff Infrastructure Engineer
HedraSan FranciscoPosted 7 April 2026
Job Description
Senior/Staff Infrastructure Engineer
ABOUT HEDRA:
Hedra is an AI that bridges the gap between market intelligence and content generation. By analyzing your existing assets, Hedra ensures every new creation is deeply aligned with your audience preferences, market trends and your brands core identity.
Backed by $45 million from premier investors like A16Z, Index, and Abstract Ventures, we are building the world’s most advanced models for unified media understanding. Today, Hedra powers the creative workflows of 10 million users and 20% of the Fortune 500. Join our team of world-class researchers and engineers as we define the next generation of creative super intelligence.
SUMMARY
As a Senior/Staff Infrastructure Engineer, you will own the reliability, availability, and operability of our core Python web services running at scale on AWS.
You will be responsible for designing, maintaining, and improving the production infrastructure that keeps Hedra online: Kubernetes for orchestration, AWS as the core cloud platform, and Postgres on RDS as a key managed data service. Your work will focus on building a highly available runtime environment for our services, ensuring we can ship quickly while staying resilient through incidents, traffic spikes, and growth.
You will design robust deployment patterns on Kubernetes, optimize our use of AWS (networking, load balancing, scaling, resilience), and put in place the observability and alerting we’re currently missing — from system-level metrics to product health signals. You’ll also partner with product engineers to make Python a great place to build: smoothing out CI/CD, runtime configuration, and production debugging.
This is a hands-on infrastructure role, not a product feature role. You will work closely with engineering leadership and product teams, but your primary mandate is to keep our services healthy, observable, and ready to scale. We're looking for a full-time hire in our San Francisco office.
EXPERIENCE
We’re looking for candidates who have:
- At least 4+ years in infrastructure / SRE / platform / backend operations roles at technology companies
- At least 3+ years running a critical Python web application in production on AWS
- Strong experience operating services on Kubernetes, including:
- Designing deployment strategies (rolling, blue/green, canary)
- Autoscaling, resource limits/requests, capacity planning
- Debugging pod/node issues and cluster-level problems
- Solid experience with AWS for high availability, such as:
- Multi-AZ architectures, load balancers, security groups, IAM basics
- Using managed services (RDS, S3, queues, caches, etc.) effectively
- Understanding maintenance windows, failure modes, and regional/AZ considerations
- Experience improving observability for production systems:
- Implementing or refining system metrics (CPU, memory, disk, network, pod/node health)
- Adding application and product health metrics (latency, error rates, key business KPIs)
- Standing up useful dashboards, traces, structured logging, and actionable alerts
- Comfort working with Python services at scale:
- CI/CD pipelines, dependency management, runtime configuration
- Performance tuning, concurrency models, and production debugging
- Practical experience with Postgres on RDS:
- Running it reliably in production (backups, restores, monitoring, failover)
- Coordinating version upgrades and schema changes with minimal disruption
- A developer experience mindset:
- Making it easier and safer for engineers to deploy and operate services
- Improving tooling, scripts, and workflows around our infrastructure and observability
- A pragmatic approach to reliability and incident response:
- Participating in or leading on-call rotations and incidents
- Running postmortems, designing runbooks, and putting guardrails around risky operations
- Strong communication skills and the ability to collaborate with product engineers and other stakeholders on trade ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card
More jobs at Hedra
See all →More Node jobs
See all →Developer Relations Engineer, Tools
Tenstorrent · Austin, Texas, United States; Fort Collins, Colorado, United States; Portland, Oregon, United States; Santa Clara, California, United States; Toronto, Ontario, Canada; United States
Performance Architect, AI HW
Tenstorrent · Toronto, Ontario, Canada
Power Architect, AI Data Center Chiplets
Tenstorrent · United States
SOC Emulation Engineer - Hardware Emulation Infrastructure
Tenstorrent · Austin, Texas, United States; Santa Clara, California, United States