Observability Engineer

TensorWave
Las Vegas, NevadaPosted 24 February 2026

Job Description

Our mission at TensorWave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the status quo to empower builders and support AI innovation.About the roleWe are looking for an Observability Engineer who is deeply obsessed with Grafana, Prometheus, and modern observability practices. This role exists to ensure our systems are measurable, understandable, and debuggable at all times.You will own the observability stack end-to-end — from instrumentation standards to dashboards, alerts, and signal quality — and work closely with infrastructure, platform, and application teams to make sure nothing important fails silently.If you think about metrics before features, believe bad alerts are worse than no alerts, and treat Grafana dashboards as first-class products, this role is for you.ResponsibilitiesOwn and evolve our observability and monitoring platform, with Grafana and Prometheus at its coreDesign, build, and maintain high-quality metrics pipelines using Prometheus and related toolingCreate clear, actionable Grafana dashboards that tell a story — not just chartsDefine and maintain alerts that are meaningful, actionable, and low-noiseEstablish and enforce observability standards across services (metrics, logs, traces)Partner with engineering teams to instrument applications correctlyLead improvements to alerting strategies, SLOs, and SLIsSupport incident response by helping teams quickly understand what broke and whyContinuously evaluate and improve signal quality, cardinality, and costIdentify observability gaps and eliminate blind spots before they become outagesYou Are Obsessed With:Grafana dashboards that instantly explain system healthPrometheus metrics that are intentionally designed, not accidentalAlerts that wake people up only when action is requiredMonitoring that scales with system complexityObservability as a product, not an afterthoughtRequired ExperienceStrong hands-on experience with Grafana and PrometheusDeep understanding of metrics-based observabilityExperience designing monitoring and alerting systems at scaleStrong knowledge of alerting best practices (burn rates, SLO-based alerts, noise reduction)Experience working with distributed systems and cloud or Kubernetes environmentsAbility to reason about system behavior using telemetryComfortable working across teams to improve instrumentation and visibilityPreferred ExperienceExperience with OpenTelemetryFamiliarity with logs and traces (Loki, Tempo, Jaeger, etc.)Kubernetes observability experienceExperience operating observability systems in high-scale or production-critical environmentsInfrastructure-as-Code experience (Terraform, Helm, etc.)What We BringMission driven companyCompetitive SalaryStock Options100% paid Medical, Dental, and Vision insuranceLife and Voluntary Supplemental InsuranceShort Term Disability InsuranceFlexible Spending Account401(k)Flexible PTOPaid HolidaysParental LeaveMental Health Benefits through Spring HealthWe’re looking for resilient, adaptable people to join our team, people who believe in the mission and think at massive scale. The solutions that worked on a handful of devices will not work at Exascale. Be prepared to be pushed daily, to learn a lot, and literally build the future.TensorWave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, national origin, or veteran status.