Job Description
CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
What you'll do:
CoreWeave is seeking a highly skilled and motivated HPC Performance Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will play a crucial part in the design, development, and optimization of our bare-metal systems from POST through joining a Kubernetes cluster. The team’s primary responsibilities include maintaining a custom Linux kernel, various OS images (Ubuntu-based), the virtualization stack (kubevirt/qemu/vfio), and the container/pod runtime stack (containerd/nydus/kubelet). You will collaborate closely with cross-functional teams, up stack engineering teams, and stakeholders to ensure our low-level software stack is performant in the context of hardware updates; and providing data, metrics, dashboards, and analysis to substantiate performance assertions.
Kernel
H
ardware -
A
cceleration -
V
irtualization -
O
perating Systems -
C
ontainerization -
K
ubelet
Our Team’s Stack:
Python, Go, bash/sh, C
Prometheus, Victoria Metrics, Grafana
Linux Kernel (custom build), Ubuntu
Intel/AMD/ARM CPUs, Nvidia GPUs, DPUs, Infiniband and Ethernet NICs
Docker, kubernetes (k8s), KubeVirt, containerd, kubelet
About the role:
Develop and maintain tools for establishing systems performance baselines
Develop and maintain performance regression analysis testing automation
Design and maintain performance regression test pipelines for HPC workloads
Debug and Tune fabric-level performance to ensure low-latency high throughput configurations
Development of telemetry for performance analysis across distributed clusters of servers
Triage and fix performance issues in Linux
Collect data, produce metrics and visualizations that communicate performance information compared to benchmarks; this data should lead to appropriate business decisions and toward greater automation that improves customer experience in relation to performance
Define Linux and OS requirements, specifications, and system architecture in relation to systems performance, in collaboration with cross-functional teams. Along with these responsibilities there will also be cross team collaboration to triage and resolve bottlenecks
Who you are:
5+ years of professional experience in Systems/HPC Performance Engineering, Benchmarking, and/or Validation.
Strong experience with MPI workloads and distributed system performance analysis
Familiarity with RoCE, InfiniBand, and GPUDirect/Data Direct I/O, NUMA, etc in HPC workloads
Hands-on use of public HPC benchmarks (HPCC, HPL, OSU, MLPerf-HPC, STREAM, IO500)
Extensive, deep experience in Linux internals
Fluency with a programming language geared toward automation (Python preferred, but others possible)
Experience writing robust, testable code
Experience diagnosing and fixing systems performance issues
Experiencing with implementing automation testing
Ability to effectively prioritize and communicate proposed features and fixes in a remote-employee environment
Strong passion for automation, with a commitment to automating processes comprehensively
Excellent documentation skills and attention to detail
Strong analytical and problem-solving abilities
Preferred:
Familiarity with QA/QE best practices
Familiarity with Golang
Opinions about software version control and team collaboration
Experience working in Cloud environments
Experience as a software engineer writing large-scale applications
Experience in op ... (truncated, view full listing at source)