Deep Learning Kernel Software Performance Architect

NVIDIA
2 LocationsPosted 8 March 2026

Job Description

NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups. What you'll be doing: Performance analysis debugging Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks. Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams. Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution. Automation regression infrastructure (Python-heavy) Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable. Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards. Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining. Cross-team collaboration and operating model Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs. Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards. Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently Following general software engineering best practices including support for regression testing and CI/CD flows What we need to see: Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field Strong programming ability in Python plus C/C (performance-oriented code reading/debugging) Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism). Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage. Comfortable working across teams and driving issues to decision/closure with clear communication Demonstrated strong C programming and software design skills, including debugging, performance analysis, and test design Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads) Solid understanding of computer architecture and some experience with assembly programming Identify bottlenecks, optimize resource utilization, and improve throughput Ways to stand out from the crowd: Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts) Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics GPU programming/perf experience (CUDA or equivalent parallel programming) Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks) Familiarity with simulators/analytical modeling or performance characterization methodology
Apply Now

Direct link to company career page

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card

Share