Staff Machine Learning Training Framework Engineer, GenAI
AdobeSan JosePosted 1 March 2026
Tech Stack
Job Description
The Opportunity Adobe Applied Science & Machine Learning (ASML) is seeking a Staff Machine Learning Training Framework Engineer to play a critical role in building and scaling the core training systems behind Adobe’s generative AI foundation models. In this role, you will serve as a senior technical owner for key components of our training framework, translating research needs into reliable, scalable, and high‑performance training infrastructure. Rather than focusing on a single model, your work will enable multiple multimodal and video foundation models by strengthening the shared systems used to train them. You will operate at the intersection of applied research and large‑scale systems execution, ensuring that training workflows are robust, reproducible, and performant across large GPU clusters. This role is ideal for a senior engineer who thrives on deep technical ownership, complex execution, and close collaboration with research teams. Job Responsibilities Training Framework Ownership: Own the design and implementation of major components of the training framework, including abstractions for model configuration, optimizer and scheduler integration, checkpointing, and experiment management. Large‑Scale Training Execution: Implement and support distributed training strategies such as PyTorch FSDP, Tensor Parallelism, and Pipeline Parallelism, ensuring correctness, stability, and scalability across multi‑node GPU environments. Reliability & Fault Tolerance: Improve the resilience of long‑running training jobs by strengthening restartability, state management, and failure handling mechanisms. Performance‑Aware Framework Design: Identify framework‑level inefficiencies and reduce overhead related to memory usage, communication, or execution orchestration in large training runs. Research Enablement: Partner directly with applied researchers to support new model architectures and training requirements, ensuring the framework adapts quickly to evolving research needs. Training Pipeline Integration: Collaborate with infrastructure and platform teams to integrate the training framework with scheduling, storage, monitoring, and logging systems used in production‑scale environments. What You’ll Need to Succeed Education: Master’s or PhD degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience. Strong Systems Engineering Skills: Proficiency in Python and C++, with experience contributing to large, shared codebases that support multiple users or teams. Proven ML Training Experience: Hands‑on experience training models using PyTorch (or JAX), including multi‑GPU and multi‑node distributed training setups. Distributed Systems Understanding: Solid understanding of synchronization, state management, fault tolerance, and performance tradeoffs in distributed systems. Senior‑Level Execution: Demonstrated ability to independently own complex technical problems, drive solutions to completion, and deliver high‑quality systems relied upon by others. Preferred Experience Experience supporting large‑scale foundation model training or long‑running multi‑node training jobs. Familiarity with ML training infrastructure such as DeepSpeed, Accelerate, or internal training platforms. Experience working closely with applied research teams on rapidly evolving model requirements. Exposure to profiling, debugging, and optimizing training performance at scale. About Adobe Adobe empowers everyone to create through innovative platforms and tools that unleash creativity, productivity and personalized customer experiences. Adobe’s industry-leading offerings including Adobe Acrobat Studio, Adobe Express, Adobe Firefly, Creative Cloud, Adobe Experience Platform, Adobe Experience Manager, and GenStudio enable people and businesses to turn ideas into impact, powered by AI and driven by human ingenuity. Our 30,000+ employees worldwide are creating the future and raising the bar as we drive the next decade of growth. We’ ... (truncated, view full listing at source)
Apply Now
Direct link to company career page