Senior Machine Learning Engineer, ML Training Platform
RedditRemote - United StatesPosted 11 February 2026
Job Description
<div class="content-intro"><div class="c-message_kit__blocks c-message_kit__blocks--rich_text">
<div class="c-message__message_blocks c-message__message_blocks--rich_text" data-qa="message-text">
<div class="p-block_kit_renderer" data-qa="block-kit-renderer">
<div class="p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first">
<div class="p-rich_text_block">
<div class="p-rich_text_section">Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 116 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit <a class="c-link" href="http://www.redditinc.com/" target="_blank" data-stringify-link="http://redditinc.com" data-sk="tooltip_parent">www.redditinc.com</a>.</div>
</div>
</div>
</div>
</div>
</div></div><p><strong>Who We Are: </strong><strong><br></strong>The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams.</p>
<p><strong>What You’ll Do:</strong><br>As a Senior Software Engineer, Machine Learning Platform (Training Platform), you will be instrumental in architecting, implementing, and maintaining foundational Machine Learning (ML) infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and much more to fulfill Reddit’s mission of bringing community and belonging to everyone in the world.&nbsp; You will deliver a self service ML platform that enables the continuous iteration and improvement of systems that use ML techniques including Deep Learning, Natural Language Processing, Recommendation Systems, Representation Learning and Computer Vision.</p>
<ul>
<li>Lead the building, testing, and maintenance of ML training infrastructure at Reddit.</li>
<li>Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows.</li>
<li>Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows.</li>
<li>Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance.</li>
<li>GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.</li>
<li>Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management).</li>
</ul>
<p><strong>Who You Might Be:</strong></p>
<ul>
<li>5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems.</li>
<li><strong>Deep Kubernetes Expertise:</strong> You know K8s beyond just "deploying pods." You understand CRDs, Controllers and the Operator pattern.</li>
<li><strong>Jupyter Ecosystem Knowledge:</stron ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
More jobs at Reddit
See all →Client Account Executive, Mid-Market (Services)
Toronto, Canada · 28 February 2026
3rd Party Partnerships Manager - Commerce
Remote - United States · 28 February 2026
Senior Product Manager, Ads Identity & Attribution
Remote - United States · 27 February 2026
Senior Product Manager, Safety
Remote - United States · 27 February 2026
More Python jobs
See all →[Summer 2026] People Science - PhD Intern
Roblox · San Mateo, CA, United States
Team Lead - Security Platform
Cloudflare · Distributed; Hybrid
Sr. Security Software Engineer, Applied Computing (Starshield)
SpaceX · Hawthorne, CA
Security Software Engineer, Applied Computing (Starshield)
SpaceX · Washington, DC