Senior Backend Engineer, Data Modeling and Ingestion Platform

Udio
New York$180k – $220kPosted 5 March 2026

Job Description

<h2><strong>About the Role </strong></h2> <p>We are looking for a Senior Backend Engineer to lead the unification of <strong>large, highly rich, and heterogeneous datasets</strong> sourced from a wide range of external providers. These datasets are used to power our generative audio models. </p> <p>Your work will create the foundational dataset that powers our research by building robust, scalable systems for <strong>linking, deduplicating, reconciling, and enriching </strong>data at massive scale. This role centers on <strong>high-impact bulk ingestion and advanced data linkage</strong>. You will design the logic, algorithms, and strategies that transform many independent datasets into a unified, high-quality canonical asset used throughout the company.</p> <p>You will collaborate closely with ML researchers and product teams, working with tools such as <strong>BigQuery, Dataflow/Beam, TFRecords</strong>, and—where beneficial—distributed systems frameworks like <strong>Ray</strong>. Familiarity with ML workflows using <strong>JAX</strong> or <strong>multihost training</strong> is a plus, as the datasets you produce will directly support that ecosystem.</p> <h2>What You'll Do</h2> <ul> <li>Build high-throughput <strong>bulk ingestion workflows</strong> to integrate datasets from multiple external providers. </li> <li>Design and implement scalable <strong>entity-resolution</strong> solutions, including record linking, deduplication, clustering, and conflict arbitration. </li> <li>Create and refine <strong>matching logic, decision rules, and similarity functions</strong> to align datasets with high accuracy and strong coverage. </li> <li>Define and track <strong>data quality indicators</strong>, such as overlap metrics, match precision/recall, duplicate rates, and completeness. </li> <li>Prepare training-ready datasets in formats such as <strong>TFRecords</strong>, and structure data to meet ML research requirements. </li> <li>Develop processing components using <strong>Dataflow (Beam)</strong> and manage large analytical workloads in <strong>BigQuery</strong>. </li> <li>Leverage frameworks like <strong>Ray</strong> to accelerate large-scale experiments, feature extraction, and research-oriented data preparation. </li> <li>Collaborate with ML researchers to anticipate downstream requirements and evolve linkage strategies as new sources and use cases emerge. </li> </ul> <h2>What We're Looking For </h2> <ul> <li>Experience working with <strong>large, heterogeneous datasets </strong>from multiple providers or domains. </li> <li>Strong background in <strong>entity resolution</strong>, deduplication, data unification, or related large-scale data integration techniques. </li> <li>Proficiency in <strong>Python</strong>, with an emphasis on efficient, scalable data processing. </li> <li>Experience with <strong>BigQuery, Google Dataflow/Apache Beam</strong>, or similar batch-processing frameworks. </li> <li>Familiarity with <strong>data validation, normalization, reconciliation</strong>, and building consistent views across diverse data sources. </li> <li>Ability to craft well-structured <strong>matching and decision strategies</strong> that balance accuracy, completeness, and computational efficiency. </li> <li>Comfortable iterating quickly on pragmatic solutions, balancing correctness with time-to-delivery. </li> <li>Clear communication skills and the ability to collaborate closely with ML and research teams. </li> </ul> <h2> Nice to Have</h2> <ul> <li>Knowledge of architecting <strong>Google Cloud Platform</strong> systems at scale</li> <li>Experience with distributed compute frameworks such as <strong>Ray</strong>, <strong>Spark</strong>, or <strong>Flink</strong>. </li> <li>Understanding of <strong>JAX-based ML pipelines</strong>, <strong>multihost training setups,</strong> or large-scale data preparation for accelerator-backed workflows. </li> <li>Familiarity with <strong>TFRecords</strong> or other high-volume training data format ... (truncated, view full listing at source)
Apply Now

Direct link to company career page

Share this job