Data Infrastructure Engineer

Berkeley, CA$135k – $178kPosted 2 April 2026

Tech Stack

Python AWS Docker PostgreSQL Snowflake BigQuery Machine Learning AI LLM Linux Jira Confluence Agile

Job Description

About Glyphic: At Glyphic Biotechnologies, we plan to create the protein revolution for which scientists and researchers have been waiting. We are developing a massively parallel, single-molecule proteome sequencing platform that will transform life science discovery and usher in a new era of insights into human biology and disease. To date, we have raised >$80M from venture partners and non-dilutive grant funding to achieve our vision of next generation proteome sequencing. What we are looking for in you We are looking for a Data Infrastructure Engineer to design, build, and maintain the data systems that connect our nanopore sequencing instruments to analysis and insight. Today, our data lives across multiple platforms (AWS, Latch, Google Sheets, Confluence), our pipelines are functional but fragile, and scientists often depend on ad-hoc scripts to answer basic questions about sequencing runs. You will change that. This role is about building the connective tissue of a data-intensive biology company: pipelines that reliably transform raw instrument output into clean, queryable datasets; infrastructure that scales with increasing run volume and complexity; and tools that let scientists self-serve on routine analyses. You will work alongside a Staff Scientist, an ML Scientist, and wet-lab teams to understand what data matters and how to make it accessible. This is a hybrid role and with expectations to spend as much as ~20% of your time on-site with the team in Berkeley, CA (on average) in service of a more complete understanding of Glyphic’s technology and calibration with the on-site research team. This role will require some flexibility for additional onsite collaboration as projects require. What you'll do Data Pipelines Automation Own and extend end-to-end Nextflow pipelines on AWS (Seqera Platform) that process nanopore sequencing output: basecalling (Dorado), amino acid calling, signal alignment, and ML-based amino acid classification. Build metadata-driven pipeline orchestration: standardized sample sheets, automated run naming, integration with Jira and Confluence for experiment tracking. Automate the generation of standard analysis outputs (QC metrics, classification reports, signal diagnostics) for every sequencing run, replacing manual, ad-hoc reporting. Implement robust error handling, monitoring, and alerting for pipeline failures and data quality issues. Data Modeling Storage Design and implement a data model and schema for nanopore sequencing data: raw signal, basecalls, classification results, experimental metadata, and QC metrics. Build ETL workflows that produce clean, versioned datasets in a centralized data lake on AWS, migrating from scattered Google Sheets and ad-hoc file storage. Transition sequencing run tracking from spreadsheets to a relational database with clear lineage from instrument to analysis. Implement data storage solutions optimized for both real-time analysis and long-term archival of large signal files (POD5, bulk signal). Visualization Self-Serve Analytics Deploy and maintain data visualization tools (dashboards, interactive browsers) that allow scientists to independently explore sequencing metrics: yields, classification accuracy, plate-level comparisons, signal quality trends. Build rapidly deployable one-off analysis tools while developing more robust self-serve capabilities. Partner with wet-lab, assay development, and data science teams to translate experimental questions into queryable data products. Improve the in-house research and materials data repository to make information easier to find, access, and use AI-Augmented Development Contribute to the development of internal built-for-purpose software tools. Leverage AI coding tools (Claude Code, Copilot, etc.) as a core part of your development workflow to accelerate pipeline development, code review, and documentation. Build with AI-first patterns: automate boilerplate, use LLMs for data exploration an ... (truncated, view full listing at source)

Apply Now

Direct link to company career page

More jobs atGlyphic Biotechnologies

AI Resume Fit Check

See exactly which skills you match and which are missing before you apply. Free, instant, no spam.

Check my resume fit

Free · No credit card