AI Data Engineer - Scientific Data Platforms (Remote)
Astrix Technology
- United States
- United States
About
Science & Research
South San Francisco, CA, US
Added - 15/06/2026
Pay Rate Low: 35 | Pay Rate High: 40
Our client is a leading global biotechnology and pharmaceutical organization driven by a mission to innovate, continuously advance science, and ensure everyone has access to the healthcare they need.
Title:
AI Data Engineer - Scientific Data Platforms
Location:
Remote, Must work PST
Pay rate:
$35-38/hr (Depends on experience level)
Schedule:
Full-time (40 hours/week)
Duration:
1-year contract, (Plus benefits)
Position Overview
This role addresses a critical need in scaling our AI models for drug discovery by building largely automated, scalable, agent-driven data ingestion and curation pipelines for genomics data. This includes metadata inference, constructing performant query architectures, and transforming high-dimensional datasets (e.g., single-cell omics, clinical trials) into AI-ready training formats.
Key Responsibilities
Build an agentic data ingestion pipeline and move beyond bespoke steps toward agents that teams can reliably use as a shared, deployed service.
Triage and prioritize incoming requests to ingest specific datasets. Clean and organize data, building the first-pass cleaning and organization steps into the agentic flow.
Validate cross-modal linkage. Add automated checks that catch when ingested data does not connect correctly and flag low-quality or mismatched records.
Version every dataset, retaining and making prior versions addressable. Preserve raw data and provenance, ensuring agent workflows log validation and transformation steps so lineage is fully traceable.
Partner with AI, software engineering, and computational biology groups to co-define data standards and conventions.
Qualifications & Requirements
Demonstrated experience building multi-agent workflows or LLM workflows using tools/frameworks such as LangGraph or LlamaIndex, including tool/function calling and asynchronous task execution.
Strong Python skills for data manipulation, working with APIs and databases, and handling heterogeneous data formats.
Familiarity with dataset versioning approaches (e.g., DVC, lakeFS, or equivalent).
Comfortable with or showing a strong willingness to learn common omics data formats like AnnData, H5AD, and TileDB.
No deep bioinformatics expertise required; just a basic conceptual understanding of different modalities (e.g., RNA-seq vs. scRNA-seq vs. WES; genomics vs. transcriptomics vs. proteomics vs. metabolomics).
Comfortable writing unit and functional tests to ensure data processing workflows are reliable and reproducible.
Degree in a technical field or equivalent practical experience.
Must be Authorized to work in the United States without Sponsorship.
Nice to Have
Experience deploying agent workflows as a shared service (e.g., FastAPI or MCP endpoints).
Exposure to cloud platforms (AWS, GCP) and containerization (Docker).
Familiarity with scientific workflow managers such as Nextflow or Snakemake.
INDBH
#LI-MG1
We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, or any other characteristic protected by law.
Languages
- English
Notice for Users
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.