Staff Machine Learning Engineer (Infra)

Aarki

United States

United States

Trouver des emplois similaires

À propos

Who are we? Aarki is an AI-driven company specializing in mobile advertising solutions designed to fuel revenue growth. We leverage AI to discover audiences in a privacy-first environment through trillions of contextual bidding signals and proprietary behavioral models. Our audience engagement platform includes creative strategy and execution. We handle 5 million mobile ad requests per second from over 10 billion devices, driving performance for both publishers and brands. We are headquartered in San Francisco, CA, with a global presence across the United States, EMEA, and APAC.
Role Overview
We are seeking a Staff Machine Learning Engineer (Infra) to design, build, and operate the model training and deployment infrastructure that powers our Demand-Side Platform (DSP). This role focuses on building scalable, flexible, and reliable systems for training models on billions of records across bidding, ranking, pacing, and fraud use cases.
You will work at the intersection of machine learning, data platforms, and infrastructure, with a strong focus on automation, reproducibility, and reliability. The ideal candidate has experience building production-grade ML training systems and is motivated by improving the velocity and reliability of model development.
What will you do? Own and evolve shared ML infrastructure for training, deployment, and lifecycle management; deliver measurable gains in reliability, cost, and developer velocity. Lead cross-pod initiatives end-to-end (design → build → production), reducing org bottlenecks and aligning stakeholders on goals and success metrics. Build scalable training and orchestration systems (Prefect-first) for billion-scale datasets with strong failure recovery and backfill support. Build and operate high-throughput, low-latency serving/inference systems for DSP models (bidding, ranking, pacing, fraud), including safe rollouts and performance guardrails. Establish ML observability across the lifecycle: data quality, training stability, drift/anomalies, and regression monitoring with actionable alerting and runbooks. Standardize reproducibility and governance: versioning, lineage/traceability, and experiment tracking (MLflow), with clear production readiness criteria. Drive operational excellence for owned components: on-call ownership, incident response, postmortems, and reliability improvements. Build foundations for feature management (feature pipelines/feature store) and offline/online consistency guarantees. What are we looking for? 6+ years building and operating production ML systems, including training pipelines and online inference. Strong Python and Spark for large-scale processing (on-prem/YARN environments preferred). Proven experience with workflow orchestration for ML (Prefect or similar) and production-grade automation. Experience designing and operating serving systems in high-throughput, low-latency environments (REST/gRPC, canary/rollback strategies). Strong DevOps/MLOps practices: CI/CD, automated testing, infrastructure as code, and reliability engineering. Strong understanding of experimentation and reproducibility: dataset/model versioning, lineage, and traceability; MLflow familiarity preferred. Solid grounding in core ML methods to evaluate and diagnose model/data issues. Strong communication skills across ML and engineering stakeholders. Nice-to-Have Familiarity with system programming languages including C++ and Rust is a plus. Strong grasp of probability, statistics, and data analysis principles. Ad-tech familiarity: auction dynamics, pacing, fraud signals, creative personalization.

United States

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre a été publiée par l’un de nos partenaires. Vous pouvez consulter l’offre originale ici.

Trouver des emplois similaires