À propos
Role Overview
We are seeking a Staff Machine Learning Engineer (Infra) to design, build, and operate the model training and deployment infrastructure that powers our Demand-Side Platform (DSP). This role focuses on building scalable, flexible, and reliable systems for training models on billions of records across bidding, ranking, pacing, and fraud use cases.
You will work at the intersection of machine learning, data platforms, and infrastructure, with a strong focus on automation, reproducibility, and reliability. The ideal candidate has experience building production-grade ML training systems and is motivated by improving the velocity and reliability of model development.
What will you do? Own and evolve shared ML infrastructure for training, deployment, and lifecycle management; deliver measurable gains in reliability, cost, and developer velocity. Lead cross-pod initiatives end-to-end (design → build → production), reducing org bottlenecks and aligning stakeholders on goals and success metrics. Build scalable training and orchestration systems (Prefect-first) for billion-scale datasets with strong failure recovery and backfill support. Build and operate high-throughput, low-latency serving/inference systems for DSP models (bidding, ranking, pacing, fraud), including safe rollouts and performance guardrails. Establish ML observability across the lifecycle: data quality, training stability, drift/anomalies, and regression monitoring with actionable alerting and runbooks. Standardize reproducibility and governance: versioning, lineage/traceability, and experiment tracking (MLflow), with clear production readiness criteria. Drive operational excellence for owned components: on-call ownership, incident response, postmortems, and reliability improvements. Build foundations for feature management (feature pipelines/feature store) and offline/online consistency guarantees. What are we looking for? 6+ years building and operating production ML systems, including training pipelines and online inference. Strong Python and Spark for large-scale processing (on-prem/YARN environments preferred). Proven experience with workflow orchestration for ML (Prefect or similar) and production-grade automation. Experience designing and operating serving systems in high-throughput, low-latency environments (REST/gRPC, canary/rollback strategies). Strong DevOps/MLOps practices: CI/CD, automated testing, infrastructure as code, and reliability engineering. Strong understanding of experimentation and reproducibility: dataset/model versioning, lineage, and traceability; MLflow familiarity preferred. Solid grounding in core ML methods to evaluate and diagnose model/data issues. Strong communication skills across ML and engineering stakeholders. Nice-to-Have Familiarity with system programming languages including C++ and Rust is a plus. Strong grasp of probability, statistics, and data analysis principles. Ad-tech familiarity: auction dynamics, pacing, fraud signals, creative personalization.
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.