Machine Learning - Infrastructure

Causal Labs

United States

United States

Postuler Maintenant

À propos

Infrastructure Engineer
Our mission is general causal intelligence, AI that is capable of predicting the future and identifying the optimal actions to change that future. To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data. Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple - who believe general causal intelligence will be the most important technical breakthrough for civilization. We look for infrastructure engineers who are excited to tackle unsolved problems. Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission. Responsibilities Design, deploy, and maintain large distributed ML training and inference clusters Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-to-date on research to bring new ideas to work What We're Looking For Strong grasp of state-of-the-art techniques for optimizing training and inference workloads Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker) Background working on distributed task management systems and scalable model serving & deployment architectures Understanding of monitoring, logging, observability, and version control best practices for ML systems You don't have to meet every single requirement above.

United States

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.

Postuler Maintenant