Staff Machine Learning Infrastructure Engineer

Metric Geo

United States

United States

Find similar jobs

About

About the role We build general-purpose robots powered by a proprietary embodied AI foundation model that generalises and self-improves across varied environments with commercial-grade performance. Our robots are already deployed across multiple industries, and our frontier model leads the industry in generalisation and performance.
Ready to make your application Please do read through the description at least once before clicking on Apply.
We're looking for a Staff ML Infrastructure Engineer to serve as the architect of our training engine, the person who bridges raw hardware and cutting-edge research to ensure our ML team can iterate at speed without friction. Your goal: maximise intelligence-per-watt by optimising every millisecond of the training and inference pipeline. What you'll do Architect and own the infrastructure for large-scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery Design high-throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs Build low-latency inference pipelines for real-time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet What we're looking for 7+ years of engineering experience with a track record of leading technical projects in high-performance computing or ML infrastructure Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation Hands-on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter-node communication An ownership mindset — you design, build, and operate systems end-to-end xywuqvp rather than simply deploying code Nice to have Experience with robotics data formats such as MCAP or Protobuf, or with multimodal models (VLAs) Deep ML systems work including custom kernels (Triton), compilers, or runtime optimisation Experience as a founding or early-stage infrastructure hire What we offer Competitive base salary of $220,000 – $320,000 + Equity The opportunity to work at the frontier of embodied AI and physical robotics A fast-moving, research-driven environment where your infrastructure work directly shapes what the robots can do Backing from top investors including CRV and First Round, with over $100M raised

United States

Languages

English

Notice for Users

This job was posted by one of our partners. You can view the original job source here.

Find similar jobs