Jobbörse
Finde Jobs in deiner Nähe – ob vor Ort, hybrid oder remote.- Ähnliche Jobs zu: Machine Learning Infrastructure Engineer
Machine Learning Infrastructure Engineer
Institute of Foundation ModelsSunnyvaleAbout the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next g
Machine Learning Infrastructure Engineer
TRM LabsSan FranciscoBuild a Safer World. TRM Labs provides blockchain analytics and AI solutions to help law enforcement and national security agencies, financial institutions, and cryptocurrency businesses detect, inves
Machine Learning / Reinforcement Learning Infrastructure Engineer
ekaroboticsBostonEka RoboticsEka Robotics is on a mission to build intelligence for the physical world - robots that are fast, general, and reliable. Our approach, grounded in physics, unlocks superhuman capabilities.
Senior Machine Learning Infrastructure Engineer, Simulation
DiversityJobsUnited StatesWaymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building
Sr. Machine Learning Infrastructure Engineer, Creator Studio
AppleCulver CitySr. Machine Learning Infrastructure Engineer, Creator Studio Culver City, California, United States Software and ServicesAt Apple, new ideas have a way of becoming phenomenal products, services, and c
Staff Machine Learning Engineer, ML Infrastructure
Unity TechnologiesUnited StatesBellevue, WA, USAStaff Machine Learning Engineer, ML InfrastructureLocationBellevue, WA, USADepartmentAI & Machine LearningRequisition IDJOBREQ-2615904Role descriptionThe opportunityUnity Vector build
Staff Machine Learning Engineer, ML Infrastructure
Venturefizz Product Management CommunityUnited StatesPrincipal DevOps EngineerWe're a high-tech home security company that's passionate about protecting the life you've built and our mission of keeping Every Home Secure. And we've created a culture here
Manager, Machine Learning Infrastructure - SIML
AppleUnited StatesManager, Machine Learning Infrastructure - SIMLWork Locations (2) Submit Resume Do you think Computer Vision and Machine Learning can change the world? Do you think it can transform the way millions o
Senior Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technical oriented, impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical work. I
Senior Machine Learning Engineer - ML Training Infrastructure
General MotorsDoverJob OverviewSenior ML Engineer – ML Training Infrastructure, General Motors. We are seeking an experienced, technical‑oriented, impact‑delivering expert in ML training infrastructure to design and bui
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Senior Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technical oriented, impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical work. I
Senior Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technical oriented, impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical work. I
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Staff Machine Learning Engineer - ML Training Infrastructure
General MotorsUnited StatesJob Description**The Role:**We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. I
Machine Learning Solutions Engineer (ML + Infrastructure Focus)
Lightning AIUnited StatesMachine Learning Solutions Engineer (ML + Infrastructure Focus)New York, New York, United States; San Francisco, California, United States; Seattle, Washington, United States Who We AreLightning AI is
Sr Machine Learning Engineer- ML Infrastructure & Data Platforms
Dormont Manufacturing CoSan JoseWe’re looking for a Senior Machine Learning Engineer to join our Applied Science Data Frameworks team. In this role, you’ll build the infrastructure that powers large-scale, multimodal AI training and
Machine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad)
United States Digital Space LLCMountain ViewRegional Manager, Sales Engineering - Public Sector As a Regional Manager, Sales Engineering, you will lead a team of Sales Engineers and frontline leaders, driving technical execution, operational ex
Machine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad)
Unity TechnologiesUnited StatesMountain View, CA, USAMachine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad)LocationMountain View, CA, USADepartmentAI & Machine LearningRequisition IDJOBREQ-2616004Role descriptio
Backend Engineer - Infrastructure
HeyGenUnited StatesBackend Engineer - InfrastructureLos Angeles, San Francisco, Palo Alto, Toronto About HeyGenAt HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual conten
Backend Engineer - Infrastructure
HeyGenUnited StatesBackend Engineer - InfrastructureLos Angeles, San Francisco, Palo Alto, Toronto About HeyGenAt HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual conten
Data Infrastructure Engineer
AlljoinedSan FranciscoAbout Alljoined Alljoined is creating a future where humans are fully understood and augmented by technology. Our work solves the communication bottleneck between humans and computers by decoding thou
Machine Learning Infrastructure Engineer
- Sunnyvale, California, United States
- Sunnyvale, California, United States
Über
As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to:
Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
Implement distributed optimizers from mathematical specs
Build robust config + launch systems across multi‑node, multi‑GPU clusters
Own experiment tracking, metrics logging, and job monitoring for external visibility
Improve training system reliability, maintainability, and performance
While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most.
Key Responsibilities
Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.
Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.
Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.
Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.
Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.
Qualifications Must-Haves:
5+ years of experience in ML systems, infra, or distributed training
Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
Strong software engineering fundamentals (Python, systems design, testing)
Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
Ability to implement algorithms across GPUs/nodes based on mathematical specs
Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
Experience with large‑scale machine learning workloads (strong ML fundamentals)
Nice-to-Haves:
Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation
Familiarity with performance profiling, kernel fusion, or memory optimization
Open‑source contributions or published research (MLSys, ICML, NeurIPS)
CUDA or Triton kernel experience
Experience with large‑scale pre‑training
Experience building custom training pipelines at scale and modifying them for custom needs
Deep familiarity with training infrastructure and performance tuning
$150,000 - $450,000 a year
Benefits
Comprehensive medical, dental, and vision
401(k) program
Generous PTO, sick leave, and holidays
Paid parental leave and family‑friendly benefits
On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station
#J-18808-Ljbffr
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.