Machine Learning Engineer - Distributed ML SystemsPluralis Research • United States
Dieses Stellenangebot ist nicht mehr verfügbar
Machine Learning Engineer - Distributed ML Systems
Pluralis Research
- United States
- United States
Über
We’re looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large‑scale training. You’ll be implementing a novel substrate for training distributed ML models that work under consumer‑grade internet connection.
Responsibilities Distributed Training Architecture & Optimization
Design and implement large‑scale distributed training systems optimized for heterogeneous hardware operating under low‑bandwidth, high‑latency conditions.
Develop and optimise model‑parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimise communication overhead.
Optimise GPU utilisation, memory efficiency, and compute performance across distributed nodes.
Implement robust checkpointing, state synchronisation, and recovery mechanisms for long‑running, fault‑prone training jobs.
Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.
Decentralised Networking & Resilience
Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
Design and optimise peer‑to‑peer topologies for decentralised coordination across non‑co‑located nodes.
Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
Profile and optimise communication patterns to reduce latency and bandwidth overhead in multi‑participant environments.
What You’ll Bring
Strong experience building and operating distributed systems in production.
Hands‑on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
Deep understanding of model parallelism (data, tensor, pipeline parallelism).
Expert‑level Python with production experience (concurrency, error handling, retry logic, clean architecture).
Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
Experience optimising GPU workloads, memory management, and large‑scale compute efficiency.
What we offer
Equity‑heavy compensation with meaningful ownership in a mission‑driven company
Competitive base salary for senior engineering roles in Australia
Visa sponsorship available for exceptional candidates
Remote‑first with optional access to our Melbourne hub
World‑class team — team mates were previously at Google, Amazon, Microsoft, and leading startups
Backed by Union Square Ventures and other tier‑1 investors, we’re a world‑class, deeply technical team of ML researchers and engineers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.
#J-18808-Ljbffr
Sprachkenntnisse
- English
Hinweis für Nutzer
Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.