À propos
You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage.
What You'll Do
Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions.
What We're Looking For
Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.
Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).
ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Strong Candidates May Have
Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
Security & Privacy: Expertise in security best practices for high-scale infrastructure.
Observability: Familiarity with monitoring tools like Prometheus and Grafana.
Contract duration of more than 6 months. with 40 hours per week.
Mandatory skills: Linux, PyTorch, C++, AI infrastructure, TensorFlow, Rust, CUDA, NCCL, eBPF, NVIDIA AI Platform, Python
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.