Solutions Engineer for Andromeda Cluster

FreelanceJobs

Canada

Canada

Postuler Maintenant

À propos

We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers.
You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage.
What You'll Do
Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions.
What We're Looking For
Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.
Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).
ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Strong Candidates May Have
Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
Security & Privacy: Expertise in security best practices for high-scale infrastructure.
Observability: Familiarity with monitoring tools like Prometheus and Grafana.
Contract duration of more than 6 months. with 40 hours per week.
Mandatory skills: Linux, PyTorch, C++, AI infrastructure, TensorFlow, Rust, CUDA, NCCL, eBPF, NVIDIA AI Platform, Python

Canada

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.

Postuler Maintenant