Cette offre d'emploi n'est plus disponible
HPC Systems Administrator
Empire AI
- Buffalo, New York, United States
- Buffalo, New York, United States
À propos
The HPC Systems Administrator will administer, optimize, and support the high-performance computing platforms that power Empire AI's AI/ML workloads, scientific research, and large-scale simulation across its statewide consortium. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for the day to day cluster operations, job scheduling, GPU resource management, and systems reliability of Empire AI's distributed HPC infrastructure. This role ensures that Empire AI's shared computing environments remain available, performant, and accessible to researchers across partner institutions. The HPC Systems Administrator works at the intersection of systems administration, AI/ML infrastructure support, and research computing, bridging the gap between complex user workloads and the underlying HPC platform. Duties and Responsibilities
Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes Administer and optimize Slurm workload manager including partition design, QOS policies, fair-share accounting, and cross-institutional workload orchestration models Manage NVIDIA GPU resources (H100/H200/GB200) including driver, CUDA, firmware, and NCCL lifecycle management for AI training and inference workloads Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management Support containerized and virtualized research environments using Apptainer/Singularity, Pyxis and Enroot Troubleshoot performance bottlenecks including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and AI workloads Administer parallel file systems such as Lustre and Vast and integrate with cluster storage workflows Establish incident alerting and escalation procedures for HPC cluster and infrastructure. Manage detailed monitoring dashboards (Prometheus, Grafana) to track critical metrics: network throughput, GPU utilization, cluster health, and job telemetry. AI/ML Infrastructure Support
Architect and support systems for AI training and inference pipelines, including large language models (LLMs) and multimodal AI workloads Tune and benchmark systems for GPU-intensive AI/ML frameworks including PyTorch and TensorFlow Work with research faculty to translate scientific goals into technical configurations and workload requirements Evaluate emerging HPC hardware and software solutions, propose procurement recommendations aligned with AI/ML workload demands Security & Compliance
Enforce security baselines, access control policies, and network segmentation across HPC environments Integrate robust monitoring, alerting, access control, and disaster recovery planning into cluster operations Partner with the Security & Compliance specialist to ensure security is integrated into system design and workload orchestration Consult with research teams across consortium institutions to assess computational needs and advise on workflow optimization Translate user feedback and researcher requirements into system-level improvements and configuration optimizations Maintain clear system documentation, configuration guides, runbooks, and architecture diagrams Minimum Qualifications
Bachelor's degree in Computer Science, Engineering, or a related technical field 5+ years of hands-on experience administering Linux-based HPC clusters in production environments, supporting research or scientific computing projects Expertise with job schedulers (e.g., Slurm) and GPU computing Familiarity with AI/ML frameworks, container environments (Apptainer/Singularity, Pyxis, Docker), and distributed storage systems Working knowledge of InfiniBand networking (subnet management, UFM, opensm) and/or RoCEv2/Ethernet HPC fabrics Proficiency in Bash and Python scripting for automation and systems administration Experience with monitoring stacks: Prometheus, Grafana, or equivalent Demonstrated success collaborating with researchers or supporting scientific computing projects Preferred Qualifications
Experience with NVIDIA Base Command Manager (BCM), NVIDIA UFM, or DGX SuperPOD infrastructure Familiarity with workload patterns and infrastructure needs for training, tuning, and deploying large-scale AI/ML models Proficiency in infrastructure automation and system configuration tools: Ansible, Git Experience supporting or collaborating within academic or industry research environments focused on artificial intelligence, machine learning, or large-scale data science
#J-18808-Ljbffr
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre a été publiée par l’un de nos partenaires. Vous pouvez consulter l’offre originale ici.