Über
My client is building next-generation AI and high-performance computing platforms that power advanced machine learning, data science, and large-scale compute workloads. We operate high-density GPU clusters and are looking for a Senior Linux Engineer to lead the design, optimization, and reliability of our GPU-based infrastructure.
The following information aims to provide potential candidates with a better understanding of the requirements for this role.
As a Senior Linux Engineer specializing in GPU systems, you will architect, deploy, and operate high-performance Linux environments optimized for GPU workloads including AI/ML training, inference, simulation, and data processing. You will work closely with ML engineers, platform teams, and DevOps to ensure performance, scalability, and reliability across our compute infrastructure. This is a hands‑on technical leadership role requiring deep Linux expertise and strong experience managing GPU-based systems at scale.
Key Responsibilities:- Architect and maintain enterprise‑grade Linux systems (RHEL, Rocky, Ubuntu, or equivalent)
- Kernel tuning and performance optimization for HPC and GPU workloads
- Develop automation for provisioning and lifecycle management
- Troubleshoot complex OS‑level, hardware, and performance issues
- Deploy and manage NVIDIA GPU infrastructure (A100, H100, or equivalent)
- Install, configure, and maintain NVIDIA drivers, CUDA, NCCL, and related libraries
- Optimize multi‑GPU and multi‑node performance
- Monitor GPU utilization, thermals, and power efficiency
- Diagnose PCIe, NVLink, NUMA, and memory bottlenecks
- Manage large‑scale compute clusters (on‑prem or cloud)
- Integrate GPUs into Kubernetes environments (GPU operator, device plugins)
- Build infrastructure using Terraform, Ansible, or similar
- Develop CI/CD workflows for system configuration
- Automate GPU fleet provisioning and configuration management
- Establish SLOs and capacity planning models
- Lead incident response for infrastructure outages
- Conduct root cause analysis and implement preventive measures
- Harden Linux systems using security best practices
- Implement access controls, patch management, and vulnerability remediation
- Support xcfaprz SOC2 / ISO27001 / FedRAMP initiatives (if applicable)
- 7+ years of Linux systems engineering experience
- 3+ years managing GPU infrastructure in production environments
- Deep knowledge of:
- Linux internals (kernel, memory management, networking stack)
- NVIDIA driver stack, CUDA, and GPU troubleshooting
- High‑performance storage (NVMe, parallel file systems)
- Networking (10/25/40/100GbE, InfiniBand preferred)
- Experience with:
- Kubernetes with GPU workloads
- Infrastructure as Code (Terraform, Ansible)
- Python or Bash scripting
- Strong debugging and performance analysis skills
- Experience operating in large‑scale production environments
#J-18808-Ljbffr
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klicken Sie auf „Jetzt Bewerben“, um Ihre Bewerbung direkt auf deren Website einzureichen.