jobtraffic
Senior Linux EngineerjobtrafficIreland

This job offer is no longer available

jobtraffic

Senior Linux Engineer

jobtraffic
  • IE
    Ireland
  • IE
    Ireland

About

My client is building next-generation AI and high-performance computing platforms that power advanced machine learning, data science, and large-scale compute workloads. We operate high-density GPU clusters and are looking for a Senior Linux Engineer to lead the design, optimization, and reliability of our GPU-based infrastructure.


The following information aims to provide potential candidates with a better understanding of the requirements for this role.

As a Senior Linux Engineer specializing in GPU systems, you will architect, deploy, and operate high-performance Linux environments optimized for GPU workloads including AI/ML training, inference, simulation, and data processing. You will work closely with ML engineers, platform teams, and DevOps to ensure performance, scalability, and reliability across our compute infrastructure. This is a hands‑on technical leadership role requiring deep Linux expertise and strong experience managing GPU-based systems at scale.

Key Responsibilities:
  • Architect and maintain enterprise‑grade Linux systems (RHEL, Rocky, Ubuntu, or equivalent)
  • Kernel tuning and performance optimization for HPC and GPU workloads
  • Develop automation for provisioning and lifecycle management
  • Troubleshoot complex OS‑level, hardware, and performance issues
GPU Infrastructure & Performance
  • Deploy and manage NVIDIA GPU infrastructure (A100, H100, or equivalent)
  • Install, configure, and maintain NVIDIA drivers, CUDA, NCCL, and related libraries
  • Optimize multi‑GPU and multi‑node performance
  • Monitor GPU utilization, thermals, and power efficiency
  • Diagnose PCIe, NVLink, NUMA, and memory bottlenecks
  • Manage large‑scale compute clusters (on‑prem or cloud)
  • Integrate GPUs into Kubernetes environments (GPU operator, device plugins)
Automation & Infrastructure as Code
  • Build infrastructure using Terraform, Ansible, or similar
  • Develop CI/CD workflows for system configuration
  • Automate GPU fleet provisioning and configuration management
Reliability & Observability
  • Establish SLOs and capacity planning models
  • Lead incident response for infrastructure outages
  • Conduct root cause analysis and implement preventive measures
Security & Compliance
  • Harden Linux systems using security best practices
  • Implement access controls, patch management, and vulnerability remediation
  • Support xcfaprz SOC2 / ISO27001 / FedRAMP initiatives (if applicable)
Required Qualifications
  • 7+ years of Linux systems engineering experience
  • 3+ years managing GPU infrastructure in production environments
  • Deep knowledge of:
  • Linux internals (kernel, memory management, networking stack)
  • NVIDIA driver stack, CUDA, and GPU troubleshooting
  • High‑performance storage (NVMe, parallel file systems)
  • Networking (10/25/40/100GbE, InfiniBand preferred)
  • Experience with:
  • Kubernetes with GPU workloads
  • Infrastructure as Code (Terraform, Ansible)
  • Python or Bash scripting
  • Strong debugging and performance analysis skills
  • Experience operating in large‑scale production environments

#J-18808-Ljbffr
  • Ireland

Languages

  • English
Notice for Users

This job was posted by one of our partners. You can view the original job source here.