Senior Linux Engineer

jobtraffic

Ireland

Ireland

Jetzt Bewerben

Über

My client is building next-generation AI and high-performance computing platforms that power advanced machine learning, data science, and large-scale compute workloads. We operate high-density GPU clusters and are looking for a Senior Linux Engineer to lead the design, optimization, and reliability of our GPU-based infrastructure.

The following information aims to provide potential candidates with a better understanding of the requirements for this role.

As a Senior Linux Engineer specializing in GPU systems, you will architect, deploy, and operate high-performance Linux environments optimized for GPU workloads including AI/ML training, inference, simulation, and data processing. You will work closely with ML engineers, platform teams, and DevOps to ensure performance, scalability, and reliability across our compute infrastructure. This is a hands‑on technical leadership role requiring deep Linux expertise and strong experience managing GPU-based systems at scale.

Key Responsibilities:

Architect and maintain enterprise‑grade Linux systems (RHEL, Rocky, Ubuntu, or equivalent)
Kernel tuning and performance optimization for HPC and GPU workloads
Develop automation for provisioning and lifecycle management
Troubleshoot complex OS‑level, hardware, and performance issues

GPU Infrastructure & Performance

Deploy and manage NVIDIA GPU infrastructure (A100, H100, or equivalent)
Install, configure, and maintain NVIDIA drivers, CUDA, NCCL, and related libraries
Optimize multi‑GPU and multi‑node performance
Monitor GPU utilization, thermals, and power efficiency
Diagnose PCIe, NVLink, NUMA, and memory bottlenecks
Manage large‑scale compute clusters (on‑prem or cloud)
Integrate GPUs into Kubernetes environments (GPU operator, device plugins)

Automation & Infrastructure as Code

Build infrastructure using Terraform, Ansible, or similar
Develop CI/CD workflows for system configuration
Automate GPU fleet provisioning and configuration management

Reliability & Observability

Establish SLOs and capacity planning models
Lead incident response for infrastructure outages
Conduct root cause analysis and implement preventive measures

Security & Compliance

Harden Linux systems using security best practices
Implement access controls, patch management, and vulnerability remediation
Support xcfaprz SOC2 / ISO27001 / FedRAMP initiatives (if applicable)

Required Qualifications

7+ years of Linux systems engineering experience
3+ years managing GPU infrastructure in production environments
Deep knowledge of:
Linux internals (kernel, memory management, networking stack)
NVIDIA driver stack, CUDA, and GPU troubleshooting
High‑performance storage (NVMe, parallel file systems)
Networking (10/25/40/100GbE, InfiniBand preferred)
Experience with:
Kubernetes with GPU workloads
Infrastructure as Code (Terraform, Ansible)
Python or Bash scripting
Strong debugging and performance analysis skills
Experience operating in large‑scale production environments

#J-18808-Ljbffr

Ireland

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klicken Sie auf „Jetzt Bewerben“, um Ihre Bewerbung direkt auf deren Website einzureichen.

Jetzt Bewerben