[Remote] AI Support Operations EngineerNerdleveltech • Irvine, California, United States
This job offer is no longer available
[Remote] AI Support Operations Engineer
Nerdleveltech
- Irvine, California, United States
- Irvine, California, United States
About
Responsibilities
Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments
Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard"
Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales
Skills
Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management
Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management
Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level
Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures
Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments
Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures
High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data
Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure
Benefits
BONUS
RSUs
#J-18808-Ljbffr
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.