About
Hardware Systems Engineer
supporting a large-scale
bare-metal server environment (~17,000 servers)
with a heavy emphasis on
CPU and GPU compute availability . This role is centered on reliability, automation, and operational excellence — digging into systems and pipelines when things break and improving them so they break less often. (Not hands-on in Data center) What you’ll be doing Administer and support large-scale
bare-metal server infrastructure , primarily
HPE and Dell
platforms Perform
server break/fix troubleshooting
including hardware faults, firmware/BIOS/BMC issues, POST failures, degraded components, and system instability Manage server lifecycle operations: onboarding, provisioning, firmware updates, BIOS/BMC configuration, and hardware refresh kits Own incident response and break/fix workflows while maintaining
98%+ compute availability SLAs Work cross?functionally with
Data Center and Networking teams
during hardware incidents, including ticket creation, repair coordination, and log collection Interface directly with
HPE and Dell vendors : gathering diagnostics, sending logs, driving RMAs, and tracking issues through resolution Support and troubleshoot
CI/CD and automation pipelines
used for server provisioning, configuration, and lifecycle management Dig into automation code and workflows (Ansible, scripts, pipelines) when jobs fail to understand root cause and unblock deployments Identify recurring operational issues and contribute to
process improvements , runbooks, and reliability enhancements Help manage and reduce the
operations backlog , prioritizing fixes, cleanup, and automation improvements Must Have: Hands?on experience supporting
HPE and Dell servers
in production, including
break/fix and hardware incident troubleshooting Experience with
HPE iLO ,
Dell iDRAC , and related
BMC environments Strong understanding of server hardware components (CPU, GPU, memory, disks, NICs, power) and common failure modes Experience troubleshooting
automation and CI/CD pipelines
that manage infrastructure (not just running them, but fixing them when they fail) Operational mindset with experience owning incidents, SLAs, backlog items, and process improvements Automation experience with
Ansible, Bash, Jenkins , or similar tooling Exposure to
GPU dense, HPC, or high-performance compute environments Experience improving runbooks, reducing toil, and scaling operations through automation
Languages
- English
Notice for Users
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.