DevOps Engineer, AI InfrastructureSB Telecom America Corp. • Sunnyvale, California, United States
Dieses Stellenangebot ist nicht mehr verfügbar
DevOps Engineer, AI Infrastructure
SB Telecom America Corp.
- Sunnyvale, California, United States
- Sunnyvale, California, United States
Über
SoftBank is making significant investments in infrastructure for AI. Through its wholly owned US subsidiary, SoftBank Corp. has established Infrinia team in Silicon Valley, focused on infrastructure software for AI and AI foundations for mobile networks. Our goals are to challenge the norms and create products making use of our SOTA infrastructure (like Nvidia GB200, MGX and DGX Grace & Hopper platforms) and cloud-native software. These products are geared towards centralized AI data centers as well as distributed AI Radio Access Network (AI RAN) data centers. We are looking for experienced practitioners who are inspired to bring innovation and build transformative products. Minimum Qualifications: Bachelor's degree in Computer Science, Electrical Engineering, or related field. 5+ years in data center operations, site reliability engineering (SRE), or DevOps. Experience with Linux system administration, networking, and hardware troubleshooting. Experience in automating data center operations using tools like Ansible, Terraform, or Python. Preferred Qualifications: Master's degree or relevant certification in Cloud/DevOps. Deep experience with Kubernetes clusters and container orchestration on bare metal. Experience with GPU systems (Nvidia DGX/HGX), high-performance computing (HPC) clusters, and fabric management (Ethernet). Expertise in building scalable monitoring and alerting systems (Prometheus, Grafana, ELK stack). Experience with "Day 0, 1, and 2" automation for large-scale infrastructure. Role:
Be a key member of the infrastructure team responsible for the operational excellence and automation of GPU systems supporting AI workloads (training, fine-tuning, and serving). Own and develop the deployment pipelines and operational playbooks for new AI infrastructure with a focus on Kubernetes and GPU systems. Drive innovation in data center automation for maximizing uptime and resource utilization. As a Data Center DevOps Engineer responsible for major operational tasks, work with Staff Engineers, product management, and program management to drive execution towards commercialization. Responsibilities: Manage the pre-deployment operations including staging rack health (monitoring, troubleshooting, triaging etc.), identify gaps and build automation for improving operations. Manage the post-deployment health (monitoring, troubleshooting, triaging etc.), identify gaps and build automation for improving operations. Interface with Data Center Operations and bridge the requirements with the Software Engineering Team. Contribute to Product Definition (PRD) and program execution (sprint) planning from an operations perspective. Attract and help build engineering talent. Role model and foster a culture of humility and innovation for product delivery. Salary:
The base salary for this position ranges from ($150,000-$250,000), with additional attractive biannual bonus and benefits.
Sprachkenntnisse
- English
Hinweis für Nutzer
Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.