Jobbörse
Finde Jobs in deiner Nähe – ob vor Ort, hybrid oder remote.- Ähnliche Jobs zu: Software Engineer, Hardware Health
Software Engineer, Hardware Health
United States Digital Space LLCSan FranciscoAbout the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of the company’s global compute fleet.Our mission is to maximize healthy, usable compute across accelerat
Software Engineer, Fleet Hardware Health
OpenAISan FranciscoAbout the team The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-scale systems that span data centers, GPUs, n
Senior System Software Engineer, AI Hardware Management
SlopeSan FranciscoSlope is seeking a System Software Engineer in San Francisco to develop and maintain the low-level software for AI hardware systems. Your responsibilities will span designing BMC software, firmware in
Systems Software Engineer, Security, First Party Hardware
OpenAISan FranciscoSystems Software Engineer, Security, First Party Hardware Hardware – San FranciscoLocation: San Francisco, CA (Hybrid: 3 days/week onsite)Relocation assistance available.About the Team OpenAI's Hardwa
Systems Software Engineer, Security, First Party Hardware
jobr.proSan FranciscoAbout the Team OpenAI's Hardware organization develops silicon, systems, and platform infrastructure designed for the unique demands of advanced AI workloads. The First-Party Hardware team works acros
Senior Wearable Hardware & Software Program Manager
HPSan FranciscoHP Inc. is seeking a Hardware Devices EPM in San Francisco, California, to play a critical role in product development for wearable devices. This position requires deep experience from concept through
Host Systems Software Engineer: AI Hardware & High-Perf I/O
AI Chopping Block, Inc.San FranciscoAI Chopping Block, Inc. in San Francisco is looking for an experienced systems software engineer to develop the host software stack for next-generation AI systems. You will work on performance-critica
IT Service Desk Specialist II - Hardware, Software & VPN
Astreya PartnersSan FranciscoAstreya Partners, LLC is looking for a Service Desk Specialist II to support their key clients in San Francisco, California. The role involves troubleshooting desktop and computer hardware/software is
Hardware Systems Engineer
LumafieldSan FranciscoAbout the Role The Hardware Group is looking for a Hardware Systems Engineer to join our San Francisco office and own the breadth of hardware development for a new class of industrial CT scanners. You
ML Research Engineer - Hardware Codesign
OpenAISan FranciscoAbout the Team OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next genera
Founding Applied AI Engineer Hardware Automation
ErsiliaSan FranciscoErsilia is hiring an Applied AI Engineer to design and implement core systems for hardware engineering. This role focuses on building AI agents that integrate with enterprise-level engineering tools a
Embedded Systems Engineer: Hardware & Firmware Innovator
Alumni VenturesSan FranciscoAlumni Ventures in San Francisco is seeking an experienced engineer to bridge hardware and firmware for consumer devices. Responsibilities include designing electronic systems and developing low-level
Security Systems Engineer - First-Party AI Hardware
JobrSan FranciscoOpenAI is looking for a Security Engineer to join our First-Party Hardware team in San Francisco, CA. This role is pivotal in securing our AI hardware by managing security from design through deployme
Security Engineer, AI Hardware Platform (Hybrid SF)
OpenAISan FranciscoOpenAI in San Francisco is hiring a Systems Software Engineer focused on Security for their First-Party Hardware team. This role involves developing and ensuring the security foundation for AI hardwar
Senior Embedded Linux Engineer Secure IoT & Pod Hardware
Eight SleepSan FranciscoEight Sleep in San Francisco is looking for a Senior Embedded Software Engineer to join their innovative team. This role involves designing and maintaining system software for their Pod, focusing on s
Senior Software Engineer, Enterprise Software Engineering
Crusoe Energy Systems LLCSan FranciscoCrusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed
Software Engineering Manager - Primefocus Health
Lg NovaSan FranciscoSoftware Engineering Manager - Primefocus Health Oct 31, 2025 4:16:17 PM Primefocus Health is a seed‑stage healthtech startup headquartered in San Francisco, launched as a venture spinout from LG Elec
Software Engineer - Frontend
Mach9San FranciscoThe role: At Mach9, Frontend Engineers build the connective tissue of our product: the surfaces that turn our powerful CAD engine into a whole product that civil engineers and surveyors can live in ev
Embedded Software Engineer
Canaan Company LLCSan FranciscoWork with Director on Linux BSP (bootloader, kernel, drivers) for RISC‑V SoCs.What You’ll DoAssist in U‑Boot porting & board bring‑upSupport RTOS/bare‑metal firmware for MCUsContribute to open‑source
Embedded Software Engineer
WhaisperSan FranciscoCompany Stage: Seed-stage AI + Consumer Hardware StartupWe are building a voice-first AI wearable — starting with a smart ring that integrates embedded systems, low‑power wireless communication, and i
Software Engineer - Backend
Madrona Venture LabsSan FranciscoLocation Seattle, San Francisco, New York City, AmsterdamEmployment Type Full timeDepartment EngineeringCompensation$140K – $250K Offers EquityIndividual pay is determined by skills, qualifications, e
Fullstack Software Engineer
Collective Hub Inc.San FranciscoAbout Collective Collective is on a mission to redefine the way businesses‑of‑one work. Our technology and team of trusted advisors help members achieve financial independence by taking care of everyt
Backend Software Engineer
LumafieldSan FranciscoAbout Lumafield: Lumafield was founded in 2019 to upgrade manufacturing.We are engineers with deep experience across the product development cycle, from initial ideas to shipping hardware, across indu
Software Engineer - Backend
Specter Services LLCSan FranciscoCompany Background Specter is creating a software-defined “control plane” for the physical world. We are starting with protecting American businesses by granting them ubiquitous perception over their
Software Engineer, Frontend
SideSan FranciscoAt Side, we believe everyone should own their path.Side partners with top-producing real estate professionals to help them own and operate their own boutique real estate companies, without the legal,
Software Engineer, Hardware Health
- San Francisco, California, United States
- San Francisco, California, United States
Über
Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling.
We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads.
About the Role On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps the company’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally.
Responsibilities
Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure.
Build and evolve health checks that detect, remediate, and verify failures at scale.
Ensure critical health checks execute with minimal latency to maximize workload uptime.
Investigate hardware failures and system-level issues across large-scale compute environments.
Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes.
Build automation and tooling that enables global cluster management with minimal manual intervention.
Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems.
Qualifications
7+ years of industry experience in software or infrastructure engineering.
Strong proficiency with Python and shell scripting.
Experience building large-scale distributed systems or infrastructure platforms.
Comfort digging into noisy operational data using SQL, PromQL, or similar tooling.
Experience building reproducible analyses and operational tooling.
Strong systems debugging and operational instincts with an ownership mindset.
Bonus Experience
Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging).
Experience operating or debugging large-scale GPU or accelerator clusters.
Expertise in network operations, observability, or systems telemetry.
Experience with automated remediation systems or fleet lifecycle management.
Experience improving reliability, utilization, or workload uptime in distributed compute environments.
Equal Opportunity Statement We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
#J-18808-Ljbffr
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.