Offres d'emploi
Trouvez des postes près de chez vous, sur site, hybrides ou à distance.- Emplois similaires à : Hardware / Software CoDesign Engineer - 3P
Software Engineer, Hardware Health
United States Digital Space LLCSan FranciscoAbout the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of the company’s global compute fleet.Our mission is to maximize healthy, usable compute across accelerat
ML Research Engineer - Hardware Codesign
OpenAISan FranciscoAbout the Team OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next genera
Software Engineer, Fleet Hardware Health
OpenAISan FranciscoAbout the team The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-scale systems that span data centers, GPUs, n
Embedded Software Engineer - Real-Time Hardware
Robotics Technologies LLCSan FranciscoRobotics Technologies LLC in San Francisco is looking for an experienced embedded software developer. The role focuses on designing, coding, and testing software that interacts directly with hardware
Systems Software Engineer, Security, First Party Hardware
The Consulting SolutionsSan FranciscoAbout the Team OpenAI’s Hardware organization develops AI-native silicon and system‑level solutions for the unique demands of advanced AI workloads. Building on efforts like Jalapeño, the team is deve
Senior System Software Engineer, AI Hardware Management
SlopeSan FranciscoSlope is seeking a System Software Engineer in San Francisco to develop and maintain the low-level software for AI hardware systems. Your responsibilities will span designing BMC software, firmware in
Security Engineer - First-Party AI Hardware
The Consulting SolutionsSan FranciscoThe Consulting Solutions is seeking a Security Engineer for their First-Party Hardware team. You will be responsible for the security foundation of AI hardware systems, handling aspects from hardware
IoT Systems Engineer: Hardware + Firmware
Meter ServiceSan FranciscoMeter is seeking an IoT Engineer in San Francisco, CA, to blend mechanical design with embedded firmware development. In this role, you'll create mechanical assemblies and develop the firmware that co
Senior Embedded Linux Engineer Secure IoT & Pod Hardware
Eight SleepSan FranciscoEight Sleep in San Francisco is looking for a Senior Embedded Software Engineer to join their innovative team. This role involves designing and maintaining system software for their Pod, focusing on s
Software Engineer, Backend
ServalSan FranciscoWho We Are Serval is an AI-native automation platform transforming how enterprises operate. We build intelligent agents that understand real-world workflows and execute them end-to-end — replacing man
Software Engineer, Frontend (All Levels)
ZIPSan FranciscoThe simple task of buying software, services, or tools at work has become hopelessly complicated at even the most innovative companies in the world. Today, enterprises spend $120T+ per year globally (
Embedded Software Engineer (Data Platform, Autonomy)
ZipLineSan FranciscoRequirementsMaster's degree in Computer Science or related field and 3+ years of experience building software for safety-critical systems (aerospace/AV/robotics/medical)Strong in Rust/C++/C for real-t
Staff Backend Software Engineer Messaging & AI Infrastructure
EmeraldadvantageconceptsSan FranciscoAbout the Role We're looking for a Staff Backend Engineer who can own major parts of our backend architecture, move fast, and solve complex distributed systems problems at scale.You will work directly
Backend Software Engineer - Go/Python API Focus
EMASan FranciscoEma is seeking a Software Engineer to join their team in San Francisco. This role involves designing and developing scalable software applications, with a focus on API and back-end systems using Pytho
House Parents
ConfidentialSan FranciscoMilton Hershey School, a cost-free private residential school for pre-K through 12th grade, is seeking dedicated House Parents to join our Education, Training & E-Learning team. House Parents live in
Remote Senior Sales Manager, Walmart & Sam's
GrabJobsSan FranciscoEvery Man Jack, a B Corp Certified Company, is a leader in the natural men’s grooming space, with a presence across major national retailers like Target, Walmart, Kroger, Whole Foods, and Amazon. We’v
Data Engineer - Fully Remote | Upto $80/hr
ObsidianSan FranciscoAbout the RoleMercor is partnering with a leading AI research lab to support a Frontier Code Agents project. Contributors help evaluate and improve frontier AI coding models through structured technic
Senior Insurance Data Analyst — Remote
Work At Home Vintage ExpertsSan FranciscoWork At Home Vintage Experts invites applications for a Data Analyst role aimed at supporting clients in the insurance industry. This position is fully remote, offering either full-time or part-time o
Remote Senior Data Scientist
GrabJobsSan FranciscoProminence Advisors is actively seeking aSenior Data Scientistto join our team.Who We AreProminence is a healthcare and life sciences technology and consulting firm dedicated to helping leading organi
Pricing, Yield, & Sales Compensation Manager
SocotraSan FranciscoAt Lyft, our purpose is to serve and connect. We aim to achieve this by cultivating a work environment where all team members belong and have the opportunity to thrive.Lyft is building a centralized P
Data Analyst, Growth Product
Chime Financial, IncSan FranciscoAbout the role We're looking for a Growth Product Analyst to partner with our Growth Product and Engineering teams to drive measurable impact across the member journey — from acquisition to conversion
Business Intelligence Manager
TwitchSan FranciscoAbout Us Twitch is the world’s biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come t
Senior IT Project Manager (5504) - Department of Technology
City and County of San FranciscoSan FranciscoSpecific information regarding this recruitment process is listed below:Appointment Type: Temporary Exempt - This position is a full-time position excluded by the Charter from the competitive civil se
Senior Data Scientist, Growth Product
Chime Financial, IncSan FranciscoAbout the role We're looking for a Sr. Growth Product Data Scientist to partner with our Growth Product and Engineering teams to drive measurable impact across the member journey — from acquisition to
Staff Network Engineer, Operations
CrusoeSan FranciscoCrusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of th
Software Engineer, Hardware Health
- San Francisco, California, United States
- San Francisco, California, United States
À propos
Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling.
We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads.
About the Role On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps the company’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally.
Responsibilities
Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure.
Build and evolve health checks that detect, remediate, and verify failures at scale.
Ensure critical health checks execute with minimal latency to maximize workload uptime.
Investigate hardware failures and system-level issues across large-scale compute environments.
Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes.
Build automation and tooling that enables global cluster management with minimal manual intervention.
Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems.
Qualifications
7+ years of industry experience in software or infrastructure engineering.
Strong proficiency with Python and shell scripting.
Experience building large-scale distributed systems or infrastructure platforms.
Comfort digging into noisy operational data using SQL, PromQL, or similar tooling.
Experience building reproducible analyses and operational tooling.
Strong systems debugging and operational instincts with an ownership mindset.
Bonus Experience
Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging).
Experience operating or debugging large-scale GPU or accelerator clusters.
Expertise in network operations, observability, or systems telemetry.
Experience with automated remediation systems or fleet lifecycle management.
Experience improving reliability, utilization, or workload uptime in distributed compute environments.
Equal Opportunity Statement We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
#J-18808-Ljbffr
Compétences linguistiques
- English
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.