Jobbörse
Finde Jobs in deiner Nähe – ob vor Ort, hybrid oder remote.- Ähnliche Jobs zu: Senior System Software Engineer, AI Hardware Management
Software Engineer, Hardware Health
United States Digital Space LLCSan FranciscoAbout the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of the company’s global compute fleet.Our mission is to maximize healthy, usable compute across accelerat
Senior Mobile Engineer — AI-Driven Mobile & Hardware
PlaudSan FranciscoPlaud in San Francisco is seeking a skilled mobile application developer to design, develop, and maintain core features for the Plaud App. This role involves collaborating with cross-functional teams
Senior Embedded Linux Engineer Secure IoT & Pod Hardware
Eight SleepSan FranciscoEight Sleep in San Francisco is looking for a Senior Embedded Software Engineer to join their innovative team. This role involves designing and maintaining system software for their Pod, focusing on s
Senior Staff QA Engineer, S3 File System & AI Storage
DataDirect Networks IncSan FranciscoDataDirect Networks, Inc. is seeking a Senior Staff Quality Engineer to lead embedded quality engineering for our S3-compliant high-performance file system. This role demands 12+ years in systems soft
Nurse Practitioner or Physician Assistant (Castro) - Sign-On Bonus Available
One MedicalSan FranciscoAbout Us One Medical is a primary care solution challenging the industry status quo by making quality care more affordable, accessible and enjoyable. But this isn’t your average doctor’s office. We
Dedicated CDL-A Truck Driver Job (Earn Up to $70,000/Yr)
US XpressSan FranciscoCDL-A Truck Drivers: GREAT ROUTES AVAILABLE! Top Pay & Benefits OTR Drivers: Average 2,000+ miles per week. Bonuses may be available. Count on consistent miles and paycheck with flexible home time t
OTR CDL-A Truck Driver Job (Earn Top Pay / Consistent Miles)
US XpressSan FranciscoCDL-A Truck Drivers: GREAT ROUTES AVAILABLE! Top Pay & Benefits OTR Drivers: Average 2,000+ miles per week. Bonuses may be available. Count on consistent miles and paycheck with flexible home time t
Frontend Engineer for Visual AI & Autonomy
PassFortSan FranciscoPassFort in San Francisco is seeking a Frontend Engineer to turn ambitious designs into polished, production-ready interfaces. You'll work closely with designers and ML teams, ensuring everything from
Technical Co-Architect & Fullstack Engineer - AI Dating App
PassFortSan FranciscoPassFort is seeking a technical co-founder for their product Known to build the entire product surface, focusing on mobile app development, backend systems, and API integrations. The role requires 3-5
Sr. Software Engineer, Frontend
Debtt GroupSan FranciscoUses AI coding assistants like Cursor and Copilot and embraces AI-assisted development workflows to accelerate delivery.About the Role Senior Frontend Engineer responsible for building and maintaining
Senior React Native Engineer Build Mobile Apps
Arena ClubSan FranciscoArena Club in San Francisco is seeking a Senior Mobile Engineer with strong expertise in React Native to enhance its mobile app. The ideal candidate will have over 7 years of experience in mobile deve
Sr. Business Intelligence Engineer - Digital Experiences & Capabilities
VisaSan FranciscoAbout Us Visa is a world leader in payments technology, facilitating transactions between consumers, merchants, financial institutions and government entities across more than 200 countries and territ
Digital Marketing Strategist - Banking Product Launch
AkrayaSan FranciscoAkraya Inc seeks a candidate for its Web Marketing team in San Francisco. The role focuses on strategic planning and execution of digital marketing efforts for a new credit card product aimed at SMB a
Machine Learning Research Engineer, Agents - Enterprise GenAI
Scale AISan FranciscoAI is becoming vitally important in every function of our society. At Scale, our mission is to accelerate the development of AI applications. For 9 years, Scale has been the leading AI data foundry, h
Manager, Data Analytics Consulting
EPAM Systems IncSan FranciscoIf you are an AI builder who thrives on taking ideas from vision to whiteboard (or demo) to production — and enjoys collaborating within a larger team to shape the future of enterprise AI — this role
Remote IT Support & Junior Systems Administrator
FirmaTRUSTSan FranciscoFirmaTRUST in San Francisco, CA is looking for a User Support & Junior Systems Administrator to enhance client IT environments. You will provide essential technical support, manage server operations a
CNA STAFFING
Arcadia Resources , Inc.San FranciscoCertified Nursing Assistant Pay rates range from $19.61 - $22/hr Offering Daily Pay! Come join Arcadia Home Care & Staffing, an industry leader offering premier home care and staffing services for ove
Digital Shelf SEO & GEO Strategy Lead
Keurig Dr Pepper Inc.San FranciscoKeurig Dr Pepper is looking for a Digital Shelf SEO, GEO Optimization Manager to lead SEO strategies for optimized product discovery across digital platforms. The ideal candidate will have over 6 year
Cyber Security Advisor- Cloud | Las Vegas or Bay Area, USA
TenEleven VenturesSan FranciscoThis position can be hired within the Denver, CO, Las Vegas, NV or San Francisco, CA/Bay AreaAbility to travel within assigned region – Las Vegas, Utah and San Francisco or assigned accounts ~25%.As a
Sales + Accounts Coordinator - Remote
FormicSan FranciscoAbout Formic Formic is on a mission to reshape American manufacturing by making automationaccessible to every factory. As labor constraints rise, costs increase, andglobal competition intensifies, aut
Sr. Business Intelligence Engineer - Digital Experiences & Capabilities
TinkSan FranciscoJob Description We are seeking a Hybrid Data Specialist with strong data engineering expertise and data analytics skills to play a critical role in building and optimizing the data foundation for Visa
Software Engineering Lead - FlowGen Labs
PassFortSan FranciscoAbout Us We are a well-funded SF Bay Area startup backed by tier-1 investors. We build developer tools and AI-powered platforms for complex enterprise software (code generation, workflow automation, a
Machine Learning Research Scientist, Reasoning
Scale AISan FranciscoAbout Scale At Scale AI, our mission is to accelerate the development of AI applications. For 8 years, Scale has been the leading AI data foundry, fueling the most exciting advancements in AI, includi
Senior Sales Executive
Internet Brands IncSan FranciscoAre you passionate about the chance to bring your sales experience to a world-class company that is market-leading for both content and technology? Does hearing that we are completely committed to org
Machine Learning Research Engineer, Agent Data Foundation - Enterprise GenAI
Scale AISan FranciscoAI is becoming vitally important in every function of our society. At Scale, our mission is to accelerate the development of AI applications. For 9 years, Scale has been the leading AI data foundry, h
Software Engineer, Hardware Health
- San Francisco, California, United States
- San Francisco, California, United States
Über
Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling.
We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads.
About the Role On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps the company’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally.
Responsibilities
Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure.
Build and evolve health checks that detect, remediate, and verify failures at scale.
Ensure critical health checks execute with minimal latency to maximize workload uptime.
Investigate hardware failures and system-level issues across large-scale compute environments.
Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes.
Build automation and tooling that enables global cluster management with minimal manual intervention.
Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems.
Qualifications
7+ years of industry experience in software or infrastructure engineering.
Strong proficiency with Python and shell scripting.
Experience building large-scale distributed systems or infrastructure platforms.
Comfort digging into noisy operational data using SQL, PromQL, or similar tooling.
Experience building reproducible analyses and operational tooling.
Strong systems debugging and operational instincts with an ownership mindset.
Bonus Experience
Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging).
Experience operating or debugging large-scale GPU or accelerator clusters.
Expertise in network operations, observability, or systems telemetry.
Experience with automated remediation systems or fleet lifecycle management.
Experience improving reliability, utilization, or workload uptime in distributed compute environments.
Equal Opportunity Statement We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
#J-18808-Ljbffr
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.