Jobbörse
Finde Jobs in deiner Nähe – ob vor Ort, hybrid oder remote.- Ähnliche Jobs zu: Principal Software Engineer - Machine Learning
Principal Software Quality Engineer - GPU & Machine Learning
Advanced Micro Devices , Inc.United StatesWHAT YOU DO AT AMD CHANGES EVERYTHINGAt AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems.
Principal Machine Learning Engineer
ArmisNew YorkAI Pipeline Engineer/ Principal Machine Learning Engineer - REMOTE Strategic Initiatives Team Armis is seeking a talented and motivated AI Pipeline Engineer to join our Strategic Initiatives team. Rep
Principal Machine Learning Engineer
IMC B.V.ChicagoAt IMC, we believe technology is the foundation of our competitive edge — and machine learning is increasingly central to how we trade. Over the past few years, we've been steadily building our machin
Principal Machine Learning Engineer
IMC IncNew YorkPrincipal Machine Learning Engineer Location: Chicago, United States; Hong Kong, Hong Kong; New York, United States; Sydney, AustraliaAt IMC, we believe technology is the foundation of our competitive
Principal Machine Learning Engineer
Medical GuardianNew YorkAbout Medical Guardian: Founded in 2005, Medical Guardian is a fast-growing digital health and safety company on a mission to help people live a life without limits. With 13 consecutive years on the I
Principal Machine Learning Engineer
F. Hoffmann-La Roche AGSouth San FranciscoThe Principal Machine Learning Engineer leads the strategic design and development of advanced machine learning models, driving innovation and exploring emerging technologies. This role involves overs
Principal Machine Learning Engineer (MLE)
EquinixDallasOverview Equinix is the world’s digital infrastructure company®, shortening the path to connectivity to enable the innovations that enrich our work, life and planet. A place where tech thinkers and fu
Principal Machine Learning Engineer - Reliability
I did my part and supported the Regular ToiletSan MateoEvery day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers an
Principal Architect - Machine Learning
United AirlinesUnited StatesAchieving our goals starts with supporting yours. Grow your career, access top-tier health and wellness benefits, build lasting connections with your team and our customers, and travel the world using
Software Engineer, Machine Learning
SynthesiaNew YorkAbout the role You will work end-to-end, contributing to our new Agentic AI Agents, and managing the safe release of your features to our customer base. This is a two-way communication within Synthesi
Software Engineer - Machine Learning
100 CRC Insurance Group, LLCCharlotteJob Overview We are building the foundation of the machine learning function at a market-leading insurance company. As one of the first data science hires, you will shape our ML strategy, frameworks,
Machine Learning Software Engineer
Virtual Vocations IncUnited StatesCollaborating with cross-functional teams, the full-time Machine Learning Software Engineer will build innovative application experiences, implement custom user interfaces, and develop reusable softwa
Senior Principal Machine Learning Engineer, Ad Platforms
5014 Disney Entertainment & Sports LLCIowaSenior Principal Machine Learning Engineer, Ad Platforms Technology is at the heart of Disney’s past, present, and future. Disney Entertainment and ESPN Product & Technology is a global organization o
Staff / Principal Machine Learning Engineer, Serving - Switzerland
careers.bitkraft.vc - JobboardIndianaLocation Switzerland Employment Type Full time Location Type Remote Department ML Engineering About Inworld Inworld is a product-oriented research lab of top AI researchers and engineers, developing b
Senior Machine Learning Engineer/Machine Learning Engineer III
HR Tech JobSeattleYour work days are brighter here.We’re obsessed with making hard work pay off, for our people, our customers, and the world around us. As a Fortune 500 company and a leading AI platform for managing p
Software Engineer, Machine Learning Infrastructure
Point72 Asset Management, L.PNew YorkA Career with Point72’s Knowledge Graph Intelligence Team On the Knowledge Graph Intelligence team, you’ll work alongside product managers, engineers, and data scientists to build the next generation
Quantum Software Engineer - Machine Learning
Infleqtion, Inc.ChicagoWe are seeking a Quantum Software Engineer with expertise in quantum-inspired classical machine learning and quantum machine learning (QML). This role will focus on developing advanced ML models and a
Principal Machine Learning Researcher (Physical AI)
FreeFormLos AngelesPRINCIPAL MACHINE LEARNING RESEARCHER (PHYSICAL AI) Freeform builds AI-native manufacturing systems that tightly integrate software, hardware, and physics to produce real-world parts atindustrialscale
Staff Software Engineer, Machine Learning
SmarterDxNew YorkJob Overview SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, our platform connects clinical
Software Engineer II, Machine Learning
TinderPalo AltoAbout the Role We are looking for a Machine Learning Engineer II to help build and ship machine learning systems that improve product experience and drive measurable business impact. This role is idea
Senior Machine Learning Engineer, Machine Learning Platform Technologies
AppleSan FranciscoSenior Machine Learning Engineer, Machine Learning Platform Technologies Seattle, Washington, United States Machine Learning and AIImagine what you could do here. At Apple, great ideas have a way of b
Senior Software Engineer, Machine Learning
HR Tech JobSeattleWe offer a hybrid work environment. Most US-based positions can also be performed remotely (any exceptions will be noted in the Minimum Qualifications below.)Our Mission: To actively connect people to
Machine Learning Engineer
6AM CityBrooklynJob Description We are dedicated to innovation and excellence in technology. Our team is committed to pushing the boundaries of what's possible, and we're looking for a talented Machine Learning Engin
Machine Learning Engineer
RobinhoodBellevueJoin us in building the future of finance. Our mission is to democratize finance for all. An estimated $124 trillion of assets will be inherited by younger generations in the next two decades. The lar
Machine Learning Engineer
GCS RecruitmentNewarkRole : Machine Learning Engineer Position Summary We are seeking a Machine Learning Engineer to join a high-performing AI and Data Science team responsible for developing, deploying, and maintaining p
Principal Software Quality Engineer - GPU & Machine Learning
- United States
- United States
Über
At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.
Together, we advance your career.
About the Role
We are seeking a Principal Software Quality Engineer to serve as the senior technical leader for ROCm software validation across compute workloads and server-class systems. In this individual-contributor leadership role, you will define how AMD proves ROCm is ready to ship - from unit and component testing, through full-stack workload validation, to multi-node system-level qualification on AMD Instinct™ GPU platforms. You will set the technical direction for validation strategy, build and evolve the test infrastructure that gates every ROCm release, and personally drive the hardest debugging, characterization, and qualification problems. Your work directly determines the quality bar experienced by hyperscalers, OEMs, sovereign-AI customers, and the open-source community running ROCm in production.
What You Will Do
Own the end-to-end validation architecture for ROCm - unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers - across multiple GPU generations and server platforms. Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them. Lead system-level testing for server nodes - multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation. Drive compute workload validation and characterization - LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks - establishing reproducible methodology, baselines, and regression tracking. Architect the test infrastructure - distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines. Champion modern, agile quality engineering - shift-left testing, test pyramids, contract testing between layers, hermetic test environments, deterministic reproducers, and continuous validation in trunk. Set the bar for GitHub-based quality workflows - PR gating policy, required checks, code-coverage standards, bug-bash and triage cadences, and disciplined issue management across ROCm/* repositories and partner upstream projects. Lead complex escalation debug - partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures and convert findings into durable test coverage. Influence the roadmap - work with product management, silicon, platform, and software architecture to ensure validation readiness for next-generation Instinct GPUs and server platforms before tape-in milestones and silicon arrival. Mentor and elevate Senior and Staff validation engineers, SDETs, and SQA leads; raise the technical bar through design review, code review, and written guidance. Represent ROCm validation externally - strategic customer engagements, OEM qualification programs, and open-source community quality initiatives. Minimum Qualifications
Strongl software engineering experience with a strong validation, SDET, or quality-engineering focus, including 5+ years in a senior IC role (Staff/Principal/PMTS or equivalent) leading validation of complex systems software. BS/MS/PhD in Computer Science, Computer Engineering, or related discipline (or equivalent demonstrated experience). Expert-level Python for test automation and infrastructure; strong C++ for debugging, and extending production code paths under test. Deep, demonstrable validation experience in at least two of the following domains: GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL) Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM) HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric) Linux kernel, GPU drivers, or accelerator firmware Distributed systems and large-scale cluster software System-level validation for server-class compute nodes - multi-GPU, multi-node, fabric-attached environments - including stress/stability, soak, fault-injection, and RAS testing. Proven, hands-on experience working efficiently in an agentic AI engineering environment - daily, production use of LLM-based coding agents (e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) and orchestration frameworks for real engineering work, with demonstrable productivity, quality, or coverage gains attributable to those workflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents. Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers. Mastery of GitHub at scale for quality engineering - PR gating, GitHub Actions, self-hosted runners, required status checks, release tagging, and open-source contribution and triage norms. Strong command of modern, agile software development practices - trunk-based development, CI/CD, shift-left testing, observability, feature flags, and incremental delivery - applied specifically to validation organizations. Excellent written and verbal communication - able to author crisp test plans, qualification reports, RFCs, and post-mortems, and to influence development teams without authority. Preferred Qualifications
Direct contributions to validation, CI, or test infrastructure for ROCm, PyTorch, LLVM, Triton, vLLM, or comparable upstream open-source projects. Demonstrated leadership in agentic-AI adoption - built or rolled out agent-based workflows across an engineering team (e.g., autonomous test generation, AI-driven log/triage pipelines, multi-agent debug systems, MCP server design, retrieval-augmented engineering knowledge bases) with measurable outcomes. Experience operating or validating large GPU clusters (256+ GPUs) - fabric bring-up, cluster health monitoring, and fleet-level diagnostics. Familiarity with Training/Inference/HPC industry-standard benchmark methodologies and submissions. Background in performance validation: roofline analysis, profiler tooling (rocprof, Omniperf, Nsight-class), regression detection Experience with fault injection, RAS, telemetry, and long-haul stability programs for accelerator platforms. Familiarity with hardware lab automation: BMC/IPMI/Redfish, PDU control, serial-console capture, automated re-imaging, and topology-aware test scheduling. Prior experience standing up validation for pre-silicon / emulation / first-silicon bring-up of accelerators.
Why This Role
ROCm powers AI and HPC workloads on AMD Instinct GPUs at the largest scale in the industry. The quality of every ROCm release is felt across millions of GPUs in production - and the validation organization is what stands between "code complete" and "customer ready." As Principal MTS for ROCm Validation, you will define that bar, build the systems that enforce it, and personally lead the toughest qualification problems on AMD's most strategic platforms.
#LI-TC1
#Hybrid
AMD is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here.
This posting is for an existing vacancy.
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.