About
Drive strategic partnership and alignment with Product teams to understand roadmap intent, co‑define critical metrics, and ensure unified direction across technical, sales, and leadership organizations.
Influence without authority across Product, Engineering, Sales, Operations, and CSP customers, driving clarity, alignment, and unblock paths for scale‑up.
Analyze deployment and performance data, identifying product health trends, system bottlenecks, and operational risks.
Solve challenging technical problems involving GPUs, networking, drivers, containers, firmware, and distributed system interactions.
Deliver streamlined executive‑level communication on status, risks, progress, and required decisions.
Collaborate with Product and Engineering, enabling future improvements in platform design, validation, and operational workflows.
What We Need to See: BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or similar, or equivalent experience.
4+ years experience in Solutions Architecture, Infrastructure Engineering, or similar technical roles.
Hands‑on experience with bring‑up and validation of large‑scale NVIDIA GPU platforms, including multi‑GPU and multi‑node architectures.
Understanding of high‑performance networking technologies (e.g., RDMA, congestion control, high‑bandwidth interconnects) and their role in distributed AI workloads.
Familiarity with NVIDIA system software stacks: CUDA, NCCL, NVSwitch/NVLink, driver behavior, and performance tuning.
Proficiency with Linux systems tools for identifying issues and evaluating system performance, such as: dmesg, journalctl, lspci, numactl, ethtool, iostat, perf, nvidia-smi, top/htop, ipmitool, container‑level tooling, and related utilities.
Understanding of server hardware architecture, including PCIe topologies, system firmware, NUMA, BIOS/UEFI configuration, power/thermal envelopes, and memory/subsystem behavior.
Understanding of BMC/IPMI/Redfish for remote management, hardware health monitoring, and out‑of‑band debugging during early‑stage bring‑up.
Strong Linux fundamentals across drivers, kernel subsystems, cgroups, containers, and node‑level performance analysis.
Ability to identify performance bottlenecks at the cluster, node, accelerator, network, or application layer.
Ways to Stand Out from the Crowd: Outstanding interpersonal skills and the ability to build clarity and direction across diverse, fast paced technical teams.
Knowledge of Compute and networking infrastructure (e.g., Instance types, networking primitives, high‑performance communication paths etc) at Hyperscalers or Cloud Service Providers.
Demonstrated leadership resolving multi‑team infrastructure challenges across engineering, product, and customer groups.
A consistent record of taking GPU or infrastructure products from pilot to high‑volume deployment in large data center environments.
Familiarity with modern deep learning, LLM architectures, and distributed training/inference challenges at scale.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.
You will also be eligible for equity and benefits. Applications for this job will be accepted at least until February 9, 2026.
This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Languages
- English
Notice for Users
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.