XX
Software Engineering Manager, LLM TrainingLinkedInUnited States
XX

Software Engineering Manager, LLM Training

LinkedIn
  • US
    United States
  • US
    United States

Über

Job Description This role will be based in Mountain View, CA. The work location is hybrid, meaning it will be performed both from home and from a LinkedIn office on select days, as determined by the business needs of the team.
As a Software Engineering Manager of the Post-Training Infra team, you will architect the high-throughput systems required for Supervised Fine-Tuning (SFT) and RL, Multi-Teacher Distillation, Reinforcement Learning from Human Feedback (RLHF), Agentic Performance Optimization and Agentic Research at scale. You won’t just be "running scripts"; you’ll be optimizing the engine that makes rapid model alignment possible.
Responsibilities
Distributed Training Enablement: Enable and support sophisticated parallelism strategies, including data, tensor, pipeline, context, and expert parallelism, for models exceeding 100B+ parameters. Provide optimized configurations, reference examples, and platform‑level integration so that customer teams can effectively leverage these techniques.
Post‑Training Expertise: Maintain deep expertise across the post‑training landscape, including Multi‑Teacher Distillation, RL‑based alignment and optimization (RLHF, GRPO), Pruning, Quantization, and Speculative Decoding. Build and maintain reusable platform components that enable customer teams to efficiently leverage these techniques in their workflows.
Performance Engineering: Deep‑ dive into strategic customer workloads and drive workload‑specific and platform‑level optimizations, including Liger Kernels, FlashAttention, low‑precision training, high‑performance data I/O, and inter‑node latency reduction.
Multi‑Modal Strategy: Video and Audio Models Post Training strategy.
Framework & Ecosystem Mastery: Act as a bridge to the OSS community. Contribute to and troubleshoot the "Post‑Training Stack," including Liger, PyTorch, Hugging Face (Accelerate/Transformers), Megatron, Ray, VERL, SGLang, and vLLM.
Observability & Profiling: Develop advanced telemetry for large‑scale training runs. Use profiling tools to debug hardware‑level stalls (NCCL timeouts, memory fragmentation) and provide internal teams with actionable insights into training stability.
Containerized Lifecycle Management: Lead the development of the "Golden Image" environment. Maintain and distribute optimized, containerized base images with compatible, validated builds of PyTorch, CUDA, and the broader training stack to ensure seamless training on our clusters.
Responsible AI & Compliance Partnership: Serve as the bridge between the training platform and Responsible AI teams, collaborating on data compliance, model evaluation, and safety processes. Ensure the platform provides the tooling and integration points needed for RAI teams to effectively apply their frameworks throughout the training lifecycle.
Agentic Strategy: Lead development of Agents for autonomous model research and performance optimization.
Lead, coach, and manage the core team of engineers working on building the infrastructure.
Participate with senior management in developing a long‑term technology roadmap for the team and company.
Have the ability to dive deep into technical discussions to challenge the status quo, and steer the team in the right direction to push the envelope.
Communicate and collaborate effectively with stakeholders across engineering and business leadership.
Help the team realize their potential by setting clear expectations, openly evaluating performance, upholding accountability, and providing challenges to stretch their skills.
Drive a culture of operational excellence. Lead the team into defining performance goals, metrics, and building the infrastructure and tooling necessary to maintain a high quality bar and detect issues in real time.
Create an inclusive work environment that fosters autonomy, transparency, innovation, and learning, while holding a high bar for quality.
Qualifications Basic Qualifications
BA/BS Degree in Computer Science or related technical discipline, or equivalent practical experience.
1+ year(s) of management experience or 1+ year(s) of staff level engineering experience with management training.
5+ years of industry experience in software design, development, and large‑scale software engineering.
Experience in LLMs – Post Training and/or Inference for a minimum of one year.
Hands‑on experience developing distributed systems.
Preferred Qualifications
MS or PhD in Computer Science or related technical discipline.
2+ years of hands‑on software engineering/technical management and people management experience.
7+ years of industry experience in software design, development, and algorithm‑related solutions.
Experience in architecting, building, and running large‑scale distributed systems.
Experience with industry, open‑source, and/or academic research papers published in the space.
Suggested Skills
Distributed systems
LLM Training
AI infrastructure
Benefits We strongly believe in the well‑being of our employees and their families. LinkedIn offers generous health and wellness programs and time off for employees of all levels. LinkedIn is committed to fair and equitable compensation practices. The pay range for this role is $170,000 – $277,000. Actual compensation packages are based on a variety of factors unique to each candidate.
The total compensation package may include an annual performance bonus, stock, benefits, and/or other incentive compensation plans.
Equal Opportunity Statement We seek candidates with a wide range of perspectives and backgrounds and we are proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other legally protected class.
LinkedIn is committed to offering an inclusive and accessible experience for all job seekers, including individuals with disabilities. Our goal is to foster an inclusive and accessible workplace where everyone can succeed. If you need a reasonable accommodation to search for a job opening, apply for a position, or participate in the interview process, connect with us at accommodations@linkedin.com and describe the specific accommodation requested for a disability‑related limitation.
San Francisco Fair Chance Ordinance Pursuant to the San Francisco Fair Chance Ordinance, LinkedIn will consider for employment qualified applicants with arrest and conviction records.
Pay Transparency Policy Statement As a federal contractor, LinkedIn follows the Pay Transparency and non‑discrimination provisions described at https://lnkd.in/paytransparency.
Global Data Privacy Notice for Job Candidates Please follow this link to access the document that provides transparency around how LinkedIn handles personal data of employees and job applicants: https://legal.linkedin.com/candidate-portal.
#J-18808-Ljbffr
  • United States

Sprachkenntnisse

  • English
Hinweis für Nutzer

Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.