Principal ML Engineer, Machine Learning Platform – Systems Architectureremoterocketship • United States
Principal ML Engineer, Machine Learning Platform – Systems Architecture
remoterocketship
- United States
- United States
About
Lead architecture and delivery for major ML platform capabilities across training, evaluation, deployment, and observability Design scalable systems for distributed training, data processing, feature and model lifecycle management, and production inference Own platform-level technical outcomes from design through deployment, operations, and continuous improvement Drive the design and scaling of data pipelines for large-scale structured and semi-structured technical datasets Lead architecture for distributed data processing and orchestration systems such as Ray, Airflow, Spark, or similar platforms Establish strong practices for data lineage, provenance, governance, and responsible data usage in ML systems Guide the design of model deployment, inference services, monitoring, and observability for production ML workloads Contribute to the development of ML-ready representations for geometry, graph, hierarchical, or multimodal data Clarify ambiguous problem spaces, define solution approaches, and lead execution across multiple engineers and teams Establish and improve engineering standards, operational practices, and architectural patterns for ML systems Lead incident response for critical platform issues and drive lasting improvements across system health and supportability Mentor engineers and act as a force multiplier through design leadership, coaching, and technical reviews Communicate technical strategy, tradeoffs, and execution plans clearly to technical and non-technical stakeholders Requirements:
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent industry experience Typically 6 to 8 years of industry experience in software engineering, ML infrastructure, distributed systems, or platform engineering, including experience leading design and delivery of complex technical systems Deep experience in software architecture, distributed systems, large-scale data platforms, or ML infrastructure Strong proficiency in Python and strong command of production software engineering practices Experience leading complex technical initiatives that span multiple engineers or cross-functional teams Strong experience with large-scale data pipelines, distributed data processing, and cloud-native platform architectures Experience with model deployment, inference systems, and production observability Demonstrated ability to make architecture decisions that balance performance, scalability, reliability, and cost Strong communication and stakeholder management skills Benefits:
Health and financial benefits Time away and everyday wellness
Languages
- English
Notice for Users
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.