Job Opportunities
Find jobs near you, whether onsite, hybrid, or remote.- Similar Jobs to: Foundation Model DevOps Engineer
Foundation Model DevOps Engineer
Institute of Foundation ModelsUnited StatesJob DescriptionJob DescriptionAbout the Institute of Foundation ModelsWe are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance
Senior DevOps Engineer
MintegralSan MateoAbout Us Mintegral is a leading programmatic and interactive mobile advertising platform. Focused on the APAC region and radiating out globally. Powered by advanced AI technology, we provide global ad
Mid-Level Model Based Systems Engineer
Intuitive Research and Technology CorporationColorado SpringsINTUITIVEis a nationally recognized Best Place to Work that provides solutions from design through production to sustainment by delivering targeted results. Our approach couples the latest technology
Senior DevOps Engineer
MintegralSunnyvaleAbout Us Mintegral is a leading programmatic and interactive mobile advertising platform. Focused on the APAC region and radiating out globally. Powered by advanced AI technology, we provide global ad
Senior DevOps Engineer
MintegralHaywardAbout Us Mintegral is a leading programmatic and interactive mobile advertising platform. Focused on the APAC region and radiating out globally. Powered by advanced AI technology, we provide global ad
Senior Model Based Systems Engineer
Intuitive Research and Technology CorporationColorado SpringsINTUITIVEis a nationally recognized Best Place to Work that provides solutions from design through production to sustainment by delivering targeted results. Our approach couples the latest technology
Interim Site Reliability / DevOps Engineer
GofractionalUnited StatesSite Reliability / DevOps Engineer - 100% Remote (m/f/d) Position OverviewAre you an experienced SRE or DevOps engineer? Do you want the freedom to work remotely and want to grow in the new field of s
Remote Senior DevOps Platform Engineer
Veterans EZ Info IncUnited StatesVeterans EZ Info Inc is seeking a remote Senior DevOps Platform Engineer to enhance its healthcare technology platforms. The candidate will be responsible for designing CI/CD pipelines, automating wor
Azure SRE & DevOps Engineer — Remote, AKS & CI/CD
Hire With Jarvis - Lance MelendezUnited StatesHire With Jarvis - Lance Melendez is seeking a Site Reliability Engineer to ensure the stability and performance of applications hosted on Microsoft Azure. This remote position requires strong experti
Senior DevOps Engineer (Reliability Focus)
Quantum Electronic PaymentsUnited StatesJob DescriptionJob DescriptionUnlock your potential with Quantum ePay® We're a full-service financial technology provider that helps businesses lower costs, earn more, and operate with confidence. We
Urgent need of OPERATIONS ENGINEER - DEVOPS
MastercardUnited StatesOur Purpose Mastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosp
Senior DevOps Engineer / Site Reliability Engineer (SRE)
FirstchoicedriversUnited StatesJob DescriptionJob Description Senior DevOps Engineer / Site Reliability Engineer (SRE)About the Role We're hiring a Senior DevOps Engineer / Site Reliability Engineer (SRE) to help architect and scal
DevOps Engineer - Flight Software
HermeusUnited StatesJob DescriptionJob DescriptionWe are seeking a DevOps Engineer to support the development, integration, and deployment of flight software systems. This role sits at the intersection of software engine
DevOps Engineer : Pulumi & AWS ECS
Eitacies IncUnited StatesJob DescriptionJob DescriptionDevOps Engineer : Pulumi & AWS ECS Onsite, SF Bay AreaRequirement:We are looking for a DevOps Engineer withstrong hands-on experience in Pulumi and AWS ECSto support infr
DevOps Engineer- Docker & CI/CD
Optimal Inc.United StatesJob DescriptionJob DescriptionNo C2C candidates DevOps Engineer - Docker & CI/CD Position Summary The Software Engineer - DevOps will support the robotics software development team through the design,
Embedded DevOps Engineer
Spear AIUnited StatesJob DescriptionJob DescriptionSpear AI is seeking an engineer to provide DevOps and infrastructure for the embedded software team on our hardware platforms. This person will also serve as an informal
Senior DevOps Engineer / Site Reliability Engineer
The Greene GroupUnited StatesJob DescriptionJob DescriptionA major financial services company in NYC is growing its team rapidly, and they are looking for aSenior DevOps Engineer / Site Reliability Engineerwho can join.If you’re
Senior DevOps Engineer / Site Reliability Engineer (SRE)
Thomas Talent NetworkUnited StatesJob DescriptionJob DescriptionA leading B2B SaaS platform in the cross-border e-commerce sector, is expanding its North America operations. We're seeking a Senior DevOps Engineer / Site Reliability En
SRE/DevOps Engineer
VersanaUnited StatesJob DescriptionJob DescriptionAbout Us: Versana is an industry-backed data and technology company on a mission to make the syndicated loan market better. By digitally capturing agent banks’ data on a
Senior DevOps Engineer: GitLab CI/CD & Secure Automation
Lumine Software SolutionsUnited StatesLumine Software Solutions is seeking a Senior DevOps Engineer to lead the GitLab ecosystem and enhance CI/CD platforms. You will work closely with various teams to improve developer productivity and d
Remote Senior Azure Cloud Security & DevOps Engineer
MeshUnited StatesMesh is looking for a Senior Azure Cloud Engineer to develop secure and scalable infrastructure in Azure. This remote role focuses on enhancing security and DevOps practices while leading incident res
Senior DevOps Security Engineer: Remote AWS Security & SRE
Alto PharmacyUnited StatesAlto Pharmacy is seeking a Sr. DevOps Security Engineer to design and maintain secure AWS environments. This position offers the opportunity for full telecommuting from anywhere in the U.S. The ideal
Remote Sr. DevOps Engineer - Kubernetes & Terraform Lead
Stryker CorporationBostonStryker Corporation seeks a Sr. DevOps Engineer to own cloud infrastructure, managing Kubernetes clusters, databases, and networks using Terraform. This position offers an annual salary range of $213,
Senior DevOps Engineer - Remote & Flexible Hours
TechOne Search LtdUnited StatesTechOne Search Ltd is seeking a (Senior) DevOps Engineer to join a rapidly growing start-up focused on providing 'DevOps-as-a-Service'. This role offers the opportunity to work remotely while collabor
Cloud-Native .NET DevOps & Full-Stack Engineer
SimpliigenceUnited StatesSimpliigence is seeking a highly skilled .NET Full-Stack Engineer with over 8 years of experience in architecting and delivering cloud-native solutions. The ideal candidate will possess deep expertise
About
Job Description
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development.
You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. The Role
We are seeking a
Foundation Model DevOps Engineer
focused on
Operational Stability
to serve as the backbone of our AI research infrastructure.
You will be designing the friction-free environment that allows our models to be built. Your mandate is to build the tooling, release pipelines, and storage policies that remove drag on our research team. You will own the "foundational layer", ensuring that our researchers have immediate, secure, and reliable access to the tools, data, and compute they need. Key Responsibilities Model Release Engineering
·
High-Fidelity Release Management:
You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top-tier open-source product.
CI/CD for Research:
Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification.
·
Repo Administration:
Administer the organization’s GitHub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab. Resource Management & Infrastructure Efficiency
·
Compute Governance:
Manage the efficiency of our large-scale GPU resources. You track utilization to identify idle nodes, "zombie jobs," or inefficient scheduling, ensuring we extract maximum value from our compute clusters.
·
Storage Strategy & Hygiene:
Manage the lifecycle of petabyte-scale datasets and checkpoint storage. You implement intelligent aging policies to solve the "disk full" bottleneck without risking critical data loss.
·
Quota & Access Logic:
Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run. Research Tooling & Orchestration
·
Experiment Management Systems:
Build and maintain the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs.
·
Resource Telemetry:
Set up real-time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly.
·
Job Orchestration:
Work closely with infrastructure teams to optimize how we run synthetic data pipelines and large-scale evaluations, ensuring our tooling scales with our compute. Research Environment Provisioning
·
Automated Workspace Setup:
Build the scripts and tooling that instantly provision compute environments, permissions, and storage namespaces for researchers (automating away the manual work).
·
Cluster Access Architecture:
Streamline SSH and node access protocols to ensure friction-free entry to our massive-scale compute clusters while maintaining security boundaries. Academic Qualifications
A bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience. Professional Experience - Minimum (The Bar)
·
3+ years
of experience in DevOps, Release Engineering, or MLE, specifically within
AI/ML or HPC environments .
·
Foundation Model Fluency:
You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, you understand the difference between pre-training and inference, and you are familiar with the artifacts required for a model release.
·
Linux/Unix Fluency:
You live in the command line. You have deep expertise in bash scripting, file system permissions, and SSH configuration.
·
Version Control Admin:
Expert-level administration of GitHub Enterprise (managing teams, API limits, and repository security).
·
Scripting & Automation:
Proficiency in Python or Bash to automate repetitive administrative tasks. Professional Experience - Preferred (The Fit)
·
"Gold Standard" Open Source:
Experience contributing to or managing high-profile open-source releases (Hugging Face libraries, model families, datasets).
·
HPC Schedulers:
Deep understanding of
Slurm
job scheduling and troubleshooting.
·
Cloud Storage:
Familiarity with cloud storage buckets (S3/GCP) and efficient data transfer tools.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability Visa Sponsorship
This position is eligible for visa sponsorship. Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Languages
- English
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.