MLOps EngineerForschungszentrum Jülich GmbH • Jülich, North Rhine-Westphalia, Germany

Apply Now

MLOps Engineer

Forschungszentrum Jülich GmbH

Jülich, North Rhine-Westphalia, Germany

Jülich, North Rhine-Westphalia, Germany

Apply Now

About

Shaping change: this is what drives us at Forschungszentrum Jülich. As a member of the Helmholtz Association with some 7,600 employees, we conduct interdisciplinary research into a digitalized society, a climate-friendly energy system, and a sustainable economy. We focus on the natural, life, and engineering sciences in the fields of information, energy, and bioeconomy. We combine this with expertise in high‑performance computing and artificial intelligence using unique scientific infrastructures.

Are you excited about bridging cutting-edge AI and robust secure infrastructure operations? Then you've come to the right place! At the Jülich Supercomputing Centre (JSC), we operate some of Europe's most powerful computing systems and provide them to science and engineering communities in Germany and across Europe. Within this mission, the JUPITER AI Factory (JAIF) is being established as a major pillar of Europe's AI infrastructure in Jülich-providing simplified access to JUPITER, Europe's first exascale-class supercomputer, for all of its user groups.

We are looking for an MLOps Engineer, a hands-on operator to help bring AI models into production-inference on JAIF using the BLABLADOR AI inference suite. We have created the BLABLADOR LLM inference server in 2022, in order to provide the Helmholtz Association with a simple-to-use, private Large Language Model infrastructure. lt has been steadily growing in usage, in this age of agentic AI and multi-modal models. We are working on scaling BLABLADOR to attend thousands of simultaneous users.

Join our team MLOps Engineer

As an MLOps Engineer, you will work at the intersection of AI enablement and production-grade operations-helping teams run model inference services that are dependable, scalable, and secure. Your key responsibilities include:

Model Onboarding & Deployment: Take AI models from internal and external users into production inference workflows. Package and deploy models as containerized inference endpoints. Define and operate validation gates covering functional checks, performance benchmarks, and safety/security requirements before production rollout.
Platform Operations & Reliability: Run and maintain BLABLADOR as an operational inference suite within JAIF's environment, aligned with the goal of enabling high-performance, secure, privacy-compliant AI applications. Ensure reliability through comprehensive monitoring, logging, alerting, incident response, and post-incident improvements. Handle day-2 operations including upgrades, patching, lifecycle management, and capacity planning.
Scalability, Performance & Efficiency: Tune inference services to optimize latency, throughput, batching, GPU utilization, and multi-tenant fairness.

We are looking for a colleague who can confidently operate modern infrastructure and understands what it takes to run AI inference services in production.

Required/strong fit:

Excellent master's degree in Computer Science, Mathematics or comparable relevant fields
Relevant experience in platform operations, Site reliability engineering, DevOps, or infrastructure engineering
Practical expertise with containers and OCI tooling (e.g., Docker/Podman, image builds, registries, runtime troubleshooting)
Experience operating cloud technologies and orchestration (e.g., Kubernetes and its ecosystem)
Very good command of written and spoken English with extensive vocabulary is required (at least B2 level according to the CEFR), ideally supported by a certificate confirming the language level. Knowledge of the German language is not mandatory but certainly appreciated

Nice to have:

lnfrastructure-as-code and Cl/CD experience (e.g., Terraform/Ansible, GitOps)
Observability stacks (e.g., Prometheus/Grafana, ELK/OpenSearch, Loki, OpenTelemetry)
Familiarity with AI model inference engines and serving stacks (e.g., TensorRT, Triton, vLLM serving, or comparable inference frameworks)
Experience with web service deployment, such as reverse proxies, TLS certificates, load balancing, API endpoints
Experience with GPU scheduling/operations and performance debugging on accelerated systems
Strong Linux skills (process/network debugging, automation, scripting)
Prior work in (or with) research computing, scientific environments, or multi-tenant platforms

How you work:

You enjoy structured operations: measurable SLOs, calm incident response
You communicate clearly across disciplines and hierarchies in an international context

We work on highly topical, socially relevant issues and offer you the opportunity to actively shape change! You can expect a wide range of opportunities:

Working Environment: A highly relevant mission at the intersection of exascale computing and Al, including JAIF's goal of enabling broad access to JUPITER and its planned cloud-based inference capabilities.
Work-Life Balance: Optimal conditions for balancing work and private life, as well as a family-friendly company policy. The option of flexible working (in terms of location) is generally available after consultation and in line with upcoming tasks and (on-site) appointments
Flexibility: Flexible working time models, including options close to full-time, allow you to tailor
your working hours to suit your individual needs
Knowledge & Further Training: Your professional development is important to us – we provide
targeted, individual support
Health & Well-Being: Your health is important to us. You can look forward to a comprehensive occupational health management program with a wide range of offerings – e.g., a beach volleyball court, running groups, yoga classes, and much more. In addition, our company medical service and an experienced social counseling team are available to assist you on site
Vacation: You will receive 30 days of vacation plus additional days off (e.g. between Christmas and New Year's)
Fair remuneration: Depending on your existing qualifications and the tasks assigned to you, you will be classified in pay grade 13 of the TVöD Bund (Collective Agreement for the Public Service). All information on the TVöD Bund collective agreement can be found on the BMI website. The monthly salaries in euros can be found on page 69 ff. of the PDF download
Perspective: After a 2-year fixed-term contract, our goal is to hire you on a permanent basis. Let's use this time to find out how well we fit together
Additional benefits: Benefit from attractive additional services such as a company pension scheme with employer contribution. In addition to your basic salary, you will receive an annual bonus and capital-forming benefits
Support for international employees: Our International Advisory Service makes it easier for international employees to get started

In addition to exciting tasks and a collegial working environment, we offer you much more: https://go.fzj.de/benefits

We welcome applications from people with diverse backgrounds, e.g. in terms of age, gender, disability, sexual orientation/identity, and social, ethnic and religious origin. A diverse and inclusive working environment with equal opportunities in which everyone can realize their potential is important to us.

The following links provide further information on diversity and equal opportunities: https://go.fzj.de/equality and on specific support options: https://go.fzj.de/womens-job-journey

Jülich, North Rhine-Westphalia, Germany

Languages

English

Notice for Users

This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.

Apply Now