Infrastructure Engineerhttps://ladybirdltd.com/ • Manchester, England, United Kingdom
This job offer is no longer available
Infrastructure Engineer
https://ladybirdltd.com/
- Manchester, England, United Kingdom
- Manchester, England, United Kingdom
About
This role is for a Senior / Principal Platform Engineer with deep experience in on-premise data centres, bare-metal infrastructure, private cloud platforms, and self-managed Kubernetes.
The platform operates on leased data-centre infrastructure, rather than managed cloud services. There is no reliance on EKS, GKE, AKS, or other fully managed abstractions.
You will design, build, operate, secure, and scale a low-latency, high-availability, real-time AI platform from the ground up — starting with physical infrastructure and extending through to production Kubernetes workloads, integrated end-to-end with GitLab CI/CD.
This role is best suited to engineers who have built and owned infrastructure end-to-end and are comfortable operating close to the underlying systems.
*What You Will Own*
*On-Prem Infrastructure & Platform Engineering (Bare Metal)*
You will:
* Design and build on-premise and private-cloud platforms in a leased data-centre environment
* Architect bare-metal Kubernetes clusters, including self-managed control planes and worker nodes
* Own compute, networking, storage, and OS-level architecture
* Design for low-latency, high-throughput, and high-availability real-time systems
* Operate GPU-enabled infrastructure for inference and training workloads
* Plan explicitly for failure modes, capacity, scaling, and recovery
This is a platform ownership role, not a platform consumption role.
*CI/CD, GitOps & Automation (GitLab-First)*
You will:
* Design, build, and operate GitLab CI/CD pipelines from first principles
* Automate:
* Infrastructure provisioning
* Kubernetes deployments
* Application and model releases
* Rollback, recovery, and disaster scenarios
* Implement GitOps-style workflows
* Automate upgrades, patching, and day-2 operations
* Reduce operational risk through Infrastructure as Code
*Streaming, Real-Time & Low-Latency Systems*
You will:
* Operate Kafka-based streaming platforms in production
* Support stateful, ordered, low-latency event processing
* Design systems that degrade safely under load
* Optimise for latency, throughput, back-pressure, and reliability
* Understand and design for real-world system behaviour under stress
*MLOps Infrastructure (Platform-Focused)*
You will:
* Operate production ML inference platforms (e.g. KServe, Seldon, Triton, or similar)
* Support retraining pipelines and model lifecycle management
* Enable GPU scheduling, concurrency control, and resource isolation
* Implement safe deployment strategies, including:
* Canary releases
* Versioning
* Rollback paths
* Partner closely with AI engineers while retaining ownership of the serving platform
*Reliability Engineering, Security & Observability*
You will:
* Build and maintain observability stacks using Prometheus, Grafana, and centralised logging
* Define and monitor SLIs, SLOs, latency, availability, and error budgets
* Lead incident response, root-cause analysis, and remediation
* Harden systems for security, resilience, and compliance
* Implement secrets management, access controls, and least-privilege models
*What This Role Is Not*
* Not data science or ML research
* Not frontend or product engineering
* Not managed-cloud DevOps
* Not maintaining an inherited platform
This is a hands-on platform ownership role.
*Required Experience (Non-Negotiable)*
You should be able to confidently say “yes” to most of the following:
* You have personally built and operated on-prem or private-cloud infrastructure
* You have run Kubernetes outside managed cloud services
* You have strong Linux, networking, storage, and systems fundamentals
* You have built GitLab CI/CD pipelines, not just used them
* You understand real-time, low-latency system design
* You are comfortable acting as the final technical owner of a production platform
*Strong Signals (Highly Valued)*
* Bare-metal Kubernetes (no EKS / GKE / AKS)
* GPU scheduling and accelerator workloads at scale
* Kafka and streaming systems in production
* Terraform and GitOps workflows
* Experience in regulated, mission-critical, or high-availability environments
* Background in telecoms, finance, HPC, or real-time systems
*Engineering Culture – Above the Line*
Culture is reflected in how systems are designed, operated, and supported when things go wrong.
Everyone operates *Above the Line*:
* *Accountability* – See it. Own it. Solve it. Do it.
* *Respect* – Communicate clearly and professionally.
* *Action* – Move work forward without waiting.
* *Feedback* – Seek challenge and continuous improvement.
* *Ownership* – Be responsible for outcomes, not tasks.
* *Empathy* – Build systems that behave predictably for real users.
At senior and principal level, this means end-to-end ownership, sound judgement, and technical leadership by example.
*Level:* Principal (Hands-On, Individual Contributor)
*Environment:* On-Prem Data Centre · Bare Metal · Private Cloud · GitLab-First · GPU-Enabled
Job Types: Part-time, Permanent, Fixed term contract, Temp to perm, Freelance, Zero hours contract, Volunteer, Internship
Contract length: 12 months
Pay: £6,349.20-£21,067.20 per year
Ability to commute/relocate:
* Manchester M35 9BD: reliably commute or plan to relocate before starting work (required)
Experience:
* hands-on platform engineering, including building on-prem: 3 years (required)
Work authorisation:
* United Kingdom (required)
Work Location: In person
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.