Senior Specialist Engineer (SRE) - UKHSA - SEOManchester Digital • Manchester, England, United Kingdom
Senior Specialist Engineer (SRE) - UKHSA - SEO
Manchester Digital
- Manchester, England, United Kingdom
- Manchester, England, United Kingdom
About
Location & Working Arrangement Hybrid working model: minimum 60% contractual hours (≈3 days a week pro rata) at one of UKHSA's core HQs (Birmingham, Leeds, Liverpool, London). Modern refurbished offices with excellent transport links. Public space collaboration with other government departments including DHSC.
About The Job The Digital and Data Directorate provides scientific and research computing services. The Digital Development and Operations unit delivers platforms and technical capabilities to enable public health services within the organisation and with clients and stakeholders.
Key Responsibilities
Remediate infrastructure and operational problems.
Leverage automation and CI/CD to ensure reliable, scalable, and high‑performance services.
Monitor and manage cloud infrastructure services and observe systems to prioritize operational and performance improvements meeting/exceeding SLOs.
Architect, develop & manage multi‑cloud HPC platforms and on‑premise infrastructure.
Ensure services are highly available, scalable, and resilient.
Manage performance, capacity planning, and support UKHSA's AI requirements.
Incident Response & Troubleshooting
Respond swiftly to production incidents with minimal downtime and rapid restoration.
Perform root cause analysis and post‑mortems to implement lessons learned.
Monitoring, Alerting & Observability
Design and implement effective monitoring and alerting systems using Prometheus, Grafana, etc.
Improve observability to identify issues before impacting users.
Continuously refine practices to reduce alert fatigue.
Automation & Tooling
Develop automation to eliminate manual repetitive tasks and improve efficiency.
Write clean, maintainable, well‑tested code for automation and tooling.
Drive initiatives to reduce operational toil via Infrastructure as Code.
Service Level Objectives & Operational Improvements
Define, track, and improve SLOs, SLI, and error budgets.
Prioritize improvements aligning with business goals & user experience.
SRE Best Practices & Advocacy
Evangelize SRE principles across the organisation.
Integrate reliability practices into the development lifecycle.
Collaboration & Knowledge Sharing
Collaborate with software engineering, DevOps, and infrastructure teams.
Promote culture of shared responsibility for service reliability.
Documentation & Training
Maintain accurate technical documents, runbooks, post‑incident reports.
Provide training and mentorship on best practices and tools.
Essential Criteria
Experience as a Site Reliability Engineer, DevOps Engineer, Operations Engineer or similar.
Programming/scripting skills in Python, PowerShell, Bash.
Understanding of Linux/Unix, Windows, networking, distributed systems.
Experience with observability tools (Prometheus, Grafana, Datadog) and alerting systems.
Infrastructure automation skills (Terraform, Ansible, Helm).
Excellent communication and collaboration skills.
Experience with security best practices.
Strong problem‑solving skills and ability to respond to sudden demands.
Desirable Criteria
CI/CD pipelines, cloud platforms (AWS, GCP, Azure), and Kubernetes experience.
Post‑incident review experience.
Driving SRE practice adoption across an organisation.
Delivering training or mentoring of junior engineers.
Seniority Level Mid‑Senior level
Employment Type Full‑time
Job Function Engineering and Information Technology; Industries: Technology, Information and Internet
#J-18808-Ljbffr
Languages
- English
Notice for Users
This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.