This job offer is no longer available
About
Ready to apply Before you do, make sure to read all the details pertaining to this job in the description below.
The Senior Site Reliability / DevOps Engineer will be responsible for building resilient, scalable, and observable platforms through strong automation, infrastructure engineering, and SRE best practices. This role blends SRE, DevOps, and platform engineering with hands‑on programming and production ownership in complex, distributed environments.
Key Responsibilities
- Design, build, and operate high‑reliability production platforms following SRE and DevOps principles.
- Develop automation and tooling using Python and Go to reduce operational toil and improve system reliability.
- Implement and maintain Ansible‑based automation for configuration management and infrastructure operations.
- Design and operate CI/CD pipelines using Jenkins, GitHub Actions, GitLab, and Azure DevOps.
- Implement Infrastructure as Code using Terraform and configuration management using Helm and Kustomize.
- Support and operate containerized and cloud‑native workloads on Docker and Kubernetes.
- Build, operate, and optimize observability platforms (metrics, logs, traces) using Prometheus, Grafana, ELK, Splunk, or similar tools.
- Ensure deep visibility into system health, performance, and availability across distributed environments.
- Troubleshoot and resolve critical production issues, performing root‑cause analysis and driving permanent fixes.
- Partner with infrastructure, platform, and application teams to improve system reliability, scalability, and operability.
Required Skills & Experience
- 8+ years of experience in SRE, DevOps, Platform Engineering, or Production Engineering roles.
- Strong programming expertise in Python (automation, scripting, internal tooling) and hands‑on experience with Ansible for automation and configuration management.
- Strong understanding of Linux internals, networking, and distributed systems.
- Proven experience with CI/CD pipelines and Git‑based workflows.
- Hands‑on experience with Infrastructure as Code (Terraform) and configuration tooling (Helm, Kustomize).
- Solid experience running containerized environments using Docker and Kubernetes.
- Strong background in observability engineering (metrics, logs, traces).
- Experience working with at least one cloud platform: AWS, Azure, or GCP.
- Excellent troubleshooting skills and experience managing high‑severity production incidents.
Good to Have
- Experience applying SRE concepts such as SLIs, SLOs, and error budgets.
- Experience building internal developer platforms or reliability tooling. xcfaprz
Mandatory Skills: Site Reliability Engineering (SRE).
#J-18808-Ljbffr
Languages
- English
Notice for Users
This job was posted by one of our partners. You can view the original job source here.