SRE/Observability EngineerBlueSky Resource Solutions • Saint Paul, Illinois, United States
SRE/Observability Engineer
BlueSky Resource Solutions
- Saint Paul, Illinois, United States
- Saint Paul, Illinois, United States
Über
JOB DESCRIPTION Site Reliability Engineer – Observability Overview We are seeking a skilled Site Reliability Engineer III to join our Platform Engineering team, focusing on building and maintaining a comprehensive observability platform. In this role, you will ensure that our microservices, Kubernetes clusters, and cloud infrastructure are consistently reliable, high-performing, and scalable. You will work closely with cross-functional teams to provide deep insights into system health, performance, and availability through metrics, logs, and traces. This is a key role for those passionate about creating robust, proactive monitoring systems to support troubleshooting and optimization.
Responsibilities
Develop and sustain a resilient observability stack using tools such as Prometheus, Grafana, Loki, InfluxDB, Telegraf, OpenTelemetry, and more.
Collaborate with DevOps, engineering, and product teams to understand monitoring requirements and deliver data-driven insights for better decision-making.
Design and implement monitoring solutions across diverse environments, including Kubernetes clusters, microservices, AWS, Azure, on-prem vSphere setups, and networks using Windows, Linux, Cisco, Juniper, Arista, and more.
Aggregate and store logs, metrics, and traces from distributed systems to ensure comprehensive, end-to-end visibility.
Develop alerting mechanisms based on KPIs and thresholds to support proactive performance monitoring and application uptime.
Create and maintain dashboards to monitor system health, application performance, and resource utilization.
Build solutions for monitoring key application metrics, including latency, request rates, error rates, and service dependencies.
Support incident response efforts, collaborating with DevOps, SRE, and development teams to troubleshoot and resolve performance issues.
Define and implement automated incident response workflows using observability data.
Participate in post-incident analyses to identify root causes and continuously improve system reliability.
Identify areas to improve observability practices, including better instrumentation, alerting, and reporting strategies.
Document observability setups, best practices, and troubleshooting techniques.
Stay informed on the latest observability technologies and industry trends to enhance the observability ecosystem.
Provide regular reports and dashboards on system health and performance metrics to ensure transparency for stakeholders.
Preferred Qualifications
Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field (or equivalent practical experience).
3–5 years of experience in observability, monitoring, or related areas such as SRE, DevOps, or Platform Engineering.
Proven experience in building, scaling, and managing observability solutions for complex infrastructure environments (Kubernetes, AWS, Azure, on-prem vSphere, and Windows/Linux).
Proficiency with Git version control, including branch management, conflict resolution, and GitHub workflows, along with experience in CI/CD using GitHub Actions.
Familiarity with VMware vSphere, cloud platforms (AWS, GCP, Azure), and containerized environments (Docker and Kubernetes).
Relevant certifications (e.g., VMware Certified Professional - VCP, AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, Certified Kubernetes Administrator) are a plus.
Skills and Abilities
Deep understanding of observability principles, including metrics, logs, and traces.
Strong experience with monitoring tools (Prometheus, Grafana, InfluxDB, Telegraf, etc.) and Kubernetes/containerized workloads.
Knowledge of cloud-native technologies, Infrastructure as Code (IaC), and DevOps practices.
Experience with Application Performance Management (APM) tools.
Proficient in scripting and automation with languages like Python, Bash, or Go.
Skilled in data visualization and reporting, using tools like Grafana and Kibana.
Ability to troubleshoot complex issues using logs, metrics, and traces for effective incident response.
Strong collaboration and communication skills for working with SRE, DevOps, and engineering teams.
Problem-solving mindset with attention to detail in designing observability solutions.
Adaptable to a fast-paced, evolving technical environment.
Eagerness to stay up-to-date with trends in observability, cloud technologies, and distributed systems.
#J-18808-Ljbffr
Sprachkenntnisse
- English
Hinweis für Nutzer
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.