Engineering Manager, SRE - Observability

+2
+8
United States

+2
+8
United States

Über

Engineering Manager, SRE - Observability at Zendesk summary:

The Engineering Manager, SRE - Observability leads a team to design, build, and maintain enterprise-grade monitoring and incident response systems using tools like Datadog and Grafana. This role focuses on transforming reactive operations into proactive reliability engineering, collaborating across engineering and product teams to enhance system observability at scale. The position requires deep technical expertise, people management skills, and experience with large-scale cloud-native distributed systems.

Job Description
As an Engineering Manager specializing in Observability, you will lead and scale a highly skilled team responsible for architecting, building, and evolving enterprise-grade monitoring, alerting, and incident response systems. Leveraging your deep expertise with observability tools such as Datadog, Grafana, Loki, and others, you will drive our transformation from reactive firefighting to proactive reliability engineering at scale. Your mission is to empower engineering teams by providing the right visibility and tooling to ensure system health, availability, and performance.
You will collaborate closely with Product Management and Technical Leads to define and execute a strategic roadmap that addresses the challenges of monitoring complex, large-scale distributed systems in a cloud-native environment. This role demands a hands-on engineering leader who understands the nuances of telemetry data, visualization, alerting reliability, and cost-efficient observability architectures in enterprise settings.
What You’ll Be Doing

Recruit, mentor, and retain top engineering talent specialized in observability and reliability engineering.
Directly contribute to the design and implementation of observability solutions alongside your team, maintaining a high bar for technical excellence.
Own and evolve the end-to-end observability stack and operational processes, including metrics, traces, logs, dashboards, and alerting.
Partner with SRE, DevOps, and platform teams to integrate and extend observability tooling across diverse services running at large scale.
Lead roadmap planning for observability infrastructure and tooling in partnership with Product and Engineering leadership.
Establish best practices for instrumentation, data collection, alerting thresholds, and incident response workflows to elevate the organization's reliability posture.
Identify gaps and weaknesses in monitoring coverage and performance; proactively drive improvements and automation.
Collaborate cross-functionally with teams across the enterprise to influence observability adoption, standardization, and innovation.
Foster a culture of continuous learning, high team engagement, and technical craftsmanship within your team.
Communicate technical strategy, progress, risks, and impact effectively with stakeholders at all levels.

What You Bring to the Role

Deep hands-on experience with commercial and open-source observability tools, including Datadog, Grafana, Loki, and related telemetry technologies.
Proven track record managing observability or SRE teams within large, complex enterprise environments.
Strong understanding of distributed systems, cloud-native architectures (Kubernetes, AWS), and how observability fits into scalable operations.
Ability to provide technical leadership while actively contributing to engineering solutions and troubleshooting.
Expertise in designing scalable, reliable telemetry pipelines and intelligent alerting to reduce alert noise and incident toil.
Demonstrated skill in building and improving observability platforms that serve multiple engineering teams and business units.
Effective communicator and collaborator, able to bridge engineering, product, and business stakeholders.
Commitment to developing team members through coaching, feedback, and career growth opportunities.
Experience driving cultural change in organizations towards proactive reliability engineering and data-driven decision making.

Required

3+ years of people management experience leading engineering teams.
Deep domains expertise in Observability with hands-on experience in tools like Datadog, Grafana, Loki, etc.
Significant experience working in or managing engineering teams within large-scale enterprise companies.
Proven ability to hire, mentor, and retain high-performing engineers.
Strong collaboration skills to influence cross-functional teams in large engineering organizations.
Experience with distributed systems and cloud environments (AWS, Kubernetes).

Preferred

Background leading Observability focused teams.
Hands-on experience operating telemetry systems for large-scale Kubernetes and AWS workloads.
Passion for innovation, continuous learning, and championing a growth mindset.
Experience managing geographically distributed teams.

Our Tech Environment

Primarily AWS cloud infrastructure with Kubernetes orchestration.
Codebase spans Ruby, Go, and Python.
Data storage includes AWS Aurora (MySQL), S3, and Kafka streaming.
Observability responsibilities include balancing operational maintenance, tooling innovation, and incident support.

The Poland annualized base salary range for this position is zł297,000.00-zł445,000.00. Please note that while the salary range represents the minimum and maximum base salary rate for this position, the actual compensation offered will be based on job related capabilities, applicable experience, and other relevant factors. This position may also be eligible for bonus, benefits, or related incentives that will be communicated during the offer stage.
Hybrid: In this role, our hybrid experience is designed at the team level to give you a rich onsite experience packed with connection, collaboration, learning, and celebration - while also giving you flexibility to work remotely for part of the week. This role must attend our local office for part of the week. The specific in-office schedule is to be determined by the hiring manager.
The intelligent heart of customer experience
Zendesk software was built to bring a sense of calm to the chaotic world of customer service. Today we power billions of conversations with brands you know and love.
Zendesk believes in offering our people a fulfilling and inclusive experience. Our hybrid way of working, enables us to purposefully come together in person, at one of our many Zendesk offices around the world, to connect, collaborate and learn whilst also giving our people the flexibility to work remotely for part of the week.
As part of our commitment to fairness and transparency, we inform all applicants that artificial intelligence (AI) or automated decision systems may be used to screen or evaluate applications for this position, in accordance with Company guidelines and applicable law.
Zendesk is an equal opportunity employer, and we’re proud of our ongoing efforts to foster global diversity, equity, & inclusion in the workplace. Individuals seeking employment and employees at Zendesk are considered without regard to race, color, religion, national origin, age, sex, gender, gender identity, gender expression, sexual orientation, marital status, medical condition, ancestry, disability, military or veteran status, or any other characteristic protected by applicable law. We are an AA/EEO/Veterans/Disabled employer. If you are based in the United States and would like more information about your EEO rights under the law, please click here .
Zendesk endeavors to make reasonable accommodations for applicants with disabilities and disabled veterans pursuant to applicable federal and state law. If you are an individual with a disability and require a reasonable accommodation to submit this application, complete any pre-employment testing, or otherwise participate in the employee selection process, please send an e-mail to with your specific accommodation request.

Keywords:

Engineering Manager, Site Reliability Engineering, Observability, Datadog, Grafana, Loki, Monitoring, Incident Response, Cloud-native, Kubernetes

Wünschenswerte Fähigkeiten

AWS
Amazon S3
Go
Kubernetes
MySQL
Python
Ruby
Grafana

United States

Berufserfahrung

DevOps
Site Reliability (SRE)

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.

Ähnliche Jobs finden