Site Reliability Engineer

Recurring Decimal

Phoenix, Arizona, United States

Phoenix, Arizona, United States

Ähnliche Jobs finden

Über

(SRE)
to join Cloud Operations and Observability team. You'll be instrumental in driving
resiliency, performance, automation
, and
AI-driven observability
across hybrid cloud environments (Azure and GCP). You will design, implement, and manage infrastructure with a strong focus on
Kubernetes
, and integrating
AI/LLM solutions
into observability and operational workflows.

Key Responsibilities:

Build and operate
scalable, secure, and highly available infrastructure in Azure and GCP.
Design and maintain
observability platforms
leveraging Splunk, OpenTelemetry, and cloud-native monitoring tools.
Develop and support
AI/LLM-driven automation solutions
to improve incident triage, alert correlation, and root cause analysis.
Partner with application and data teams to define
SLOs, SLIs, and error budgets
.
Drive operational excellence through
automation, chaos testing, and proactive reliability improvements
.
Optimize Kubernetes environments (GKE/AKS) for performance, security, and cost-efficiency.
Integrate observability data pipelines with LLMs for anomaly detection, summarization, and proactive remediation.
Participate in
on-call rotations
, incident response, and postmortem reviews.
Implement
runbooks, auto-remediation scripts
, and
AI copilots
for operations.

Required Qualifications:

8+ years of experience as an SRE.
Strong expertise in
Azure and GCP cloud platforms
(certifications a plus).
Proficient in
Splunk (Enterprise + Observability)
for monitoring, alerting, and log analytics.
In-depth knowledge of
Kubernetes (AKS, GKE)
, Helm, and container lifecycle.
Familiarity with
AI/ML and LLM-based tools
(e.g., OpenAI, Hugging Face, Azure OpenAI) for observability or automation use cases.
Experience with
CI/CD pipelines
, GitOps, and secure deployment practices.
Programming/scripting skills in
Python, Go, or Bash
.
Strong understanding of
SRE principles
: SLAs, SLIs, SLOs, error budgets, and incident management.

Preferred Qualifications:

Experience building
AI-enabled runbooks or copilots
.
Exposure to
FinOps
or
cost-optimization strategies
in cloud environments.
Knowledge of
distributed tracing
and event correlation using OpenTelemetry.
Familiarity with
Kafka, Pub/Sub
, or other messaging systems for observability data.

Phoenix, Arizona, United States

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.

Ähnliche Jobs finden