Site Reliability Engineer

Recurring Decimal

Phoenix, Arizona, United States

Phoenix, Arizona, United States

Trouver des emplois similaires

À propos

(SRE)
to join Cloud Operations and Observability team. You'll be instrumental in driving
resiliency, performance, automation
, and
AI-driven observability
across hybrid cloud environments (Azure and GCP). You will design, implement, and manage infrastructure with a strong focus on
Kubernetes
, and integrating
AI/LLM solutions
into observability and operational workflows.

Key Responsibilities:

Build and operate
scalable, secure, and highly available infrastructure in Azure and GCP.
Design and maintain
observability platforms
leveraging Splunk, OpenTelemetry, and cloud-native monitoring tools.
Develop and support
AI/LLM-driven automation solutions
to improve incident triage, alert correlation, and root cause analysis.
Partner with application and data teams to define
SLOs, SLIs, and error budgets
.
Drive operational excellence through
automation, chaos testing, and proactive reliability improvements
.
Optimize Kubernetes environments (GKE/AKS) for performance, security, and cost-efficiency.
Integrate observability data pipelines with LLMs for anomaly detection, summarization, and proactive remediation.
Participate in
on-call rotations
, incident response, and postmortem reviews.
Implement
runbooks, auto-remediation scripts
, and
AI copilots
for operations.

Required Qualifications:

8+ years of experience as an SRE.
Strong expertise in
Azure and GCP cloud platforms
(certifications a plus).
Proficient in
Splunk (Enterprise + Observability)
for monitoring, alerting, and log analytics.
In-depth knowledge of
Kubernetes (AKS, GKE)
, Helm, and container lifecycle.
Familiarity with
AI/ML and LLM-based tools
(e.g., OpenAI, Hugging Face, Azure OpenAI) for observability or automation use cases.
Experience with
CI/CD pipelines
, GitOps, and secure deployment practices.
Programming/scripting skills in
Python, Go, or Bash
.
Strong understanding of
SRE principles
: SLAs, SLIs, SLOs, error budgets, and incident management.

Preferred Qualifications:

Experience building
AI-enabled runbooks or copilots
.
Exposure to
FinOps
or
cost-optimization strategies
in cloud environments.
Knowledge of
distributed tracing
and event correlation using OpenTelemetry.
Familiarity with
Kafka, Pub/Sub
, or other messaging systems for observability data.

Phoenix, Arizona, United States

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre a été publiée par l’un de nos partenaires. Vous pouvez consulter l’offre originale ici.

Trouver des emplois similaires