Site Reliability EngineerRecurring Decimal • Phoenix, Arizona, United States
Cette offre d'emploi n'est plus disponible
Site Reliability Engineer
Recurring Decimal
- Phoenix, Arizona, United States
- Phoenix, Arizona, United States
À propos
(SRE)
to join Cloud Operations and Observability team. You'll be instrumental in driving
resiliency, performance, automation
, and
AI-driven observability
across hybrid cloud environments (Azure and GCP). You will design, implement, and manage infrastructure with a strong focus on
Kubernetes
, and integrating
AI/LLM solutions
into observability and operational workflows.
Key Responsibilities:
- Build and operate
scalable, secure, and highly available infrastructure in Azure and GCP. - Design and maintain
observability platforms
leveraging Splunk, OpenTelemetry, and cloud-native monitoring tools. - Develop and support
AI/LLM-driven automation solutions
to improve incident triage, alert correlation, and root cause analysis. - Partner with application and data teams to define
SLOs, SLIs, and error budgets
. - Drive operational excellence through
automation, chaos testing, and proactive reliability improvements
. - Optimize Kubernetes environments (GKE/AKS) for performance, security, and cost-efficiency.
- Integrate observability data pipelines with LLMs for anomaly detection, summarization, and proactive remediation.
- Participate in
on-call rotations
, incident response, and postmortem reviews. - Implement
runbooks, auto-remediation scripts
, and
AI copilots
for operations.
Required Qualifications:
- 8+ years of experience as an SRE.
- Strong expertise in
Azure and GCP cloud platforms
(certifications a plus). - Proficient in
Splunk (Enterprise + Observability)
for monitoring, alerting, and log analytics. - In-depth knowledge of
Kubernetes (AKS, GKE)
, Helm, and container lifecycle. - Familiarity with
AI/ML and LLM-based tools
(e.g., OpenAI, Hugging Face, Azure OpenAI) for observability or automation use cases. - Experience with
CI/CD pipelines
, GitOps, and secure deployment practices. - Programming/scripting skills in
Python, Go, or Bash
. - Strong understanding of
SRE principles
: SLAs, SLIs, SLOs, error budgets, and incident management.
Preferred Qualifications:
- Experience building
AI-enabled runbooks or copilots
. - Exposure to
FinOps
or
cost-optimization strategies
in cloud environments. - Knowledge of
distributed tracing
and event correlation using OpenTelemetry. - Familiarity with
Kafka, Pub/Sub
, or other messaging systems for observability data.
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre a été publiée par l’un de nos partenaires. Vous pouvez consulter l’offre originale ici.