XX
Site Reliability EngineerRecurring DecimalPhoenix, Arizona, United States

Dieses Stellenangebot ist nicht mehr verfügbar

XX

Site Reliability Engineer

Recurring Decimal
  • US
    Phoenix, Arizona, United States
  • US
    Phoenix, Arizona, United States

Über

(SRE)
to join Cloud Operations and Observability team. You'll be instrumental in driving
resiliency, performance, automation
, and
AI-driven observability
across hybrid cloud environments (Azure and GCP). You will design, implement, and manage infrastructure with a strong focus on
Kubernetes
, and integrating
AI/LLM solutions
into observability and operational workflows.

Key Responsibilities:

  • Build and operate
    scalable, secure, and highly available infrastructure in Azure and GCP.
  • Design and maintain
    observability platforms
    leveraging Splunk, OpenTelemetry, and cloud-native monitoring tools.
  • Develop and support
    AI/LLM-driven automation solutions
    to improve incident triage, alert correlation, and root cause analysis.
  • Partner with application and data teams to define
    SLOs, SLIs, and error budgets
    .
  • Drive operational excellence through
    automation, chaos testing, and proactive reliability improvements
    .
  • Optimize Kubernetes environments (GKE/AKS) for performance, security, and cost-efficiency.
  • Integrate observability data pipelines with LLMs for anomaly detection, summarization, and proactive remediation.
  • Participate in
    on-call rotations
    , incident response, and postmortem reviews.
  • Implement
    runbooks, auto-remediation scripts
    , and
    AI copilots
    for operations.

Required Qualifications:

  • 8+ years of experience as an SRE.
  • Strong expertise in
    Azure and GCP cloud platforms
    (certifications a plus).
  • Proficient in
    Splunk (Enterprise + Observability)
    for monitoring, alerting, and log analytics.
  • In-depth knowledge of
    Kubernetes (AKS, GKE)
    , Helm, and container lifecycle.
  • Familiarity with
    AI/ML and LLM-based tools
    (e.g., OpenAI, Hugging Face, Azure OpenAI) for observability or automation use cases.
  • Experience with
    CI/CD pipelines
    , GitOps, and secure deployment practices.
  • Programming/scripting skills in
    Python, Go, or Bash
    .
  • Strong understanding of
    SRE principles
    : SLAs, SLIs, SLOs, error budgets, and incident management.

Preferred Qualifications:

  • Experience building
    AI-enabled runbooks or copilots
    .
  • Exposure to
    FinOps
    or
    cost-optimization strategies
    in cloud environments.
  • Knowledge of
    distributed tracing
    and event correlation using OpenTelemetry.
  • Familiarity with
    Kafka, Pub/Sub
    , or other messaging systems for observability data.
  • Phoenix, Arizona, United States

Sprachkenntnisse

  • English
Hinweis für Nutzer

Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.