XX
Staff Software Development EngineerCVS HealthNashville, Tennessee, United States
XX

Staff Software Development Engineer

CVS Health
  • US
    Nashville, Tennessee, United States
  • US
    Nashville, Tennessee, United States

Über

Job Description:
Define and implement enterprise-wide SRE practices, including SLIs, SLOs, error budgets, and reliability governance Drive a culture of reliability, automation, and continuous improvement across engineering teams Establish metrics-driven approaches to measure system health, availability, and performance Lead adoption of AIOps solutions to enable predictive monitoring, anomaly detection, and automated root cause analysis Integrate machine learning models and analytics into monitoring pipelines to proactively detect and prevent incidents Develop intelligent alerting systems to reduce noise and improve signal quality Architect and build scalable observability frameworks covering metrics, logs, traces, and events Define standards for instrumentation, telemetry collection, and distributed tracing Enable real-time insights into system performance across microservices and cloud-native architectures Lead incident response practices, including on-call readiness, RCA, postmortems, and continuous learning loops Build self-healing systems and automate remediation workflows to reduce Mean Time to Resolution (MTTR) Implement runbooks, playbooks, and automated escalations Develop internal platforms and tools for observability, monitoring, and performance optimization Integrate observability into CI/CD pipelines to enable proactive quality and reliability checks Drive infrastructure automation using IaaC frameworks and GitOps principles Partner with engineering, platform, and product teams to embed reliability and observability into system design Mentor engineers and lead design reviews focused on scalability, resilience, and operability Influence enterprise architecture decisions and promote best practices across teams Requirements:
5+ years of experience in software engineering, SRE, or production engineering in large-scale distributed systems Hands-on experience with Observability tools such as AppDynamics, Grafana, Prometheus, Datadog, OpenTelemetry, or similar Experience with AIOps or intelligent monitoring platforms, including anomaly detection and event correlation Strong expertise in cloud platforms (AWS, Azure, or GCP) and cloud-native architectures (Kubernetes, containers, microservices) Proficiency in at least one programming language (e.g., Python, Java, Go) Strong understanding of distributed systems, resiliency patterns, and fault tolerance Experience implementing incident management, on-call processes, and root cause analysis Hands-on expertise with Infrastructure as Code (Terraform, ARM, CloudFormation) and CI/CD pipelines Experience using GenAI/Automation tools and frameworks such as OpenAI, CoPilot, Gemini, Claude, MCP etc Proven ability to design scalable, reliable, and observable systems Benefits:
medical, dental, and vision coverage paid time off retirement savings options wellness programs other resources, based on eligibility
  • Nashville, Tennessee, United States

Sprachkenntnisse

  • English
Hinweis für Nutzer

Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klick auf „Jetzt Bewerben”, um deine Bewerbung direkt auf deren Website einzureichen.