Senior DevOps Engineer, Infrastructure & Reliability

Worth AI

United States

United States

Ähnliche Jobs finden

Über

Worth AI, a leader in the computer software industry, is looking for a Senior DevOps Engineer to join our Infrastructure team with a singular mission: to make our systems faster, more reliable, and more resilient while making life dramatically easier for engineers shipping software. In this role, you won't just manage infrastructure; you will design and evolve the foundation that every product and engineer depends on.
You will act as a force multiplier by eliminating operational friction, automating repetitive processes, strengthening system reliability, and building scalable infrastructure patterns that allow teams to deploy confidently and recover quickly. You are part architect, part reliability engineer, and part automation evangelist.
Responsibilities
Conduct regular interviews with engineering teams to identify operational pain points in CI/CD, deployments, observability, and cloud environments and proactively eliminate them
Design and implement scalable Infrastructure-as-Code patterns using tools like Terraform to standardize cloud provisioning and reduce configuration drift
Own and evolve our Kubernetes platform (EKS or self-managed), ensuring workloads are secure, scalable, and resilient by default
Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase confidence in releases
Lead systemic reliability initiatives, including incident response improvements, root cause analysis practices, and postmortem frameworks
Design and enforce secure networking, IAM, and secrets management strategies across environments
Improve observability by refining metrics, logs, and tracing using tools like DataDog, ensuring actionable insight into system health
Optimize cloud cost efficiency through rightsizing, autoscaling strategies, and architectural improvements
Own disaster recovery planning, backup strategies, and multi-region resilience initiatives
Refactor brittle or manually managed infrastructure into automated, testable, and reproducible systems
Introduce new infrastructure tooling or architectural shifts and drive adoption through documentation, workshops, and hands‑on support
Lead by example in incident management, risk mitigation, and operational excellence
Communicate technical trade‑offs clearly across engineering and product stakeholders, balancing speed with safety
Technology Stack
Cloud & Infrastructure: AWS (EKS, RDS, MSK, S3, Lambda, IAM, VPC)
Containerization & Orchestration: Kubernetes, ArgoCD
Infrastructure‑as‑Code: Terraform
CI/CD: GitHub Actions (or equivalent)
Monitoring & Observability: DataDog
Data & Messaging: PostgreSQL, Kafka, Redis
Languages (as needed): Bash, Python, TypeScript
Requirements
8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles
Proven experience designing and operating production Kubernetes environments at scale
Deep hands‑on expertise with AWS infrastructure and cloud networking
Strong experience building and maintaining Terraform modules across large cloud environments
Demonstrated ownership of CI/CD systems and measurable improvement of DORA metrics
Experience leading incident response processes and driving meaningful postmortem outcomes
Strong understanding of distributed systems, event‑driven architectures (Kafka), and database performance (PostgreSQL)
Proven ability to modernize legacy infrastructure and eliminate manual operational toil
Experience navigating high‑ambiguity environments and translating operational friction into prioritized infrastructure roadmaps
Demonstrated ability to build trust across teams while raising the reliability bar
Success Metrics
DORA Metrics Improvement
Drive measurable improvements in Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR)
System Reliability
Maintain or exceed defined SLO/SLA targets with reduced incident frequency and duration
Infrastructure Stability
Reduce production incidents caused by misconfiguration, manual processes, or infrastructure drift
Operational Efficiency
Increase percentage of infrastructure managed through code and automation
Cost Optimization
Improve cloud cost efficiency without sacrificing reliability or performance
Bonus Points (Nice to Have)
Experience operating high‑throughput Kafka clusters (MSK or self‑managed)
Strong background in database performance tuning (PostgreSQL, Redis)
Experience implementing autoscaling strategies for high‑traffic systems
Familiarity with service mesh technologies
Experience building internal developer platforms (IDP)
Background in security best practices (zero‑trust networking, policy‑as‑code)
Experience with multi‑region or globally distributed systems
Proficiency in Python for automation and tooling development
Experience introducing platform‑wide reliability frameworks (SLOs, error budgets, chaos testing)
All Remote Hires - will be required to travel to Orlando, Florida at least twice per year for Town Halls and team collaboration in addition to orientation in Orlando, Florida
Benefits
Health Care Plan (Medical, Dental & Vision)
Retirement Plan (401k, IRA)
Life Insurance
Flexible Vacation
Work From Home
Free Food & Snacks (in office)
Orlando, Florida (Hybrid)
Wellness Resources
#J-18808-Ljbffr

United States

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.

Ähnliche Jobs finden