Dieses Stellenangebot ist nicht mehr verfügbar
Site Reliability Engineer(local to Cincinnati, OH)
- Cincinnati, Ohio, United States
- Cincinnati, Ohio, United States
Über
Requirements
· years of experience in Cloud SRE, DevOps, Infrastructure, or related engineering roles
· years working with databases, web applications, microservices, event-driven systems, messaging platforms, REST APIs, integrations, and containerized environments
· Strong knowledge of Java, Spring Boot, microservices architecture, Kafka, Cassandra, and SQL Server
· Proficiency in Python and Shell scripting for automation and operational tooling
· year managing observability platforms such as Dynatrace, ELK, PagerDuty, Datadog, Azure Monitor, or Grafana
· Hands-on experience with GitHub Actions for CI/CD automation
· Strong foundation in Linux architecture, security, performance tuning, troubleshooting, and production operations
· Experience working in Agile delivery teams
· Ability to collaborate effectively with multi-location/global teams
· Demonstrated ability to contribute at both a tactical and strategic level
· Familiarity with eCommerce, fulfillment, or retail technology environments
· Strong written, verbal, and presentation skills
Nice to Have
· years of experience designing or supporting high-volume eCommerce applications
· years configuring and managing cloud environments in Azure, AWS, or GCP
· Hands-on experience with Kafka, Cosmos DB, Cassandra, Ansible, Terraform, Docker, Kubernetes (1+ year)
· Experience with Nginx, HAProxy, or Squid
· Experience building CI/CD pipelines with Jenkins, Spinnaker, Azure DevOps, or TeamCity
· Experience implementing and managing RoyalTS or similar cross-platform remote management tools
Responsibilities
· Partner with application engineering, observability, and support teams — along with business operations and third-party partners — to identify, prioritize, and resolve issues impacting customer pickup and delivery operations
· Lead root cause analysis of critical business and production incidents, ensuring corrective actions and preventive measures are implemented
· Manage and facilitate Major Incident calls for the Pickup Fulfillment domain, providing timely and accurate updates to key stakeholders during service restoration
· Collaborate with engineering teams to continuously enhance build environments for improved reliability, speed, and scalability
· Drive automation initiatives to improve system efficiency, deployment accuracy, and operational quality
· Ensure system traceability, observability, and retrievability to support issue diagnosis and performance monitoring
· Build and maintain comprehensive logging, monitoring, and alerting systems to proactively identify bottlenecks and enable performance optimization across cloud, on-prem, and in-store environments
· Develop detailed documentation, playbooks, and design guides to support operational readiness and incident response consistency
- · Participate in off-hours on-call rotation and scheduled maintenance windows to ensure uninterrupted system availability
Sprachkenntnisse
- English
Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.