Director, Digital Reliability Engineering
- Miami, Florida, United States
- Miami, Florida, United States
Über
Journey with us Combine your career goals and sense of adventure by joining our exciting team of employees Royal Caribbean Group is pleased to offer a competitive compensation and benefits package and excellent career development opportunities each offering unique ways to explore the world
The Royal Caribbean Group's Digital Team has an exciting career opportunity for a full-time Director Digital Reliability Engineering reporting to the VP of Engineering
The position is onsite and based in Miami Florida
Position Summary:
The Director Digital Reliability Engineering will lead the global Technology Operations portfolio for Royal Caribbean's Digital organization ensuring the reliability availability and performance of guest-facing pre-cruise platforms across web and mobile
This leader is responsible for both Site Reliability Engineering (SRE) practices and run-the-business engineering support Beyond incident response the Director is accountable for managing and delivering on the resolution of all production issues executing ongoing maintenance activities and coordinating technical communications This role also manages a dedicated engineering development capacity focused on production fixes ongoing maintenance and technical debt reduction This ensures that stability improvements are not only identified but also delivered This person is expected to walk the talk—able to jump in during incidents work side by side with engineers and demonstrate technical depth when guiding solutions
This is a hands-on role where the leader is expected to actively support teams during critical incidents work directly with engineers to troubleshoot and ensure sustained improvements in reliability
This role also carries executive accountability for critical incidents The Director must be prepared to provide leadership and direct support during major incidents at any time ensuring the organization responds with speed clarity and effectiveness
Essential Duties and Responsibilities:
Strategic Leadership
Define and execute the global SRE strategy for Digital Operations- - aligning with business priorities and Royal Caribbean's long-term technology vision
Build and nurture a culture of reliability resilience- and continuous improvement across all digital platforms
Drive initiatives to maintain zero downtime by rapidly addressing issues conducting root cause analysis- and implementing remediations
Build strong relationships with product management engineering design- and operations stakeholders
Own and drive operational metrics (eg MTTx metrics incident rates error budgets- service availability) with visible progress and accountabilityHands-On Operational Engagement
Lead global site reliability and operations teams across onshore nearshore- - and offshore locations while actively engaging in day-to-day challenges
Actively participate in major incident response including log analysis recovery validation- and executive updates
Lead problem bridges- collaborating across technical and functional teams for timely issue resolution
Partner with engineers to diagnose troubleshoot and resolve critical issues in real time- demonstrating technical credibility
Strengthen ITSM processes (Incident Problem Change Major Incident) using tools like ServiceNow PagerDuty- and JIRARun-the-Business
Lead engineering support for production issue remediation ensuring timely root-cause analysis resolution- - and prevention of recurring problems
Lead a dedicated production engineering team responsible for developing and deploying fixes patches- and enhancements that improve reliability and guest experience
Ensure development workstreams include not only feature delivery but also operational hardening technical debt remediation- and defect resolution
Manage and prioritize ongoing maintenance activities patches upgrades- and operational improvements across the digital technology stackEstablish strong feedback loops with product and engineering teams so that recurring issues and operational pain points are systematically eliminated
Technology & Engineering
- Work directly with teams to ensure the reliability of a hybrid technology stack spanning:
Mobile: Native iOS Android- - and cross-platform frameworks
Web: React Angular- and modern web technologies
Backend Services: Microservices APIs- and integration layers - Commerce: SAP Hybris platform
Cloud Infrastructure: AWS (EC2 ECS S3 API Gateway) DKP/on-prem clusters- and observability pipelines
Champion observability and performance practices leveraging platforms such as Splunk Dynatrace Prometheus- Quantum Metric / RUM tools
Promote automation chaos engineering- and AI-driven anomaly detection to strengthen system resilience
Guide teams in Infrastructure as Code- and modern operational tooling
Environment Management: Oversee all environment activities- including new code deployments
- Work directly with teams to ensure the reliability of a hybrid technology stack spanning:
Team Development & Leadership by Example
Recruit mentor- - and develop global SRE talent while modeling hands-on technical engagement
Encourage engineers to take ownership and proactively solve problems- supported by your direct involvement when needed- Manage vendor and partner teams with the same "roll-up-your-sleeves" approach as internal teams
- Deliver executive-ready dashboards and insights to communicate the health of digital operations
Qualifications:
Bachelor's or Master's degree in Computer Science Engineering
- or related field
15+ years of experience in technology operations- including 8+ years in global leadership roles
Engineering Management: Experience leading software engineering teams delivering production fixes and technical debt remediation- not only operational monitoring - Proven track record supporting and stabilizing large-scale digital/commerce platforms with high transaction volumes and direct customer impact
Experience managing fast-paced 24x7 environments- demonstrating adaptability and confident decision-making
Strong technical background in cloud platforms (AWS hybrid/on-prem clusters) container orchestration (Docker Kubernetes DKP)- and microservices - Deep understanding of SOA principles and Web Services
Proficiency in scripting: Bash Python- JavaScript - Experience running and scaling commerce platforms (preferably SAP Hybris or equivalent)
Advanced knowledge of observability performance engineering telemetry automation- and incident management frameworks
Ability to personally dive into logs code- and dashboards during critical incidents
Strong troubleshooting root-cause analysis- and application design skills
Demonstrated ability to lead through crisis situations with composure speed- and clear communication
Knowledge and Skills:
- Technical Depth & Breadth: Mobile web backend and commerce systems at enterprise scale
- Leadership by Example: Hands-on willing to engage directly with engineers in solving problems
- Strategic Thinking: Ability to drive long-term improvements while ensuring short-term incident readiness
- Maintenance & Communication: Experience managing ongoing maintenance programs and crafting technical communications
- Engineering Collaboration: Skilled at bridging operations and engineering to ensure production issues are treated as high-priority deliverables
- Communication: Executive presence with the ability to brief leadership clearly during outages
- Global Experience: Skilled at leading distributed teams and managing vendor partnerships
- Resiliency Mindset: Comfortable with 24/7 operational accountability especially during major incidents
Financial Responsibilities:
Own and manage the Operational Expenditure (OPEX) budget for Digital Operations ensuring efficient allocation of resources while balancing reliability scalability
- and cost optimization
- Provide transparency into operational spend through regular reporting and executive updates
Partner with Finance and Procurement to negotiate track- and optimize vendor contracts and third-party services - Ensure budget discipline while identifying opportunities for automation and efficiency improvements to reduce operational costs without compromising reliability
Working Conditions:
- Global role requiring flexible availability to lead and engage directly in critical incidents outside of standard business hours
- Domestic and international travel may be required to support operations and vendor partners
We know there's a lot to consider As you go through the application process our recruiters will be glad to provide guidance and more relevant details to answer any additional questions Thank you again for your interest in Royal Caribbean Group We'll hope to see you onboard soon
It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race color religion sex age national origin disability sexual orientation sexuality gender identity or expression marital status or any other characteristic protected by law Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment
#LI-MP1
Nearest Major Market: Miami
Sprachkenntnisse
- English
Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klicken Sie auf „Jetzt Bewerben“, um Ihre Bewerbung direkt auf deren Website einzureichen.