Remote - Site Reliability Developer 3 (USC)
Ll Oefentherapie
- United States
- United States
À propos
Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technologyintersect.
You will have the opportunity to:
Reach billions of people with our products & services
Create technology in which truly impacts the world
Ability to have immediate impact on developingtechnology
Unlimitedgrowth potential with inspiring work
Work with the best minds in the industry
Enjoy working in an open, diverse, and productive environment
About The Job This role provide support to core data platforms behind Oracle Health’s Data & Analytics Platform. As a Senior Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.
You will work on the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.
What You'll Do Platform Ownership & Technical Leadership
Own the end-to-end reliability, scalability, and operability of shared data platforms
Define platform standards, architectural direction, and operational guardrails
Influence cross-team technical decisions and long-term platform strategy
Drive long-term platform evolution and influence reliability strategy across the data ecosystem
Architecture & Design
Clearly articulate system behavior, dependencies, and failure modes
Make principled trade-offs between reliability, performance, cost, and complexity
Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
Operations Engineering
Establish capacity models, scaling strategies, and operational best practices
Design platforms that behave predictably under load, failure, and change
Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
Distributed Systems Expertise
Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
Security
Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
Treat security as a first-class architectural concern
Automation
Design and evolve an Ansible- and Terraform-driven automation framework
Treat automation as production software: versioned, reviewed, tested, and improved
Eliminate operational toil by encoding reliability and safety into the platform
Incident Leadership & Prevention
Serve as the ultimate escalation point for complex or ambiguous incidents
Focus on eliminating entire classes of failure, not just resolving individual issues
Representation
Represent SRE and platform engineering in high-visibility and sensitive forums
Communicate clearly with engineering leadership and partner teams
#J-18808-Ljbffr
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.