Staff Platform Engineer (Remote)

Radimal

New York, New York, United States

New York, New York, United States

Postuler Maintenant

À propos

Radimal is a veterinary radiology and AI diagnostics platform delivering 24/7 imaging insights to hospitals nationwide. We combine board-certified radiologists with advanced AI to support real-time clinical decision-making for patients when it matters most. Our platform spans high-throughput medical imaging, GPU-backed inference, global distribution, and enterprise-grade reliability. As we scale, we’re investing in senior platform ownership to make the system safer, more predictable, and easier for engineers to build on. Why This Role Exists Radimal has grown quickly. While the platform is working, operational ownership has been too diffuse. Reliability, on-call clarity, and platform standards need a single senior owner who can reduce noise, establish guardrails, and make the system more predictable as we scale. This role exists to bring focus, ownership, and calm to the platform layer. The Role We’re hiring a Staff Platform Engineer to own the technical foundations that enable Radimal’s engineering teams to move quickly and reliably. This is a senior, hands-on role with real accountability for platform architecture, infrastructure, and production systems. You’ll own DevOps, reliability, and on-call systems, with authority to investigate and diagnose issues across the full stack. You will not be expected to do everything at once. Success comes from establishing ownership, setting priorities, and making the system more predictable over time. A core part of this role is operational containment and reliability ownership. You’ll reduce operational burden on product and AI teams by owning platform standards, tooling, and reliability so others can focus on building. You’ll work closely with the CEO and VP of Engineering on platform strategy, architectural tradeoffs, and operational risk, while maintaining clear ownership of production systems. This role is for someone who wants true ownership and influence through execution, not advisory distance. What You Will Own Own the core platform foundations that support all product and AI development Build shared infrastructure, libraries, and patterns that make it easier to ship safely Establish clear interfaces and ownership boundaries so teams can move independently Improve developer experience through better CI/CD, local tooling, and observability Raise the overall operational maturity of the engineering organization Infrastructure and Cloud Own and evolve Radimal’s AWS and Terraform footprint Lead deployments across ECS, Fargate, EC2, containerized services, and GPU workloads Manage and improve workloads running on Render and Modal Make architectural decisions for scale, reliability, and cost efficiency Reduce operational burden on product and AI engineers by owning reliability and tooling Create guardrails that increase safety without slowing development Enable engineers to self-serve infrastructure and diagnostics where appropriate Reliability, On-Call, and Operations Own production uptime, SLOs, and operational health Design and own on-call coverage and escalation models Serve as senior escalation during incidents while building systems that minimize the need for escalation Lead incident response and post-incident reviews with clear accountability Eliminate ambiguity around who owns production at all times Over time, success in this role means fewer incidents, fewer escalations, and a platform that largely runs without heroics. Observability and Performance Operate and extend Grafana and Prometheus monitoring stacks Improve alerting, diagnostics, and operational visibility Build high-availability and fault-tolerant architectures Implement caching, CDN, and performance strategies for global scale Investigate production issues across infrastructure, backend services, data pipelines, AI inference workflows, and frontend behavior Trace request flow end to end across GraphQL APIs, Python services, and React applications Read and debug React code as needed to understand client-side behavior and API usage Form and test hypotheses during incidents to drive fast, accurate resolution Know when to dive deep personally and when to pull in specialists ML Ops and AI Platform Support Understand ML Ops fundamentals including model deployment, versioning, and monitoring Support GPU-backed inference workloads and AI service reliability Partner with AI engineers to ensure models are observable, debuggable, and production-ready Identify and mitigate operational risks related to model performance, latency, and failures Leadership and Collaboration Partner with the CEO and VP of Engineering on platform strategy and architectural tradeoffs Provide clear, grounded assessments of platform risk and readiness Act as a trusted technical owner during high-impact decisions and incidents Align platform reliability with product and business goals Security and Compliance Strengthen infrastructure security and access controls Support enterprise security reviews, penetration testing, and SOC 2 readiness Improve auditability, monitoring, and operational hygiene What We’re Looking For 7+ years operating production systems at scale Strong Python experience for automation and backend tooling Deep AWS experience (ECS, Fargate, EC2, ECR, RDS, CloudFront, IAM) Strong Terraform, Docker, CI/CD, and infrastructure-as-code expertise Hands-on experience with Grafana and Prometheus Experience with Postgres and modern backend architectures Strong understanding of distributed systems, caching, and performance Comfort debugging across GraphQL APIs, Python services, and React frontends Working knowledge of ML Ops concepts and AI inference systems Clear communicator with a strong ownership mindset Bonus Experience Medical imaging or DICOM workflows GPU compute, AI inference, or ML pipeline integration Enterprise security reviews or penetration testing GraphQL or Hasura-based platforms Why Join Radimal High ownership of a mission-critical clinical platform Deep technical challenges with real patient impact Clear mandate and authority to shape how engineering works Fully remote team with high trust and low bureaucracy Opportunity to grow into broader technical leadership over time
#J-18808-Ljbffr

New York, New York, United States

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.

Postuler Maintenant