Senior System Reliability Engineer
Lirio
- Saint Paul, Illinois, United States
- Saint Paul, Illinois, United States
À propos
The
SeniorSystemReliability Engineer (SRE)
at Lirio is responsible for the reliability, scalability, and performance of our cloud-native applications and infrastructure. This role leads the design and implementation of automation, monitoring, and incident response processes, and mentors other engineers in SRE best practices. The Senior SRE partners with development teams to ensure robust, secure, and highly available systems, and drives continuous improvement in operational excellence. This role operates as a senior, hands-on reliability engineer embedded with product and platform teams. The Senior SRE is accountable for defining and enforcing service-level objectives (SLOs), reducing operational toil through automation, and improving system reliability through proactive engineering rather than reactive support. This role is not ticket-driven operations and is expected to influence architecture, development practices, and incident readiness across the platform. Reliability Engineering & Automation (40%)
Architect, implement, and maintain automated solutions for deployment, monitoring, alerting, and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL). Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation). Build and optimize CI/CD pipelines for seamless, reliable delivery. Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services. Identify and reduce operational toil through automation, platform improvements, and architectural changes. Performance analysis and optimization of Lirio systems and services. Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments. Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness. Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows. Partner with software engineering teams during design and architecture discussions to identify reliability risks early. Operational Support & Incident Management (20%)
Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog). Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations. Contribute to and maintain incident severity definitions, response procedures, and no-blame postmortem practices. Lead incident response, root cause analysis, and postmortems for production issues. Triage and resolve issues, ensuring minimal downtime and rapid recovery. Support client onboarding and production rollouts by ensuring reliability, observability, and operational readiness standards are met. Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices. Design processes to share operational knowledge and avoid single points of failure. Advise colleagues on architecture and reliability strategies. Help establish shared operational ownership across teams to reduce single points of failure and knowledge silos. Stay current with industry trends in reliability engineering, cloud operations, and automation. Bring innovation to operational practices and system design, evaluating and introducing new tools and technologies as appropriate for Lirio. Evaluate new tooling with an emphasis on operational simplicity, security, and long-term maintainability. Documentation & Process Improvement (5%)
Define and document operational processes, incident response playbooks, and reliability standards. Contribute to operational planning, incident reviews, and reliability documentation. Qualifications
5-7 years related experience Bachelor\'s Degree in related field Linux systems and networking fundamentals (DNS, TCP/IP, TLS) Distributed systems debugging and failure analysis Load, stress, and fault-injection testing CI/CD tools and processes Version control (e.g., Git) Containers and orchestration (Kubernetes) Kafka (messaging/streaming) Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python) Databases/SQL Observability/monitoring tools (DataDog) Medical (HSA available) Vision Short-term & long-term disability (company-paid) Life & AD&D (company-paid) 401K with company match 10 paid holidays, quarterly company closure dates, + holiday week company closure Flexible time off policy Work from home 6 weeks paid parental leave Salary range: $130k-$150k
#J-18808-Ljbffr
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.