Principal Site Reliability Engineer
- California, Maryland, United States
- California, Maryland, United States
À propos
Location/Time zone requirements:
Must be based in the San Francisco Bay Area, with weekly visits to the client's headquarters.
About Virtasant
Virtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we've helped large enterprises thrive in the public cloud — optimizing cost, scaling infrastructure, migrating legacy systems, and building cloud-native products.
We take an AI-first mindset and are big on FinOps, software engineering, product development, and technology operations. Our outcome-driven model helps enterprises solve complexity in the cloud, build efficient systems, and unlock real business value.
About the Role
We are looking for a
Principal-Level Site Reliability Engineer (Operations)
to provide hands-on, day-to-day operational support for one of our largest global clients. This role is not a leadership or people-management position — it is a
senior individual contributor SRE
role focused on
incident response, system diagnostics, dashboard monitoring, operational maintenance, and ensuring platform reliability
.
You will be directly responsible for keeping critical systems healthy, resolving incidents, improving operational workflows, and working with engineering teams to maintain high reliability across large-scale distributed systems.
If you're a senior SRE who enjoys solving problems
in the system
, not managing teams or driving strategy, this is the right role.
What You Will Do
Operational SRE Responsibilities
- Monitor dashboards, alerts, and system health in real time.
- Respond to incidents quickly and decisively, driving issues to resolution.
- Perform root-cause analysis and contribute to post-incident reviews.
- Troubleshoot complex system and infrastructure issues across distributed environments.
- Maintain and improve runbooks, playbooks, and operational documentation.
- Support and enhance the observability tooling used for metrics, logs, and alerting.
- Work cross-functionally with engineering teams to escalate system-level issues when required.
Systems Reliability & Maintenance
- Run routine operational checks to ensure platform stability.
- Tune alerts, update dashboards, and ensure monitoring accuracy.
- Identify recurring operational issues and recommend improvements.
- Implement small automation and scripting solutions to improve operational workflows.
- Keep services running smoothly through proactive maintenance.
Collaboration & Communication
- Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents.
- Provide clear, concise updates and documentation for operational work.
- Participate in shift patterns or rotational incident coverage depending on client needs.
What You Bring
Essential Experience
- 5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles.
- Strong hands-on experience troubleshooting distributed systems in production.
- Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics.
- Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling.
- Experience with:
- Monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)
- Incident management workflows
- Root-cause analysis / postmortems
- CI/CD operational processes
Technical Skills
- Strong Linux debugging and performance troubleshooting skills.
- Familiarity with Kubernetes, containers, or cloud-native runtime environments.
- Ability to write or modify scripts (Python, Bash, or similar) for operational automation.
- Hands-on experience with logs, metrics, traces, and alert lifecycle management.
Soft Skills
- Calm, structured decision-making under pressure.
- Excellent communication — clear, concise, and reliable.
- Strong attention to detail and consistency in documentation.
- A proactive, ownership-driven mindset for reliability and operations.
Why Join Virtasant
- Build and lead a new SRE-focused customer success function from day one.
- Work at the intersection of reliability engineering, customer engagement, and cloud transformation.
- Partner with global enterprises on cutting-edge cloud and DevOps programs.
- Join a global, remote-first consultancy with 4,000+ experts across 130 countries.
- Thrive in a culture that values autonomy, agility, and innovation.
Compétences linguistiques
- English
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.