Principal Site Reliability EngineerVirtasant • California, Maryland, United States

Postuler Maintenant

Principal Site Reliability Engineer

Virtasant

California, Maryland, United States

California, Maryland, United States

Postuler Maintenant

À propos

Location/Time zone requirements:
Must be based in the San Francisco Bay Area, with weekly visits to the client's headquarters.

About Virtasant

Virtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we've helped large enterprises thrive in the public cloud — optimizing cost, scaling infrastructure, migrating legacy systems, and building cloud-native products.

We take an AI-first mindset and are big on FinOps, software engineering, product development, and technology operations. Our outcome-driven model helps enterprises solve complexity in the cloud, build efficient systems, and unlock real business value.

About the Role

We are looking for a
Principal-Level Site Reliability Engineer (Operations)
to provide hands-on, day-to-day operational support for one of our largest global clients. This role is not a leadership or people-management position — it is a
senior individual contributor SRE
role focused on
incident response, system diagnostics, dashboard monitoring, operational maintenance, and ensuring platform reliability
.

You will be directly responsible for keeping critical systems healthy, resolving incidents, improving operational workflows, and working with engineering teams to maintain high reliability across large-scale distributed systems.

If you're a senior SRE who enjoys solving problems
in the system
, not managing teams or driving strategy, this is the right role.

What You Will Do

Operational SRE Responsibilities

Monitor dashboards, alerts, and system health in real time.
Respond to incidents quickly and decisively, driving issues to resolution.
Perform root-cause analysis and contribute to post-incident reviews.
Troubleshoot complex system and infrastructure issues across distributed environments.
Maintain and improve runbooks, playbooks, and operational documentation.
Support and enhance the observability tooling used for metrics, logs, and alerting.
Work cross-functionally with engineering teams to escalate system-level issues when required.

Systems Reliability & Maintenance

Run routine operational checks to ensure platform stability.
Tune alerts, update dashboards, and ensure monitoring accuracy.
Identify recurring operational issues and recommend improvements.
Implement small automation and scripting solutions to improve operational workflows.
Keep services running smoothly through proactive maintenance.

Collaboration & Communication

Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents.
Provide clear, concise updates and documentation for operational work.
Participate in shift patterns or rotational incident coverage depending on client needs.

What You Bring

Essential Experience

5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles.
Strong hands-on experience troubleshooting distributed systems in production.
Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics.
Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling.
Experience with:
Monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)
Incident management workflows
Root-cause analysis / postmortems
CI/CD operational processes

Technical Skills

Strong Linux debugging and performance troubleshooting skills.
Familiarity with Kubernetes, containers, or cloud-native runtime environments.
Ability to write or modify scripts (Python, Bash, or similar) for operational automation.
Hands-on experience with logs, metrics, traces, and alert lifecycle management.

Soft Skills

Calm, structured decision-making under pressure.
Excellent communication — clear, concise, and reliable.
Strong attention to detail and consistency in documentation.
A proactive, ownership-driven mindset for reliability and operations.

Why Join Virtasant

Build and lead a new SRE-focused customer success function from day one.
Work at the intersection of reliability engineering, customer engagement, and cloud transformation.
Partner with global enterprises on cutting-edge cloud and DevOps programs.
Join a global, remote-first consultancy with 4,000+ experts across 130 countries.
Thrive in a culture that values autonomy, agility, and innovation.

California, Maryland, United States

Compétences linguistiques

English

Avis aux utilisateurs

Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.

Postuler Maintenant