XX
Principal Site Reliability EngineerVirtasantCalifornia, Maryland, United States
XX

Principal Site Reliability Engineer

Virtasant
  • US
    California, Maryland, United States
  • US
    California, Maryland, United States
Apply Now

About

Location/Time zone requirements:
Must be based in the San Francisco Bay Area, with weekly visits to the client's headquarters.

About Virtasant

Virtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we've helped large enterprises thrive in the public cloud — optimizing cost, scaling infrastructure, migrating legacy systems, and building cloud-native products.

We take an AI-first mindset and are big on FinOps, software engineering, product development, and technology operations. Our outcome-driven model helps enterprises solve complexity in the cloud, build efficient systems, and unlock real business value.

About the Role

We are looking for a
Principal-Level Site Reliability Engineer (Operations)
to provide hands-on, day-to-day operational support for one of our largest global clients. This role is not a leadership or people-management position — it is a
senior individual contributor SRE
role focused on
incident response, system diagnostics, dashboard monitoring, operational maintenance, and ensuring platform reliability
.

You will be directly responsible for keeping critical systems healthy, resolving incidents, improving operational workflows, and working with engineering teams to maintain high reliability across large-scale distributed systems.

If you're a senior SRE who enjoys solving problems
in the system
, not managing teams or driving strategy, this is the right role.

What You Will Do

Operational SRE Responsibilities

  • Monitor dashboards, alerts, and system health in real time.
  • Respond to incidents quickly and decisively, driving issues to resolution.
  • Perform root-cause analysis and contribute to post-incident reviews.
  • Troubleshoot complex system and infrastructure issues across distributed environments.
  • Maintain and improve runbooks, playbooks, and operational documentation.
  • Support and enhance the observability tooling used for metrics, logs, and alerting.
  • Work cross-functionally with engineering teams to escalate system-level issues when required.

Systems Reliability & Maintenance

  • Run routine operational checks to ensure platform stability.
  • Tune alerts, update dashboards, and ensure monitoring accuracy.
  • Identify recurring operational issues and recommend improvements.
  • Implement small automation and scripting solutions to improve operational workflows.
  • Keep services running smoothly through proactive maintenance.

Collaboration & Communication

  • Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents.
  • Provide clear, concise updates and documentation for operational work.
  • Participate in shift patterns or rotational incident coverage depending on client needs.

What You Bring

Essential Experience

  • 5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles.
  • Strong hands-on experience troubleshooting distributed systems in production.
  • Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics.
  • Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling.
  • Experience with:
  • Monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)
  • Incident management workflows
  • Root-cause analysis / postmortems
  • CI/CD operational processes

Technical Skills

  • Strong Linux debugging and performance troubleshooting skills.
  • Familiarity with Kubernetes, containers, or cloud-native runtime environments.
  • Ability to write or modify scripts (Python, Bash, or similar) for operational automation.
  • Hands-on experience with logs, metrics, traces, and alert lifecycle management.

Soft Skills

  • Calm, structured decision-making under pressure.
  • Excellent communication — clear, concise, and reliable.
  • Strong attention to detail and consistency in documentation.
  • A proactive, ownership-driven mindset for reliability and operations.

Why Join Virtasant

  • Build and lead a new SRE-focused customer success function from day one.
  • Work at the intersection of reliability engineering, customer engagement, and cloud transformation.
  • Partner with global enterprises on cutting-edge cloud and DevOps programs.
  • Join a global, remote-first consultancy with 4,000+ experts across 130 countries.
  • Thrive in a culture that values autonomy, agility, and innovation.
  • California, Maryland, United States

Languages

  • English
Notice for Users

This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.