XX
Datadog

Software Engineer - Production Practices

  • +1
  • +2
  • FR
    Niort, Nouvelle-Aquitaine, France
Manifester de l'intérêt pour ce poste
  • +1
  • +2
  • FR
    Niort, Nouvelle-Aquitaine, France

À propos

Production Practices is a new team in our SRE organization whose mission is to steward production readiness and support engineers as they implement best practices around reliability and operational excellence. We advocate on behalf of engineers to improve the overall developer experience of building resilient services at Datadog, and support the company by identifying, tracking, and mitigating systemic reliability risks.

At Datadog, we place value in our office culture - the relationships and collaboration it builds and the creativity it brings to the table. We operate as a hybrid workplace to ensure our Datadogs can create a work-life harmony that best fits them.

What you’ll do:

  • Steward production readiness for the company by setting the technical direction of reliable and sustainable production practices.
  • Perform readiness reviews for new services when they launch to production and writing software to automate the process.
  • Work with teams across Datadog to help them implement best practices to build resilience and operational excellence. This involves running training workshops for SRE concepts and writing reliability bulletins that solve specific engineering problems.
  • Identify sources of friction for engineers running a service in production and advocate on their behalf by building golden paths and tools to support healthy production practices. Collaborate with other infrastructure teams to improve the developer experience and lower the burden of launching services.
  • Help the organization identify, track, and mitigate emergent risks. Where necessary, participate in cross-functional squads to solve complex reliability problems.
  • Collaborate with the Chaos Engineering team on large-scale game days which involve injecting faults into our stack. Follow up with engineering teams after these events to ensure risks are identified and fixes are adopted.

Who you are:

  • Around 5 years experience working with distributed systems. A lot of our work involves reviewing unfamiliar services before they launch, so we rely on our people’s strong systems thinking and familiarity with common production patterns.
  • Strong interest in training and helping upskill others. We aim to help build resilience through distributing best practices, so prior experience teaching others or writing developer documentation is a plus.
  • Good coding skills in Go and/or Python. We automate as many manual processes as possible and use code to scale our impact.
  • Empathy, collaboration, and communication skills in English to work remotely with people across teams. Our goal is to improve the developer experience of launching and operating services in production, and to do this we need empathy for others that’s sourced from real world experience.
  • Willingness to jump into new codebases and unknown systems and quickly ramp up. We’re looking for people excited by challenges and who use their determination to cut through and help scope ambiguity.

Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you’re passionate about technology and want to grow your skills, we encourage you to apply.

Benefits and Growth:

  • New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
  • Continuous professional development, product training, and career pathing
  • Intradepartmental mentor and buddy program for in-house networking
  • An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)
  • Access to Inclusion Talks, our Internal panel discussions
  • Free, global mental health benefits for employees and dependents age 6+
  • Competitive global benefits
#J-18808-Ljbffr

Compétences idéales

  • Go
  • Python
  • Niort, Nouvelle-Aquitaine, France

Expérience professionnelle

  • Site Reliability (SRE)

Compétences linguistiques

  • English