À propos
Responsibilities:
Cloud Solution Design: Architect reliable, scalable, and secure cloud systems (public, private, hybrid, multi-cloud) that meet current and future requirements. Governance & Best Practices: Define cloud governance, cost optimization, and security frameworks; establish and enforce cloud best practices across teams. Technology Evaluation: Recommend and select cloud providers and tools suitable for enterprise needs (e.g., AWS, Azure, GCP). Facilitate organizational adoption of the cloud and manage migration and modernization initiatives, often managing related cultural and process changes. Evaluate, recommend, and responsibly integrate AI-enhanced tooling into platform engineering workflows to improve efficiency, delivery quality, and operational intelligence. Lead efforts with service owners to establish SLOs, define meaningful SLIs, and ensure systems consistently meet or exceed SLAs. Champion and enforce security best practices across infrastructure, services, and operational workflows. Reduce operational toil and drive a high degree of automation through Infrastructure-as-Code and GitOps-based delivery patterns. Define, document, and advocate for CI/CD standards, governance models, and platform-wide best practices that support modern, scalable, and resilient delivery of cloud-based solutions. Provide architectural leadership for large-scale, distributed systems, ensuring that infrastructure is scalable, reliable, and cost-effective. Drive the design, implementation, and ongoing optimization of our cloud-based infrastructure, focusing on containers (Kubernetes) and virtual machine environments. Leading the design and implementation of cloud networking solutions for production and development environments, ensuring scalability, high availability, and security across multi-cloud and hybrid environments. Lead the technical strategy and roadmap for large-scale, cross-functional SRE initiatives that improve reliability, scalability, and developer productivity across enterprise systems. Drive automation and observability improvements, using metrics and analytics to enhance performance, reliability, and efficiency. Collaborate across Cloud, Platform, Security, and AI/ML teams to implement modern SRE components that ensure high availability and secure operations. Analyze and troubleshoot complex systems, championing best practices in system design, incident management, and postmortem analysis. Mentor and influence engineers across teams, fostering technical excellence and a culture of reliability engineering. Requirements: 7+ years of experience in engineering, with at least 3+ years in a leadership or management role, overseeing teams focused on infrastructure, SRE, cloud platforms, and DevOps practices. Proven expertise in managing large-scale, distributed infrastructure systems, including containerized environments (primarily Kubernetes), virtual machines, and cloud services (AWS, GCP). Expertise in cloud networking, including VPCs, DNS, routing, security groups, VPNs, load balancing, and inter-cloud connectivity plus experience with Istio and service mesh technologies, including traffic management, observability, mTLS, and security features. Proficiency in infrastructure automation and configuration management tools such as Terraform and Ansible. Experience with ArgoCD and GitOps principles for automating Kubernetes-based deployments and managing infrastructure as code (IaC). 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Cloud Architect roles. BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience Strong proficiency in programming languages such as Python, Typescript, JavaScript, or Go, with a focus on automation and infrastructure-as-code. Experience with infrastructure-as-code such as AWS CDK, AWS CloudFormation, Terraform or CrossPlane Solid understanding of OpenTelemetry or other Observability implementation at scale. Deep expertise in systems architecture, networking, Kubernetes, and public cloud services (AWS, Azure, or GCP). Outstanding problem-solving, communication, and teamwork skills, with the ability to influence across technical and interpersonal boundaries. Architect, develop, and maintain peering and transit infrastructure, including routing policies, eDNS, cloud connectivity, CGNAT, and load balancer fleets. Leverage Python scripting to automate repetitive engineering tasks and proactively mitigate customer impacts. Manage and troubleshoot DNS infrastructure to ensure reliable performance. Oversee and resolve issues related to cloud VPCs and connected network hardware. Diagnose and resolve complex TCP/IP issues to maintain seamless network operations. Collaborate with cross-functional teams to enhance infrastructure efficiency and support xAI's AI platforms. 10+ years operating or architecting enterprise SaaS systems in cloud environments. Awareness of AI-driven engineering and operational capabilities, including opportunities to leverage large language models (LLMs), automation, and intelligent tooling to improve developer productivity, reliability, and incident response. Expert-level experience designing, deploying, and scaling container-based microservices platforms (e.g., Kubernetes/GKE, Docker, service mesh, observability, automated deployments). Strong foundational knowledge of Linux/Unix systems, networking, and security protocols (HTTPS, TCP/IP, TLS/PKI, DNS, SMTP, SSH, NTP, etc.). 7+ years hands-on with public cloud platforms and 3+ years focused on Google Cloud Platform (GCP). High proficiency in scripting and automation using Bash and/or Python. Experience with modern code workflows, including branching strategies, release processes, and automated quality gates. Hands-on experience with Infrastructure as Code (Terraform required; Helm and Ansible strongly preferred). 5+ years building and operating observability platforms, including metrics, logging, distributed tracing, and alerting (Prometheus, Grafana, Splunk, New Relic, etc.). Experience designing and operating CI/CD pipelines using tools such as Jenkins, GitHub Actions, and/or ArgoCD. Experience participating in 24/7 operations, incident response, developing runbooks, and producing high-quality platform documentation. Staff Infrastructure Architect:
Responsibilities:
Cloud Solution Design: Architect reliable, scalable, and secure cloud systems (public, private, hybrid, multi-cloud) that meet current and future requirements. Governance & Best Practices: Define cloud governance, cost optimization, and security frameworks; establish and enforce cloud best practices across teams. Technology Evaluation: Recommend and select cloud providers and tools suitable for enterprise needs (e.g., AWS, Azure, GCP). Facilitate organizational adoption of the cloud and manage migration and modernization initiatives, often managing related cultural and process changes. Evaluate, recommend, and responsibly integrate AI-enhanced tooling into platform engineering workflows to improve efficiency, delivery quality, and operational intelligence. Lead efforts with service owners to establish SLOs, define meaningful SLIs, and ensure systems consistently meet or exceed SLAs. Champion and enforce security best practices across infrastructure, services, and operational workflows. Reduce operational toil and drive a high degree of automation through Infrastructure-as-Code and GitOps-based delivery patterns. Define, document, and advocate for CI/CD standards, governance models, and platform-wide best practices that support modern, scalable, and resilient delivery of cloud-based solutions. Provide architectural leadership for large-scale, distributed systems, ensuring that infrastructure is scalable, reliable, and cost-effective. Drive the design, implementation, and ongoing optimization of our cloud-based infrastructure, focusing on containers (Kubernetes) and virtual machine environments. Leading the design and implementation of cloud networking solutions for production and development environments, ensuring scalability, high availability, and security across multi-cloud and hybrid environments. Lead the technical strategy and roadmap for large-scale, cross-functional SRE initiatives that improve reliability, scalability, and developer productivity across enterprise systems. Drive automation and observability improvements, using metrics and analytics to enhance performance, reliability, and efficiency. Collaborate across Cloud, Platform, Security, and AI/ML teams to implement modern SRE components that ensure high availability and secure operations. Analyze and troubleshoot complex systems, championing best practices in system design, incident management, and postmortem analysis. Mentor and influence engineers across teams, fostering technical excellence and a culture of reliability engineering. Requirements: 7+ years of experience in engineering, with at least 3+ years in a leadership or management role, overseeing teams focused on infrastructure, SRE, cloud platforms, and DevOps practices. Proven expertise in managing large-scale, distributed infrastructure systems, including containerized environments (primarily Kubernetes), virtual machines, and cloud services (AWS, GCP). Expertise in cloud networking, including VPCs, DNS, routing, security groups, VPNs, load balancing, and inter-cloud connectivity plus experience with Istio and service mesh technologies, including traffic management, observability, mTLS, and security features. Proficiency in infrastructure automation and configuration management tools such as Terraform and Ansible. Experience with ArgoCD and GitOps principles for automating Kubernetes-based deployments and managing infrastructure as code (IaC). 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Cloud Architect roles. BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience Strong proficiency in programming languages such as Python, Typescript, JavaScript, or Go, with a focus on automation and infrastructure-as-code. Experience with infrastructure-as-code such as AWS CDK, AWS CloudFormation, Terraform or CrossPlane Solid understanding of OpenTelemetry or other Observability implementation at scale. Deep expertise in systems architecture, networking, Kubernetes, and public cloud services (AWS, Azure, or GCP). Outstanding problem-solving, communication, and teamwork skills, with the ability to influence across technical and interpersonal boundaries. Architect, develop, and maintain peering and transit infrastructure, including routing policies, eDNS, cloud connectivity, CGNAT, and load balancer fleets. Leverage Python scripting to automate repetitive engineering tasks and proactively mitigate customer impacts. Manage and troubleshoot DNS infrastructure to ensure reliable performance. Oversee and resolve issues related to cloud VPCs and connected network hardware. Diagnose and resolve complex TCP/IP issues to maintain seamless network operations. Collaborate with cross-functional teams to enhance infrastructure efficiency and support xAI's AI platforms. 10+ years operating or architecting enterprise SaaS systems in cloud environments. Awareness of AI-driven engineering and operational capabilities, including opportunities to leverage large language models (LLMs), automation, and intelligent tooling to improve developer productivity, reliability, and incident response. Expert-level experience designing, deploying, and scaling container-based microservices platforms (e.g., Kubernetes/GKE, Docker, service mesh, observability, automated deployments). Strong foundational knowledge of Linux/Unix systems, networking, and security protocols (HTTPS, TCP/IP, TLS/PKI, DNS, SMTP, SSH, NTP, etc.). 7+ years hands-on with public cloud platforms and 3+ years focused on Google Cloud Platform (GCP). High proficiency in scripting and automation using Bash and/or Python. Experience with modern code workflows, including branching strategies, release processes, and automated quality gates. Hands-on experience with Infrastructure as Code (Terraform required; Helm and Ansible strongly preferred). 5+ years building and operating observability platforms, including metrics, logging, distributed tracing, and alerting (Prometheus, Grafana, Splunk, New Relic, etc.). Experience designing and operating CI/CD pipelines using tools such as Jenkins, GitHub Actions, and/or ArgoCD. Experience participating in 24/7 operations, incident response, developing runbooks, and producing high-quality platform documentation.
[] Shift: []
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.