Software Engineering Manager - Site Reliability Center

PNC

United States

United States

Über

Software Engineering Manager – Site Reliability Engineering (SRE)
At PNC, our people are our greatest differentiator and competitive advantage in the markets we serve. We are all united in delivering the best experience for our customers. We work together each day to foster an inclusive workplace culture where all of our employees feel respected, valued and have an opportunity to contribute to the company's success. As a Software Engineering Manager for PNC's Site Reliability Engineering Center, you will work within PNC's Information Technology Group and be located at one of our IT Hubs: Cleveland, Ohio; Birmingham, Alabama; Pittsburgh, Pennsylvania; Dallas, Texas; Denver, Colorado or Phoenix, Arizona and manage the daylight shift. The Site Reliability Center (SRC) is focused on establishing a culture of operational excellence by ensuring infrastructure, platforms, and applications adhere to SRC onboarding standards that improve reliability, enable proactive issue resolution, and reduce customer impact. This role supports the vision of building a collaborative technology organization across application, infrastructure, and security teams to deliver a stable, reliable, and secure environment. Key responsibilities include driving customer-centric service improvements, implementing proactive and preventative reliability practices, fostering cross-functional collaboration, enhancing monitoring and observability capabilities, promoting a blameless culture of continuous learning, and reducing operational toil through automation. The ideal candidate will help improve service performance, strengthen operational resiliency, and advance automation and observability initiatives that enhance the overall customer experience. As a Software Engineering Manager – Site Reliability Engineering (SRE), you will lead a team responsible for ensuring the reliability, scalability, and operational excellence of mission-critical platforms that power PNC's digital experiences. This role blends technical leadership, hands-on problem solving, and people management, driving both production stability and continuous improvement across complex distributed systems. You will…. Manage SRE and related Teams; lead, coach, and develop a team of SRE engineers; set clear goals, drive accountability, and foster a culture of ownership and excellence; partner with cross-functional stakeholders to align technology and business objectives; support talent development, performance management, and succession planning; encourage innovation, continuous learning, and DevOps/SRE best practices. Lead incident management & remediation; manage and actively participate in end-to-end incident response for major (P1/P2) incidents; guide real-time triage, diagnostics, and troubleshooting across application, infrastructure, and network layers; ensure rapid execution of remediation actions and service restoration; provide clear, timely communication to stakeholders during incidents; oversee post-incident analysis, reporting, and documentation to drive improvements. Provide technical leadership in production support; serve as an escalation point for complex production issues; guide troubleshooting across: applications, infrastructure (Linux/Windows), databases (Oracle, SQL), middleware and integrations; ensure efficient log, metric, and system analysis; oversee batch/ETL monitoring and recovery processes; foster strong collaboration across engineering, infrastructure, and vendor teams. Drive problem management & root cause resolution; lead root cause analysis (RCA) efforts for major and recurring incidents; ensure ownership and resolution of problem records; drive permanent fixes and systemic improvements to eliminate repeat issues, identify trends and patterns to reduce risk and improve stability; partner with engineering teams to resolve code defects and system gaps and promote knowledge sharing via runbooks, knowledge articles, and error catalogs. Oversee change management & release execution; ensure safe and compliant execution of production changes and releases; validate change readiness, testing, rollback strategies, and risk assessments; represent the team in CAB reviews, providing technical risk evaluation; oversee post-implementation reviews (CPIR) and ensure follow-through and drive improvements in change success rate and reduction in production defects. Advance monitoring, alerting & observability; lead efforts to build and optimize monitoring, dashboards, and alerting frameworks, champion use of tools such as Dynatrace, BigPanda, Logscale, and enterprise platforms, improve signal-to-noise ratio through alert tuning; enable proactive issue detection before customer impact; strengthen event management and observability practices. Provide technical leadership in production support; serve as an escalation point for complex production issues; guide troubleshooting across: applications, infrastructure (Linux/Windows), databases (Oracle, SQL), middleware and integrations; ensure efficient log, metric, and system analysis; oversee batch/ETL monitoring and recovery processes; foster strong collaboration across engineering, infrastructure, and vendor teams. Champion resiliency, stability & availability; lead efforts to ensure high availability of critical systems; oversee disaster recovery, failover, and continuity testing; identify and eliminate single points of failure and drive improvements in MTTR, uptime, and service reliability. Enable scalability & performance optimization; guide capacity planning and performance tuning strategies; ensure systems scale effectively under peak demand; partner with development teams for performance-driven design improvements; optimize system configurations to improve efficiency and throughput. Lead a 24x7 production support model; manage team participation in a 24x7 on-call rotation; oversee engagement in incident bridges, war rooms, and escalations; support pod-based operating models aligned to key applications; ensure seamless handoffs and global support continuity. Drive Automation & Operational Efficiency; identify and prioritize opportunities to reduce manual effort through automation; implement automation across: Incident remediation, monitoring and alerting, deployment and validation, promote standardized runbooks and automation frameworks and improve operational metrics and reduce toil. Ensure Governance, Risk & Compliance; maintain adherence to enterprise policies and regulatory standards; support audits, vulnerability remediation, and risk controls; ensure accurate documentation and operational procedures and champion security, access management, and data governance practices Qualifications: • 5 + years of related experience and 3+ years of management experience. • Strong experience in Site Reliability Engineering, Production Support, or DevOps. • Proven ability to lead teams in high-availability, enterprise environments • Deep understanding of incident, problem, and change management frameworks • Hands-on knowledge of monitoring tools, cloud/infrastructure platforms, and automation • Experience improving system reliability, observability, and operational maturity • Strong communication skills with the ability to lead during high-pressure situations. • Experience with OCP under infrastructure (Linux/Windows, OCP), MongoDB, Cassandra under databases (Oracle, SQL, MongoDB, Cassandra) and working knowledge of Elasticsearch, Redis, MQ and Kafka is a plus.

United States

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot wurde von einem unserer Partner veröffentlicht. Sie können das Originalangebot einsehen hier.

Ähnliche Jobs finden