Accenture
Website:
accenture.com
Job details:
Program Manager – Site Reliability Engineering (Cloud Native Platform Team)
Role Summary
The Program Manager will drive day-to-day operations of the Site Reliability Engineering (SRE) team, ensuring alignment with organizational goals for reliability, scalability, and operational excellence. This role requires a strong technical background in SRE practices and proven program management expertise to drive cross-functional initiatives, optimize processes, and deliver measurable business and operations value.
Key Responsibilities
- Operational Leadership
- Drive adoption of SRE best practices such as error budgets, SLIs/SLOs, and automation to reduce toil.
- Ensure compliance with security, privacy, and regulatory standards in all reliability initiatives.
- Program Management
- Define program scope, objectives, and success criteria for reliability initiatives.
- Develop and maintain quarterly roadmaps for SRE projects in collaboration with platform engineering teams.
- Track progress, risks, and dependencies across multiple projects using tools like JIRA and Confluence.
- Facilitate communication between SRE, development, and leadership teams to ensure transparency and alignment.
- Performance Measurement
- Establish and monitor KPIs for reliability and operational efficiency.
- Prepare executive dashboards and reports to translate technical metrics into business impact narratives.
- Lead continuous improvement initiatives based on data-driven insights.
- Stakeholder Engagement
- Act as the primary liaison between SRE and other teams (Product, Engineering and Delivery-SOC).
- Influence decision-making at all levels through clear communication and structured reporting.
Performance Measurement Parameters
Incident Metrics:
- Mean Time to Detect (MTTD)
- Mean Time to Respond (MTTR)
- Mean Time to Recovery (MTTR)
- Incident Frequency and Severity
Change Management:
- Change Failure Rate
- Change Success Rate
Reliability Metrics:
- System Uptime / Availability
- Service Level Objective Achievement Percentage
Operational Efficiency:
- Automation Rate
- On-call Burden Reduction
Measurement Matrix for Leadership Presentation
Use a dashboard approach combining:
Latency, Traffic, Errors, Saturation.
Monthly/Quarterly trends on SLO
Incident Heatmaps: Highlighting root causes and resolution times.
Business Impact Metrics: Cost savings, risk reduction, and ROI from reliability improvements.
Tools: Datadog
Experience Requirements
Technical Background:
- Prior hands-on experience as a Site Reliability Engineer or in DevOps roles.
- Strong understanding of cloud-native architectures (Kubernetes, microservices, distributed systems).
Program Management Expertise:
- 5+ years in program or technical project management.
- Proven ability to manage cross-functional initiatives in fast-paced environments.
- Familiarity with Agile methodologies and tools (JIRA, Confluence).
Leadership & Communication:
- Experience presenting technical and operational metrics to executive leadership.
- Strong stakeholder management and negotiation skills.
Certifications: PMP, SAFe, or SRE Foundation – SAFe preferred
Click on Apply to know more.