Flag job

Report

SRE Program Manager

Location

Pune City, Maharashtra, India

JobType

full-time

About the job

Info This job is sourced from a job board

About the role

Accenture

Website: accenture.com
Job details:
Program Manager – Site Reliability Engineering (Cloud Native Platform Team)

Role Summary

The Program Manager will drive day-to-day operations of the Site Reliability Engineering (SRE) team, ensuring alignment with organizational goals for reliability, scalability, and operational excellence. This role requires a strong technical background in SRE practices and proven program management expertise to drive cross-functional initiatives, optimize processes, and deliver measurable business and operations value.

Key Responsibilities

  • Operational Leadership
  • Drive adoption of SRE best practices such as error budgets, SLIs/SLOs, and automation to reduce toil.
  • Ensure compliance with security, privacy, and regulatory standards in all reliability initiatives.
  • Program Management
  • Define program scope, objectives, and success criteria for reliability initiatives.
  • Develop and maintain quarterly roadmaps for SRE projects in collaboration with platform engineering teams.
  • Track progress, risks, and dependencies across multiple projects using tools like JIRA and Confluence.
  • Facilitate communication between SRE, development, and leadership teams to ensure transparency and alignment.
  • Performance Measurement
  • Establish and monitor KPIs for reliability and operational efficiency.
  • Prepare executive dashboards and reports to translate technical metrics into business impact narratives.
  • Lead continuous improvement initiatives based on data-driven insights.
  • Stakeholder Engagement
  • Act as the primary liaison between SRE and other teams (Product, Engineering and Delivery-SOC).
  • Influence decision-making at all levels through clear communication and structured reporting.


Performance Measurement Parameters

Incident Metrics:

  • Mean Time to Detect (MTTD)
  • Mean Time to Respond (MTTR)
  • Mean Time to Recovery (MTTR)
  • Incident Frequency and Severity


Change Management:

  • Change Failure Rate
  • Change Success Rate


Reliability Metrics:

  • System Uptime / Availability
  • Service Level Objective Achievement Percentage


Operational Efficiency:

  • Automation Rate
  • On-call Burden Reduction


Measurement Matrix for Leadership Presentation

Use a dashboard approach combining:

Latency, Traffic, Errors, Saturation.

Monthly/Quarterly trends on SLO

Incident Heatmaps: Highlighting root causes and resolution times.

Business Impact Metrics: Cost savings, risk reduction, and ROI from reliability improvements.

Tools: Datadog

Experience Requirements

Technical Background:

  • Prior hands-on experience as a Site Reliability Engineer or in DevOps roles.
  • Strong understanding of cloud-native architectures (Kubernetes, microservices, distributed systems).


Program Management Expertise:

  • 5+ years in program or technical project management.
  • Proven ability to manage cross-functional initiatives in fast-paced environments.
  • Familiarity with Agile methodologies and tools (JIRA, Confluence).


Leadership & Communication:

  • Experience presenting technical and operational metrics to executive leadership.
  • Strong stakeholder management and negotiation skills.


Certifications: PMP, SAFe, or SRE Foundation – SAFe preferred Click on Apply to know more.

Skills

Agile
compliance
Confluence
cross-functional
Datadog
DevOps
Jira
Kubernetes
microservices
operational metrics
PMP
project management
SRE
uptime