FundsIndia
Website:
fundsindia.com
Job details:
Role: SRE Lead
Experience: 6 – 7 years
Location - Chennai
This role is critical for strengthening our incident management, observability, and overall reliability practices. The candidate will be responsible for leading incident response, improving monitoring systems, and driving SRE best practices across production environments.
Job Description – SRE Lead
Role Overview
We are looking for a Lead Site Reliability Engineer with 6-7 years of experience to drive reliability, observability, and incident management practices. The ideal candidate will have strong expertise in Grafana stack, production monitoring, and handling critical incidents in high-availability systems.
Key Responsibilities
- Act as the Incident Commander during production outages, ensuring timely resolution and stakeholder communication
- Lead incident response, triage, RCA (Root Cause Analysis), and postmortems
- Build and enhance observability systems using Grafana (Prometheus, Loki, Tempo)
- Define and manage SLIs, SLOs, and SLAs for critical services.
- Develop and maintain monitoring, alerting, and dashboards for proactive issue detection.
- Collaborate with Dev, Infra, and DB teams to improve system reliability and performance.
- Drive automation and runbook creation to reduce manual intervention.
- Improve on-call processes and incident management workflows
- Ensure high availability, scalability, and fault tolerance of systems
Required Skills
- 5–6 years of experience in Site Reliability Engineering / Production Support
- Strong hands-on experience with Grafana stack (Prometheus, Loki, Tempo)
- Solid understanding of monitoring, alerting, and observability principles
- Experience in incident management and handling P1/P2 incidents
- Knowledge of cloud platforms (AWS)
- Experience with Linux systems and troubleshooting
- Familiarity with Kubernetes / containerized environments
- Strong scripting skills (Python / Bash)
Click on Apply to know more.