Spot Your Leaders & Consulting
Website:
spotyourleaders.com
Job details:
Job Description: Site Reliability Engineer (SRE)
(Notice period - Immediate or maximum 30 days)
- Total years of experience 9- 15 yrs
- Need to have experience or exposure in Chaos Engineering or Resilience Testing.
- Hands on experience in Python/ Bash
- Hands on experience in Ansible, Gitlab, CI/CD, Gitlab Pages, Jenkins, Terraform
- Hands on experience in Azure
Required Skills & Experience
- Strong experience in Core SRE practices, including reliability engineering, incident management, and automation.
- Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems.
- Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks.
- Hands-on experience with containerization and orchestration (Docker, Kubernetes).
- Strong background in monitoring, observability, and logging
- Tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack.
- Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
- Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
- Strong troubleshooting skills across application, infrastructure, and network layers.
- Experience designing and running incident response and post-mortem reviews.
- Ownership mindset with accountability for service reliability and customer outcomes.
- Excellent communication, collaboration, and stakeholder management skills.
Nice to Have (SRE+ Skills)
- Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery.
- Programming experience in Java, especially for debugging, performance profiling, or building automation tools.
- Familiarity with chaos engineering practices and tools.
- Experience working in banking, payments, or capital markets domains.
- Knowledge of security best practices and regulatory compliance in enterprise environment
Responsibilities
What You Will Be Doing
Core SRE & Reliability Engineering
- Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity.
- Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience.
- Reduce toil through extensive automation of operational workflows, alerts, and remediation activities.
Monitoring, Observability & Service Health
- Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys.
- Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR).
- Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention.
- Enable reliability reporting for stakeholders using SLO compliance and service health metrics.
Performance Engineering & Testing
- Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation.
- Identify performance bottlenecks across application, middleware, database, and infrastructure layers.
- Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios.
- Partner with development and QA teams to embed performance testing into CI/CD pipelines.
Incident Management & Operations
- Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews.
- Drive blameless post-mortems and ensure corrective actions are tracked to completion.
- Participate in on-call rotations, providing 24x7 support for critical production systems.
- Continuously improve operational readiness and resilience.
Automation, CI/CD & Cloud Operations
- Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments.
- Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning.
- Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety.
- Develop and test disaster recovery plans, backup strategies, and failover mechanisms.
Collaboration & Governance
- Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals.
- Ensure platforms meet security, compliance, and regulatory requirements common in financial services.
- Act as a reliability and performance advocate throughout the SDLC.
Click on Apply to know more.