Bean HR Consulting
Website:
beanhr.com
Job details:
Job Title: Site Reliability Engineer (SRE)
Job Overview
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and help build, scale, and maintain highly reliable and resilient cloud-based systems. The ideal candidate will have a strong foundation in cloud infrastructure, automation, observability, and incident management, with a focus on improving system reliability and performance.
Key Responsibilities
- Design, build, and maintain highly available and scalable cloud infrastructure (primarily on Azure).
- Implement and manage Infrastructure as Code (IaC) using tools like Terraform, Helm, or Ansible.
- Develop and optimize CI/CD pipelines with integrated security and quality checks.
- Deploy, manage, and orchestrate containerized applications using Kubernetes and Docker.
- Establish and enhance observability practices including monitoring, logging, tracing, and alerting.
- Collaborate with development teams to define SLIs, SLOs, and implement effective alerting strategies.
- Participate in on-call rotations, respond to production incidents, and perform root cause analysis (RCA).
- Continuously improve system reliability, availability, and performance through automation and best practices.
- Drive a culture of reliability, scalability, and operational excellence.
Required Qualifications
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 3–8 years of experience as a Site Reliability Engineer or in a similar role.
- Hands-on experience with cloud platforms such as Azure (preferred) or AWS.
- Strong experience with Infrastructure as Code (Terraform preferred).
- Proficiency in scripting languages such as Python, Bash, or PowerShell.
- Experience with CI/CD tools and automation pipelines.
- Solid understanding of containerization and orchestration (Docker, Kubernetes).
- Experience in incident management, on-call support, and root cause analysis.
Preferred Qualifications
- Experience with observability tools such as Grafana, Prometheus, ELK Stack.
- Familiarity with on-call and incident management tools like PagerDuty or Zenduty.
- Experience defining and managing SLIs, SLOs, and SLAs.
- Knowledge of security best practices in CI/CD and cloud environments.
Key Skills
- System Reliability & High Availability
- Observability & Monitoring
- Incident Response & Troubleshooting
- Automation & Infrastructure as Code
- Cloud & Container Technologies
- Collaboration & Communication
Work Model
Click on Apply to know more.