Key Responsibilities:
Lead and mentor a team of SREs, fostering a culture of operational excellence and continuous improvement.
Develop and implement SRE best practices, including monitoring, alerting, and incident response.
Design and implement scalable, highly available, and resilient architectures.
Collaborate with engineering teams to optimize system performance, reliability, and capacity planning.
Drive automation efforts to reduce manual work and increase efficiency.
Establish and enforce SLAs, SLOs, and error budgets to balance reliability with development velocity.
Lead incident management, root cause analysis, and post-mortem processes.
Work with security teams to ensure compliance and best practices in infrastructure and operations.
Evaluate and implement new tools, technologies, and methodologies to enhance reliability and efficiency.
Qualifications & Experience:
8+ years of experience in software engineering, DevOps, or site reliability engineering.
3+ years of experience in a leadership or managerial role.
Strong expertise in cloud platforms such as AWS, GCP, or Azure.
Hands-on experience with infrastructure as code (IaC) tools like Terraform, CloudFormation, or Ansible.
Proficiency in programming/scripting languages such as Python, Go, or Bash.
Experience with Kubernetes, Docker, and container orchestration.
Deep understanding of monitoring, logging, and observability tools (e.g., Prometheus, Grafana, ELK, Datadog).
Expertise in CI/CD pipelines, automation, and deployment strategies.
Strong problem-solving skills with a data-driven and analytical approach.
Excellent communication and leadership abilities.
Preferred Qualifications:
Experience in large-scale distributed systems and microservices architectures.
Strong understanding of networking, security, and performance optimization.
Knowledge of database reliability, including SQL and NoSQL databases.
Prior experience in an organization with high-traffic, mission-critical applications.