About the role
As an SRE at Andela, you will be responsible for building and maintaining reliable, scalable, and performant infrastructure to support our growing engineering teams and product offerings. You will collaborate with software engineers to design, implement, and operate robust and secure systems that enable rapid development and deployment of new features. Additionally, you will be responsible for monitoring, troubleshooting, and incident response to ensure high availability and reliability of our production environments.
Responsibilities:
- Design, build, and maintain highly available, scalable, and secure infrastructure using tools like Terraform, Kubernetes, and cloud providers
- Implement observability solutions for monitoring and alerting on key metrics and incidents
- Automate infrastructure provisioning, configuration, and deployment processes
- Continuously optimize and improve system performance, reliability, and cost-effectiveness
- Collaborate with software engineers to define and implement best practices for infrastructure and software development
- Participate in on-call rotations and incident response to ensure high availability of production systems
Requirements:
- 3+ years of experience in Site Reliability Engineering or DevOps engineering
- Strong understanding of cloud infrastructure (AWS, GCP, or Azure)
- Proficiency in infrastructure-as-code tools like Terraform, CloudFormation, or Ansible
- Hands-on experience with container technologies like Docker and Kubernetes
- Familiarity with monitoring and observability tools like Prometheus, Grafana, and ELK stack
- Ability to write clean, maintainable, and efficient code in languages like Python, Go, or Bash
- Strong problem-solving and troubleshooting skills
- Excellent communication and collaboration skills
This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.