CareerXperts Consulting
Website:
careerxperts.com
Job details:
Site Reliability Engineer responsible for ensuring reliability, scalability, and performance of production systems and infrastructure. Role focuses on building resilient platforms, automating operational processes, and improving system stability across high-availability environments.
This role sits at the intersection of software engineering and infrastructure operations, requiring strong system thinking, automation capability, and a proactive approach to reliability engineering.
Role Focus Areas
- Infrastructure reliability and system scalability
- Automation of operational workflows and incident management
- Monitoring, observability, and production stability
Key Responsibilities
- Design, maintain, and improve highly available production systems and infrastructure
- Build automation tools and workflows to reduce operational overhead
- Monitor system health, performance, and reliability across environments
- Troubleshoot infrastructure and application-related production issues
- Improve system observability through logging, metrics, and monitoring tools
- Collaborate with engineering teams to optimize deployment and release processes
- Support incident response, root cause analysis, and system recovery efforts
- Manage infrastructure scalability, performance optimization, and uptime initiatives
- Maintain documentation for infrastructure, operational procedures, and recovery processes
Expected Outcomes
- Reliable and scalable infrastructure with high system uptime
- Faster incident detection and resolution
- Reduced operational bottlenecks through automation
- Improved observability and production system performance
Core Competencies
- Strong understanding of Linux systems, networking, and distributed systems
- Experience with cloud platforms such as AWS, Azure, or Google Cloud
- Proficiency in scripting or programming languages such as Python, Go, or Bash
- Familiarity with Kubernetes, Docker, and container orchestration
- Experience with CI/CD pipelines and infrastructure automation tools
- Understanding of monitoring and observability tools such as Prometheus, Grafana, or Datadog
Experience & Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related field
- 4–8 years of experience in SRE, DevOps, platform engineering, or infrastructure roles
Preferred Background
- Experience supporting large-scale or high-traffic production systems
- Familiarity with incident management and reliability engineering practices
- Exposure to infrastructure-as-code and automation frameworks
- Understanding of security, scalability, and performance optimization principles
Click on Apply to know more.