Site Reliability Engineer

CareerXperts Consulting

full-time

Required skills

Python
AWS
automation tools
Azure
Bash
Datadog
DevOps
Docker
Google Cloud
incident response
infrastructure-as-code
Kubernetes
Linux
Root Cause Analysis
SRE
uptime

About the role

CareerXperts Consulting

Website: careerxperts.com
Job details:

Site Reliability Engineer responsible for ensuring reliability, scalability, and performance of production systems and infrastructure. Role focuses on building resilient platforms, automating operational processes, and improving system stability across high-availability environments.

This role sits at the intersection of software engineering and infrastructure operations, requiring strong system thinking, automation capability, and a proactive approach to reliability engineering.

Role Focus Areas

Infrastructure reliability and system scalability
Automation of operational workflows and incident management
Monitoring, observability, and production stability

Key Responsibilities

Design, maintain, and improve highly available production systems and infrastructure
Build automation tools and workflows to reduce operational overhead
Monitor system health, performance, and reliability across environments
Troubleshoot infrastructure and application-related production issues
Improve system observability through logging, metrics, and monitoring tools
Collaborate with engineering teams to optimize deployment and release processes
Support incident response, root cause analysis, and system recovery efforts
Manage infrastructure scalability, performance optimization, and uptime initiatives
Maintain documentation for infrastructure, operational procedures, and recovery processes

Expected Outcomes

Reliable and scalable infrastructure with high system uptime
Faster incident detection and resolution
Reduced operational bottlenecks through automation
Improved observability and production system performance

Core Competencies

Strong understanding of Linux systems, networking, and distributed systems
Experience with cloud platforms such as AWS, Azure, or Google Cloud
Proficiency in scripting or programming languages such as Python, Go, or Bash
Familiarity with Kubernetes, Docker, and container orchestration
Experience with CI/CD pipelines and infrastructure automation tools
Understanding of monitoring and observability tools such as Prometheus, Grafana, or Datadog

Experience & Qualifications

Bachelor’s degree in Computer Science, Engineering, or related field
4–8 years of experience in SRE, DevOps, platform engineering, or infrastructure roles

Preferred Background

Experience supporting large-scale or high-traffic production systems
Familiarity with incident management and reliability engineering practices
Exposure to infrastructure-as-code and automation frameworks
Understanding of security, scalability, and performance optimization principles

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.