UST
Website:
ust.com
Job details:
Role Description
Site Reliability Engineer (SRE)
Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering to build and operate large-scale, distributed, and fault-tolerant systems. The focus is on ensuring that services consistently meet reliability and performance expectations through strong engineering practices.
SRE applies an engineering approach to operational challenges by building scalable solutions, automating processes, and improving system resilience. Key practices include reducing manual operational overhead, conducting blameless postmortems, and proactively identifying and preventing potential outages.
The environment encourages collaboration, innovation, and continuous learning, with a focus on problem-solving, ownership, and open communication.
Key Responsibilities
- Ensure high availability and uptime of systems across cloud-native (AWS, GCP) and hybrid environments
- Design and implement Infrastructure as Code (IaC) using tools such as Terraform, cloud CLI, and SDKs
- Build and maintain CI/CD pipelines for application and infrastructure deployment (e.g., Jenkins, cloud-native tools)
- Develop automation frameworks to streamline service requests and production deployments
- Create and maintain detailed runbooks for incident detection, remediation, and recovery
- Troubleshoot complex distributed systems and perform root cause analysis
- Participate in on-call rotations for critical incidents and continuously improve MTTR
- Lead blameless postmortems and drive corrective and preventive actions
Required Experience & Skills
- Bachelor’s degree in Computer Science or a related technical field, or equivalent practical experience
- 7–10 years of experience in software engineering, system administration, or related domains
- 4+ years of experience working with public cloud platforms (GCP/AWS)
Technical Expertise
- Hands-on experience with GCP services (GCE, GKE, storage, networking)
- Experience in provisioning and managing infrastructure using Terraform
- Strong experience in rebuilding and managing VM instances using automated pipelines
- Experience configuring monitoring tools such as Stackdriver (Cloud Monitoring), Datadog, or AppDynamics
- Proficiency in scripting languages such as Shell or Python
- Experience in setting up and managing IAM policies, roles, and access controls in GCP/AWS
- Good experience with CI/CD pipelines using Git and Jenkins
- Experience with container orchestration (GKE) and Helm charts
- Strong understanding of blue/green deployment strategies
- Exposure to PagerDuty or similar incident management tools
Skills
site reliability engineering,terraform,cloud cli,cicd,
Click on Apply to know more.