SRE Lead I - DevOps Engineering

UST

full-time

Required skills

Python
AWS
CICD
CLI
Datadog
GCP
Git
Helm
Jenkins
Root Cause Analysis
SRE
Terraform
uptime

About the role

UST

Website: ust.com
Job details:
Role Description

Site Reliability Engineer (SRE)

Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering to build and operate large-scale, distributed, and fault-tolerant systems. The focus is on ensuring that services consistently meet reliability and performance expectations through strong engineering practices.

SRE applies an engineering approach to operational challenges by building scalable solutions, automating processes, and improving system resilience. Key practices include reducing manual operational overhead, conducting blameless postmortems, and proactively identifying and preventing potential outages.

The environment encourages collaboration, innovation, and continuous learning, with a focus on problem-solving, ownership, and open communication.

Key Responsibilities

Ensure high availability and uptime of systems across cloud-native (AWS, GCP) and hybrid environments
Design and implement Infrastructure as Code (IaC) using tools such as Terraform, cloud CLI, and SDKs
Build and maintain CI/CD pipelines for application and infrastructure deployment (e.g., Jenkins, cloud-native tools)
Develop automation frameworks to streamline service requests and production deployments
Create and maintain detailed runbooks for incident detection, remediation, and recovery
Troubleshoot complex distributed systems and perform root cause analysis
Participate in on-call rotations for critical incidents and continuously improve MTTR
Lead blameless postmortems and drive corrective and preventive actions

Required Experience & Skills

Bachelor’s degree in Computer Science or a related technical field, or equivalent practical experience
7–10 years of experience in software engineering, system administration, or related domains
4+ years of experience working with public cloud platforms (GCP/AWS)

Technical Expertise

Hands-on experience with GCP services (GCE, GKE, storage, networking)
Experience in provisioning and managing infrastructure using Terraform
Strong experience in rebuilding and managing VM instances using automated pipelines
Experience configuring monitoring tools such as Stackdriver (Cloud Monitoring), Datadog, or AppDynamics
Proficiency in scripting languages such as Shell or Python
Experience in setting up and managing IAM policies, roles, and access controls in GCP/AWS
Good experience with CI/CD pipelines using Git and Jenkins
Experience with container orchestration (GKE) and Helm charts
Strong understanding of blue/green deployment strategies
Exposure to PagerDuty or similar incident management tools

Skills

site reliability engineering,terraform,cloud cli,cicd, Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.