Lead II - Incident Lead

UST

Location: Trivandrum, Kerala, India
Job type: Full-time

Required skills

AWS
Cassandra
CloudWatch
database
DevOps
EC2
K8s
Kubernetes
Linux
microservices
production support
Root Cause Analysis
Shell Scripting
Splunk
SRE

About the role

UST

Website: ust.com
Job details:
Role Description

Incident Lead – Application Production Support (Enterprise Microservices)

Job Title

Incident Lead – Application Production Support (Enterprise Microservices)

Role Overview

We are looking for a highly skilled Incident Lead to manage and drive resolution of production incidents for enterprise microservices-based platforms. This role plays a mission-critical function in real-time incident triage, bridge management, root cause analysis, and coordination across multiple technology and business teams in a 24x7 global environment.

The ideal candidate thrives in high-pressure scenarios, demonstrates strong technical depth, and excels in communication and crisis leadership.

Key Responsibilities

Incident Management & Response

Lead and manage Major Incident (P1/P2) bridges, ensuring fast triage and restoration
Act as the Single Point of Contact (SPOC) during major incidents
Ensure incidents are resolved within SLA timelines with clear communication throughout the lifecycle
Coordinate with engineering, infrastructure, DevOps, and database teams during incidents

Technical Triage & Diagnostics

Perform hands-on troubleshooting for microservices-based applications
Analyze logs using Splunk, identify patterns, and isolate root causes
Monitor application health via Grafana dashboards and s
Support and debug Unix-based batch jobs, failures, and recoveries
Query and analyze Cassandra DB for data validation and issue diagnosis
Troubleshoot services deployed on AWS and Kubernetes (K8s)

Post-Incident & Problem Management

Lead Root Cause Analysis (RCA) and post-incident reviews
Track and ensure completion of corrective and preventive actions
Identify recurring issues and partner with teams to eliminate systemic problems

Operational Excellence

Contribute to automation and monitoring improvements to reduce MTTR
Help refine incident processes, playbooks, and escalation models
Support continuous improvements in observability and resilience

Required Skills & Experience

6–10 years of experience in Application Production Support or Incident Management
Strong understanding of microservices architecture and distributed systems
Hands-on expertise in:

Splunk (advanced log analysis and querying)
Grafana and monitoring tools
Cassandra DB (strong querying and functional knowledge)
Unix/Linux (batch jobs, shell scripting, troubleshooting)
AWS (EC2, CloudWatch, core services)
Kubernetes (K8s) and containerized environments

Strong experience handling Major Incidents and production bridges
Ability to work in 24x7 rotational shifts, including weekends

Preferred Qualifications

Experience supporting high-throughput, mission-critical enterprise platforms
Familiarity with ITIL Incident & Problem Management
Exposure to DevOps, SRE, and CI/CD toolchains

Key Competencies

Exceptional crisis management and decision-making skills
Strong analytical and troubleshooting capability
Clear, confident communication with technical and non-technical stakeholders
Ownership mindset with focus on service stability and customer impact

Skills

devops,production support,incident triage,grafana,incident management,unix,splunk, Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.