UST
Website:
ust.com
Job details:
Role Description
Incident Lead – Application Production Support (Enterprise Microservices)
Job Title
Incident Lead – Application Production Support (Enterprise Microservices)
Role Overview
We are looking for a highly skilled
Incident Lead to manage and drive resolution of production incidents for enterprise microservices-based platforms. This role plays a mission-critical function in
real-time incident triage, bridge management, root cause analysis, and coordination across multiple technology and business teams in a
24x7 global environment.
The ideal candidate thrives in high-pressure scenarios, demonstrates strong technical depth, and excels in communication and crisis leadership.
Key Responsibilities
Incident Management & Response
- Lead and manage Major Incident (P1/P2) bridges, ensuring fast triage and restoration
- Act as the Single Point of Contact (SPOC) during major incidents
- Ensure incidents are resolved within SLA timelines with clear communication throughout the lifecycle
- Coordinate with engineering, infrastructure, DevOps, and database teams during incidents
Technical Triage & Diagnostics
- Perform hands-on troubleshooting for microservices-based applications
- Analyze logs using Splunk, identify patterns, and isolate root causes
- Monitor application health via Grafana dashboards and s
- Support and debug Unix-based batch jobs, failures, and recoveries
- Query and analyze Cassandra DB for data validation and issue diagnosis
- Troubleshoot services deployed on AWS and Kubernetes (K8s)
Post-Incident & Problem Management
- Lead Root Cause Analysis (RCA) and post-incident reviews
- Track and ensure completion of corrective and preventive actions
- Identify recurring issues and partner with teams to eliminate systemic problems
Operational Excellence
- Contribute to automation and monitoring improvements to reduce MTTR
- Help refine incident processes, playbooks, and escalation models
- Support continuous improvements in observability and resilience
Required Skills & Experience
- 6–10 years of experience in Application Production Support or Incident Management
- Strong understanding of microservices architecture and distributed systems
- Hands-on expertise in:
- Splunk (advanced log analysis and querying)
- Grafana and monitoring tools
- Cassandra DB (strong querying and functional knowledge)
- Unix/Linux (batch jobs, shell scripting, troubleshooting)
- AWS (EC2, CloudWatch, core services)
- Kubernetes (K8s) and containerized environments
- Strong experience handling Major Incidents and production bridges
- Ability to work in 24x7 rotational shifts, including weekends
Preferred Qualifications
- Experience supporting high-throughput, mission-critical enterprise platforms
- Familiarity with ITIL Incident & Problem Management
- Exposure to DevOps, SRE, and CI/CD toolchains
Key Competencies
- Exceptional crisis management and decision-making skills
- Strong analytical and troubleshooting capability
- Clear, confident communication with technical and non-technical stakeholders
- Ownership mindset with focus on service stability and customer impact
Skills
devops,production support,incident triage,grafana,incident management,unix,splunk,
Click on Apply to know more.