About Data Axle
Data Axle Inc. has been an industry leader in data, marketing solutions, sales, and research for over 50 years in the USA. Data Axle now has an established strategic global centre of excellence in Pune. This centre delivers mission critical data services to its global customers powered by its proprietary cloud-based technology platform and by leveraging proprietary business and consumer databases.
Data Axle India is recognized as a Great Place to Work! This prestigious designation is a testament to our collective efforts in fostering an exceptional workplace culture and creating an environment where every team member can thrive
Role Overview
We are building a reliability-driven operations model and are looking for a Site Reliability Engineer (L2) to support and engineer production resilience across a hybrid infrastructure (AWS + On-Prem).
This role is ideal for a strong L2 operations engineer who wants to move beyond reactive support and contribute to automation, reliability engineering, and production stability at scale.
The primary objective of this role is to reduce MTTR, eliminate repetitive incidents through automation, and improve uptime across business-critical systems.
L2:4–6 years of experience in IT Operations / Production Support / SRE
Key Responsibilities
Production Reliability & Incident Management
- Manage L2 production incidents across AWS and on-prem environments
- Perform structured Root Cause Analysis (RCA)
- Drive permanent fixes to recurring issues
- Participate in on-call rotation 24*7
- Critical Incident Management
Automation & Engineering
- Develop scripts (Python / PowerShell / Bash) to automate repetitive tasks
- Contribute to self-healing automation initiatives
- Improve monitoring coverage and reduce alert noise
- Maintain and enhance operational runbooks
Infrastructure & Cloud Operations
- Support AWS services (EC2, RDS, VPC, IAM, CloudWatch)
- Manage Linux and Windows servers
- Assist in patching, backup validation, and system health management
- Collaborate on Infrastructure-as-Code initiatives (Terraform/Ansible preferred)
Monitoring & Observability
- Work with monitoring tools (Zabbix/ Datadog / AppDynamics / Splunk / Prometheus or similar)
- Tune alerts to reduce false positives
- Create dashboards for service visibility
- Track SLO/SLI metrics for Tier-1 applications
Continuous Improvement
- Identify high-frequency operational issues and propose automation
- Contribute to reliability KPIs (MTTR, uptime, incident recurrence)
- Support capacity planning and performance optimization