Epsilon
Website:
epsilon.com
Job details:
About Business Unit:
At the core of all that Epsilon does is a team that sets the foundation of our IT infrastructure. The team drives innovation and efficiency through pioneering technology across Epsilon's platforms and business verticals. From being the first point of contact for infrastructure needs to final deployment, the team provides end-to-end solutions for our client-facing platforms. ETS supports all aspects of revenue-generating platforms for Epsilon and sets the architectural direction for our enterprise deployments. By adopting the newest technologies, such as Cloud, Automation, and Artificial Intelligence, the team is at the front of redefining our digital business and capturing new opportunities.
This role acts as the operational nerve centre, ensuring rapid detection, notification, coordination, and restoration of services across Windows, Linux, Kubernetes, and cloud-native platforms, with strong emphasis on monitoring through Grafana and the ELK stack.
Why we are looking for you:
- You have to actively monitor enterprise dashboards using Grafana (metrics), ELK / Elastic Stack (logs and events), OpsRamp, SolarWinds, PagerDuty, and ServiceNow queues.
- You have to support dashboard reviews, basic enhancements, and operational usage of centralized observability platforms.
- You have to partner with monitoring, tooling, and SRE teams to improve alert quality, reduce alert noise and enhance incident detection and response effectiveness.
- You have to detect anomalies, correlate multi-system alerts, and identify patterns indicating systemic or cascading failures.
- You have a strong interest in IT infrastructure monitoring and operations across Windows, Linux, and Kubernetes platforms.
- You communicate clearly during incidents and can coordinate with multiple resolver teams.
- You are comfortable working in a 24x7, shift‑based operational environment.
What you will enjoy in this role:
- Working at the centre of enterprise operations as part of a Global Command & Control Centre.
- Exposure to modern observability platforms such as Grafana and ELK.
- Hands-on experience with major incident management and real-time incident command.
- Opportunities to build strong foundational skills in Linux, Windows, and Kubernetes operations.
- A fast-paced environment that builds decision-making, communication, and operational rigour.
Click here to view how Epsilon transforms marketing with 1 View, 1 Vision and 1 Voice.
Responsibilities
- Monitor enterprise dashboards using Grafana & ELK for logs, metrics, and alerts.
- Validate alerts, reduce noise, and classify incidents by severity (P1–P3).
- Provide L1 operational monitoring and triage for Kubernetes clusters.
- Perform Linux and Windows administration tasks, including patching, health checks, and troubleshooting.
- Open, update, and manage incidents in ServiceNow with accurate diagnostics.
- Act as first responder for major incidents and support incident bridges and escalations.
- Maintain shift logs, handover notes, SOPs, and operational documentation.
Qualifications
- Bachelor’s degree in engineering, Computer Science, IT, or equivalent discipline.
- 3-5 years of experience in IT operations, NOC, OCC, or infrastructure support roles
- Hands-on exposure to:
- Linux / Unix administration.
- Windows Server administration and patching.
- Monitoring tools such as Grafana and ELK.
- Understanding of ITIL-aligned incident management processes.
- Willingness to work in 24x7 shift-based operations.
Click on Apply to know more.