Site Reliability Engineering Specialist (Bengaluru, IN, 560103)

BT Group

Salary: ₹12.3 LPA
Location: Bengaluru, Karnataka, India
Job type: Full-time

Required skills

Linux
Kubernetes
Dynatrace
Prometheus
Grafana
Elasticsearch
Kafka
CI/CD
GitOps
Python
Bash
Ansible

About the role

At BT International, our purpose is to keep the world connected. As part of BT, we build on almost 180 years of innovation and expertise to deliver secure connectivity and digital services to some of the world’s leading multinational businesses and organisations. Our customers trust us to safeguard their data, drive their digital transformation and keep their businesses running. With colleagues on the ground across the world and supporting customers wherever they need to operate, BT International offers a truly global experience. Whether it’s about providing cloud connectivity, helping organisations collaborate, or enabling innovation in cybersecurity and digital services, you’ll be part of a team that shapes how businesses succeed in a world that is being transformed by AI. If you have the drive and ambition to make an impact on a global stage, BT International is where it happens.

About the role

As a Site Reliability Engineer (SRE) within the Network Operations team, BTI International, you will be Capable for ensuring the reliability, resilience and performance of our Global Platforms including Global Fabric. You will collaborate closely with Engineering, Product and Service teams to embed SRE principles such as automation, observability and proactive incident reduction into day to day operations. By improving how we monitor, maintain and evolve our services, you will help reduce risk, improve service quality and increase operational efficiency. Through this role, you will help BTI International’s strategy by enabling stable, secure and scalable platforms that help business growth, accelerate delivery of new capabilities, and protect customer experience.

What you’ll be doing

• Own the operational reliability, performance and resilience of the Global Fabric NaaS platform.
• Help and troubleshoot microservices, APIs and integrations across the NaaS ecosystem.
• Diagnose and resolve production issues across Kubernetes-hosted applications, Linux systems, networking, Kafka, APIs and service integrations.
• Help safe, automated change into production using CI/CD, GitOps, and automated testing.
• Improve observability, monitoring and traceability across the platform using Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Help BT’s move towards end-to-end tracing and service traceability, helping implement and improve synthetic monitoring, tracing and service flow visibility.
• Participate in major incident resolution, root cause analysis and post-incident improvement activities.
• Manage incidents, problems and changes through ServiceNow and track defects and improvements in Jira.
• Drive automation through Ansible, Python, Bash or similar tooling to reduce manual effort and operational risk.
• Mentor and help L2 engineers by improving troubleshooting practices, runbooks and operational readiness.
• Build strong knowledge of the end-to-end customer journey and ensure operational decisions are aligned to customer impact.

Essential Skills / Experience

• Strong Linux and system administration experience, including server and compute management.
• Experience deploying, supporting and troubleshooting containerised applications in Kubernetes.
• Experience using monitoring tools such as Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Experience supporting large-scale, high-availability services in an ISP, telecom, NaaS or network-centric environment.
• Experience with CI/CD, GitOps and safe production deployments.
• Experience with scripting and automation using Python, Bash, Ansible or similar.
• Growth Mindset: Self-driven attitude towards learning new skills and aiding the development of others

Desirable Skills / Experience

• In-depth knowledge of network protocols, including BGP, IS-IS and MPLS.
• Understanding of synthetic monitoring, telemetry and end-to-end service visibility.
• Experience of resilience, disaster recovery, chaos engineering or high availability testing.
• Ability to manage incidents through ServiceNow, track defects and continuous improvements in Jira.

About BT Group

Provides telecommunications, broadband, and mobile services across the United Kingdom.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.