Site Reliability Engineer

Grab

Location: Mumbai, Maharashtra, India
Job type: Full-time

Required skills

Python
AWS
Azure
Bash
cloud infrastructure
DevOps
Docker
GCP
incident response
Kubernetes
Linux
microservices
Root Cause Analysis
SRE
uptime

About the role

Grab

Website: grab.in
Job details:

Site Reliability Engineer (SRE) – Mid Level

Company: Grab A Grub

Industry: Last Mile Logistics

Location: On Site / Remote / Hybrid

About Grab A Grub

Grab A Grub is a fast-growing logistics platform powering high-volume last

mile delivery operations. Our technology platform handles large-scale real-time

transactions, distributed services, and mission-critical operational systems that

support riders, merchants, and customers across multiple locations.

As our platform continues to scale, we are looking for Site Reliability Engineers

who enjoy solving real production problems, investigating complex system

behavior, and building resilient infrastructure that keeps our systems running

smoothly.

If you enjoy digging into logs, tracing distributed systems, debugging

production issues, and finding the real root cause instead of relying only on

dashboards, this role is for you.

Role Overview

As a Site Reliability Engineer, you will be responsible for ensuring the stability,

performance, and reliability of our production systems. You will work closely

with engineering teams to identify reliability risks, investigate incidents, and

build proactive monitoring and automation frameworks.

This role requires engineers who can think beyond standard playbooks, deeply

analyze system behavior, and proactively improve system resilience.

What You Will Do

• Ensure high system uptime and production reliability across distributed

systems.

• Monitor system health using metrics, logs, and traces through modern

observability platforms.

• Investigate production incidents and outages, perform detailed root

cause analysis (RCA), and drive long-term fixes.

• Debug complex issues by analyzing logs, application behavior,

infrastructure metrics, and service interactions.

• Work closely with development teams to identify reliability

improvements early in the development lifecycle.

• Build and improve monitoring, alerting, and incident response systems.

• Improve system performance, scalability, and operational efficiency.

• Automate operational tasks and reliability checks through scripts and

tooling.

• Contribute to post-incident reviews and reliability improvement

initiatives.

• Help establish engineering practices that improve system resilience and

operational maturity.

What We Are Looking For

We are looking for engineers who:

• Enjoy deep debugging and investigating real production problems.

• Can analyze systems beyond dashboards and ready-made monitoring

tools.

• Think systematically about failure scenarios, reliability risks, and

system performance.

• Take ownership of production environments and service reliability.

• Collaborate effectively with developers, infrastructure engineers, and

product teams.

Required Skills

• 3–5 years of experience in Site Reliability Engineering, DevOps, or

Production Engineering.

• Strong experience with system monitoring and observability tools.

• Hands-on experience debugging production issues in distributed

systems.

• Strong understanding of Linux systems and performance

troubleshooting.

• Experience handling incident management and production outages.

• Ability to analyze logs, metrics, traces, and infrastructure behavior to

determine root causes.

• Good understanding of microservices architectures and distributed

systems.

Preferred Skills

• Experience with cloud infrastructure (AWS, Azure, or GCP).

• Familiarity with Docker, Kubernetes, or containerized environments.

• Experience with CI/CD pipelines and deployment automation.

• Scripting experience using Python, Bash, or similar tools.

• Experience working with high-scale production environments.

Experience

• 3–5 years of relevant experience in SRE, DevOps, or production

engineering roles.

• Experience supporting large-scale production systems or high-traffic

platforms.

Why Join Grab A Grub

• Work on large-scale real-world logistics platforms.

• Solve complex reliability challenges in distributed systems.

• Collaborate with high-performing engineering teams.

• Opportunity to build resilient systems that power critical business

operations.

Flexible Remote / Hybrid work environment.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.