Grab
Website:
grab.in
Job details:
Site Reliability Engineer (SRE) – Mid Level
Company: Grab A Grub
Industry: Last Mile Logistics
Location: On Site / Remote / Hybrid
About Grab A Grub
Grab A Grub is a fast-growing logistics platform powering high-volume last
mile delivery operations. Our technology platform handles large-scale real-time
transactions, distributed services, and mission-critical operational systems that
support riders, merchants, and customers across multiple locations.
As our platform continues to scale, we are looking for Site Reliability Engineers
who enjoy solving real production problems, investigating complex system
behavior, and building resilient infrastructure that keeps our systems running
smoothly.
If you enjoy digging into logs, tracing distributed systems, debugging
production issues, and finding the real root cause instead of relying only on
dashboards, this role is for you.
Role Overview
As a Site Reliability Engineer, you will be responsible for ensuring the stability,
performance, and reliability of our production systems. You will work closely
with engineering teams to identify reliability risks, investigate incidents, and
build proactive monitoring and automation frameworks.
This role requires engineers who can think beyond standard playbooks, deeply
analyze system behavior, and proactively improve system resilience.
What You Will Do
• Ensure high system uptime and production reliability across distributed
systems.
• Monitor system health using metrics, logs, and traces through modern
observability platforms.
• Investigate production incidents and outages, perform detailed root
cause analysis (RCA), and drive long-term fixes.
• Debug complex issues by analyzing logs, application behavior,
infrastructure metrics, and service interactions.
• Work closely with development teams to identify reliability
improvements early in the development lifecycle.
• Build and improve monitoring, alerting, and incident response systems.
• Improve system performance, scalability, and operational efficiency.
• Automate operational tasks and reliability checks through scripts and
tooling.
• Contribute to post-incident reviews and reliability improvement
initiatives.
• Help establish engineering practices that improve system resilience and
operational maturity.
What We Are Looking For
We are looking for engineers who:
• Enjoy deep debugging and investigating real production problems.
• Can analyze systems beyond dashboards and ready-made monitoring
tools.
• Think systematically about failure scenarios, reliability risks, and
system performance.
• Take ownership of production environments and service reliability.
• Collaborate effectively with developers, infrastructure engineers, and
product teams.
Required Skills
• 3–5 years of experience in Site Reliability Engineering, DevOps, or
Production Engineering.
• Strong experience with system monitoring and observability tools.
• Hands-on experience debugging production issues in distributed
systems.
• Strong understanding of Linux systems and performance
troubleshooting.
• Experience handling incident management and production outages.
• Ability to analyze logs, metrics, traces, and infrastructure behavior to
determine root causes.
• Good understanding of microservices architectures and distributed
systems.
Preferred Skills
• Experience with cloud infrastructure (AWS, Azure, or GCP).
• Familiarity with Docker, Kubernetes, or containerized environments.
• Experience with CI/CD pipelines and deployment automation.
• Scripting experience using Python, Bash, or similar tools.
• Experience working with high-scale production environments.
Experience
• 3–5 years of relevant experience in SRE, DevOps, or production
engineering roles.
• Experience supporting large-scale production systems or high-traffic
platforms.
Why Join Grab A Grub
• Work on large-scale real-world logistics platforms.
• Solve complex reliability challenges in distributed systems.
• Collaborate with high-performing engineering teams.
• Opportunity to build resilient systems that power critical business
operations.
- Flexible Remote / Hybrid work environment.
Click on Apply to know more.