Site Reliability Engineering

Girafe

Location: Hyderabad, Telangana, India
Job type: Full-time

Required skills

Ansible
Bash
caching
CentOS
compliance
DevOps
DNS
Docker
incident response
Java
Jenkins
Kafka
Linux
load balancing
middleware
proxy
reverse proxy
RHEL
Redis
Reverse Proxy
Shell Scripting
TCP
Ubuntu
version control
virtualization
VMware

About the role

Girafe

Website: thegirafe.com
Job details:
Role Overview

We are looking for a highly skilled site reliability engineer to manage and scale our

on-premise payments infrastructure. You will work on a hybrid environment spanning

virtual machines and containerized workloads on bare metal, ensuring high

availability, security, and performance for mission-critical systems.

Key Responsibilities

Operate and optimize virtualized environments (VMs) and containerizedworkloads (Docker on bare metal)
Manage and scale middleware systems like:

o Nginx (traffic routing, reverse proxy, load balancing)o Redis (caching, HA setup)o Kafka (streaming, partitioning, fault tolerance)

Build and maintain CI/CD pipelines using Jenkins
Manage infrastructure and application configurations using Git-based version control
Ensure high availability, resilience, and performance tuning across systems
Work on Linux system administration (RHEL/CentOS/Ubuntu)
Implement and maintain automation frameworks using:

o Ansibleo Shell scripting

Manage and troubleshoot networking components:

o TCP/IP, DNS, Load balancingo Firewalls, WAF policieso Akamai

Handle security and compliance requirements
Maintain accurate inventory and asset management systems
Participate in incident response, RCA, and system reliability improvements
Collaborate with application, security, and DevOps teams

Required Skills & Qualifications

Core Infrastructure
Strong hands-on experience with Linux system administration
Experience managing on-prem data center environments
Solid understanding of:

o Virtualization (VMware / KVM or similar)o Bare metal provisioning

Containers & Middleware
Experience running Docker in production (non-Kubernetes setups preferred)
Strong operational knowledge of:

o Nginxo Rediso Kafkao RDBMSo Java

Observability, Alerting & Reliability

Design and manage observability platforms:

o Elastic Stack (ELK)o Grafana / Prometheus stack

Build and maintain:

o Metrics, logs, and tracing pipelineso Dashboards for system health and business KPIs

Develop intelligent alerting strategies:

o Reduce noise (alert fatigue)o Improve signal quality

Build correlation mechanisms / alert aggregation systems to:

o Reduce MTTD (Mean Time to Detect)o Reduce MTTR (Mean Time to Recover)

Drive proactive monitoring and anomaly detection
Lead incident response, debugging, and RCA with data-driven insights

CI/CD & Version Control

Hands-on experience with:

o Git (branching strategies, code reviews, infra-as-code workflows)o Jenkins (pipeline creation, build automation, deployment orchestration)

Networking & Security

Good understanding of:

o Networking fundamentals (L3/L4 concepts)o Firewalls and WAF (rule tuning, debugging)

Experience handling secure production environments

Automation

Hands-on experience with:

o Ansibleo Shell scripting (bash)

Operations

Experience with:

o Monitoring, alerting, and logging systems

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.