Girafe
Website:
thegirafe.com
Job details:
Role OverviewWe are looking for a highly skilled site reliability engineer to manage and scale our
on-premise payments infrastructure. You will work on a hybrid environment spanning
virtual machines and containerized workloads on bare metal, ensuring high
availability, security, and performance for mission-critical systems.
Key Responsibilities
- Operate and optimize virtualized environments (VMs) and containerizedworkloads (Docker on bare metal)
- Manage and scale middleware systems like:
o Nginx (traffic routing, reverse proxy, load balancing)o Redis (caching, HA setup)o Kafka (streaming, partitioning, fault tolerance)
- Build and maintain CI/CD pipelines using Jenkins
- Manage infrastructure and application configurations using Git-based version control
- Ensure high availability, resilience, and performance tuning across systems
- Work on Linux system administration (RHEL/CentOS/Ubuntu)
- Implement and maintain automation frameworks using:
o Ansibleo Shell scripting- Manage and troubleshoot networking components:
o TCP/IP, DNS, Load balancingo Firewalls, WAF policieso Akamai- Handle security and compliance requirements
- Maintain accurate inventory and asset management systems
- Participate in incident response, RCA, and system reliability improvements
- Collaborate with application, security, and DevOps teams
Required Skills & Qualifications
- Core Infrastructure
- Strong hands-on experience with Linux system administration
- Experience managing on-prem data center environments
- Solid understanding of:
o Virtualization (VMware / KVM or similar)o Bare metal provisioning- Containers & Middleware
- Experience running Docker in production (non-Kubernetes setups preferred)
- Strong operational knowledge of:
o Nginxo Rediso Kafkao RDBMSo JavaObservability, Alerting & Reliability
- Design and manage observability platforms:
o Elastic Stack (ELK)o Grafana / Prometheus stacko Metrics, logs, and tracing pipelineso Dashboards for system health and business KPIs
- Develop intelligent alerting strategies:
o Reduce noise (alert fatigue)o Improve signal quality- Build correlation mechanisms / alert aggregation systems to:
o Reduce MTTD (Mean Time to Detect)o Reduce MTTR (Mean Time to Recover)- Drive proactive monitoring and anomaly detection
- Lead incident response, debugging, and RCA with data-driven insights
CI/CD & Version Control
- Hands-on experience with:
o Git (branching strategies, code reviews, infra-as-code workflows)o Jenkins (pipeline creation, build automation, deployment orchestration)
Networking & Security
o Networking fundamentals (L3/L4 concepts)o Firewalls and WAF (rule tuning, debugging)- Experience handling secure production environments
Automation
- Hands-on experience with:
o Ansibleo Shell scripting (bash)Operations
o Monitoring, alerting, and logging systems
Click on Apply to know more.