Head of Reliability Engineering

ACCEL HUMAN RESOURCE CONSULTANTS

full-time

Required skills

Python
backend
CI
fintech
incident response
Kafka
Kubernetes
Linux
Root Cause Analysis
SRE
Terraform
Vault

About the role

ACCEL HUMAN RESOURCE CONSULTANTS

Website: accel-hrconsulting.com
Job details:

Our client is looking for a Head of Reliability Engineering – Trading Infrastructure to lead scalable, fault-tolerant, and high-performance trading infrastructure for mission-critical real-time systems.

Key Responsibilities

Lead the reliability engineering function across trading infrastructure and production platforms.
Architect and operate highly available, fault-tolerant distributed systems supporting live trading environments.
Own infrastructure reliability, observability, scalability, deployment safety, and operational excellence across mission-critical systems.
Drive platform engineering initiatives across Kubernetes, CI/CD, infrastructure automation, runtime orchestration, and developer tooling.
Partner closely with trading, quant, and backend engineering teams to optimize latency, throughput, resiliency, and production stability.
Build and standardize monitoring, alerting, tracing, logging, failover testing, disaster recovery, and incident response frameworks.
Lead root cause analysis and resolution for complex production and distributed systems issues.
Strengthen infrastructure security, auditability, secrets management, and operational governance across trading environments.
Improve engineering productivity through automation, internal tooling, and infrastructure self-service capabilities.
Define operational best practices, reliability standards, release governance, and infrastructure lifecycle management processes.
Mentor and help scale the future reliability and platform engineering organization.

Required Experience

7–12 years of experience in Infrastructure Engineering, Reliability Engineering, SRE, Platform Engineering, or Distributed Systems environments.
Strong experience operating mission-critical production systems in high-availability environments.
Deep expertise in Linux systems, networking, and distributed infrastructure architecture.
Strong hands-on experience with Kubernetes and containerized production environments.
Strong programming ability in Go or Python.
Experience with Kafka, Terraform, Vault, Consul, CI/CD pipelines, and infrastructure automation frameworks.
Strong understanding of observability platforms including Prometheus, Alertmanager, logging, and tracing systems.
Proven expertise debugging complex distributed systems and low-latency production environments.
Experience in trading systems, fintech, exchanges, HFT firms, or other real-time infrastructure environments is highly preferred.
Strong ownership mindset with the ability to operate in high-performance engineering environments

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.