ACCEL HUMAN RESOURCE CONSULTANTS
Website:
accel-hrconsulting.com
Job details:
Our client is looking for a Head of Reliability Engineering – Trading Infrastructure to lead scalable, fault-tolerant, and high-performance trading infrastructure for mission-critical real-time systems.
Key Responsibilities
- Lead the reliability engineering function across trading infrastructure and production platforms.
- Architect and operate highly available, fault-tolerant distributed systems supporting live trading environments.
- Own infrastructure reliability, observability, scalability, deployment safety, and operational excellence across mission-critical systems.
- Drive platform engineering initiatives across Kubernetes, CI/CD, infrastructure automation, runtime orchestration, and developer tooling.
- Partner closely with trading, quant, and backend engineering teams to optimize latency, throughput, resiliency, and production stability.
- Build and standardize monitoring, alerting, tracing, logging, failover testing, disaster recovery, and incident response frameworks.
- Lead root cause analysis and resolution for complex production and distributed systems issues.
- Strengthen infrastructure security, auditability, secrets management, and operational governance across trading environments.
- Improve engineering productivity through automation, internal tooling, and infrastructure self-service capabilities.
- Define operational best practices, reliability standards, release governance, and infrastructure lifecycle management processes.
- Mentor and help scale the future reliability and platform engineering organization.
Required Experience
- 7–12 years of experience in Infrastructure Engineering, Reliability Engineering, SRE, Platform Engineering, or Distributed Systems environments.
- Strong experience operating mission-critical production systems in high-availability environments.
- Deep expertise in Linux systems, networking, and distributed infrastructure architecture.
- Strong hands-on experience with Kubernetes and containerized production environments.
- Strong programming ability in Go or Python.
- Experience with Kafka, Terraform, Vault, Consul, CI/CD pipelines, and infrastructure automation frameworks.
- Strong understanding of observability platforms including Prometheus, Alertmanager, logging, and tracing systems.
- Proven expertise debugging complex distributed systems and low-latency production environments.
- Experience in trading systems, fintech, exchanges, HFT firms, or other real-time infrastructure environments is highly preferred.
- Strong ownership mindset with the ability to operate in high-performance engineering environments
Click on Apply to know more.