Senior DevOps Engineer

BitQcode Capital

full-time

Required skills

Python
AWS
Ansible
Bash
BGP
capacity planning
compliance
configuration management
Datadog
DevOps
DNS
Docker
ECS
end-to-end
fintech
GCP
GitHub
GPU
incident response
Jenkins
Kubernetes
Linux
RDP
Rust
SRE
state management
TCP
Terraform
TypeScript
uptime
Vault
PowerShell
YAML

About the role

Website:
Job details:

About the role.

Bitqcode Capital is a quantitative hedge fund running fully automated trading strategies across global markets 24/7, across multiple venues and asset classes. Our infrastructure spans AWS, GCP, Windows trading hosts, Linux compute, and a growing stack of real-time data pipelines and research workloads.

We're hiring a Senior DevOps Engineer to be the technical owner of our production infrastructure. You'll set the standards, design the systems, and be accountable for uptime, security, and cost across the entire stack. This is a small team your decisions become the platform.

What you'll own

Production architecture. End-to-end design of trading and research infrastructure across AWS and GCP. You set the patterns; others follow them.
Reliability. SLOs, error budgets, incident response, postmortems. Market-hours downtime is measured against live P&L — you own the number.
Multi-cloud strategy. Workload placement, failover topology, cross-cloud networking, and vendor risk. You make the build-vs-buy and AWS-vs-GCP calls.
Security and compliance. IAM design, secrets management (Vault / cloud-native), network segmentation, audit logging, broker and regulator compliance posture.
CI/CD platform. Pipeline standards for Python, TypeScript, and proprietary execution components. Release safety, rollback strategy, deployment gating.
Infrastructure as Code at scale. Terraform or Pulumi modules, environment promotion, drift detection. Nothing manual.
Disaster recovery. Multi-region failover for critical components, hot-standby execution connectivity, tested DR drills with documented RTO/RPO targets.
Observability platform. Metrics, logs, traces. Custom alerting for connectivity loss, process freezes, latency spikes, queue backpressure.
Cost discipline. Quarterly cost reviews, reserved capacity planning, spot strategy, waste elimination. Concrete savings, not slideware.
Mentorship. You're the senior on infra. Junior engineers and interns learn from how you operate.

What you bring

Required

6+ years of DevOps / SRE / Platform Engineering in production, with at least 2 years in a senior or lead capacity.
Deep production experience across both AWS and GCP. You've made architectural decisions in both, debugged both at 3 AM, and have opinions about both.
Mastery of Linux and Windows server fleets at scale. Windows RDP administration, group policy, scheduled task management, PowerShell automation — all second nature.
Strong programming in Python; comfortable in Bash, PowerShell, and at least one of Go / TypeScript / Rust.
Production-scale Terraform (or Pulumi) — module design, state management, multi-account / multi-project layouts.
Configuration management at scale (Ansible / Chef / Salt).
CI/CD design ownership — GitHub Actions, GitLab CI, Jenkins, or equivalent. You've designed the system, not just edited the YAML.
Containers in production: Docker required; Kubernetes or ECS at meaningful scale required.
Observability ownership: Prometheus + Grafana, Datadog, or equivalent — including custom exporters and SLI/SLO design.
Deep networking: TCP/IP, DNS, BGP basics, VPNs, load balancers, TLS, troubleshooting at the packet level.
Security: IAM design, secrets management, network segmentation, vulnerability management.
DR and HA design — with war stories, not just diagrams.
Incident command experience. You've run a P0, written the postmortem, and shipped the fix.

Strongly preferred

Trading, fintech, or other latency-sensitive infrastructure.
Financial market connectivity protocols (FIX, REST, WebSocket) and counterparty integrations.
GPU instance management for ML/inference workloads.
SOC 2 / ISO 27001 / similar compliance program ownership.
HashiCorp Vault production deployments.
Cost optimization with documented multi-figure savings.

How you operate

You treat manual production changes as a smell, not a tool.
You write runbooks before incidents, not after.
You measure infrastructure in MTTR and cost, not in tickets closed.
You push back when something is fragile, even when it's already in production.
You raise the bar for everyone around you without slowing them down.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.