Platform Reliability Engineer

TWG GLOBAL LIMITED

Salary: $120k - $190k
Experience: 3+ yrs
Location: New York or Jacksonville or Santa Monica
Job type: Full-time

Required skills

Docker
Kubernetes
Terraform
GitLab/GitHub Actions
Airflow
Prometheus
Grafana
ELK
Datadog
AWS
GCP
Azure
Python
Bash

About the role

At TWG Group Holdings, LLC (“TWG Global”), we drive innovation and business transformation across a range of industries—including financial services, insurance, technology, media, and sports—by leveraging data and AI as core assets. Our AI-first, cloud-native approach delivers real-time intelligence and interactive business applications, empowering informed decision-making for both customers and employees.

We prioritize responsible data and AI practices to ensure ethical standards and regulatory compliance. Our decentralized structure enables each business unit to operate autonomously, supported by a central AI Solutions Group, while strategic partnerships with leading data and AI vendors fuel game-changing efforts in marketing, operations, and product development.

You will collaborate with management to advance our data and analytics transformation, enhance productivity, and enable agile, data-driven decisions. By leveraging relationships with top tech startups and universities, you will help create competitive advantages and drive enterprise innovation.

At TWG Global, your contributions will support our goal of sustained growth and superior returns, as we deliver rare value and impact across our businesses. We’re a fast-growing AI/ML team delivering high-impact use case solutions to financial institutions, insurers, and other regulated enterprises. Backed by proven leaders in finance and national security, our team is scaling rapidly to serve clients across North America with robust, secure, and production-grade AI solutions.

Role Overview

We are seeking a Platform Reliability Engineer (SRE) to ensure the scalability, stability, and performance of our data platforms and ML infrastructure. You’ll work closely with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead.

What you'll do:

Build and maintain infrastructure to support real-time and batch ML workloads
Implement observability tools (logging, monitoring, alerting) for model performance and system uptime
Design and manage CI/CD pipelines applications
Ensure high availability, disaster recovery, and rollback capabilities for production environments
Manage access controls, secrets, and security policies in collaboration with compliance and IT
Troubleshoot incidents, lead postmortems, and drive root-cause resolution
Work with U.S. and international teams to provide 24/7 coverage across time zones

About TWG GLOBAL LIMITED

Investment holding company scaling AI across finance and sports.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.