Report

Director, Site Reliability Engineering

Location

Dallas, TX

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

Job Title: Director, Site Reliability Engineering

Reports To: SVP, QA

FLSA Status: Exempt

Department: Technology

JOB SUMMARY:

Responsible for leading the strategy, architecture, and operations of the Site Reliability Engineering (SRE) function at LendingPoint. This includes overseeing infrastructure automation, DevSecOps, CI/CD pipelines, observability, release management, system stability, and incident response. The Director acts as a high-level technical decision-maker—establishing technical standards, guiding architectural decisions, and ensuring the reliability and scalability of systems to support business goals.

ESSENTIAL JOB FUNCTIONS:

· Provide day-to-day leadership to the SRE team, ensuring effective operations, growth, and innovation.

· Manage cloud-native infrastructure, including servers, container clusters, databases, and networks across AWS/GCP/Azure.

· Design and scale CI/CD pipelines and observability tools (Grafana, Prometheus, Dynatrace, Full Story, etc.) for production-grade environments.

· Oversee release planning, coordination, risk mitigation, and change control across engineering and business stakeholders.

· Implement proactive monitoring, alerting, and incident response systems to ensure performance and reliability.

· Lead capacity planning and scaling efforts for high-growth environments and services.

· Drive automation initiatives to optimize operations, reduce manual effort, and improve service quality.

· Manage vendor relationships with cloud providers, data centers, and infrastructure partners to uphold SLAs and resolve issues efficiently.

· Own disaster recovery and business continuity strategies to minimize downtime and ensure data resilience.

· Develop and maintain infrastructure and operational documentation; provide internal training as needed.

· Guide cross-functional release planning across Product, QA, Engineering, and IT Ops to align with business goals.

· Lead retrospectives for major incidents and continuously improve recovery time and system reliability.

· Promote a culture of continuous improvement, learning, and engineering excellence within the team.

MINIMUM QUALIFICATIONS:

· Bachelor's degree in computer science or related discipline, preferred.

· 10+ years of experience in SRE or DevOps roles supporting high-scale systems.

· 5+ years of experience leading SRE/DevOps or release teams.

· Strong expertise in Kubernetes administration, Docker container orchestration, and infrastructure as code (IaC).

· Experience managing production infrastructure on AWS, Azure, or Google Cloud Platform.

· Deep knowledge of monitoring, logging, and alerting tools such as Prometheus, Dynatrace, Full Story, or Nagios.

· Hands-on experience with CI/CD tools (e.g., GitLab CI, Jenkins), IaC (Terraform), and scripting languages (Python, Bash, Go).

· Strong programming background in Java, with experience building and scaling microservices-based platforms.

· Solid understanding of web/API technologies (REST, JSON), observability, and API gateways.

· Experience managing environments across development, QA, staging, and production tiers.

· Proven ability to lead disaster recovery planning, business continuity, and compliance enforcement.

· Certification in relevant areas (e.g., AWS, Azure Administrator, GCP Network Engineer) is a plus.

· Excellent analytical, troubleshooting, and decision-making skills for complex system problems.

· Strong verbal and written communication skills can interact at all levels of the organization.

COMPETENCIES:

· Customer Service: Exceptional attitude and a passion for providing outstanding service to internal customers.

· Analytical Skills: Proven capacity to extract and manipulate large datasets in an efficient manner.

· Communications: Exhibits good listening and comprehension. Expresses ideas and thoughts in verbal and written form. Strong presentation skills.

· Compliance & Risk Awareness – Enforces standards and policies to ensure secure, compliant operations.

· Infrastructure Management – Expert in managing cloud infrastructure, scalability, security, and platform efficiency.

· Observability & Incident Response – Establishes comprehensive monitoring and drives high-quality incident handling.

· Problem Solving – Tackles complex systems issues with data-driven strategies and root cause analysis.

· Release & Change Management – Effectively governs the release lifecycle, balancing speed with stability.

· Strategic Communication – Engages cross-functional teams and leadership with clarity, transparency, and influence.

· Team Leadership – Inspires and manages high-performing engineering teams with a focus on trust, agility, and resilience.

SUPERVISORY RESPONSIBILITY

Yes

PHYSICAL DEMANDS

While performing the duties of this job, the employee is regularly required to stand, walk, reach and sit for a minimum of 8 hours with or without reasonable accommodation. The employee is required to use hands to finger, handle, or feel objects and/or tools. The employee is required to talk or hear with or without reasonable accommodation and must sometimes lift and move up to 10 pounds.

WORK ENVIRONMENT

While performing the logistics duties of this job, the employee is frequently exposed to moderate noises such as computers, printers, and other light traffic noise in an office setting.

This role is in-office. Remote work may be performed from a pre-approved location, as arranged, and scheduled by team management and approved by department leadership.

OTHER DUTIES

Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities, and activities may change or be supplemented at any time with or without notice.

Equal Opportunity Employer

This employer is required to notify all applicants of their rights pursuant to federal employment laws.
For further information, please review the Know Your Rights notice from the Department of Labor.

Skills

Python

AWS

API

Azure

Bash

capacity planning

change management

cloud infrastructure

communication skills

compliance

cross-functional

DevOps

Docker

GCP

GitLab

Google Cloud

incident response

infrastructure management

Java

Jenkins

JSON

Kubernetes

Nagios

network engineer

Root Cause Analysis

SRE

Terraform