Technical Lead – Site Reliability Engineering

DevRabbit IT Solutions

full-time

Required skills

Python
AWS
API
CloudFormation
communication skills
compliance
cross-functional
design patterns
DevOps
end-to-end
GitHub
incident response
infrastructure-as-code
Kubernetes
Linux
microservices
network security
PostgreSQL
Root Cause Analysis
SRE
Terraform
SDLC

About the role

DevRabbit IT Solutions

Website: devrabbit.com
Job details:

Technical Lead – Site Reliability Engineering (SRE)

Employment Type: Full-Time

Location: Remote

Experience: 7+ Years

🔹 Role Overview

We are seeking a senior Technical Lead – SRE / Platform Engineering professional to own and evolve the reliability, scalability, and security of large-scale production systems. This role demands deep hands-on expertise in AWS, Linux, Kubernetes, Infrastructure‑as‑Code, and cloud security, combined with strong technical leadership, architectural ownership, and operational excellence.

The ideal candidate will act as a Directly Responsible Individual (DRI) for critical infrastructure initiatives, drive SLO/SLA-based reliability, lead incident response for complex production issues, and mentor engineers while making high-impact architectural decisions. A strong emphasis is placed on AI‑augmented engineering, using modern AI tools to accelerate delivery, improve documentation, and raise overall engineering effectiveness.

🚀 Key Responsibilities

1. Technical Leadership & Architecture

Own and drive the technical vision and architecture for team-owned infrastructure and platform systems.
Design and operate moderate-to-high complexity distributed systems, balancing reliability, scalability, performance, cost, and security.
Conduct architectural and design reviews, establishing and evolving platform-wide design patterns and standards.
Identify architectural risks and technical debt, and proactively implement long-term improvements.
Define, enforce, and continuously improve security best practices across all owned systems.
Proactively surface architecture, scalability, and security gaps to senior engineering leadership.

2. Reliability & Operational Excellence

Own the reliability posture of services by defining and managing SLOs, SLIs, and SLAs.
Lead incident response for complex, multi-service production incidents, driving root cause analysis and permanent remediation.
Establish best practices and standards for logging, metrics, alerting, tracing, and observability.
Anticipate operational risks and implement preventative measures to protect customer experience.
Participate in and help lead the on-call rotation, ensuring services are production-ready and well-instrumented.

3. Project & Delivery Ownership

Act as Directly Responsible Individual (DRI) for medium-to-large SRE or platform initiatives spanning multiple months and teams.
Partner closely with Engineering Managers, Product Managers, and stakeholders to shape roadmaps and delivery plans.
Break down complex initiatives into well-scoped, executable milestones with clear ownership and timelines.
Negotiate scope and trade-offs while ensuring alignment with reliability and customer goals.
Identify and mitigate delivery risks related to dependencies, architecture drift, capacity, or staffing well in advance.

4. AI-Augmented Engineering & Innovation

Demonstrate strong fluency in AI-assisted development practices and integrate them into daily SRE and platform workflows.
Use AI tools to accelerate:
Infrastructure design and validation
Documentation, runbooks, and architectural decision records
Root cause analysis and incident learning
Contribute to internal AI prompt libraries, coding workflows, and best-practice guidelines.
Stay up to date with emerging AI-driven engineering patterns and tools.
Coach and mentor teammates on responsible, effective, and secure AI usage across the SDLC.

✅ Required Experience

7+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in production cloud environments.
5+ years of hands-on experience with AWS (compute, networking, storage, security).
5+ years managing and operating Linux-based production systems at scale.
5+ years working with Infrastructure-as-Code tools (Terraform, AWS CDK, CloudFormation) and/or GitOps practices.
3+ years operating and troubleshooting production Kubernetes environments.
3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost.
3+ years of experience in cloud security (IAM, secrets management, network security, compliance).
3+ years of hands-on experience with PostgreSQL in production, including performance tuning, replication, backup, and recovery.
Proven experience leading multi-person, cross-functional technical projects from design through delivery.

🛠️ Technical Skills

Strong programming and automation skills using Python, Go, or similar languages.
Deep understanding of observability systems: metrics, logging, alerting, and distributed tracing.
Experience designing and managing CI/CD pipelines, release automation, and deployment strategies.
Strong grasp of backup, disaster recovery, and data retention strategies in cloud-native systems.
Experience with microservices architectures, service mesh concepts, and API gateway patterns.

🤖 AI Fluency

Hands-on experience with AI-powered coding assistants (e.g., Cursor, Augment, GitHub Copilot).
Ability to apply AI to break down complex infrastructure challenges and accelerate solution design.
Strong judgment to critically evaluate AI-generated outputs and identify risks, inaccuracies, or unsafe suggestions.

🤝 Leadership & Collaboration Skills

Proven ability to lead technical discussions, influence decisions, and drive alignment across teams.
Strong mentoring skills for junior and mid-level engineers.
Ability to work independently with minimal supervision and make final technical decisions as DRI.
Excellent written and verbal communication skills with both technical and non-technical stakeholders.

🎯 Ideal Candidate Profile

✔ Senior, hands-on SRE / Platform Engineering leader

✔ Strong ownership mindset with end-to-end accountability

✔ Deep cloud, Kubernetes, and automation expertise

✔ Comfortable operating in fast-paced, ambiguous environments

✔ Passionate about reliability, security, and AI-driven productivity

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.