DevRabbit IT Solutions
Website:
devrabbit.com
Job details:
Technical Lead – Site Reliability Engineering (SRE)
Employment Type: Full-Time
Location: Remote
Experience: 7+ Years
🔹 Role Overview
We are seeking a senior Technical Lead – SRE / Platform Engineering professional to own and evolve the reliability, scalability, and security of large-scale production systems. This role demands deep hands-on expertise in AWS, Linux, Kubernetes, Infrastructure‑as‑Code, and cloud security, combined with strong technical leadership, architectural ownership, and operational excellence.
The ideal candidate will act as a Directly Responsible Individual (DRI) for critical infrastructure initiatives, drive SLO/SLA-based reliability, lead incident response for complex production issues, and mentor engineers while making high-impact architectural decisions. A strong emphasis is placed on AI‑augmented engineering, using modern AI tools to accelerate delivery, improve documentation, and raise overall engineering effectiveness.
🚀 Key Responsibilities
1. Technical Leadership & Architecture
- Own and drive the technical vision and architecture for team-owned infrastructure and platform systems.
- Design and operate moderate-to-high complexity distributed systems, balancing reliability, scalability, performance, cost, and security.
- Conduct architectural and design reviews, establishing and evolving platform-wide design patterns and standards.
- Identify architectural risks and technical debt, and proactively implement long-term improvements.
- Define, enforce, and continuously improve security best practices across all owned systems.
- Proactively surface architecture, scalability, and security gaps to senior engineering leadership.
2. Reliability & Operational Excellence
- Own the reliability posture of services by defining and managing SLOs, SLIs, and SLAs.
- Lead incident response for complex, multi-service production incidents, driving root cause analysis and permanent remediation.
- Establish best practices and standards for logging, metrics, alerting, tracing, and observability.
- Anticipate operational risks and implement preventative measures to protect customer experience.
- Participate in and help lead the on-call rotation, ensuring services are production-ready and well-instrumented.
3. Project & Delivery Ownership
- Act as Directly Responsible Individual (DRI) for medium-to-large SRE or platform initiatives spanning multiple months and teams.
- Partner closely with Engineering Managers, Product Managers, and stakeholders to shape roadmaps and delivery plans.
- Break down complex initiatives into well-scoped, executable milestones with clear ownership and timelines.
- Negotiate scope and trade-offs while ensuring alignment with reliability and customer goals.
- Identify and mitigate delivery risks related to dependencies, architecture drift, capacity, or staffing well in advance.
4. AI-Augmented Engineering & Innovation
- Demonstrate strong fluency in AI-assisted development practices and integrate them into daily SRE and platform workflows.
- Use AI tools to accelerate:
- Infrastructure design and validation
- Documentation, runbooks, and architectural decision records
- Root cause analysis and incident learning
- Contribute to internal AI prompt libraries, coding workflows, and best-practice guidelines.
- Stay up to date with emerging AI-driven engineering patterns and tools.
- Coach and mentor teammates on responsible, effective, and secure AI usage across the SDLC.
✅ Required Experience
- 7+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in production cloud environments.
- 5+ years of hands-on experience with AWS (compute, networking, storage, security).
- 5+ years managing and operating Linux-based production systems at scale.
- 5+ years working with Infrastructure-as-Code tools (Terraform, AWS CDK, CloudFormation) and/or GitOps practices.
- 3+ years operating and troubleshooting production Kubernetes environments.
- 3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost.
- 3+ years of experience in cloud security (IAM, secrets management, network security, compliance).
- 3+ years of hands-on experience with PostgreSQL in production, including performance tuning, replication, backup, and recovery.
- Proven experience leading multi-person, cross-functional technical projects from design through delivery.
🛠️ Technical Skills
- Strong programming and automation skills using Python, Go, or similar languages.
- Deep understanding of observability systems: metrics, logging, alerting, and distributed tracing.
- Experience designing and managing CI/CD pipelines, release automation, and deployment strategies.
- Strong grasp of backup, disaster recovery, and data retention strategies in cloud-native systems.
- Experience with microservices architectures, service mesh concepts, and API gateway patterns.
🤖 AI Fluency
- Hands-on experience with AI-powered coding assistants (e.g., Cursor, Augment, GitHub Copilot).
- Ability to apply AI to break down complex infrastructure challenges and accelerate solution design.
- Strong judgment to critically evaluate AI-generated outputs and identify risks, inaccuracies, or unsafe suggestions.
🤝 Leadership & Collaboration Skills
- Proven ability to lead technical discussions, influence decisions, and drive alignment across teams.
- Strong mentoring skills for junior and mid-level engineers.
- Ability to work independently with minimal supervision and make final technical decisions as DRI.
- Excellent written and verbal communication skills with both technical and non-technical stakeholders.
🎯 Ideal Candidate Profile
✔ Senior, hands-on SRE / Platform Engineering leader
✔ Strong ownership mindset with end-to-end accountability
✔ Deep cloud, Kubernetes, and automation expertise
✔ Comfortable operating in fast-paced, ambiguous environments
✔ Passionate about reliability, security, and AI-driven productivity
Click on Apply to know more.