techolution
Website:
techolution.com
Job details:
Ready to architect the reliability backbone of cloud-native platforms that serve hundreds of millions of users? Join us in making the leap from Lab Grade AI to Real World AI, leveraging your skills in CI/CD, monitoring and microservices, to build the enterprise of tomorrow.
Techolution is searching for a dynamic Senior Site Reliability Engineer (SRE) with deep, hands-on experience in real-world enterprise environments. If you possess production expertise in AWS, EKS, and Terraform, focusing on engineering reliable, observable and scalable cloud systems, and have a proven track record of leading incident response and driving platform-wide improvements, we want you.
Designation: Senior Site Reliability Engineer (SRE)
Location: Remote
Employment Type: Full Time
Shift Timings: 6 PM IST to 2:30 AM IST
Please note, we are only considering people who are in notice, or are immediate joiners. If you are not serving notice, or have 60 days/90 days notice, please refrain from applying.
Key Responsibilities:
- Own production reliability and incident response for cloud-native services on AWS, including SLO/SLI definition, error-budget management, and end-to-end leadership of Sev-1 and Sev-2 events.
- Architect, deploy, and operate containerized workloads on EKS (Kubernetes), ensuring scalable, secure, and zero-downtime applications across multiple environments.
- Design and manage infrastructure programmatically using Terraform, driving consistency, drift detection, and policy-as-code across multi-account AWS landing zones.
- Engineer and maintain CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab CI, streamlining software release cycles and improving deployment frequency and safety.
- Build observability across the stack using monitoring and logging tools like New Relic, Prometheus, Grafana, and the ELK Stack, designing alerts that fire on user-impacting symptoms rather than noise.
- Troubleshoot complex production issues across the microservices architecture, performing deep root-cause analysis and driving lasting fixes through post-incident reviews.
Technical Skills:
- Deep production expertise in AWS: Hands-on experience across EC2, VPC, IAM, S3, RDS, CloudFront, Route53, and KMS at multi-account scale. AWS is the foundation of our client engagement, and your fluency here directly determines Day-1 impact.
- Strong experience with EKS and Kubernetes: Production operation of clusters including upgrades, autoscaling, networking, secrets management, and resolving noisy-neighbor and resource-starvation scenarios. This is the orchestration layer for the entire platform.
- Mastery of Terraform: Module design, remote state management, workspaces, drift detection, and CI-integrated plan and apply workflows. You will be the technical custodian of how infrastructure is shipped.
- Hands-on engineering of CI/CD pipelines (Jenkins, GitHub Actions, or GitLab CI): Including build, test, security-gate, signing, and progressive-delivery stages. Release velocity depends on the pipelines you own.
- Elite Troubleshooting Skills: Calm, methodical, hypothesis-driven debugging across the network, OS, runtime, and application layers in real time. This is the single highest-leverage skill in the role.
- Working knowledge of Monitoring and Logging Tools (New Relic, Prometheus, ELK Stack): Production exposure to designing dashboards, alerts, and distributed traces that surface real customer impact.
- Strong grasp of Microservices Architecture: Fluency with distributed-system patterns including service discovery, retries, idempotency, circuit breakers, and async messaging.
- Exposure to AWS CDK and Lambda: Comfort building infrastructure and event-driven systems programmatically to reduce toil and extend the platform.
- Preferred development skills in Java and/or JavaScript/TypeScript: Enough to read service code, ship small fixes, and pair productively with application engineers.
- Active certification: at least one of AWS Solutions Architect (Associate or Professional), AWS Developer Associate, AWS DevOps Engineer Professional, or a Kubernetes certification (CKA or CKAD). Required by the client engagement and a signal of continued investment in craft.
Foundational Must Haves:
- Exceptional Collaboration and Communication Skills: You will work directly with senior client stakeholders, write incident reports that read like product docs, and represent Techolution's engineering bar in client forums.
- Demonstrated Ownership: Taking full responsibility for production systems from inception to incident closure, and proactively seeking improvements rather than waiting for tickets. This mindset is critical for driving reliability forward.
- Possession of a Seeker Mindset: A relentless curiosity about how systems fail and an obsession with making them fail less, paired with eagerness to learn new technologies in the rapidly evolving cloud landscape.
- Genuine Passion Towards Work: A deep enthusiasm for engineering craft and problem-solving, translating into high-quality contributions and a positive impact on our team and clients.
- Displaying an Extremely Ambitious drive: A strong desire to excel, push boundaries, and contribute significantly to Techolution's innovative goals and client success — including the resilience to operate on a US-aligned shift.
- Unwavering Unbeatable Work Ethics: A commitment to diligence, reliability, and integrity in all aspects of your work, ensuring consistent high performance and trust within the team and with the client.
- Exceptional Ability to comprehend: The capacity to quickly understand complex technical architectures, project requirements, and team discussions, enabling effective problem-solving and collaboration.
Negotiable Skills:
- Exposure to advanced Kubernetes ecosystem tools (Helm, KEDA, Karpenter, service mesh): Experience operating these in production to handle complex autoscaling and traffic management scenarios.
- Knowledge of advanced observability practices (distributed tracing, RED/USE metrics, SLO engineering): Designing telemetry that reflects customer experience rather than just infrastructure health.
- Familiarity with Chaos Engineering tools (Gremlin, AWS FIS, Chaos Monkey): Practical experience injecting failure to validate system resilience before incidents do.
- Basic understanding of Database Administration (RDS, Aurora, DynamoDB): Knowledge of fundamental database concepts and operations, useful for managing data persistence layers in production applications.
- Exposure to AI/ML workload reliability (model serving, GPU node groups, inference autoscaling): A strong plus given Techolution's focus on real-world AI in production environments.
Click on Apply to know more.