Site Reliability Engineer

techolution

full-time

Required skills

AWS
Backbone
CI
CloudFront
communication skills
database
DevOps
DynamoDB
EC2
end-to-end
ethics
GitHub
GPU
Helm
incident response
Java
JavaScript
Jenkins
Kubernetes
Lambda
microservices
Node
SRE
state management
Terraform
TypeScript
VPC

About the role

techolution

Website: techolution.com
Job details:

Ready to architect the reliability backbone of cloud-native platforms that serve hundreds of millions of users? Join us in making the leap from Lab Grade AI to Real World AI, leveraging your skills in CI/CD, monitoring and microservices, to build the enterprise of tomorrow.

Techolution is searching for a dynamic Senior Site Reliability Engineer (SRE) with deep, hands-on experience in real-world enterprise environments. If you possess production expertise in AWS, EKS, and Terraform, focusing on engineering reliable, observable and scalable cloud systems, and have a proven track record of leading incident response and driving platform-wide improvements, we want you.

Designation: Senior Site Reliability Engineer (SRE)

Location: Remote

Employment Type: Full Time

Shift Timings: 6 PM IST to 2:30 AM IST

Please note, we are only considering people who are in notice, or are immediate joiners. If you are not serving notice, or have 60 days/90 days notice, please refrain from applying.

Key Responsibilities:

Own production reliability and incident response for cloud-native services on AWS, including SLO/SLI definition, error-budget management, and end-to-end leadership of Sev-1 and Sev-2 events.
Architect, deploy, and operate containerized workloads on EKS (Kubernetes), ensuring scalable, secure, and zero-downtime applications across multiple environments.
Design and manage infrastructure programmatically using Terraform, driving consistency, drift detection, and policy-as-code across multi-account AWS landing zones.
Engineer and maintain CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab CI, streamlining software release cycles and improving deployment frequency and safety.
Build observability across the stack using monitoring and logging tools like New Relic, Prometheus, Grafana, and the ELK Stack, designing alerts that fire on user-impacting symptoms rather than noise.
Troubleshoot complex production issues across the microservices architecture, performing deep root-cause analysis and driving lasting fixes through post-incident reviews.

Technical Skills:

Deep production expertise in AWS: Hands-on experience across EC2, VPC, IAM, S3, RDS, CloudFront, Route53, and KMS at multi-account scale. AWS is the foundation of our client engagement, and your fluency here directly determines Day-1 impact.
Strong experience with EKS and Kubernetes: Production operation of clusters including upgrades, autoscaling, networking, secrets management, and resolving noisy-neighbor and resource-starvation scenarios. This is the orchestration layer for the entire platform.
Mastery of Terraform: Module design, remote state management, workspaces, drift detection, and CI-integrated plan and apply workflows. You will be the technical custodian of how infrastructure is shipped.
Hands-on engineering of CI/CD pipelines (Jenkins, GitHub Actions, or GitLab CI): Including build, test, security-gate, signing, and progressive-delivery stages. Release velocity depends on the pipelines you own.
Elite Troubleshooting Skills: Calm, methodical, hypothesis-driven debugging across the network, OS, runtime, and application layers in real time. This is the single highest-leverage skill in the role.
Working knowledge of Monitoring and Logging Tools (New Relic, Prometheus, ELK Stack): Production exposure to designing dashboards, alerts, and distributed traces that surface real customer impact.
Strong grasp of Microservices Architecture: Fluency with distributed-system patterns including service discovery, retries, idempotency, circuit breakers, and async messaging.
Exposure to AWS CDK and Lambda: Comfort building infrastructure and event-driven systems programmatically to reduce toil and extend the platform.
Preferred development skills in Java and/or JavaScript/TypeScript: Enough to read service code, ship small fixes, and pair productively with application engineers.
Active certification: at least one of AWS Solutions Architect (Associate or Professional), AWS Developer Associate, AWS DevOps Engineer Professional, or a Kubernetes certification (CKA or CKAD). Required by the client engagement and a signal of continued investment in craft.

Foundational Must Haves:

Exceptional Collaboration and Communication Skills: You will work directly with senior client stakeholders, write incident reports that read like product docs, and represent Techolution's engineering bar in client forums.
Demonstrated Ownership: Taking full responsibility for production systems from inception to incident closure, and proactively seeking improvements rather than waiting for tickets. This mindset is critical for driving reliability forward.
Possession of a Seeker Mindset: A relentless curiosity about how systems fail and an obsession with making them fail less, paired with eagerness to learn new technologies in the rapidly evolving cloud landscape.
Genuine Passion Towards Work: A deep enthusiasm for engineering craft and problem-solving, translating into high-quality contributions and a positive impact on our team and clients.
Displaying an Extremely Ambitious drive: A strong desire to excel, push boundaries, and contribute significantly to Techolution's innovative goals and client success — including the resilience to operate on a US-aligned shift.
Unwavering Unbeatable Work Ethics: A commitment to diligence, reliability, and integrity in all aspects of your work, ensuring consistent high performance and trust within the team and with the client.
Exceptional Ability to comprehend: The capacity to quickly understand complex technical architectures, project requirements, and team discussions, enabling effective problem-solving and collaboration.

Negotiable Skills:

Exposure to advanced Kubernetes ecosystem tools (Helm, KEDA, Karpenter, service mesh): Experience operating these in production to handle complex autoscaling and traffic management scenarios.
Knowledge of advanced observability practices (distributed tracing, RED/USE metrics, SLO engineering): Designing telemetry that reflects customer experience rather than just infrastructure health.
Familiarity with Chaos Engineering tools (Gremlin, AWS FIS, Chaos Monkey): Practical experience injecting failure to validate system resilience before incidents do.
Basic understanding of Database Administration (RDS, Aurora, DynamoDB): Knowledge of fundamental database concepts and operations, useful for managing data persistence layers in production applications.
Exposure to AI/ML workload reliability (model serving, GPU node groups, inference autoscaling): A strong plus given Techolution's focus on real-world AI in production environments.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.