Report

Site Reliability Engineer/Cloud Engineer

Location

Pune, Maharashtra, India

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

As a SRE your job entails, architecting, Implementing and managing heterogeneous & diverse tech stacks spanning multiple datacentres and across various cloud providers. Implement and manage enterprise level software, providing hosting and domain related services to millions of customers across the globe. Your role as a SRE is primarily focussed on helping business and development teams grow, roll out new features to the market with a strong commitment to quality and availability. At the same time, you will be an expert detective, diving into complex escalations involving enterprise level technical challenges, Engineering problems, customer connects and platform growth concerns etc. This role will involve the management of short & long term projects under SLA and adherence to deadlines.

Key Responsibilities:

Architect and maintain mission critical global hybrid infrastructure spanning multiple datacenters & cloud providers, leveraging primarily open source technologies.
Design next generation scalable systems which are highly available, resilient and capable of handling high volume Internet facing web traffic.
Be responsible for downtimes and maintain the product SLA, capacity planning of the systems and overall health & performance of large scale production systems.
Participate in weekly 24/7 oncall rotation, solving escalated tickets, resolve outages and debug production issues.
Work closely with various stakeholders like Engineering, Monitoring and Operations teams, Noc / Soc, customers & business development teams.
Challenge the status quo. Empower development teams by transitioning legacy methodologies, platform & technologies to devops principles, cloud native technologies and newer ecosystems without much friction.
Strict adherence to automating routine tasks and scripting, with a low tolerance to manual processes.
Needs to be data & metric driven. Develop tools and platforms for better system observability & insights.
Writing design decision documentation and is keen on implementing overall production best practices with a strong focus on security & encourage right Devops Workflows.
Design, develop, and deploy modular cloud-based systems
Educating teams on the implementation of new cloud technologies and initiatives
Develop and maintain cloud solutions in accordance with best practices.

Requirements

Excellent knowledge of Linux internals & OS fundamentals like scheduler, memory, storage, networking, etc. Has managed production servers running on RHEL/CentOS/Ubuntu Distributions.
Needs to be good in understanding Linux Filesystems, Linux troubleshooting spanning networks and systems. Sound knowledge in shell / command line, OSI, TCP/IP & networking fundamentals is mandatory.
Exposure to RDBMS like MySQL, PostgreSQL etc.
Exposure to at least 1 configuration management tools like Puppet, Ansible, Chef etc & understanding of GIT concepts / terminologies.
Can code in Python to write scripts and automate routine tasks.
Public cloud and Kubernetes experience.
Proven work experience as a Cloud Engineer or similar role.
Working experience in multiple clouds, especially AWS & GCP with expertise in cost optimization will be an addon.
AWS and/or GCP certifications preferred, not a must.

Preferred Skillset:

A Generalist who has the knowledge of the aforementioned and below mentioned skills. Someone who understands from DNS-to-Deployments and everything in between.
Has managed in past large scale web infrastructure with deep understanding of L4/L7 Load balancing, high availability & DNS. Has worked on Haproxy, Nginx, Heartbeat/KeepAlived, pacemaker etc. Prior experience of managing DNS and large scale Email system is a bonus.
Has prior Systems administration & troubleshooting experience and exposure to high traffic production environments dealing primarily in web application stacks on Apache / Nginx / Tomcat etc.
Sound knowledge on various RDBMS and NoSQL Databases like Mysql / PostgreSQL, Redis, Cassandra etc. Exposure to Database clustering solutions is a plus.
Deploying new, maintaining, patching and upgrading systems at scale with automation tools like Rundeck etc.
Exposure to metrics & logging stacks like Ganglia, TICK. Grafana/Influx/ Graphite,, Prometheus, ELK, Fluentd, Splunk, Graylog etc.
Understands the basic principles of virtualisation and containerisation and working knowledge of Docker, KVM/Libvirt. Exposure to infrastructure orchestration platforms like Kubernetes, Openshift, OpenStack, Mesos is a bonus.
Production experience to deploying in AWS/GCP and proficient in IAC toolchains like Terraform, CloudFormation etc will be a bonus.
Public Cloud Governance - awareness, maintenance and implementation.
Experience in managing CI/CD pipelines using tools like Jenkins, Bamboo, etc
Proficient in atleast one scripting/programming language like Python, Ruby, Golang, Perl,Powershell etc.
Understands the importance of basic system, application & network security and exposure to benchmarks like CIS, NIST and OpenSCAP is a bonus.

Skills

Linux

RDBMS

Python

AWS

GCP

Kubernetes

Ansible

Chef

Docker

Terraform