Report

Lead Site Reliability Engineer, DevOps

Location

Pune Division, Maharashtra, India

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

Qualys

Website: qualys.com
Job details:
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Job Title

Senior Site Reliability Engineer (SRE) – Observability & DevOps

Role Summary

We are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.

This role requires both deep technical expertise and production ownership mindset.

Primary Responsibilities

Observability & Monitoring

Design, implement, and maintain end-to-end observability using:

Prometheus for metrics collection
Alertmanager for alert routing, deduplication, and escalation
Grafana for visualization and dashboards
AppDynamics for APM, transaction tracing, and application health

Build actionable dashboards for:

SLIs, SLOs, and error budgets
Application, infrastructure, and platform health

Reduce alert fatigue by implementing signal-based alerting and proper severity models

Data & Metrics Platform

Manage and optimize ClickHouse for:

High-volume metrics, logs, or traces
Long-term retention and fast analytical queries

Work on schema design, performance tuning, and cost optimization

Reliability & Operations

Define and measure SRE best practices (SLIs, SLOs, SLAs)
Participate in incident response, postmortems, and root cause analysis
Drive reliability improvements through automation and capacity planning

Automation & Engineering

Develop tooling and automation using at least one scripting/programming language
Automate monitoring onboarding, alert generation, dashboard creation
Improve operational efficiencies across DevOps tooling

Required Technical Skills (Must-Have)

Core Skills

Strong Linux fundamentals

Troubleshooting, performance tuning, networking, system internals

Scripting / Programming (Any one or more):

Python (preferred), Bash, Go, or similar

Observability Tools (Hands-on):

Prometheus
Alertmanager
Grafana
AppDynamics

Data Platform:

Hands-on experience with ClickHouse

Monitoring & Alerting Concepts

Metrics vs logs vs traces
Golden signals (latency, traffic, errors, saturation)
Alert thresholds, routing policies, escalation strategies

Preferred / Nice-to-Have Skills

Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
Infrastructure as Code (Terraform, Helm)
CI/CD observability
Cloud platforms (AWS / Azure / GCP)
Experience managing observability at scale (100+ services / platforms)

Senior-Level Expectations

Ability to architect observability solutions, not just operate them
Strong production troubleshooting and incident ownership
Mentoring junior engineers
Influence DevOps and SRE best practices across teams
Communicate clearly with developers and leadership

Experience & Qualification

5-7 years of experience in SRE / DevOps / Production Engineering
Experience operating high-availability, large-scale systems
Proven background in observability-driven reliability improvements

Click on Apply to know more.

Skills

Python

AWS

Azure

Bash

capacity planning

DevOps

end-to-end

GCP

Helm

incident response

Kubernetes

Linux

Root Cause Analysis

SRE

Terraform