Observability Engineer

TGS The Global Skills

full-time

Required skills

AWS
communication skills
data visualization
DevOps
Elasticsearch
Flux
incident response
microservices
Root Cause Analysis
SQL
SRE
Terraform

About the role

TGS The Global Skills

Website: theglobalskills.com
Job details:

Job Title: Grafana & Prometheus Specialist / Observability Engineer

Location: Any MP Office

Joining: Immediate Joiners Only (Project starts June 1st)

Experience: 7+ Years

Role Overview:

We are specifically looking for a Grafana & Prometheus expert — not a traditional DevOps Engineer.

This role is focused on Observability Engineering, Reliability Monitoring, and Deep Metrics Intelligence. The ideal candidate should have strong expertise in PromQL, advanced Grafana dashboarding, monitoring architecture, alerting strategies, and transforming raw logs/metrics into actionable operational insights.

Candidates whose experience is primarily around CI/CD, Terraform setup, infrastructure provisioning, or standard cloud DevOps operations will not fit this requirement.

Key Responsibilities:

• Develop, optimize, and troubleshoot complex PromQL queries to extract actionable metrics from Prometheus

• Design advanced Grafana dashboards with dynamic variables, transformations, drill-downs, and multi-source integrations

• Configure Prometheus Service Discovery for auto-scaling and dynamic target discovery

• Build monitoring solutions for large-scale distributed systems and microservices

• Implement proactive alerting strategies using Alertmanager and anomaly detection techniques

• Integrate multiple observability data sources including Prometheus, SQL, Elasticsearch, logs, and cloud metrics

• Create correlated dashboards combining metrics, logs, and traces into a unified operational view

• Support monitoring, incident response, root cause analysis, and reliability engineering initiatives

• Build observability workflows that support self-healing systems and automated triggers

• Collaborate directly with client stakeholders and offshore teams

Must Have Skills:

✅ Expert-level Prometheus & PromQL experience (Mandatory)

✅ Advanced Grafana Dashboard Engineering

✅ Strong understanding of Prometheus architecture, exporters, scraping configs, and recording rules

✅ Experience with Alertmanager, Exporters, and monitoring ecosystems

✅ Service Discovery configuration experience

✅ Multi-source dashboard integration

✅ AWS Cloud knowledge

✅ Large-scale data & log management experience

✅ Strong understanding of Monitoring & Incident Response workflows

✅ Excellent English communication skills

Strongly Preferred:

• Experience with Loki, Tempo, Flux, Elasticsearch

• Experience creating scalable dashboards for hundreds of microservices

• Ability to design dashboards that tell operational stories through data visualization

• Experience with anomaly detection and self-healing monitoring systems

• Observability-first mindset focused on Reliability & Visibility

Important Note:

This is NOT a generic DevOps role.

We are specifically looking for candidates whose core expertise is:

Prometheus

PromQL

Grafana

Monitoring Architecture

Observability Engineering

Reliability Engineering

Do NOT submit candidates focused mainly on:

❌ CI/CD pipelines only

❌ Terraform-heavy infrastructure roles

❌ General cloud administration

❌ Standard DevOps/SRE profiles without deep PromQL & Grafana expertise

Ideal Resume Indicators:

• Strong PromQL projects

• Advanced Grafana dashboards

• Monitoring automation

• Alerting frameworks

• Service Discovery implementations

• Multi-source observability platforms

• Incident response ownership

• Reliability engineering contributions

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.