Website:
csiglobal.co.uk
Job details:
Job Title: Senior Platform Grafana Engineer – Cloud Observability
Location: PAN, India
Experience: 10+ Years
Looking for Immediate to 30 Days candidates ONLY and need Senior of 10years.
NOTE: APPLICATION without CodeTest will Not consider.
Please Apply only who can share their cv with below Code test as requested.
Job Summary
Senior Platform Grafana Engineer – Cloud Observability
Summary We are seeking a Grafana Engineer to design, build, and operate cloud-native observability solutions across our platforms. You will lead instrumentation, dashboards, alerting, and SLO/SLA reporting using Grafana and its ecosystem, integrating with cloud services and metrics/logs/tracing backends to deliver reliable, actionable insights for platform and product teams.
Key Responsibilities
- Observability Architecture: Define and implement end-to-end observability patterns (metrics, logs, traces, events, SLOs) using Grafana, Prometheus-compatible systems, and cloud-native services.
- Data Sources & Integrations: Configure and manage Grafana data sources (Prometheus, Elastic/OpenSearch, CloudWatch/Azure Monitor/Stackdriver, SQL), enabling cross-system correlation and unified views.
- Dashboarding & Visualization: Build reusable, templated, role-based dashboards (Grafana panels, variables, transformations) that provide meaningful KPIs, health checks, and executive reporting.
- Alerting & Incident Response: Implement alerting (Grafana Alerting, Alertmanager, ServiceNow/Teams) with noise reduction, deduplication, and escalation policies; contribute to on-call runbooks and post-incident reviews.
- SLO/SLI Engineering: Define SLIs and SLOs with error budgets, implement tracking and burn-rate alerts, and partner with service owners to align reliability goals with business outcomes.
- Cloud & Kubernetes: Integrate our cloud native workloads self hosted. Instrument cloud workloads (containers, serverless, managed services) and Kubernetes clusters using exporters and agents (node_exporter, cAdvisor, kube-state-metrics, OpenTelemetry).
- Automation & IaC: Manage observability as code using Terraform/Cloudformation, Helm, and GitOps pipelines for repeatable deployments.
- Security & Governance: Grafana patching and upgrades. Enforce RBAC, organizations/teams, folder structures, secrets management, and data retention; ensure compliance and cost control for observability platforms.
- Performance & Scale: Tune scraping, retention, and query performance; plan capacity for high-cardinality metrics and multi-tenant environments; drive cost-efficiency. Decommission Nagios within AZ and establish monitoring runbooks and pathways ahead.
- Collaboration & Enablement: Consult with application and platform teams on instrumentation best practices, build shared dashboards and libraries, and deliver training/documentation.
Required Qualifications
- Hands-on expertise with Grafana (v11+) including dashboard design, templating, transformations, and Grafana Alerting; experience with Grafana Enterprise features is a plus.
- Dashboard automation and development like EOSL, sustainability dashboards etc.
- Strong experience with multi cloud: AWS (CloudWatch, EKS), Azure (Azure Monitor, AKS), or GCP (Cloud Monitoring, GKE), and integrating those with Grafana.
- Proficiency with metrics/logs/traces backends: Prometheus and/or Mimir/Cortex; Loki for logs; Tempo/Jaeger or OpenTelemetry for tracing; familiarity with Elastic/OpenSearch or Splunk is beneficial.
- Kubernetes fundamentals and production operations experience: exporters, service discovery, Helm/Operators, and cluster monitoring patterns.
- Solid understanding of SRE/observability principles: SLIs/SLOs, error budgets, runbooks, and incident management workflows.
- Infrastructure-as-Code and CI/CD: Terraform (especially Grafana and cloud providers), GitHub/GitLab, and automated pipeline practices.
- Scripting/automation skills in Cloudformation, Terraform, GitHub, Python, Go, or Node; ability to build exporters or transform telemetry as needed.
- Knowledge of authentication and authorization (SSO/OIDC, OAuth, LDAP), secrets management (e.g., Key Vault/KMS), and RBAC within Grafana and clouds.
- Strong troubleshooting skills across the stack (application, network, infrastructure) using observability data to drive root-cause analysis.
Success Measures
• Reduction in mean time to detect/resolve incidents (MTTD/MTTR).
• Adoption of standardized dashboards and SLOs across services.
• Improved signal-to-noise ratio in alerts and decreased false positives.
• Measurable cost efficiency and performance improvements in observability stack.
• Positive feedback from engineering teams on usability and enablement.
Assignment to share along with the Cv
Task Need to be done by the candidate:
Code -GitHub repo
IAC - CloudFormation
Service - AWS Managed Grafana, CloudWatch
Account - All to be done in single Account
1. Create AWS managed Grafana and install basic plugins via code(use cfn custom resources)
a. sso integration
b. install plugins
2. Setup one observability in aws CloudWatch
3. Create a custom CloudWatch dashboard for collected metrics and create Grafana alerts to send sns notification via notification templates
4. Dashboards/Alerts deployment to be automatic via GitHub actions after the setup
Click on Apply to know more.