iVedha Inc.
Website:
ivedha.com
Job details:
Role: ELK - DevOps/SRE Lead
Work Mode & Time: Remote and must be available in EST hours (USA/Canada).
NOTE: "This is not a dashboard-focused or entry-level ELK role" - We are looking for someone who has designed, scaled, and stabilized large production clusters, led migrations, and can operate confidently within a structured DevOps and SRE model in a regulated financial services environment.
About iVedha:
iVedha Inc. is a global AI-first digital transformation company with over 25 years of excellence. Powered by the iVedha Fabric - our AI-native operating system, we unify cloud, data, AI, security, and people to deliver measurable, resilient outcomes. Our expertise spans Agentic AI, Generative AI, Cloud Engineering, Cybersecurity, Data Modernization, Application Transformation, and Talent Enablement.
Join our team of forward-thinking innovators shaping the future of intelligent enterprises, where automation, observability, and AI-driven quality assurance redefine delivery velocity.
Role Overview:
We are looking for a senior-level ELK DevOps / SRE Lead to take ownership of an enterprise Elasticsearch platform supporting critical workloads for a major US financial institution.
This role combines deep Elasticsearch engineering expertise with DevOps and Site Reliability Engineering responsibilities. The person in this position will be accountable for cluster architecture, performance optimization, platform stability, automation, and long-term scalability — operating within a highly regulated banking environment.
Key Responsibilities:
Elasticsearch Architecture & Engineering:
- Design, build, and manage distributed, multi-node Elasticsearch clusters in production on Azure AKS and Azure VMs.
- Define cluster sizing strategy, node roles, shard allocation, and scaling models for high-volume banking workloads.
- Design and manage data streams, index lifecycle management (ILM), and data retention policies aligned to compliance requirements.
- Optimize indexing and search performance — including shard strategy, mapping design, query tuning, and Grok-based log parsing pipelines.
- Lead Elasticsearch upgrades, migrations, and re-architecture initiatives with minimal downtime and documented rollback plans.
- Ensure high availability and fault-tolerant configurations across all environments.
Production Reliability & SRE Practices:
- Take ownership of Elasticsearch platform stability in a production banking environment with strict SLA requirements.
- Lead troubleshooting across complex, high-availability clusters — using logs, metrics, and traces correlation to isolate failures.
- Perform detailed root cause analysis and implement permanent corrective actions.
- Define and track SLIs, SLOs, and SLAs for the Elasticsearch platform and build Kibana dashboards for real-time SLA compliance.
- Forecast capacity requirements and proactively plan scaling thresholds.
- Develop and maintain operational runbooks and incident response processes. Act as escalation point during critical production incidents.
DevOps, Cloud & Automation:
- Deploy and manage Elasticsearch in Microsoft Azure, covering Elastic Cloud on Azure, self-managed on AKS, and Azure VM deployments.
- Manage centralized log collection using Elastic Agent and Fleet, designing Agent Policies for large-scale data ingestion.
- Build and maintain Logstash and ingest pipelines for parsing complex, custom log formats using Grok scripting and Painless.
- Implement automation using Python and the official Elasticsearch Python client (elasticsearch-py) for index management, reporting, and platform integrations.
- Integrate the Elastic Stack with the OpenTelemetry (OTEL) framework, configuring the OTEL Collector to receive traces, metrics, and logs and export to Elasticsearch.
- Contribute to CI/CD pipelines for Elasticsearch deployments and configuration management using Terraform and Ansible.
- Integrate Elasticsearch with Azure Monitor, Azure Log Analytics, Dynatrace, LogicMonitor, and PagerDuty.
Security & Compliance:
- Configure and maintain Elasticsearch security, including TLS encryption, RBAC, audit logging, and SAML/SSO integration with Azure Active Directory.
- Ensure platform compliance with SOC 2 and HIPAA requirements, covering audit log retention, PII handling, access controls, and evidence collection for compliance cycles.
- Design and enforce data classification and PII masking policies for log ingestion pipelines.
Technical Leadership:
- Provide architectural guidance and best practices for Elasticsearch cluster design in regulated banking environments.
- Mentor engineers on performance tuning, scaling strategies, compliance, and troubleshooting.
- Drive continuous improvement initiatives across the ELK platform and contribute to long-term reliability and resilience planning.
Required Skills & Experience:
Core Elasticsearch (Must Have):
- 5+ years hands-on Elasticsearch in enterprise production
- Cluster sizing, shard allocation, node roles & scaling
- Index Lifecycle Management (ILM) & data streams
- Query performance tuning & search profiling
- Elasticsearch migrations & version upgrades
- Kibana — alerting, dashboards, ML anomaly detection
- Logstash pipelines — Grok, Painless, ingest enrichment
- Elastic Agent & Fleet for centralized agent management
Cloud & Infrastructure (Must Have):
- Microsoft Azure — AKS, Azure VMs, Azure Monitor
- Docker & Kubernetes (AKS specifically)
- Elastic Cloud on Azure deployment & management
- Azure Active Directory (AAD) — SAML/SSO integration
- Terraform & Ansible for infrastructure as code
- CI/CD pipelines for Elasticsearch deployments
Automation & Integration (Must Have):
- Python scripting using elasticsearch-py client
- OpenTelemetry (OTEL) — SDK instrumentation & Collector
- REST API integration for Elasticsearch administration
- Elasticsearch Watcher for automated alerting
- Dynatrace, LogicMonitor, or PagerDuty familiarity
Security & Compliance (Must Have):
- Elasticsearch RBAC & audit logging configuration
- TLS encryption for data in transit & at rest
- PII/PHI masking & data classification in pipelines
- SOC 2 or HIPAA compliance awareness
- Elasticsearch security in regulated environments
SRE Practices (Must Have):
- SLI / SLO / SLA definition & tracking
- P1 incident handling & root cause analysis
- MTTR reduction using correlated logs/metrics/traces
- Capacity planning & proactive scaling
- Operational runbook development
Education:
- Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent practical experience)
Preferred Certifications:
- Elastic Certified Engineer
- Elastic Certified Observability Engineer
- Elastic Certified Analyst
- Microsoft Certified: Azure Administrator Associate or Azure DevOps Engineer Expert
Preferred Experience:
- Experience in financial services, banking, or other regulated enterprise environments.
- Exposure to large-scale data ingestion pipelines using Kafka, Filebeat, or Fluentd.
- Experience with Apache Airflow or similar workflow orchestration tools.
- Familiarity with Microsoft Sentinel or other SIEM platforms for security monitoring.
- Experience with Prometheus and Grafana for supplementary metrics observability.
Click on Apply to know more.