Senior Platform Engineer

EDMO

Location: Pune Division, Maharashtra, India
Job type: Full-time

Required skills

Python
AWS
Ansible
Azure
Backbone
Bash
BGP
BigQuery
caching
capacity planning
CDN
cloud infrastructure
CloudFront
clustering
code review
compliance
cross-functional
data pipeline
data warehouse
DNS
DynamoDB
EC2
ETL
GCP
GitHub
GPU
Helm
infrastructure-as-code
Jenkins
Kafka
Kubernetes
Lambda
load balancing
Node
NoSQL
OSPF
platform services
Ray
Serverless
Spanner
SQL
state management
TCP
Terraform
VPC

About the role

Website: goedmo.com
Job details:
Position Overview

We are seeking a Senior Platform Engineer with 5+ years of hands-on experience designing, building, and operating large-scale cloud infrastructure. This is a high-ownership, high-impact role at the intersection of cloud networking, distributed systems architecture, infrastructure-as-code, and AI/ML inference delivery. You will be the technical cornerstone of our platform team — shaping the foundations that every engineering team builds upon.

📍 Location: Hybrid / On-Site (Pune, India)

🕐 Experience: 5+ Years

📂 Function: Platform Engineering & Cloud Infrastructure

📋 Reports To VP of Engineering / Chief Technical Officer

🏢 Employment: Full-Time

Key Responsibilities

Cloud Infrastructure & Networking

Architect and manage multi-region cloud environments across GCP, AWS, or Azure with a deep understanding of cloud-native networking primitives
Design and operate VPC topologies: subnets, peering, shared VPCs, Transit Gateways, and private service connect
Configure and maintain firewalls, security groups, WAF policies, and network ACLs to enforce least-privilege perimeter defence
Manage DNS infrastructure (Cloud DNS / Route 53 / Azure DNS) including split-horizon, private zones, and failover routing policies
Deploy and optimise VPN (site-to-site and client VPN), Cloud Interconnect / Direct Connect, and hybrid connectivity solutions
Operate CDN layers (Cloudflare, Cloud CDN, CloudFront) for caching, DDoS mitigation, and edge performance
Implement and tune load balancers (L4/L7 — GLB, ALB, NLB, NGINX) for high-availability, health checks, and traffic shaping
Deploy and manage IDS/IPS solutions for runtime threat detection across cloud workloads

Distributed Systems Design & Scalability

Lead system design for large-scale, highly available, fault-tolerant distributed platforms from first principles
Design horizontal scaling strategies: auto-scaling groups, serverless burst capacity, and stateless service architectures
Architect resilient multi-zone and multi-region active-active / active-passive topologies
Define and enforce SLOs, SLIs, and error budgets across platform services
Drive capacity planning, traffic modelling, and cost-optimisation exercises across the fleet
Lead cross-functional technical design reviews and own architecture decision records (ADRs.

Infrastructure as Code (IaC)

Own the organisation's Terraform codebase: module design, state management (remote backends), and workspace strategy
Implement Terragrunt for DRY multi-environment infrastructure composition and dependency management
Enforce IaC best practices: code review pipelines, tfsec / Checkov scanning, drift detection, and policy-as-code (Sentinel / OPA)
Build and maintain CI/CD pipelines for infrastructure delivery using GitHub Actions, Cloud Build, or equivalent
Manage secrets, service accounts, and IAM hierarchies as code with zero manual console operations in production

Databases & Data Infrastructure

Design and operate SQL databases (Cloud SQL, RDS, Aurora) with HA replicas, read scaling, automated backups, and failover
Architect and manage NoSQL solutions: MongoDB Atlas / self-managed clusters including sharding, replica sets, and index strategy
Operate BigQuery as a data warehouse platform: partitioning, clustering, slot reservations, and access control
Deploy and manage Kafka clusters (MSK, Confluent Cloud, or self-hosted) as the organisation's core messaging backbone
Design Kafka topic strategies: partitioning, retention policies, consumer group management, and schema registry integration
Implement data pipeline reliability patterns: dead-letter queues, idempotency, and exactly-once delivery guarantees

Inference as a Service on Kubernetes

Design and operate Kubernetes clusters (GKE, EKS, AKS) for serving AI/ML inference workloads at production scale
Build Inference-as-a-Service platforms using KServe, Triton Inference Server, or Ray Serve on Kubernetes
Manage GPU node pools, resource quotas, and taints/tolerations for cost-effective model serving
Implement Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) with custom metrics for inference workloads
Define Helm charts and Kustomize overlays for reproducible, environment-aware ML serving deployments
Ensure low-latency inference SLAs through profiling, batching strategies, and model optimisation collaboration with ML engineers
Integrate service mesh (Istio / Linkerd) for traffic management, mTLS, and observability across inference services

Observability, Security & Governance

Build comprehensive observability stacks: metrics (Prometheus/Grafana or Cloud Monitoring), logging (Loki/ELK/Cloud Logging), and tracing (Jaeger/Tempo/Cloud Trace)
Define and enforce cloud security baselines: CSPM tooling, CIS Benchmarks, and automated compliance scanning
Conduct infrastructure threat modelling and drive remediation of misconfigurations identified through security audits
Establish and enforce tagging/labelling taxonomies for cost allocation, compliance, and operational visibility
Participate in on-call rotations and lead post-incident reviews to drive systemic reliability improvements.

Required Skills & Experience

Cloud Platforms (Deep expertise in at least one)

GCP - GKE, Cloud Run, GCE, VPC, Cloud DNS, Cloud NAT, BigQuery, Cloud SQL, Spanner, Cloud Armor, IAP, SCC.

AWS - EKS, EC2, Lambda, Fargate VPC, TGW, Route 53, CF RDS, DynamoDB, Redshift WAF, Shield, GuardDuty.

Azure - AKS, VMSS, Azure Functions, VNET, Azure DNS, Front Door, Cosmos DB, Synapse, ADLS, Defender, Sentinel, Entra ID.

IaC & Automation

Terraform (advanced): modules, remote state, workspaces, and provider development
Terragrunt for multi-environment, multi-account infrastructure composition
Proficiency with at least one CI/CD platform: GitHub Actions, GitLab CI, Cloud Build, or Jenkins
Scripting proficiency in Python, Bash, or Go for automation and custom tooling
Configuration management: Ansible or equivalent for VM fleet management

Kubernetes & Container Ecosystem

Production Kubernetes administration: cluster lifecycle, RBAC, network policies, and storage
Helm chart authoring and Kustomize overlays for multi-environment delivery
Service mesh operation (Istio preferred): traffic management, mTLS, and telemetry
Container security: image scanning (Trivy, Snyk), runtime policies (OPA/Gatekeeper), and admission control
GPU workload scheduling and node pool management for ML inference

Data & Messaging Platforms

BigQuery: schema design, partitioning, query optimisation, and access control
MongoDB: replica sets, sharding strategies, performance tuning, and Atlas operations
Kafka: cluster management, topic design, consumer group rebalancing, and Kafka Streams / ksqlDB
Understanding of data pipeline patterns: CDC, event sourcing, and streaming ETL

Networking Deep Expertise

TCP/IP, BGP, OSPF fundamentals and cloud routing protocol equivalents
Load balancing algorithms: round-robin, least-connections, consistent hashing, and session affinity
CDN configuration: cache-control strategies, origin shielding, and edge compute (Cloudflare Workers / Lambda@Edge)
DNS: TTL management, DNSSEC, GeoDNS, and latency-based routing
IDS/IPS tuning: signature management, anomaly detection, and alert triage

Qualifications & Certifications

A degree in Computer Science, Computer Engineering, or a related discipline is preferred. Strong demonstrable experience and an impressive portfolio of past systems built are equally valued.

Strongly Preferred

☁️ GCP Professional Cloud Architect

☁️ AWS Solutions Architect – Professional

⎈ Certified Kubernetes Administrator (CKA)

⎈ Certified Kubernetes App Developer (CKAD)

Good to Have

🔐 Certified Kubernetes Security Specialist (CKS)

📊 dbt Certified Analytics Engineer

☁️ Azure Solutions Architect Expert

🏗️ HashiCorp Terraform Associate / Professional

Nice to Have

Experience with FinOps tooling (Infracost, CloudHealth) and cloud cost optimisation at scale
Familiarity with chaos engineering practices (LitmusChaos, Gremlin) and game-day exercises
Contributions to open-source infrastructure projects or published Terraform modules
Experience with platform engineering tooling: Backstage, Port, or internal developer portals
Knowledge of eBPF-based observability and networking tools (Cilium, Pixie, Hubble)
Understanding of AI/ML training infrastructure: distributed training, checkpointing, and GPU cluster networking

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.