Website:
goedmo.com
Job details:
Position Overview
We are seeking a Senior Platform Engineer with 5+ years of hands-on experience designing, building, and operating large-scale cloud infrastructure. This is a high-ownership, high-impact role at the intersection of cloud networking, distributed systems architecture, infrastructure-as-code, and AI/ML inference delivery. You will be the technical cornerstone of our platform team — shaping the foundations that every engineering team builds upon.
📍 Location: Hybrid / On-Site (Pune, India)
🕐 Experience: 5+ Years
📂 Function: Platform Engineering & Cloud Infrastructure
📋 Reports To VP of Engineering / Chief Technical Officer
🏢 Employment: Full-Time
Key Responsibilities
Cloud Infrastructure & Networking
- Architect and manage multi-region cloud environments across GCP, AWS, or Azure with a deep understanding of cloud-native networking primitives
- Design and operate VPC topologies: subnets, peering, shared VPCs, Transit Gateways, and private service connect
- Configure and maintain firewalls, security groups, WAF policies, and network ACLs to enforce least-privilege perimeter defence
- Manage DNS infrastructure (Cloud DNS / Route 53 / Azure DNS) including split-horizon, private zones, and failover routing policies
- Deploy and optimise VPN (site-to-site and client VPN), Cloud Interconnect / Direct Connect, and hybrid connectivity solutions
- Operate CDN layers (Cloudflare, Cloud CDN, CloudFront) for caching, DDoS mitigation, and edge performance
- Implement and tune load balancers (L4/L7 — GLB, ALB, NLB, NGINX) for high-availability, health checks, and traffic shaping
- Deploy and manage IDS/IPS solutions for runtime threat detection across cloud workloads
Distributed Systems Design & Scalability
- Lead system design for large-scale, highly available, fault-tolerant distributed platforms from first principles
- Design horizontal scaling strategies: auto-scaling groups, serverless burst capacity, and stateless service architectures
- Architect resilient multi-zone and multi-region active-active / active-passive topologies
- Define and enforce SLOs, SLIs, and error budgets across platform services
- Drive capacity planning, traffic modelling, and cost-optimisation exercises across the fleet
- Lead cross-functional technical design reviews and own architecture decision records (ADRs.
Infrastructure as Code (IaC)
- Own the organisation's Terraform codebase: module design, state management (remote backends), and workspace strategy
- Implement Terragrunt for DRY multi-environment infrastructure composition and dependency management
- Enforce IaC best practices: code review pipelines, tfsec / Checkov scanning, drift detection, and policy-as-code (Sentinel / OPA)
- Build and maintain CI/CD pipelines for infrastructure delivery using GitHub Actions, Cloud Build, or equivalent
- Manage secrets, service accounts, and IAM hierarchies as code with zero manual console operations in production
Databases & Data Infrastructure
- Design and operate SQL databases (Cloud SQL, RDS, Aurora) with HA replicas, read scaling, automated backups, and failover
- Architect and manage NoSQL solutions: MongoDB Atlas / self-managed clusters including sharding, replica sets, and index strategy
- Operate BigQuery as a data warehouse platform: partitioning, clustering, slot reservations, and access control
- Deploy and manage Kafka clusters (MSK, Confluent Cloud, or self-hosted) as the organisation's core messaging backbone
- Design Kafka topic strategies: partitioning, retention policies, consumer group management, and schema registry integration
- Implement data pipeline reliability patterns: dead-letter queues, idempotency, and exactly-once delivery guarantees
Inference as a Service on Kubernetes
- Design and operate Kubernetes clusters (GKE, EKS, AKS) for serving AI/ML inference workloads at production scale
- Build Inference-as-a-Service platforms using KServe, Triton Inference Server, or Ray Serve on Kubernetes
- Manage GPU node pools, resource quotas, and taints/tolerations for cost-effective model serving
- Implement Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) with custom metrics for inference workloads
- Define Helm charts and Kustomize overlays for reproducible, environment-aware ML serving deployments
- Ensure low-latency inference SLAs through profiling, batching strategies, and model optimisation collaboration with ML engineers
- Integrate service mesh (Istio / Linkerd) for traffic management, mTLS, and observability across inference services
Observability, Security & Governance
- Build comprehensive observability stacks: metrics (Prometheus/Grafana or Cloud Monitoring), logging (Loki/ELK/Cloud Logging), and tracing (Jaeger/Tempo/Cloud Trace)
- Define and enforce cloud security baselines: CSPM tooling, CIS Benchmarks, and automated compliance scanning
- Conduct infrastructure threat modelling and drive remediation of misconfigurations identified through security audits
- Establish and enforce tagging/labelling taxonomies for cost allocation, compliance, and operational visibility
- Participate in on-call rotations and lead post-incident reviews to drive systemic reliability improvements.
Required Skills & Experience
Cloud Platforms (Deep expertise in at least one)
GCP - GKE, Cloud Run, GCE, VPC, Cloud DNS, Cloud NAT, BigQuery, Cloud SQL, Spanner, Cloud Armor, IAP, SCC.
AWS - EKS, EC2, Lambda, Fargate VPC, TGW, Route 53, CF RDS, DynamoDB, Redshift WAF, Shield, GuardDuty.
Azure - AKS, VMSS, Azure Functions, VNET, Azure DNS, Front Door, Cosmos DB, Synapse, ADLS, Defender, Sentinel, Entra ID.
IaC & Automation
- Terraform (advanced): modules, remote state, workspaces, and provider development
- Terragrunt for multi-environment, multi-account infrastructure composition
- Proficiency with at least one CI/CD platform: GitHub Actions, GitLab CI, Cloud Build, or Jenkins
- Scripting proficiency in Python, Bash, or Go for automation and custom tooling
- Configuration management: Ansible or equivalent for VM fleet management
Kubernetes & Container Ecosystem
- Production Kubernetes administration: cluster lifecycle, RBAC, network policies, and storage
- Helm chart authoring and Kustomize overlays for multi-environment delivery
- Service mesh operation (Istio preferred): traffic management, mTLS, and telemetry
- Container security: image scanning (Trivy, Snyk), runtime policies (OPA/Gatekeeper), and admission control
- GPU workload scheduling and node pool management for ML inference
Data & Messaging Platforms
- BigQuery: schema design, partitioning, query optimisation, and access control
- MongoDB: replica sets, sharding strategies, performance tuning, and Atlas operations
- Kafka: cluster management, topic design, consumer group rebalancing, and Kafka Streams / ksqlDB
- Understanding of data pipeline patterns: CDC, event sourcing, and streaming ETL
Networking Deep Expertise
- TCP/IP, BGP, OSPF fundamentals and cloud routing protocol equivalents
- Load balancing algorithms: round-robin, least-connections, consistent hashing, and session affinity
- CDN configuration: cache-control strategies, origin shielding, and edge compute (Cloudflare Workers / Lambda@Edge)
- DNS: TTL management, DNSSEC, GeoDNS, and latency-based routing
- IDS/IPS tuning: signature management, anomaly detection, and alert triage
Qualifications & Certifications
- A degree in Computer Science, Computer Engineering, or a related discipline is preferred. Strong demonstrable experience and an impressive portfolio of past systems built are equally valued.
Strongly Preferred
☁️ GCP Professional Cloud Architect
☁️ AWS Solutions Architect – Professional
⎈ Certified Kubernetes Administrator (CKA)
⎈ Certified Kubernetes App Developer (CKAD)
Good to Have
🔐 Certified Kubernetes Security Specialist (CKS)
📊 dbt Certified Analytics Engineer
☁️ Azure Solutions Architect Expert
🏗️ HashiCorp Terraform Associate / Professional
Nice to Have
- Experience with FinOps tooling (Infracost, CloudHealth) and cloud cost optimisation at scale
- Familiarity with chaos engineering practices (LitmusChaos, Gremlin) and game-day exercises
- Contributions to open-source infrastructure projects or published Terraform modules
- Experience with platform engineering tooling: Backstage, Port, or internal developer portals
- Knowledge of eBPF-based observability and networking tools (Cilium, Pixie, Hubble)
- Understanding of AI/ML training infrastructure: distributed training, checkpointing, and GPU cluster networking
Click on Apply to know more.