Senior Developer, Platform Engineer

ICE

Location: Hyderabad, Telangana, India
Job type: Full-time

Required skills

Python
Agile
AWS
Apache
Apache Spark
Azure
business objectives
Chart
cross-functional
data science
Docker
Flux
GCP
GitHub
Hadoop
Helm
ICE
infrastructure-as-code
Jenkins
Kafka
microservices
platform services
Serverless
Shell Scripting
Spark
uptime
Vault

About the role

ICE

Website: ice.com
Job details:
Job Description

Job Purpose

Intercontinental Exchange, Inc. (ICE) is seeking an experienced Senior Developer to join our AI Centre of Excellence (AI CoE) team. In this role you will design, build, and operate the container platform and observability infrastructure that underpins our AI/ML data platform. Enabling data scientists, ML engineers, and application teams to ship and run workloads reliably at scale. You will bring deep, hands-on expertise in OpenShift and Kubernetes-based platform engineering, and you will play a visible technical role in shaping how we deploy, observe, and operate services across the enterprise.

The successful candidate combines strong engineering fundamentals with a platform-product mindset: you care about developer experience as much as uptime, you instrument everything, and you drive automation as the default. Exceptional collaboration, written communication, and the ability to influence diverse stakeholders are equally important.

Responsibilities

Design, deploy, and operate production-grade OpenShift / Kubernetes clusters, ensuring high availability, security hardening, and efficient resource utilisation.
Own the Helm chart ecosystem: author, version, test, and maintain charts for platform services and application workloads; enforce chart quality standards across teams.
Build and maintain Docker image pipelines — define base-image standards, enforce image scanning policies, and optimise build performance within CI/CD workflows.
Implement and evolve the observability stack using Prometheus (metrics collection, alerting rules, recording rules) and Grafana (dashboards, SLO/SLA visualisation, on-call runbooks).
Collaborate with development and ML teams to define deployment patterns, resource quotas, and autoscaling strategies that balance cost with performance.
Champion infrastructure-as-code and GitOps practices; ensure all platform configuration is version-controlled, peer-reviewed, and auditable.
Drive incident management and blameless post-mortems; reduce MTTR by improving runbooks, alerting fidelity, and automation.
Mentor engineers across the team on platform best practices, container security, and observability-driven development.
Actively engage stakeholders across Engineering, Data Science, Security, and Operations to align platform capabilities with strategic business objectives.

Knowledge And Experience

OpenShift: Hands-on experience deploying and administering Red Hat OpenShift clusters in production; solid understanding of OpenShift-specific constructs (Security Context Constraints, Routes, OperatorHub, OCP upgrade strategies).
Kubernetes: Deep knowledge of Kubernetes architecture and operational concerns, workload scheduling, RBAC, network policies, storage classes, custom resources (CRDs), and cluster lifecycle management.
Helm: Proficiency authoring and maintaining Helm charts; experience with Helm templating best practices, value overrides, sub-charts, and release management in multi-environment pipelines.
Docker: Strong command of Docker for building, tagging, and shipping container images; experience writing efficient multi-stage Dockerfiles, managing registries, and integrating image scanning into CI.
Prometheus: Practical experience deploying Prometheus (including the Prometheus Operator / kube-prometheus-stack), writing PromQL queries, defining alert rules, and integrating with Alertmanager for on-call workflows.
Grafana: Experience building production Grafana dashboards, configuring data sources, managing folders and permissions, and implementing SLO/SLA monitoring panels; familiarity with Grafana as Code (grafonnet or similar).
Python and/or Shell scripting : Proficiency in Python and/or Shell scripting for automation, tooling, and platform utilities.
CI/CD platforms: Experience with CI/CD platforms (e.g., Jenkins, GitLab CI, GitHub Actions, Tekton) and GitOps tooling (e.g., ArgoCD, Flux).
AI-Assisted Development: Comfortable using AI coding assistants such as GitHub Copilot, Cursor, or similar tools to accelerate development, reduce repetitive work, and improve code quality. You know how to get the best out of these tools while applying sound engineering judgement to what they produce.
Exceptional problem-solving skills and strategic thinking: able to diagnose complex distributed-systems issues under pressure and drive root-cause resolution.
Proven technical leadership: ability to set direction, mentor peers, and influence architectural decisions across team boundaries.
Platform-product mindset: you treat internal teams as customers and measure success by their productivity and confidence, not just by uptime metrics.
Excellent written and verbal communication skills; comfortable articulating technical trade-offs to both engineering peers and non-technical stakeholders.
Proactive and self-directed: you identify gaps and drive improvements without waiting to be asked.
Collaborative and inclusive: you build consensus, share knowledge freely, and contribute to a positive, high-trust team culture.

Preferred Knowledge And Experience

Apache Airflow: Experience orchestrating complex data and ML pipelines with Airflow; ability to author and test DAGs, manage task dependencies, and operate Airflow on Kubernetes.
KServe: Familiarity with KServe for deploying and serving ML models on Kubernetes, model versioning, canary deployments, custom transformers, and inference autoscaling.
Knative: Understanding of Knative Serving and Eventing for building serverless, event-driven workloads on Kubernetes; experience configuring revisions, traffic splits, and broker/trigger patterns.
Service Mesh: Practical knowledge of service mesh concepts — traffic management, mutual TLS, observability, and circuit breaking — applied in a production microservices environment.
Istio: Hands-on experience configuring Istio control-plane components, VirtualServices, DestinationRules, and Gateways; integrating Istio telemetry with Prometheus and Grafana.
Familiarity with cloud platforms — AWS, Azure, or GCP — for managing cloud-native Kubernetes services (EKS, AKS, GKE) or hybrid/on-prem deployments.
Understanding of data platform technologies (Apache Spark, Kafka, Hadoop) and how they interact with container orchestration layers.
Knowledge of container security best practices: image vulnerability scanning, runtime security policies, secrets management (Vault, Sealed Secrets), and supply-chain security.
Solid understanding of Git and trunk-based development workflows.
Experience working in Agile teams of 5–8 cross-functional engineers.
Background in an applied R&D or innovation-lab environment is a plus.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.