Sr Platform Engineer

InfoVision Inc.

full-time

Required skills

PKI
Apache
API
Bash
Buffer
caching
Chart
compliance
database
Docker
end-to-end
Golang
Helm
Kubernetes
load balancing
microservices
Node
OAuth
PostgreSQL
system integration
TCP
UAT
Vault
VLAN
YAML

About the role

InfoVision Inc.

Website: infovision.com
Job details:
Critical Skills To Possess

Kubernetes & Container Orchestration

3+ years of production Kubernetes experience; bare-metal / on-premises

experience is mandatory — cloud-managed Kubernetes experience alone does not

qualify

Hands-on Helm chart authoring (not just consumption), ArgoCD or equivalent

GitOps tooling, and cert-manager

Deep understanding of Kubernetes control plane HA: etcd Raft quorum, leader

election, minimum viable node counts

Experience with MetalLB and Traefik or equivalent bare-metal ingress and load

balancing tools

Messaging & Streaming

Production hands-on experience with Apache Kafka: topic configuration, partition

sizing, replication factor, consumer group management, KRaft mode

Solid understanding of consumer lag monitoring and back-pressure patterns under

burst load

Strong familiarity with MQTT protocol semantics: persistent sessions, QoS levels,

TLS mutual authentication; hands-on experience with EMQX or a comparable

broker at tens-of-thousands of concurrent connections

Data Infrastructure

Production hands-on experience with PostgreSQL HA: Patroni-based automatic

failover, PgBouncer connection pooling, streaming replication, PITR with

pgBackRest or equivalent

Working knowledge of Valkey / Redis: cluster mode, TTL-based caching, atomic

counter operations

Observability

Hands-on deployment and configuration of VictoriaMetrics / Prometheus,

Grafana, Loki, Tempo, and Alertmanager

Ability to write PromQL / MetricsQL queries and author production Grafana

dashboards from scratch

Experience configuring alert deduplication and grouping in Alertmanager for high

volume event storms

Security & Secrets

Hands-on experience with HashiCorp Vault / OpenBao: PKI secrets engine,

Kubernetes auth method, dynamic secrets

Experience with container image CVE scanning (Trivy or equivalent) and a self

hosted container registry (Harbor or equivalent)

Understanding of TLS certificate lifecycle management: internal CA, automated

rotation, mTLS between services

CI/CD & GitOps

Experience building CI pipelines with real service dependency testing

(testcontainers-go or equivalent)

GitOps workflow discipline: Git as the sole source of cluster state truth, PR-gated

deployment approvals

Programming

Go (Golang): able to read, debug, and modify existing microservice code — trace

latency issues through service logic, adjust Kafka consumer configuration, update

retry semantics; authoring new services from scratch is not required

Bash scripting for automation and operational runbooks
Strong proficiency with YAML / TOML for Kubernetes manifests, Helm values, and

service configuration

Networking

Understanding of TCP/TLS at scale: TLS handshake cost, session resumption

overhead, connection state memory at 30,000 simultaneous connections

Working knowledge of data centre networking concepts (MLAG, VLAN, MTU, ToR

design); able to collaborate effectively with a dedicated networking team

Preferred Qualifications

BS degree in Computer Science or Engineering or equivalent experience

Roles & Responsibilities

Roles And Responsibilities

Bare-Metal Kubernetes Cluster
Provision and configure a multi-pool bare-metal Kubernetes cluster covering ingress/edge, control plane, and application worker tiers
Deploy and configure MetalLB for bare-metal LoadBalancer IP assignment and Traefik as the ingress controller
Configure a 3-node etcd cluster (Raft quorum) for control plane HA and failover
Deploy and tune a 3-node EMQX MQTT Broker cluster for ~30,000 persistent TLS device sessions (~11–15 GB connection state)
Deploy Kong OSS API Gateway (active-active) with JWT validation, rate limiting backed by Valkey, and path-based routing
Data Infrastructure
Deploy a 5-broker Apache Kafka cluster in KRaft mode (RF=3), configure priority based topic lanes (real-time transactions, periodic telemetry, scheduled pulls, outbound publish, replay/repush, dead-letter queues), and set 72-hour message retention
Deploy and configure PostgreSQL HA using Patroni (automatic primary failover ≤30 s), PgBouncer (connection pooling: ~1,000 application connections → ~20–40 database connections), and pgBackRest (hourly incremental backup + PITR)
Deploy a 3-node Valkey cluster for portal response caching (15 s TTL), token caching, and distributed rate-limit counters
Microservice Deployment

Deploy and operationalise a suite of Go-based microservices via Helm charts onto the worker tier:

Ingestion Service — validates and envelopes inbound MQTT events into Kafka
Transaction Service — consumes real-time transaction events with composite-key idempotency; P0 SLO ≤3 s end-to-end to downstream publish
Inventory & Telemetry Services — handle high-volume periodic device readings (~9.2 M records/day) and alarm/interlock events
Scheduler Service — executes time-based and interval-based master data pulls against an external system (~600,000+ API calls/day across all sites)
Replay/Repush Service — recovers and re-publishes historical data through an isolated Kafka lane, without impacting real-time traffic
Enterprise Connectors (×2) — non-blocking outbound publish to downstream enterprise systems with exponential backoff retry and dead-letter queue routing
Portal BFF Services (×2) — role-scoped Backend-for-Frontend aggregation layers for operator and dealer portals, with Valkey-backed response caching

Portal Backend API — JWT-validated session routing and lifecycle management

Configure and deploy Keycloak for JWT issuance to portal users and OAuth 2.0 client credentials for machine-to-machine connector authentication
Observability Stack
VictoriaMetrics — configure scrape targets across MQTT, Kafka, PostgreSQL, API gateway, and pods; 30–90-day metric retention
Grafana — build and maintain seven mandatory dashboard sets: MQTT ingress health, Kafka buffer health, downstream publish health, end-to-end SLO compliance, database health, scheduler execution, and replay/repush progress
Loki — aggregate logs from all pods, the MQTT broker, API gateway, and Kafka; support per-site log query correlation
Tempo — distributed tracing across the full transaction path (device event → Kafka → service → database → downstream publish); integrate with Grafana
Alertmanager — define alert rules for MQTT connection drop, Kafka consumer lag, SLO breach, and DLQ activity; configure storm deduplication; route to SMS and email
Security & Secrets Infrastructure
Deploy Harbor (on-premises container registry) with Trivy CVE scanning; enforce policy blocking critical-vulnerability images from production promotion
Deploy OpenBao (open-source Vault fork) for centralised secrets management; configure Kubernetes ServiceAccount authentication so microservice pods retrieve secrets at startup with no credentials in image or chart
Deploy cert-manager with an internal CA; automate provisioning and rotation of device authentication certs (MQTT TLS), portal HTTPS certs, and inter-service mTLS certs

6. CI/CD Pipeline

Deploy and configure Gitea (self-hosted Git) for all source repositories: Go services, Helm charts, Kubernetes manifests, infrastructure definitions
Configure Woodpecker CI pipelines: go build → go test (using testcontainers-go against real PostgreSQL and Kafka) → Docker image build → Trivy scan → Harbor push
Configure ArgoCD for GitOps continuous delivery: automated rolling deployments triggered by Helm chart updates, automatic rollback on failed health checks
Network Fabric & Day-2 Operations
Coordinate with the networking team on dual-ToR MLAG configuration (dual 25 GbE per node) and validate full-pool path redundancy
Author and validate Day-2 runbooks: MQTT reconnect storm management, Kafka partition rebalance, PostgreSQL primary failover drill, ArgoCD rollback procedure
Lead UAT and go-live: validate end-to-end transaction flow, confirm P0 SLO ≤3 s under peak load, confirm downstream system integration, verify scheduled master data pulls across all sites

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.