InfoVision Inc.
Website:
infovision.com
Job details:
Critical Skills To Possess
Kubernetes & Container Orchestration
- 3+ years of production Kubernetes experience; bare-metal / on-premises
experience is mandatory — cloud-managed Kubernetes experience alone does not
qualify
- Hands-on Helm chart authoring (not just consumption), ArgoCD or equivalent
GitOps tooling, and cert-manager
- Deep understanding of Kubernetes control plane HA: etcd Raft quorum, leader
election, minimum viable node counts
- Experience with MetalLB and Traefik or equivalent bare-metal ingress and load
balancing tools
Messaging & Streaming
- Production hands-on experience with Apache Kafka: topic configuration, partition
sizing, replication factor, consumer group management, KRaft mode
- Solid understanding of consumer lag monitoring and back-pressure patterns under
burst load
- Strong familiarity with MQTT protocol semantics: persistent sessions, QoS levels,
TLS mutual authentication; hands-on experience with EMQX or a comparable
broker at tens-of-thousands of concurrent connections
Data Infrastructure
- Production hands-on experience with PostgreSQL HA: Patroni-based automatic
failover, PgBouncer connection pooling, streaming replication, PITR with
pgBackRest or equivalent
- Working knowledge of Valkey / Redis: cluster mode, TTL-based caching, atomic
counter operations
Observability
- Hands-on deployment and configuration of VictoriaMetrics / Prometheus,
Grafana, Loki, Tempo, and Alertmanager
- Ability to write PromQL / MetricsQL queries and author production Grafana
dashboards from scratch
- Experience configuring alert deduplication and grouping in Alertmanager for high
volume event storms
Security & Secrets
- Hands-on experience with HashiCorp Vault / OpenBao: PKI secrets engine,
Kubernetes auth method, dynamic secrets
- Experience with container image CVE scanning (Trivy or equivalent) and a self
hosted container registry (Harbor or equivalent)
- Understanding of TLS certificate lifecycle management: internal CA, automated
rotation, mTLS between services
CI/CD & GitOps
- Experience building CI pipelines with real service dependency testing
(testcontainers-go or equivalent)
- GitOps workflow discipline: Git as the sole source of cluster state truth, PR-gated
deployment approvals
Programming
- Go (Golang): able to read, debug, and modify existing microservice code — trace
latency issues through service logic, adjust Kafka consumer configuration, update
retry semantics; authoring new services from scratch is not required
- Bash scripting for automation and operational runbooks
- Strong proficiency with YAML / TOML for Kubernetes manifests, Helm values, and
service configuration
Networking
- Understanding of TCP/TLS at scale: TLS handshake cost, session resumption
overhead, connection state memory at 30,000 simultaneous connections
- Working knowledge of data centre networking concepts (MLAG, VLAN, MTU, ToR
design); able to collaborate effectively with a dedicated networking team
Preferred Qualifications
- BS degree in Computer Science or Engineering or equivalent experience
Roles & Responsibilities
Roles And Responsibilities
- Bare-Metal Kubernetes Cluster
- Provision and configure a multi-pool bare-metal Kubernetes cluster covering ingress/edge, control plane, and application worker tiers
- Deploy and configure MetalLB for bare-metal LoadBalancer IP assignment and Traefik as the ingress controller
- Configure a 3-node etcd cluster (Raft quorum) for control plane HA and failover
- Deploy and tune a 3-node EMQX MQTT Broker cluster for ~30,000 persistent TLS device sessions (~11–15 GB connection state)
- Deploy Kong OSS API Gateway (active-active) with JWT validation, rate limiting backed by Valkey, and path-based routing
- Data Infrastructure
- Deploy a 5-broker Apache Kafka cluster in KRaft mode (RF=3), configure priority based topic lanes (real-time transactions, periodic telemetry, scheduled pulls, outbound publish, replay/repush, dead-letter queues), and set 72-hour message retention
- Deploy and configure PostgreSQL HA using Patroni (automatic primary failover ≤30 s), PgBouncer (connection pooling: ~1,000 application connections → ~20–40 database connections), and pgBackRest (hourly incremental backup + PITR)
- Deploy a 3-node Valkey cluster for portal response caching (15 s TTL), token caching, and distributed rate-limit counters
- Microservice Deployment
Deploy and operationalise a suite of Go-based microservices via Helm charts onto the worker tier:
- Ingestion Service — validates and envelopes inbound MQTT events into Kafka
- Transaction Service — consumes real-time transaction events with composite-key idempotency; P0 SLO ≤3 s end-to-end to downstream publish
- Inventory & Telemetry Services — handle high-volume periodic device readings (~9.2 M records/day) and alarm/interlock events
- Scheduler Service — executes time-based and interval-based master data pulls against an external system (~600,000+ API calls/day across all sites)
- Replay/Repush Service — recovers and re-publishes historical data through an isolated Kafka lane, without impacting real-time traffic
- Enterprise Connectors (×2) — non-blocking outbound publish to downstream enterprise systems with exponential backoff retry and dead-letter queue routing
- Portal BFF Services (×2) — role-scoped Backend-for-Frontend aggregation layers for operator and dealer portals, with Valkey-backed response caching
Portal Backend API — JWT-validated session routing and lifecycle management
- Configure and deploy Keycloak for JWT issuance to portal users and OAuth 2.0 client credentials for machine-to-machine connector authentication
- Observability Stack
- VictoriaMetrics — configure scrape targets across MQTT, Kafka, PostgreSQL, API gateway, and pods; 30–90-day metric retention
- Grafana — build and maintain seven mandatory dashboard sets: MQTT ingress health, Kafka buffer health, downstream publish health, end-to-end SLO compliance, database health, scheduler execution, and replay/repush progress
- Loki — aggregate logs from all pods, the MQTT broker, API gateway, and Kafka; support per-site log query correlation
- Tempo — distributed tracing across the full transaction path (device event → Kafka → service → database → downstream publish); integrate with Grafana
- Alertmanager — define alert rules for MQTT connection drop, Kafka consumer lag, SLO breach, and DLQ activity; configure storm deduplication; route to SMS and email
- Security & Secrets Infrastructure
- Deploy Harbor (on-premises container registry) with Trivy CVE scanning; enforce policy blocking critical-vulnerability images from production promotion
- Deploy OpenBao (open-source Vault fork) for centralised secrets management; configure Kubernetes ServiceAccount authentication so microservice pods retrieve secrets at startup with no credentials in image or chart
- Deploy cert-manager with an internal CA; automate provisioning and rotation of device authentication certs (MQTT TLS), portal HTTPS certs, and inter-service mTLS certs
6
. CI/CD Pipeline
- Deploy and configure Gitea (self-hosted Git) for all source repositories: Go services, Helm charts, Kubernetes manifests, infrastructure definitions
- Configure Woodpecker CI pipelines: go build → go test (using testcontainers-go against real PostgreSQL and Kafka) → Docker image build → Trivy scan → Harbor push
- Configure ArgoCD for GitOps continuous delivery: automated rolling deployments triggered by Helm chart updates, automatic rollback on failed health checks
- Network Fabric & Day-2 Operations
- Coordinate with the networking team on dual-ToR MLAG configuration (dual 25 GbE per node) and validate full-pool path redundancy
- Author and validate Day-2 runbooks: MQTT reconnect storm management, Kafka partition rebalance, PostgreSQL primary failover drill, ArgoCD rollback procedure
- Lead UAT and go-live: validate end-to-end transaction flow, confirm P0 SLO ≤3 s under peak load, confirm downstream system integration, verify scheduled master data pulls across all sites
Click on Apply to know more.