Website:
aivar.tech
Job details:
About Aivar Innovations
Aivar is an
AI-first technology partner where cutting-edge technology meets industry expertise to supercharge your projects.
Team: Accelerators
Experience: 5–9 years | 3+ years with EKS/Kubernetes in production
Technical Focus: Foundational hire for AI Ops stack. Own the entire EKS platform: hardened cluster configurations, Terraform modules, Karpenter GPU-aware autoscaling, multi-tenancy (RBAC, namespace isolation, network policies), multi-region DR, and cost optimization. Build infrastructure that runs Llama 70B at sub-second latency on multi-GPU instances.
Key Responsibilities
- Design hardened EKS clusters — private endpoints, IMDSv2, Pod Security Admission, image scanning, audit logging.
- UltraCluster Scale — Experience in building HPCs and large clusters suitable for managing AI Ops of SLMs to LLMs.
- Build Terraform modules for a complete Kubogent stack — VPC, EKS, GPU/CPU node groups, IAM, networking, storage.
- Configure Karpenter for GPU-aware autoscaling across instance families (G6e, P4d, P5, Inferentia).
- Implement multi-tenancy — namespace isolation, resource quotas, RBAC, network policies, fair-share scheduling.
- Build multi-region DR with automated failover, cross-region replication, and failover testing.
- Optimise cloud spend — Capacity Blocks, Spot instances, reserved pricing, right-sizing, KubeCost integration.
- Design robust network architecture — VPC CNI, private subnets, security groups, Transit Gateway, private endpoints.
Must-Have Technical Skills
- AWS infrastructure — deep VPC, IAM, networking, multi-account (5+ years).
- Kubernetes/EKS — production clusters, networking (CNI), storage, RBAC (3+ years).
- Terraform expert — large module codebases, remote state, workspaces, CI/CD integration.
- Karpenter or Cluster Autoscaler in production.
- GPU instances on AWS — G-series (L40S), P-series (A100), NVIDIA GPU operator/device plugins.
- Security hardening — Pod Security Admission, OPA/Gatekeeper, image scanning, secrets management.
- Linux systems — performance tuning, storage (EBS, EFS, FSx for Lustre), kernel parameters.
Core Tech Stack
Terraform, AWS (EKS, EC2 GPU, VPC, IAM, EBS/EFS/FSx, ECR), Karpenter, Helm, Kustomize, ArgoCD, NVIDIA GPU Operator/DCGM, Calico, Istio, Prometheus/Grafana/KubeCost, OPA/Gatekeeper, Falco, Trivy.
Click on Apply to know more.