UST
Website:
ust.com
Job details:
Role Description
We are seeking an experienced Azure Platform Engineer with strong expertise in infrastructure platform management, AKS operations, and Infrastructure as Code (IaC). The ideal candidate will lead platform reliability, modernization, and incident management initiatives while mentoring junior engineers and collaborating with global stakeholders.
This role requires deep hands-on technical capability combined with operational ownership and leadership skills.
Key Responsibilities
Infrastructure & Platform Management
- Manage and maintain cloud infrastructure platforms, including:
- OS and platform patching
- Service upgrades and lifecycle management
- Certificate lifecycle management
- Ensure platform stability, security compliance, and operational excellence.
Azure Cloud & AKS Operations
- Design, deploy, and manage Azure cloud environments using Infrastructure as Code (IaC) tools such as Terraform and ARM templates.
- Operate and optimize Azure Kubernetes Service (AKS), including:
- Cluster upgrades (N-1 strategy)
- Node pool management and scaling
- Network policies and security enforcement
- Azure Firewall integrations
- Istio service mesh troubleshooting
- Certificate management within Kubernetes
- Drive automation and continuous improvement of platform operations.
CI/CD & Automation
- Build and maintain CI/CD pipelines for infrastructure and application deployments.
- Manage YAML-based pipelines and agent pool governance (legacy and modern setups).
- Support image updates, scaling strategies, and pipeline optimization.
Observability & Reliability Engineering
- Implement and enhance observability practices using:
- Dynatrace monitoring
- Prometheus & Grafana
- SLO dashboards and performance metrics
- Enable routing, service discovery, and automation for high-availability systems.
- Ensure proactive monitoring and reliability improvements across environments.
Incident Management & Operational Leadership
- Lead high-severity (P1/P2) incident management, including:
- Triage and impact analysis
- Break-fix resolution
- Root Cause Analysis (RCA) documentation
- Preventive action planning
- Drive operational maturity and continuous service improvement.
Stakeholder Collaboration & Leadership
- Collaborate effectively with customers and stakeholders in the US time zone.
- Provide clear communication during incidents and change activities.
- Lead and mentor junior engineers, fostering technical growth and accountability.
Required Skills & Experience
- Strong experience in Azure Cloud services and AKS operations.
- Hands-on expertise with Terraform, ARM templates, and Infrastructure as Code practices.
- Deep understanding of Kubernetes networking, scaling, and service mesh (Istio).
- Experience managing CI/CD pipelines for both infrastructure and applications.
- Strong knowledge of observability and monitoring tools (Dynatrace, Prometheus, Grafana).
- Proven experience leading high-severity incidents and managing RCAs.
- Excellent communication skills and ability to work across global teams.
- Prior experience leading or mentoring engineering teams.
Skills
azure devops,cluster upgrades,terraform,infrastructure as code,azure cloud services,aks operations,node pools,ci/cd pipeline,agent pool governance,yaml pipelines,arm,azure firewall,dynatrace monitoring,prometheus
Click on Apply to know more.