- Incident Response with AI
LLM assisted incident workflows (AI summaries, timeline drafting, suggested fixes, and post mortems integrated with Slack/Teams).
Runbook automation with AI (building AI assisted, context aware runbooks and approval gates for high risk actions).
Generative AI for coordination & RCA (using LLMs to accelerate investigation and communications; understanding current accuracy limits and human in the loop needs).
SRE principles applied to ML systems (SLOs/SLIs/error budgets for ML services; capacity planning and model freshness).
Production ML observability (data/concept/label drift detection, automated retraining triggers, explainability traces).
Telemetry & visualization for model health (instrumentation with Prometheus/Grafana for drift and degradation).
- AI enhanced Automation & CI/CD
AI augmented IaC and pipelines (LLM generated Terraform/Helm/Ansible, policy enforcement, drift detection in infra).
AIOps in delivery (change impact hints, automated triage, and GitOps based auto remediation ).
AI pair programming ergonomics (using Copilot responsibly; measuring impact on quality/velocity and guardrails).
- AI + Chaos Engineering (Resilience)
Designing AI guided chaos experiments (intelligent fault selection , anomaly detection during experiments, learning from outcomes).
Reinforcement learning driven fault injection (automated scenario generation to expose latent weaknesses and improve recovery times).
Operationalizing lessons from chaos + ML (predictive failure analysis and proactive controls).
- Platform & Tool Literacy (AI ready)
Hands on with AIOps/observability platforms (event correlation and unified incident views at scale).
Familiarity with AI enabled incident tooling (e.g., incident.io/Rootly/PagerDuty/Datadog for AI triage and summaries).
- Governance, Safety & Measurement
Human in the loop guardrails (approval policies, rollback safety, and compliance in autonomous actions).
Trustworthy AI practices (explainability, data/model/process trust; aligning metrics with business outcomes).
Outcome measurement for AI adoption (MTTR, alert noise, developer experience/velocity with AI tools).