AI Engineer | AI-Ops Agent Development

XaasIO

Location: Greater Coimbatore Area
Job type: Full-time

Required skills

LangChain
Python
AWS
Ansible
Artificial Intelligence
Azure
CI
communication skills
compliance
data science
DevOps
Docker
Elasticsearch
GCP
Git
GPU
Helm
infrastructure-as-code
Jenkins
Kafka
kernel
Kubeflow
Kubernetes
Linux
machine learning
OpenStack
PostgreSQL
RabbitMQ
Redis
Shell Scripting
SRE
Terraform
uptime
VMware
ServiceNow

About the role

XaasIO

Website: xaasio.com
Job details:

Job Description: AI Engineer | AI-Ops Agent Development
Primary Location: Coimbatore, Tamil Nadu
Work Mode: On-site / Hybrid
Company: XaasIO Systems Private Limited
Role Type: Full-time
Experience: 3 - 8 years preferred

Why This Role Matters

XaasIO is looking for an AI Engineer | AI-Ops Agent Development to design, build, secure, and integrate intelligent AI agents for infrastructure operations, cloud operations, platform engineering, and day-2 support automation.

The engineer will work on building AI-powered SRE and operations agents for XaasIO platforms, including OpenStack, Kubernetes, CEPH, XaasIO CMP, XaasIO MLT, PostgreSQL, OpenSearch, Grafana, Zabbix, Kafka, Linux, and enterprise open-source platforms.

The role requires hands-on skills in Python, LLM application development, agentic AI frameworks, RAG pipelines, DevOps, DevSecOps, CI/CD pipelines, observability data, logs, metrics, events, APIs, automation workflows, and infrastructure operations.

The objective is to build AI agents that can assist with L1, L2, and eventually L3 operations, including alert triage, root-cause analysis, runbook recommendation, change validation, remediation planning, human-approved automation, compliance validation, and automated operational reporting.

Your Day-to-Day Impact

The candidate will be responsible for:
Designing and developing AI-Ops agents for cloud, infrastructure, platform, and SRE operations.
Building AI agents for platforms such as:
Kubernetes
OpenStack
CEPH
PostgreSQL
MariaDB
Kafka
OpenSearch
Grafana
Zabbix
Linux
XaasIO CMP
XaasIO MLT
Developing agent workflows for:
Alert triage
Log analysis
Metrics analysis
Event correlation
Root-cause analysis
Incident summarization
Runbook recommendation
Remediation planning
Change impact analysis
Post-change validation
Post-incident review support
Compliance validation
Security posture analysis
Automated operational report generation

Building RAG-based knowledge systems using runbooks, SOPs, architecture documents, platform documentation, logs, tickets, alerts, monitoring data, security scan reports, compliance reports, and incident history.

Integrating AI agents with observability and operations platforms such as:
Grafana
Prometheus
OpenSearch
Zabbix
Alertmanager
Wazuh
ITSM tools
CI/CD systems
Git repositories
Ansible / AWX
OpenTofu / Terraform
Building safe agent workflows with human-in-the-loop approvals before executing production-impacting actions.

Creating automation playbooks and remediation workflows using Python, Ansible, APIs, shell scripts, and event-driven automation.

Developing agent tools and connectors for:
Kubernetes API
OpenStack APIs
CEPH APIs
Linux system commands
PostgreSQL / MariaDB APIs
Monitoring APIs
Logging APIs
ITSM APIs
CI/CD APIs
DevSecOps tool APIs
Designing guardrails for AI agent actions, including:
Role-based access control
Approval workflows
Audit logging
Dry-run mode
Policy validation
Change window validation
Rollback checks
Secrets protection
Security baseline validation
Safety checks before remediation

Implementing DevSecOps and CI/CD pipeline integrations for automated validation, secure build processes, security scanning, compliance checks, and deployment approvals.

Integrating SAST, DAST, SCA, container image scanning, IaC scanning, secrets scanning, SBOM generation, vulnerability checks, and policy-as-code gates into development and deployment workflows.

Evaluating and integrating open-source AI agent frameworks, AI platform engineering tools, and AI-Ops reference architectures.

Developing PoCs, demos, technical documentation, architecture diagrams, test cases, and customer-facing presentations.

Troubleshooting agent behavior, hallucination risks, prompt failures, tool-calling errors, data quality issues, model performance issues, security scan failures, pipeline failures, and infrastructure integration problems.

Skills You Bring to the Table

Bachelor’s or Master’s degree in Computer Science, Artificial Intelligence, Machine Learning, Data Science, Information Technology, Engineering, Cybersecurity, or equivalent practical experience.

Certifications in AI, data science, Kubernetes, Linux, cloud, DevOps, DevSecOps, cybersecurity, or security compliance will be an added advantage.

The candidate should have hands-on experience in:
Python programming
LLM application development
AI agent development
Prompt engineering
RAG pipeline design
Vector databases
REST API integration
Linux fundamentals
Git-based development workflow
Docker and containerized application deployment
Kubernetes basics
Observability fundamentals: logs, metrics, events, and traces
Automation scripting using Python and Shell
DevOps practices and infrastructure operations workflows
Hands-on exposure to CI/CD pipelines
CI/CD tools such as GitHub Actions, GitLab CI/CD, Jenkins, Argo CD, Tekton, or similar
Building, testing, packaging, and deploying applications through CI/CD workflows
DevSecOps practices and secure software delivery workflows
SAST, DAST, SCA, and container image scanning
IaC scanning and secrets scanning
SBOM generation and vulnerability management
Security scanning tools such as Trivy, Semgrep, SonarQube, Checkov, OWASP ZAP, or similar
Policy-as-Code using OPA, Kyverno, or similar tools
RBAC, IAM, audit logging, and compliance reporting basics
Strong debugging and problem-solving skills
AI, GenAI and Agentic AI Skills
The candidate should have working knowledge of:
Large Language Models
Open-source LLMs
Prompt engineering
Function calling / tool calling
Agentic workflows
Multi-agent patterns
RAG pipelines
Embedding models
Vector search
Semantic search
Reranking
Prompt and response evaluation
Guardrails and safety controls
Human-in-the-loop workflows
AI workflow orchestration
Preferred AI Framework Exposure
Exposure to one or more of the following will be preferred:
LangChain
LangGraph
LlamaIndex
CrewAI
AutoGen
Semantic Kernel
Haystack
DSPy
Hugging Face
vLLM
Ollama
OpenAI-compatible APIs
OpenWebUI
Milvus
Qdrant
Weaviate
ChromaDB
AI-Ops and Observability Skills
The candidate should have exposure to one or more of the following:
Prometheus
Grafana
Alertmanager
OpenSearch / Elasticsearch
Zabbix
Wazuh
Loki
Tempo
Jaeger
OpenTelemetry
Uptime Kuma
Event correlation
Alert noise reduction
SLA and SLO reporting
Root-cause analysis workflows
Runbook automation
Incident management workflows
ITSM integrations such as BMC Helix, ServiceNow, GLPI, or Zammad
Infrastructure Platform Skills
The candidate should have exposure to one or more of the following:
Kubernetes operations
OpenStack operations
CEPH operations
Linux systems administration
PostgreSQL / MariaDB operations
Kafka operations
Redis operations
NGINX / HAProxy operations
Public cloud operations such as AWS, Azure, or GCP
VMware / KVM / Nutanix exposure
Backup and restore workflows
Replication and DR workflows
Automation and Platform Engineering Skills
The candidate should have exposure to:
Ansible
AWX / Ansible Automation Platform
OpenTofu / Terraform
Python automation
Shell scripting
GitOps workflows
Kubernetes operators
Helm charts
Infrastructure-as-Code validation
Policy-as-Code
Event-driven automation
Pub/Sub architecture using Kafka, RabbitMQ, NATS, or similar platforms
DevSecOps and Security Skills
The candidate should have mandatory hands-on exposure to:
Secure CI/CD pipeline integration
Security gates in software delivery pipelines
SAST, DAST, and SCA scanning
Container image scanning
Infrastructure-as-Code scanning
Secrets scanning
SBOM generation
Vulnerability management
Security baseline validation
Compliance checks and audit reporting
Policy-as-Code enforcement
Secure tool execution for AI agents
Preferred tools include:
Trivy
OpenSCAP
Semgrep
SonarQube
OWASP ZAP
Checkov
Syft / Grype
Kyverno
OPA / Gatekeeper
Wazuh
GitHub Advanced Security or similar
Example AI-Ops Agent Use Cases
The engineer should be capable of building agents for use cases such as:
Kubernetes AI-Ops Agent
Analyze pod failures
Explain CrashLoopBackOff issues
Detect resource pressure
Recommend scaling actions
Validate cluster health
Generate remediation steps
Validate security posture of workloads
OpenStack AI-Ops Agent
Analyze Nova, Neutron, Cinder, Glance, and Keystone issues
Correlate API errors with service logs
Check hypervisor capacity and VM placement issues
Identify network, floating IP, router, or volume attachment problems
Recommend safe remediation steps
Validate service health before and after changes
CEPH AI-Ops Agent
Analyze OSD, MON, MGR, RGW, and RBD health
Explain PG states and recovery status
Identify disk, latency, or replication issues
Recommend recovery and rebalancing actions
Validate cluster health after remediation
Database AI-Ops Agent
Analyze PostgreSQL or MariaDB performance
Detect slow queries, locks, replication lag, and connection issues
Recommend tuning or remediation steps
Generate database health and risk reports
Observability AI-Ops Agent
Summarize alerts
Group related events
Correlate logs, metrics, and traces
Generate incident summaries
Prepare RCA and post-incident reports
Generate SLA and SLO compliance summaries
Change Validation Agent
Validate pre-change checklist
Compare pre-change and post-change system states
Validate security and compliance gates
Generate change success or failure reports
Recommend rollback actions where required
DevSecOps Agent
Analyze CI/CD pipeline failures
Review security scan reports
Summarize vulnerabilities and severity
Recommend remediation steps
Validate IaC and container security findings
Generate compliance evidence reports
Customer-Facing and Delivery Responsibilities
The candidate should be able to:
Participate in customer-facing technical discussions and AI-Ops solution workshops.

Understand customer operations workflows, monitoring stack, DevOps process, DevSecOps controls, incident management process, escalation process, and existing runbooks.

Convert customer operations use cases into AI agent workflows, automation flows, integration requirements, and security guardrails.

Support PoCs and demos for AI-Ops, AI-SRE, automated RCA, intelligent remediation, and DevSecOps automation.

Document agent capabilities, limitations, guardrails, integrations, test cases, security controls, and operational procedures.

Present technical findings, demo outcomes, risks, and recommendations to internal and customer stakeholders.

Extra Cool If You Know

The following skills will be an added advantage:
Experience building AI agents for infrastructure operations
Experience with CAIPE, CNOE, K8sGPT, Komodor-like workflows, or AI platform engineering tools
Experience with SRE, DevOps, cloud operations, NOC, or SOC operations
Experience with MLflow, Kubeflow, JupyterLab, or Private AI Factory platforms
Experience with model evaluation and prompt evaluation
Experience with GPU inference platforms such as vLLM
Experience with fine-tuning, LoRA, QLoRA, or model quantization
Experience with streaming architectures using Kafka, NATS, RabbitMQ, or Redis Streams
Experience with ITSM integration and ticket lifecycle automation
Active GitHub profile, open-source contributions, AI demos, notebooks, blogs, or technical portfolio
Preferred Technical Stack
Programming: Python, Shell, SQL
AI Frameworks: LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen
LLM Serving: vLLM, Ollama, OpenAI-compatible APIs, Hugging Face
RAG / Vector DB: Milvus, Qdrant, Weaviate, ChromaDB
Observability: Prometheus, Grafana, OpenSearch, Zabbix, Alertmanager, OpenTelemetry
Cloud / Infrastructure: Kubernetes, OpenStack, CEPH, Linux, PostgreSQL, MariaDB, Kafka
Automation: Ansible, AWX, OpenTofu, Terraform, Python automation
CI/CD: GitHub Actions, GitLab CI/CD, Jenkins, Argo CD, Tekton
DevSecOps: Trivy, OpenSCAP, Semgrep, SonarQube, OWASP ZAP, Checkov, Syft, Grype, OPA, Kyverno
ITSM: BMC Helix, ServiceNow, GLPI, Zammad, or similar

Knowledge Sources: Runbooks, SOPs, logs, metrics, events, tickets, KB articles, security scan reports, compliance reports

Required Soft Skills
The candidate should have:
Strong problem-solving and analytical thinking
Strong communication skills
Ability to understand infrastructure operations problems
Ability to explain AI agent behavior clearly
Ability to work with SRE, DevOps, DevSecOps, cloud, security, and customer teams
Strong documentation skills
Curiosity to learn new AI and open-source operations tools
Ownership mindset for delivery, safety, security, quality, and customer success

How You’ll Make an Impact

We are looking for someone who:
Can build AI agents for real infrastructure operations.
Understands both AI engineering and cloud operations workflows.
Has mandatory DevOps, DevSecOps, and CI/CD pipeline exposure.
Can convert runbooks and SOPs into intelligent agent workflows.
Can integrate agents with monitoring, logging, ITSM, automation, CI/CD, DevSecOps, and cloud APIs.
Can design safe, secure, human-approved remediation workflows.
Can work with Kubernetes, OpenStack, CEPH, Linux, and enterprise open-source platforms.
Can build RAG pipelines using operational knowledge, documentation, security reports, and historical incidents.
Can contribute to XaasIO Private AI Factory and XaasIO AI-Ops platform capabilities.
Can demonstrate practical engineering work through GitHub, demos, notebooks, or past projects.

Perks, Culture & Growth

At XaasIO Systems Pvt. Ltd, we believe our employees are our greatest asset. We are committed to creating a workplace that fosters innovation, growth, and well-being.

Learning & Growth
Opportunities to work on cutting-edge technologies
Continuous learning through training, certifications, and mentorship
Exposure to real-time projects and global clients
Work Culture
Open, inclusive, and collaborative work environment
Encouragement of new ideas and innovation
Strong focus on teamwork and transparency
Career Development
Clear career progression paths
Performance-driven growth opportunities
Internal mobility across roles and projects
Work-Life Balance
Flexible work environment (WFH / hybrid)
Paid time off and leave benefits
Supportive policies for employee well-being
Rewards & Recognition
Competitive compensation and benefits
Performance-based incentives
Employee recognition programs
Safe & Respectful Workplace

Strong adherence to policies aligned with the Sexual Harassment of Women at Workplace (Prevention, Prohibition and Redressal) Act, 2013

Zero tolerance for harassment or discrimination

Summary

This is an AI engineering role based primarily in Coimbatore for engineers who want to build AI-Ops agents, AI-SRE automation, and DevSecOps-aware intelligent operations capabilities for XaasIO’s Private AI Factory, Cloud Management Platform, Monitoring/Logging/Telemetry platform, and enterprise open-source infrastructure stack.

The role is ideal for candidates who can combine LLMs, RAG, agentic AI, observability, DevOps, DevSecOps, CI/CD, automation, Kubernetes, OpenStack, CEPH, Linux, and platform engineering to build secure intelligent operations agents for enterprise private cloud and sovereign AI infrastructure. Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.