XaasIO
Website:
xaasio.com
Job details:
- Job Description: AI Engineer | AI-Ops Agent Development
- Primary Location: Coimbatore, Tamil Nadu
- Work Mode: On-site / Hybrid
- Company: XaasIO Systems Private Limited
- Role Type: Full-time
- Experience: 3 - 8 years preferred
Why This Role Matters
XaasIO is looking for an AI Engineer | AI-Ops Agent Development to design, build, secure, and integrate intelligent AI agents for infrastructure operations, cloud operations, platform engineering, and day-2 support automation.
The engineer will work on building AI-powered SRE and operations agents for XaasIO platforms, including OpenStack, Kubernetes, CEPH, XaasIO CMP, XaasIO MLT, PostgreSQL, OpenSearch, Grafana, Zabbix, Kafka, Linux, and enterprise open-source platforms.
The role requires hands-on skills in Python, LLM application development, agentic AI frameworks, RAG pipelines, DevOps, DevSecOps, CI/CD pipelines, observability data, logs, metrics, events, APIs, automation workflows, and infrastructure operations.
The objective is to build AI agents that can assist with L1, L2, and eventually L3 operations, including alert triage, root-cause analysis, runbook recommendation, change validation, remediation planning, human-approved automation, compliance validation, and automated operational reporting.
Your Day-to-Day Impact
- The candidate will be responsible for:
- Designing and developing AI-Ops agents for cloud, infrastructure, platform, and SRE operations.
- Building AI agents for platforms such as:
- Kubernetes
- OpenStack
- CEPH
- PostgreSQL
- MariaDB
- Kafka
- OpenSearch
- Grafana
- Zabbix
- Linux
- XaasIO CMP
- XaasIO MLT
- Developing agent workflows for:
- Alert triage
- Log analysis
- Metrics analysis
- Event correlation
- Root-cause analysis
- Incident summarization
- Runbook recommendation
- Remediation planning
- Change impact analysis
- Post-change validation
- Post-incident review support
- Compliance validation
- Security posture analysis
- Automated operational report generation
Building RAG-based knowledge systems using runbooks, SOPs, architecture documents, platform documentation, logs, tickets, alerts, monitoring data, security scan reports, compliance reports, and incident history.
- Integrating AI agents with observability and operations platforms such as:
- Grafana
- Prometheus
- OpenSearch
- Zabbix
- Alertmanager
- Wazuh
- ITSM tools
- CI/CD systems
- Git repositories
- Ansible / AWX
- OpenTofu / Terraform
- Building safe agent workflows with human-in-the-loop approvals before executing production-impacting actions.
Creating automation playbooks and remediation workflows using Python, Ansible, APIs, shell scripts, and event-driven automation.
- Developing agent tools and connectors for:
- Kubernetes API
- OpenStack APIs
- CEPH APIs
- Linux system commands
- PostgreSQL / MariaDB APIs
- Monitoring APIs
- Logging APIs
- ITSM APIs
- CI/CD APIs
- DevSecOps tool APIs
- Designing guardrails for AI agent actions, including:
- Role-based access control
- Approval workflows
- Audit logging
- Dry-run mode
- Policy validation
- Change window validation
- Rollback checks
- Secrets protection
- Security baseline validation
- Safety checks before remediation
Implementing DevSecOps and CI/CD pipeline integrations for automated validation, secure build processes, security scanning, compliance checks, and deployment approvals.
Integrating SAST, DAST, SCA, container image scanning, IaC scanning, secrets scanning, SBOM generation, vulnerability checks, and policy-as-code gates into development and deployment workflows.
Evaluating and integrating open-source AI agent frameworks, AI platform engineering tools, and AI-Ops reference architectures.
- Developing PoCs, demos, technical documentation, architecture diagrams, test cases, and customer-facing presentations.
Troubleshooting agent behavior, hallucination risks, prompt failures, tool-calling errors, data quality issues, model performance issues, security scan failures, pipeline failures, and infrastructure integration problems.
Skills You Bring to the Table
Bachelor’s or Master’s degree in Computer Science, Artificial Intelligence, Machine Learning, Data Science, Information Technology, Engineering, Cybersecurity, or equivalent practical experience.
Certifications in AI, data science, Kubernetes, Linux, cloud, DevOps, DevSecOps, cybersecurity, or security compliance will be an added advantage.
- The candidate should have hands-on experience in:
- Python programming
- LLM application development
- AI agent development
- Prompt engineering
- RAG pipeline design
- Vector databases
- REST API integration
- Linux fundamentals
- Git-based development workflow
- Docker and containerized application deployment
- Kubernetes basics
- Observability fundamentals: logs, metrics, events, and traces
- Automation scripting using Python and Shell
- DevOps practices and infrastructure operations workflows
- Hands-on exposure to CI/CD pipelines
- CI/CD tools such as GitHub Actions, GitLab CI/CD, Jenkins, Argo CD, Tekton, or similar
- Building, testing, packaging, and deploying applications through CI/CD workflows
- DevSecOps practices and secure software delivery workflows
- SAST, DAST, SCA, and container image scanning
- IaC scanning and secrets scanning
- SBOM generation and vulnerability management
- Security scanning tools such as Trivy, Semgrep, SonarQube, Checkov, OWASP ZAP, or similar
- Policy-as-Code using OPA, Kyverno, or similar tools
- RBAC, IAM, audit logging, and compliance reporting basics
- Strong debugging and problem-solving skills
- AI, GenAI and Agentic AI Skills
- The candidate should have working knowledge of:
- Large Language Models
- Open-source LLMs
- Prompt engineering
- Function calling / tool calling
- Agentic workflows
- Multi-agent patterns
- RAG pipelines
- Embedding models
- Vector search
- Semantic search
- Reranking
- Prompt and response evaluation
- Guardrails and safety controls
- Human-in-the-loop workflows
- AI workflow orchestration
- Preferred AI Framework Exposure
- Exposure to one or more of the following will be preferred:
- LangChain
- LangGraph
- LlamaIndex
- CrewAI
- AutoGen
- Semantic Kernel
- Haystack
- DSPy
- Hugging Face
- vLLM
- Ollama
- OpenAI-compatible APIs
- OpenWebUI
- Milvus
- Qdrant
- Weaviate
- ChromaDB
- AI-Ops and Observability Skills
- The candidate should have exposure to one or more of the following:
- Prometheus
- Grafana
- Alertmanager
- OpenSearch / Elasticsearch
- Zabbix
- Wazuh
- Loki
- Tempo
- Jaeger
- OpenTelemetry
- Uptime Kuma
- Event correlation
- Alert noise reduction
- SLA and SLO reporting
- Root-cause analysis workflows
- Runbook automation
- Incident management workflows
- ITSM integrations such as BMC Helix, ServiceNow, GLPI, or Zammad
- Infrastructure Platform Skills
- The candidate should have exposure to one or more of the following:
- Kubernetes operations
- OpenStack operations
- CEPH operations
- Linux systems administration
- PostgreSQL / MariaDB operations
- Kafka operations
- Redis operations
- NGINX / HAProxy operations
- Public cloud operations such as AWS, Azure, or GCP
- VMware / KVM / Nutanix exposure
- Backup and restore workflows
- Replication and DR workflows
- Automation and Platform Engineering Skills
- The candidate should have exposure to:
- Ansible
- AWX / Ansible Automation Platform
- OpenTofu / Terraform
- Python automation
- Shell scripting
- GitOps workflows
- Kubernetes operators
- Helm charts
- Infrastructure-as-Code validation
- Policy-as-Code
- Event-driven automation
- Pub/Sub architecture using Kafka, RabbitMQ, NATS, or similar platforms
- DevSecOps and Security Skills
- The candidate should have mandatory hands-on exposure to:
- Secure CI/CD pipeline integration
- Security gates in software delivery pipelines
- SAST, DAST, and SCA scanning
- Container image scanning
- Infrastructure-as-Code scanning
- Secrets scanning
- SBOM generation
- Vulnerability management
- Security baseline validation
- Compliance checks and audit reporting
- Policy-as-Code enforcement
- Secure tool execution for AI agents
- Preferred tools include:
- Trivy
- OpenSCAP
- Semgrep
- SonarQube
- OWASP ZAP
- Checkov
- Syft / Grype
- Kyverno
- OPA / Gatekeeper
- Wazuh
- GitHub Advanced Security or similar
- Example AI-Ops Agent Use Cases
- The engineer should be capable of building agents for use cases such as:
- Kubernetes AI-Ops Agent
- Analyze pod failures
- Explain CrashLoopBackOff issues
- Detect resource pressure
- Recommend scaling actions
- Validate cluster health
- Generate remediation steps
- Validate security posture of workloads
- OpenStack AI-Ops Agent
- Analyze Nova, Neutron, Cinder, Glance, and Keystone issues
- Correlate API errors with service logs
- Check hypervisor capacity and VM placement issues
- Identify network, floating IP, router, or volume attachment problems
- Recommend safe remediation steps
- Validate service health before and after changes
- CEPH AI-Ops Agent
- Analyze OSD, MON, MGR, RGW, and RBD health
- Explain PG states and recovery status
- Identify disk, latency, or replication issues
- Recommend recovery and rebalancing actions
- Validate cluster health after remediation
- Database AI-Ops Agent
- Analyze PostgreSQL or MariaDB performance
- Detect slow queries, locks, replication lag, and connection issues
- Recommend tuning or remediation steps
- Generate database health and risk reports
- Observability AI-Ops Agent
- Summarize alerts
- Group related events
- Correlate logs, metrics, and traces
- Generate incident summaries
- Prepare RCA and post-incident reports
- Generate SLA and SLO compliance summaries
- Change Validation Agent
- Validate pre-change checklist
- Compare pre-change and post-change system states
- Validate security and compliance gates
- Generate change success or failure reports
- Recommend rollback actions where required
- DevSecOps Agent
- Analyze CI/CD pipeline failures
- Review security scan reports
- Summarize vulnerabilities and severity
- Recommend remediation steps
- Validate IaC and container security findings
- Generate compliance evidence reports
- Customer-Facing and Delivery Responsibilities
- The candidate should be able to:
- Participate in customer-facing technical discussions and AI-Ops solution workshops.
Understand customer operations workflows, monitoring stack, DevOps process, DevSecOps controls, incident management process, escalation process, and existing runbooks.
Convert customer operations use cases into AI agent workflows, automation flows, integration requirements, and security guardrails.
- Support PoCs and demos for AI-Ops, AI-SRE, automated RCA, intelligent remediation, and DevSecOps automation.
Document agent capabilities, limitations, guardrails, integrations, test cases, security controls, and operational procedures.
- Present technical findings, demo outcomes, risks, and recommendations to internal and customer stakeholders.
Extra Cool If You Know
- The following skills will be an added advantage:
- Experience building AI agents for infrastructure operations
- Experience with CAIPE, CNOE, K8sGPT, Komodor-like workflows, or AI platform engineering tools
- Experience with SRE, DevOps, cloud operations, NOC, or SOC operations
- Experience with MLflow, Kubeflow, JupyterLab, or Private AI Factory platforms
- Experience with model evaluation and prompt evaluation
- Experience with GPU inference platforms such as vLLM
- Experience with fine-tuning, LoRA, QLoRA, or model quantization
- Experience with streaming architectures using Kafka, NATS, RabbitMQ, or Redis Streams
- Experience with ITSM integration and ticket lifecycle automation
- Active GitHub profile, open-source contributions, AI demos, notebooks, blogs, or technical portfolio
- Preferred Technical Stack
- Programming: Python, Shell, SQL
- AI Frameworks: LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen
- LLM Serving: vLLM, Ollama, OpenAI-compatible APIs, Hugging Face
- RAG / Vector DB: Milvus, Qdrant, Weaviate, ChromaDB
- Observability: Prometheus, Grafana, OpenSearch, Zabbix, Alertmanager, OpenTelemetry
- Cloud / Infrastructure: Kubernetes, OpenStack, CEPH, Linux, PostgreSQL, MariaDB, Kafka
- Automation: Ansible, AWX, OpenTofu, Terraform, Python automation
- CI/CD: GitHub Actions, GitLab CI/CD, Jenkins, Argo CD, Tekton
- DevSecOps: Trivy, OpenSCAP, Semgrep, SonarQube, OWASP ZAP, Checkov, Syft, Grype, OPA, Kyverno
- ITSM: BMC Helix, ServiceNow, GLPI, Zammad, or similar
Knowledge Sources: Runbooks, SOPs, logs, metrics, events, tickets, KB articles, security scan reports, compliance reports
- Required Soft Skills
- The candidate should have:
- Strong problem-solving and analytical thinking
- Strong communication skills
- Ability to understand infrastructure operations problems
- Ability to explain AI agent behavior clearly
- Ability to work with SRE, DevOps, DevSecOps, cloud, security, and customer teams
- Strong documentation skills
- Curiosity to learn new AI and open-source operations tools
- Ownership mindset for delivery, safety, security, quality, and customer success
How You’ll Make an Impact
- We are looking for someone who:
- Can build AI agents for real infrastructure operations.
- Understands both AI engineering and cloud operations workflows.
- Has mandatory DevOps, DevSecOps, and CI/CD pipeline exposure.
- Can convert runbooks and SOPs into intelligent agent workflows.
- Can integrate agents with monitoring, logging, ITSM, automation, CI/CD, DevSecOps, and cloud APIs.
- Can design safe, secure, human-approved remediation workflows.
- Can work with Kubernetes, OpenStack, CEPH, Linux, and enterprise open-source platforms.
- Can build RAG pipelines using operational knowledge, documentation, security reports, and historical incidents.
- Can contribute to XaasIO Private AI Factory and XaasIO AI-Ops platform capabilities.
- Can demonstrate practical engineering work through GitHub, demos, notebooks, or past projects.
Perks, Culture & Growth
At XaasIO Systems Pvt. Ltd, we believe our employees are our greatest asset. We are committed to creating a workplace that fosters innovation, growth, and well-being.
- Learning & Growth
- Opportunities to work on cutting-edge technologies
- Continuous learning through training, certifications, and mentorship
- Exposure to real-time projects and global clients
- Work Culture
- Open, inclusive, and collaborative work environment
- Encouragement of new ideas and innovation
- Strong focus on teamwork and transparency
- Career Development
- Clear career progression paths
- Performance-driven growth opportunities
- Internal mobility across roles and projects
- Work-Life Balance
- Flexible work environment (WFH / hybrid)
- Paid time off and leave benefits
- Supportive policies for employee well-being
- Rewards & Recognition
- Competitive compensation and benefits
- Performance-based incentives
- Employee recognition programs
- Safe & Respectful Workplace
Strong adherence to policies aligned with the Sexual Harassment of Women at Workplace (Prevention, Prohibition and Redressal) Act, 2013
- Zero tolerance for harassment or discrimination
Summary
This is an AI engineering role based primarily in Coimbatore for engineers who want to build AI-Ops agents, AI-SRE automation, and DevSecOps-aware intelligent operations capabilities for XaasIO’s Private AI Factory, Cloud Management Platform, Monitoring/Logging/Telemetry platform, and enterprise open-source infrastructure stack.
The role is ideal for candidates who can combine LLMs, RAG, agentic AI, observability, DevOps, DevSecOps, CI/CD, automation, Kubernetes, OpenStack, CEPH, Linux, and platform engineering to build secure intelligent operations agents for enterprise private cloud and sovereign AI infrastructure.
Click on Apply to know more.