DigiHelic Solutions Pvt. Ltd.
Website:
digihelic.com
Job details:
DevSecOp Developer
Experience: 10+ Years
Location: Pan India
Description:
"We are seeking an experienced Senior SRE & DevSecOps Engineers with 10+ years of hands-on experience to design, implement, and maintain secure, scalable, and highly available infrastructure with a strong focus on Cloud & AI/ML platforms. Will be a key contributor in bridging development, security, and operations, ensuring our Cloud/AI systems are resilient, performant, secure, and production-ready.
Key Responsibilities
Site Reliability Engineering
Design, build, and maintain highly available, scalable, and fault-tolerant distributed systems
Define and track SLIs, SLOs, and SLAs; drive reliability improvements based on error budgets
Lead incident response, conduct blameless post-mortems, and implement preventive measures
Build and improve observability stack (monitoring, logging, tracing, alerting)
Automate toil reduction through tooling and self-healing infrastructure
Perform capacity planning and optimize system performance and cost efficiency
Implement chaos engineering practices to proactively identify system weaknesses
AI Security & Governance
Implement AI/ML security best practices including model access controls and API security
Secure model artifacts, training data, and inference endpoints
Set up prompt injection protection and input/output validation for LLM applications
Implement data privacy controls for AI training pipelines (PII detection, data anonymization)
Ensure AI compliance with regulations (EU AI Act, GDPR for AI, industry-specific requirements)
Monitor for adversarial attacks and implement model robustness testing
Implement AI audit trails and model lineage tracking for governance
Manage secrets and API keys for third-party AI services (OpenAI, Anthropic, etc.)
DevSecOps & Security
Embed security into CI/CD pipelines (SAST, DAST, SCA, container scanning, secrets management)
Design and implement infrastructure security controls and hardening standards
Manage vulnerability assessments, penetration testing coordination, and remediation tracking
Implement and maintain IAM policies, RBAC, and zero-trust architecture principles
Ensure compliance with security frameworks (SOC2, ISO 27001, GDPR, HIPAA, PCI-DSS asapplicable)
Conduct security audits and threat modeling for infrastructure and applications
Manage secrets, certificates, and encryption (at rest and in transit)
Infrastructure & Automation
Design and manage cloud infrastructure (AWS/GCP/Azure) using IaC (Terraform, Pulumi,CloudFormation)
Build and maintain container orchestration platforms (Kubernetes, EKS/GKE/AKS)
Develop and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Implement GitOps practices for infrastructure and application deployment
Automate operational tasks using Python, Go, Bash, or similar languages
AI/ML Infrastructure & MLOps (preferred)
Design, deploy, and maintain scalable AI inference infrastructure (HIVE)
Build and manage AI/ML pipelines
Implement model serving infrastructure (
Manage clusters and optimize resource allocation for training and inference workloads
Implement model versioning, A/B testing, and canary deployments
Set up feature stores and manage data pipelinesMonitor model performance, drift detection, and automated retraining pipelines
Optimize inference latency, throughput, and cost for production AI services
Manage LLM infrastructure including API gateways, rate limiting, and token management
Deploy and scale vector databases (Pinecone, Milvus, Weaviate, pgvector) for RAG applications
Implement LLMOps practices for prompt versioning, evaluation, and deployment\Leadership & Collaboration
Mentor junior engineers and promote SRE/DevSecOps/MLOps best practices across teams
Collaborate with data science, ML engineering, security, and platform teams
Participate in architecture reviews and provide guidance on reliability, security, and AI infrastructure
Document runbooks, architecture decisions, and operational procedures
Drive cultural change toward shared ownership of reliability and security
Evangelize MLOps and AI platform best practices across the organization
Required Qualifications
Experience
10+ years of experience in SRE, DevOps, Platform Engineering, or related roles
6+ years with cloud platforms (AWS, GCP, or Azure) in production environments
5+ years with container orchestration at scale
4+ years integrating security practices into DevOps workflows
2+ years experience with AI/ML infrastructure and MLOps in production
Technical Skills
Cloud & Infrastructure
Cloud Platforms: AWS (preferred), GCP, Azure - expertise in core services
AI/ML Cloud Services: SageMaker, Vertex AI, Azure ML, Bedrock (preferred), or similar
IaC: Terraform (preferred), Pulumi, or CloudFormation
Containers & Orchestration: Docker, Kubernetes, Helm, service mesh (Istio/Linkerd)
AI/ML Platform
MLOps Tools: Kubeflow, MLflow, Airflow, DVC, Weights & Biases
Model Serving: Triton Inference Server, TensorFlow Serving, KServe, Seldon Core, BentoML
GPU Management: NVIDIA GPU Operator, CUDA, multi-GPU training orchestration
Vector Databases: Pinecone, Milvus, Weaviate, Qdrant, pgvector
Feature Stores: Feast, Tecton, or similar
LLM Platforms: OpenAI API, Anthropic, HuggingFace, LangChain, LlamaIndex
CI/CD & Observability
CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD, Spinnaker
Observability: Prometheus, Grafana, Datadog, ELK/OpenSearch, Jaeger, PagerDuty
ML Monitoring: Evidently AI, Arize, WhyLabs, or custom drift detection solutions
Security
Security Tools: Vault, Trivy, Snyk, SonarQube, Falco, OPA/Gatekeeper, AWS Security Hub
AI Security: Guardrails, prompt injection protection, model security scanning
Programming
Scripting/Programming: Python (required), Go, Bash
Familiarity with ML frameworks: PyTorch, TensorFlow (operational knowledge)
Knowledge Areas
Distributed systems design and microservices architecture
AI/ML system design and production ML best practices
Security frameworks and compliance standards
Incident management and on-call best practices
Cost optimization and FinOps principles (including GPU cost optimization)
Preferred Qualifications
Experience with multi-cloud or hybrid AI infrastructure
Hands-on experience with LLM fine-tuning and deployment at scale
Experience with real-time ML inference and low-latency systems
Contributions to open-source projects in the SRE/DevSecOps/MLOps space
Certifications: AWS Solutions Architect/Security, CKA/CKS, AWS ML Specialty, GCP ML Engineer
Experience with distributed training (Horovod, DeepSpeed, Ray)
Familiarity with edge AI deployment and model optimization (quantization, pruning)
Experience with responsible AI practices and bias detection/mitigation
Mandatory skills:
Cloud & AI/ML platforms, SAST, DAST, SCA, container scanning, secrets management, SOC2, ISO 27001, GDPR, HIPAA, PCI-DSS as applicable, Design and manage cloud infrastructure (AWS/GCP/Azure) using IaC (Terraform, Pulumi, CloudFormation),Kubernetes, EKS/GKE/AKS, Python
Desired skills
Cloud & AI/ML platforms, SAST, DAST, SCA, container scanning, secrets management, SOC2, ISO 27001, GDPR, HIPAA, PCI-DSS as applicable,Design and manage cloud infrastructure (AWS/GCP/Azure) using IaC (Terraform, Pulumi, CloudFormation),Kubernetes, EKS/GKE/AKS, Python
Domain (Industry):Banking
Click on Apply to know more.