Happiest Minds Technologies
Website:
happiestminds.com
Job details:
Position : Cybersecurity Engineer - Cloud Disaster Recovery
Job Summary
The Cloud Disaster Recovery Engineer is responsible for designing, implementing, and maintaining Disaster Recovery (DR) solutions to ensure the organization's technology infrastructure and critical systems can withstand and recover from disruptions. This role involves hands-on work with high availability (HA) architectures, disaster recovery strategies, automation, and failover solutions in on-premises, cloud, and hybrid environments.
The ideal candidate will have expertise in IT resilience, infrastructure engineering, and cloud-based recovery solutions, working closely with IT operations, cybersecurity, and business continuity teams to enhance the organization's overall technology resilience posture.
Key Responsibilities
Technology Resilience & Disaster Recovery Engineering
- Design, implement, and maintain highly available (HA) and fault-tolerant architectures across cloud (AWS, Azure, GCP) and on-premises environments.
- Develop and maintain disaster recovery (DR) solutions, ensuring that IT systems meet defined recovery time objectives (RTO) and recovery point objectives (RPO).
- Implement automation and orchestration for disaster recovery and failover processes using Infrastructure as Code (IaC) and scripting tools (Terraform, Ansible, PowerShell, Python).
- Work with IT infrastructure and application teams to integrate resilience best practices into system design, deployment, and operations.
- Perform failure mode analysis (FMA) to identify and address system vulnerabilities.
Disaster Recovery Testing & Validation
- Design and implement fully automated disaster recovery runbooks using Ansible and Python using one-click or event-triggered failover systems
- Develop automated recovery verification (post-failover health checks)
- Develop and execute disaster recovery drills and failover testing, identifying gaps and improvements.
- Automate DR testing, validation, and reporting
- Conduct regular validation of backup and replication strategies, ensuring data integrity and availability.
- Monitor system failover and recovery performance, optimizing configurations to improve response times.
Incident Response & Crisis Management Support
- Act as a technical lead during disruptions and disaster recovery events, ensuring rapid system recovery.
- Work closely with cybersecurity teams to integrate DR solutions with cyber resilience strategies, ensuring quick restoration from ransomware or cyberattacks.
- Support post-incident analysis and recommend improvements to resilience strategies.
Monitoring, Compliance & Reporting
- Implement and maintain resilience monitoring tools, ensuring continuous tracking of system availability and DR readiness.
- Ensure compliance with industry standards and regulatory requirements (e.g., ISO 27001, NIST, FFIEC, SOC 2).
- Provide technical input for audits and regulatory assessments related to technology resilience.
- Generate reports on resilience testing results, failover performance, and risk mitigation efforts.
Collaboration & Training
- Work closely with IT teams, business continuity professionals, and cloud architects to ensure resilience strategies align with business needs.
- Provide training and technical guidance to IT staff on disaster recovery best practices and system failover configurations.
- Assist in the development of technical documentation and playbooks for disaster recovery and resilience processes.
Qualifications & Experience
Required:
- Bachelor?s degree in Computer Science, Information Technology, Cybersecurity, or a related field.
- 5+ years of experience in IT infrastructure, cloud engineering, disaster recovery, or resilience engineering.
- Expertise in disaster recovery planning and high-availability (HA) solution design.
- Hands-on experience with cloud resilience strategies (AWS, Azure, or GCP) and cloud-native DR tools.
- Hands-on experience automating AWS disaster recovery using:
- Boto3 for orchestration of EC2, RDS, S3, Route 53, IAM, AWS Backup, and EDRS.
- Cross-region replication and failover strategies
- AWS-native DR patterns (pilot light, warm standby, multi-region active/active)
- Automation of AMI lifecycle, backup validation, and restore testing
- Advanced automation engineering experience, including:
- Designing and maintaining enterprise-scale Ansible automation frameworks (roles, collections, dynamic inventories, Ansible Automation Platform/AWX)
- Developing production-grade Python automation using Boto3
- Building event-driven automation workflows for failover and recovery
- Implementing idempotent Infrastructure as Code (IaC) patterns
- Integrating automation into CI/CD pipelines (e.g., GitHub Actions, Jenkins)
- Experience with backup, replication, and data protection solutions (e.g., Veeam, Commvault, Zerto, Azure Site Recovery).
- Knowledge of networking, storage, virtualization, and hybrid-cloud architectures.
Preferred:
- Industry certifications such as AWS Certified Solutions Architect, Red Hat Certified Engineer, Microsoft Azure Administrator, Certified Business Continuity Professional (CBCP), or Disaster Recovery Certified Specialist (DRCS).
- Experience working in highly regulated industries (finance, healthcare, government).
- Familiarity with cyber resilience frameworks and incident response strategies.
Key Competencies
- Problem-solving mindset, with the ability to diagnose complex technical issues.
- Strong collaboration skills, able to work across IT, cybersecurity, and business teams.
- Excellent communication and documentation skills, translating technical recovery plans into actionable steps.
- Proactive and adaptable, capable of handling crisis situations and evolving technology landscapes.
Click on Apply to know more.