TECEZE
Website:
teceze.com
Job details:
We are seeking a highly skilled Infrastructure Reliability & Operations Engineer with strong private cloud experience and a minimum of 5 years in infrastructure reliability, operations, or site reliability engineering. The ideal candidate will be responsible for designing, implementing, and maintaining fault-tolerant infrastructure while driving automation, observability, and reliability across mission-critical systems.
You will collaborate with DevOps, development, and security teams to ensure seamless deployments, optimize performance, and uphold the highest standards of security and compliance. This role requires a proactive mindset, technical expertise, and a passion for building resilient systems.
Key Responsibilities:
• Design and maintain highly available, scalable, and secure infrastructure
• Lead incident response, root cause analysis, and post-incident reviews
• Develop automation tools and apply Infrastructure as Code (Terraform, Ansible, CloudFormation)
• Build self-healing systems and streamline operational workflows
• Support CI/CD pipelines and containerized platforms (Docker, Kubernetes, OpenShift)
• Implement monitoring, logging, and alerting systems (Prometheus, Grafana, ELK, Datadog)
• Define and track SLIs, SLOs, and SLAs for system reliability
• Collaborate with security teams on vulnerability management and compliance
Required Skills & Qualifications
• Strong experience in Linux/Unix system administration
• Proficiency in Python, Go, Bash, Shell or similar scripting languages
• Hands-on experience with AWS, Azure, or GCP
• Expertise in containerization & orchestration technologies
• Solid understanding of networking concepts (DNS, TCP/IP, load balancing, firewalls)
• Experience with monitoring, logging, and alerting tools
Click on Apply to know more.