Network Science
Website:
networkscience.ai
Job details:
Location: Onsite – Mumbai
Reports to: Head of Infrastructure / CTO
You must be available to join this role immediately
Network Science is a global AI innovation platform powering enterprise AI transformation through a metric-backward approach. With 150+ AI projects delivered and a curated ecosystem of 70+ deep-tech startups, we help enterprises move from AI ambition to measurable business outcomes. As we scale our core technology platform, we are looking for a Cloud Engineer who thrives in complex, high-availability environments and takes pride in keeping systems reliable, secure, and performant.
What You Will Own
Cloud Infrastructure & Operations:
• Own, manage, and optimize AWS cloud infrastructure across development, staging, and production environments.
• Monitor system health, respond to incidents, and drive root cause analysis to prevent recurrence.
• Ensure high availability, fault tolerance, and disaster recovery across all cloud-hosted services.
• Make infrastructure decisions that balance cost, performance, and reliability.
AI & Platform Support:
• Support backend and AI/ML workloads running on AWS — including inference endpoints, data pipelines, and model-serving infrastructure.
• Collaborate with engineers to understand customer requirements and design tailored cloud solutions.
• Build and maintain infrastructure for training and inference clusters, working with large scale models like LLMs.
Security & Compliance:
• Implement and enforce AWS security best practices — IAM policies, VPC design, encryption, and access controls.
• Conduct regular audits, vulnerability assessments, and ensure compliance with enterprise security standards.
• Apply principle of least privilege across all cloud services and environments.
Automation & DevOps:
• Automate infrastructure provisioning and configuration using IaC tools (Terraform, CloudFormation, or CDK).
• Build and maintain CI/CD pipelines to streamline deployment and reduce manual intervention.
• Develop runbooks, alerting rules, and self-healing mechanisms to minimize operational toil.
Collaboration & Ownership:
• Work closely with product, backend, AI/ML, and DevOps teams — no ticket-passing culture.
• Translate infrastructure requirements into clear technical designs and implementation plans.
• Take responsibility for systems in production — build it, ship it, own it.
• Share knowledge with peers, help debug cross-functional issues, and improve team workflows.
• You don't wait for instructions when something is broken — you investigate, communicate, and fix.
What We Expect You to Be Good At
Core Skills (Non-Negotiable):
• 3–4 years of hands-on experience with AWS, including:
• EC2, ECS/EKS, Lambda, S3, RDS, CloudFront
• VPC, IAM, Route 53, CloudWatch, AWS Config
• Cost management, reserved instances, and resource optimization
• Strong understanding of networking fundamentals — DNS, load balancing, firewalls, and CDN.
• Experience with Linux systems administration and shell scripting.
• Proficiency in at least one scripting/automation language (Python, Bash, or similar).
• Comfortable working with Git, code reviews, and collaborative engineering workflows.
Cloud & AI Application Focus:
• Experience supporting AI/ML workloads on AWS (SageMaker, Bedrock, or equivalent).
• Understanding of how AI models are deployed and served at scale — latency, throughput, and fallback strategies.
• Ability to design and maintain infrastructure that supports high-throughput, low-latency AI-powered services. Engineering Mindset:
• You think in systems, not just tickets.
• You ask, "Will this hold under load?" before you ask, "Is it running?"
• You care about reliability, observability, and maintainability as much as resolution speed.
Must-Have Qualifications:
• 3–4 years of experience working as a Cloud or Infrastructure Engineer with AWS as the primary cloud platform.
• AWS certification preferred (Solutions Architect Associate or above).
• Experience working in fast-paced, high-growth environments.
• High empathy with high performance — you care about quality AND outcomes.
• Deep ownership mindset: you love fixing problems before they are noticed.
• Comfortable collaborating with AI/ML and backend teams, understanding their constraints, and translating them into reliable infrastructure.
Good to Have (Signals of Maturity):
• Experience with multi-cloud environments (GCP or Azure alongside AWS).
• Familiarity with Kubernetes (EKS) and container orchestration at scale.
• Exposure to MLOps concepts — model versioning, A/B testing, canary releases for AI features.
• Experience with logging, monitoring, and alerting stacks (ELK, Prometheus, Grafana, Datadog).
• Experience working on multi-tenant platforms or enterprise SaaS products.
How We Measure Success:
• Cloud infrastructure is reliable, observable, and scales without constant firefighting.
• Incidents reduce over time — not increase — especially around AI-integrated workloads.
• Deployments are automated, repeatable, and trusted by engineering teams.
• Other teams rely on the infrastructure you own as a stable foundation to build on.
What You Won't Find Here:
• Micromanagement disguised as process
• Endless meetings without decisions
• "Just patch it" thinking that creates long-term mess
We value clarity, accountability, and strong engineering judgment.
What We Offer:
• Opportunity to work on impactful, real-world AI products and platforms.
• High ownership and autonomy.
• Fast learning and growth environment, working closely with AI/ML experts.
• Competitive compensation based on experience.
Click on Apply to know more.