Report

Member of Technical Staff - Site Reliability Engineer - Onsite - San Jose CA

Min Experience

3 years

Location

San Jose, CA

JobType

full-time

About the job

Info This job is sourced from a job board

Overview

About the role

Company Overview:

At Nile, we envision an enterprise network that inherently defends against cyber threats, eliminates lateral attack vectors like ransomware, and operates free of complexity. Our goal is to deliver Campus Network-as-a-Service (NaaS) that makes network operations virtually invisible to our customers by pushing the boundaries of autonomy. Imagine a network that continuously monitors, optimizes, and upgrades itself—all without the need for human intervention. Our audacious journey began in 2018 when we brought together a team of industry veterans and visionaries in networking, cybersecurity, cloud software, and AI to disrupt a $100 billion enterprise networking market, starting with the wired and wireless LAN. Today, our Nile Access Service is redefining connectivity as a service for organizations worldwide, from cutting-edge technology companies to leading healthcare and financial institutions, and beyond.

Where do we go from here? Well, that’s where you come in. We are expanding in all areas, bringing in some of the brightest talent to further shape Nile’s future, prepare for growth, and tackle tough tasks to ensure our momentum never slows.

About The Role

We are looking for a hands-on Site Reliability Engineer (SRE) to join our Cloud Operations team. As part of a highly experienced SRE team, you will be responsible for the reliability, scalability, and performance of our Kubernetes-based infrastructure that supports a large-scale, multi-tenant microservices architecture deployed in both AWS and GCP. You will work closely with developers, platform engineers, and other stakeholders to ensure smooth deployments, stable systems, and rapid incident response.

Key Responsibilities

Operate and maintain Kubernetes clusters in production environments, including upgrades, node group management, and day-to-day troubleshooting.
Manage GitOps-based continuous delivery pipelines using ArgoCD and Helm.
Design and implement monitoring, alerting, and observability systems with Grafana, VictoriaMetrics, and OpenTelemetry.
Collaborate on the deployment and maintenance of infrastructure using Terraform (IaC).
Support and enhance service mesh capabilities with Istio (including configuration of traffic policies, mTLS, and observability integration).
Administer and tune distributed systems including Kafka, MySQL, Redis, and Druid.
Develop automation and tools using a modern programming language (e.g., Python or Go) to improve system reliability and deployment efficiency.
Participate in on-call rotations, incident response, and postmortems, contributing to our culture of continuous improvement.

Required Qualifications

3+ years of hands-on experience running Kubernetes in production, including real-world troubleshooting and upgrade processes.
Solid understanding of Kubernetes internals and concepts: nodes, node groups, control plane vs. data plane, service discovery, ingress, network policies, RBAC, and cluster security.
Proficiency with tools and technologies in our stack: ArgoCD, kubectl, Terraform, Helm, Istio, Grafana, VictoriaMetrics, Jenkins.
Working knowledge of core AWS services: EC2, S3, IAM, VPC, EKS.
Operational experience with at least Kafka, MySQL, Redis, and Druid in production environments.
Proficiency in at least one programming language (Python or Go) — scripting alone is not sufficient.
Demonstrated ability to learn and adapt to evolving infrastructure and tooling.

Nice To Have

Experience with GitOps best practices and progressive delivery (e.g., Argo Rollouts).
Familiarity with service reliability concepts such as SLOs, error budgets, and multi-burn rate alerting.
Exposure to GCP services and hybrid cloud operations.

What We Offer

A high-impact role on a technically strong and collaborative SRE team.
Exposure to cutting-edge tooling across cloud infrastructure, observability, and progressive delivery.
Opportunity to shape the future of our infrastructure and influence best practices.
Support for continuous learning, training, and professional growth.

About the company

At Nile, we envision an enterprise network that inherently defends against cyber threats, eliminates lateral attack vectors like ransomware, and operates free of complexity. Our goal is to deliver Campus Network-as-a-Service (NaaS) that makes network operations virtually invisible to our customers by pushing the boundaries of autonomy. Imagine a network that continuously monitors, optimizes, and upgrades itself—all without the need for human intervention.

Skills

kubernetes

terraform

helm

istio

grafana

victoriametrics

jenkins

aws

kafka

mysql

redis

druid

python