Staff Software Engineer (Infrastructure)

PW (PhysicsWallah)

Location: Noida, Uttar Pradesh, India
Job type: Full-time

Required skills

Python
Open Source
AWS
Ansible
Apigee
Azure
caching
capacity planning
CDN
communication skills
compliance
configuration management
Dash
database
FFmpeg
GCP
HLS
incident response
infrastructure-as-code
JS
Kubernetes
load balancing
multi-tenant
MySQL
Node
OAuth
Postgres
SRE
Terraform
VPC
WebRTC

About the role

PW (PhysicsWallah)

Website: pw.live
Job details:

Job Title: Staff Software Engineer (Infrastructure)

Location- Noida Sector 62

Job brief:

We are looking for a Staff Engineer to set the technical direction for our infrastructure organization. In this role, you will partner with engineering leadership to define multi-year infrastructure strategy, identify and drive cross-cutting initiatives, and raise the technical bar across multiple teams.

You will operate with significant autonomy on ambiguous, high-impact problems, and your influence will extend well beyond the code you personally write. You will be a force multiplier — making other engineers more effective, shaping architecture decisions across the org, and representing the engineering function in conversations with senior leadership, partner teams, and external stakeholders.

Your goal is to ensure that our infrastructure is a long-term competitive advantage: reliable, secure, scalable, and a delight for developers to build on.

Responsibilities:

Define and own the multi-year technical roadmap for infrastructure, reliability, and developer productivity, in partnership with engineering leadership and product stakeholders.
Drive cross-team and cross-org initiatives that span multiple quarters, breaking down ambiguous problems into executable workstreams for senior engineers.
Set architectural standards for large-scale distributed systems, including service design, data systems, networking, and platform abstractions.
Own the reliability posture of the organization — define SLOs, error budgets, and the operational practices (golden signals, incident response, capacity planning) that teams are expected to follow.
Identify and eliminate entire categories of toil and operational risk, not just individual instances.
Lead the technical strategy for the developer platform — what we build in-house, what we buy, what we contribute to upstream — and ensure the platform scales with the engineering org.
Set the security and compliance bar for infrastructure, partnering with security engineering on threat models, controls, and audit readiness.
Partner with application teams during high-severity incidents — able to dive into production Node.js and Go services, read code you didn't write, and drive debugging to root cause.
Mentor senior and staff-track engineers, conduct architecture reviews, and uplevel the engineering practices of teams you partner with.
Represent the engineering organization externally — technical interviews at the senior level, conference talks, open source contributions, and vendor/partner technical discussions.
Make build-vs-buy, cloud strategy, and major migration decisions, and own the technical narrative for them with executive stakeholders.

Requirements and skills:

10+ years of professional software engineering experience, with a substantial portion spent on infrastructure, platform, or SRE at scale.
Demonstrated track record of leading initiatives that span multiple teams and multiple quarters, with measurable business and engineering impact.
Expert-level proficiency in at least one systems language (Go and Python strongly preferred), and the ability to be productive in others as needed.
Strong ability to debug production applications written in Go and Node.js — comfortable with profilers, tracers, and runtime diagnostics (pprof, delve, goroutine dumps, Node.js inspector, clinic.js, heap snapshots, flame graphs), and able to reason about event loops, garbage collection, concurrency primitives, and memory leaks in both runtimes.
Deep expertise in large-scale distributed systems design — consensus, replication, sharding, caching, eventual consistency, failure modes, and the tradeoffs between them.
Expert knowledge of networking at L3, L4, and L7, including how this manifests in cloud environments (VPC design, service mesh, load balancing, ingress, mTLS).
Hands-on experience with API gateways (Kong, Envoy, AWS API Gateway, Apigee, Tyk, or similar) — including custom plugin development, rate limiting and quota strategies, auth flows (OAuth, JWT, mTLS), traffic shaping, and operating gateways at scale as a critical piece of north-south traffic infrastructure.
Deep expertise across at least one major cloud provider (AWS, GCP, or Azure), and working familiarity with the others. Has owned non-trivial cloud architecture and cost decisions at the org level.
Expert in Kubernetes — including authoring operators, managing cluster upgrades and migrations at scale, and designing multi-tenant platform abstractions on top of Kubernetes for application teams.
Strong experience designing CI/CD systems and developer workflows that scale to hundreds of engineers and thousands of services.
Deep operational expertise with observability stacks (Prometheus, Grafana, Mimir, Loki, ELK, New Relic, or equivalents) — has defined the observability strategy, not just consumed it.
Strong working knowledge of multiple database systems (MongoDB, MySQL, Postgres) including their operational characteristics, failure modes, and when to choose which.
Expert with infrastructure-as-code and configuration management (Terraform, Ansible, or similar), including how to structure these systems for hundreds of contributors.
Strong security instincts — has set security standards for infrastructure, not just followed them.
Excellent technical writing and communication skills — can write design documents that align stakeholders, and can present technical strategy to both engineers and executives.
Demonstrated ability to influence without authority — building consensus across teams that don't report into you.

Nice to have:

Experience with live streaming and video infrastructure at scale — AWS MediaLive, MediaPackage, MediaConvert, IVS, or equivalent services on other clouds (e.g., GCP Live Stream API, Azure Media Services), as well as open source equivalents.
Hands-on experience with the broader live video stack: ingest protocols (RTMP, SRT, WebRTC, RIST), packaging and delivery formats (HLS, DASH, CMAF, low-latency HLS/LL-DASH), transcoding pipelines, DRM, and CDN strategy for live workloads.
Understanding of the operational characteristics of live streaming — origin/edge scaling, multi-region failover for live events, latency budgets, concurrency spikes, and observability for QoE metrics (rebuffering, startup time, bitrate).
Open source contributions, especially to projects in the Kubernetes / CNCF ecosystem, observability tooling, API gateway / service mesh projects (Kong, Envoy, Istio), video/streaming infrastructure (e.g., FFmpeg, GStreamer, Pion, OvenMediaEngine), or major language ecosystems.
Conference talks, published technical writing, or other public technical presence.
Experience scaling an engineering organization through a major growth phase or through a major architectural transition (e.g., monolith decomposition, multi-region, multi-cloud).

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.