Sr Software Engineer (SRE/ Observability)

neurogent.ai

full-time

Required skills

Python
AWS
automation solutions
Azure
Bash
C#
cloud infrastructure
Datadog
DevOps
Docker
end-to-end
GCP
Git
Java
Jenkins
Jira
Kubernetes
Linux
Root Cause Analysis
Splunk
SQL
SRE
Terraform
version control
PowerShell

About the role

Website: neurogent.ai
Job details:

About the role

We're looking for a software engineer who wants to move beyond just writing features - someone who gets excited about distributed systems, loves digging into a gnarly incident and fixing it at the root, and wants to build the automation and observability tooling that prevents it from ever happening again.

You'll write code. You'll fix defects. You'll be shaping CI/CD pipelines, building self-healing automation, defining SLOs, and working directly with cloud infrastructure. If you've been a solid software engineer and want your next role to be a launchpad into SRE, DevOps, or cloud engineering - this is it.

What you'll do

Engineering & automation

Write, debug, and deploy code fixes in Python, C#, or Java directly in production - not just raise tickets.
Build automation that eliminates operational toil: self-service tooling, auto-remediation scripts, and self-healing workflows.
Contribute to and maintain CI/CD pipelines, making deployments faster, safer, and more reliable.
Use AI-assisted development tools to write smarter, faster automation and accelerate your own DevOps workflows.

Observability & reliability

Build and own the observability layer - dashboards, alerts, log queries, and distributed tracing - using tools like Dynatrace, Datadog, or similar platforms.
Reduce alert noise and improve signal quality so the team acts on what matters.
Define SLOs and SLIs for critical services and drive engineering improvements based on error budget burn.
Analyse logs, metrics, and traces to investigate performance and availability issues at the system level.

Incident ownership & continuous improvement

Lead end-to-end incident response: from detection and triage to root cause analysis and long-term fix.
Run post-incident reviews and translate learnings into architectural or process improvements.
Identify recurring patterns and drive problem management - not just fix symptoms.
Collaborate with product engineering, QA, and infrastructure teams to ship reliability improvements.

Documentation & knowledge

Write runbooks, SOPs, and post-mortems that are actually useful - concise, actionable, and maintained.
Document automation solutions and contribute to a shared engineering knowledge base.

What we're looking for

Must-haves

3+ years of experience in software development, application engineering, or a related technical role.
Strong programming skills in at least one of: Python, C#, or Java - and comfort reading code you didn't write.
Solid SQL skills for data analysis and troubleshooting production issues.
Experience with scripting for automation - Python, Bash, or PowerShell.
Familiarity with Git and standard version control workflows.
Comfortable working in Linux environments and using command-line tools.
Strong analytical mindset: you dig for root causes, not just symptom fixes.
Basic understanding of CI/CD pipelines and deployment processes.

Nice to have

SLOs / SLIs / error budgets Dynatrace / Datadog / Splunk AWS / Azure / GCP Docker / Kubernetes Terraform or IaC GitHub Actions / Jenkins AI dev tools (Claude Code, Cursor) ITIL basics Jira / ServiceDesk+

Education

Bachelor's or Master’s degree in Computer Science, Information Technology, or a related field - or equivalent hands-on engineering experience.

Work details

Location: Hybrid (Gurgaon)

Hours: 5:30 PM – 2:30 AM IST

Shift alignment: US client hours

Type: Full-Time

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.