Operations, Anomaly Detection & SRE Intern

CloudNuro.AI

full-time

Required skills

Python
code review
data science
GitHub
Pandas
Rust
Shadow
spectral
SQL
SRE
ticketing
TypeScript

About the role

CloudNuro.AI

Website: cloudnuro.ai
Job details:

Program structure

Track: Site reliability / data science

Reports to: Senior SRE, EOS Operations team (dotted line to ML engineering lead)

Duration: 16–20 weeks, full-time preferred

Primary languages: Python (pandas, scikit-learn, PyOD), SQL, Bash; some Go for tooling

Outcome: A tiered anomaly-detection stack running in production with measured pre- cision/recall and a documented reduction in alert noise

Compensation: stipend per internal scale; conversion to full-time considered for strong performers.

Mentorship: each intern is paired with a senior engineer or researcher who is the technical owner of the area.

How to apply: Send

• Resume / CV (PDF).

• A link to a GitHub profile, portfolio, or representative project.

• The role number(s) you are applying for. You can apply for up to two.

• The application-prompt response for the role you are most interested in (300–500 words).

Applications without the prompt response will be deprioritized it is the single most useful signal we have.

About the role

A self-optimizing system needs detectors that catch its rare failures fast and quietly handle the routine ones. We run a tiered alerting hierarchy with four levels — sub-millisecond hard caps at the gateway, second-scale spectral and change-point detectors on telemetry streams, five-minute Isolation Forests on aggregated features, and overnight autoencoder + graph-neural-network detectors for slow-burn quality regressions. Each tier has a different latency budget, a different statistical method, and a different action — block, page, ticket, or trend report.

This role sits at the intersection of SRE and applied data science. You will own at least one tier end-to- end: design the detector, train it on historical telemetry, tune it against real false-positive and false-negative costs, deploy it, and then live with it on call. You will work closely with the platform systems intern (your detector probably lives in their gateway or alerting pipeline) and with the stochastic modeling in- tern (their fitted distributions inform your detector's null hypothesis).

What you will work on

Core deliverables

• Own one of the four tiers (T0 hard caps, T1 spectral / change-point, T2 Isolation Forest, T3 au- toencoder / GNN). Build, tune, deploy, document.

• Construct a labeled evaluation set from historical incidents; estimate precision and recall with confidence intervals; tune the operating threshold against business-cost-weighted error.

• Implement a feedback loop — operator-marked false positives and missed detections retrain the model on a documented schedule.

• Build the dashboard that surfaces detector health, alert volumes per tier, time-to-acknowledge, time-to-resolve.

• Write a post-incident review for one anomaly the detector caught, and one it missed.

Stretch goals

• Add a fifth tier or improve an existing one with a new method (graph-based agent loop detection, contrastive anomaly detection on embeddings).

• Build a synthetic-incident harness: parameterized anomalies (retry leak, runaway agent, distribution drift) injected into the simulator to test detector coverage.

• Contribute to the runbook automation: when an anomaly fires, what is the diagnostic playbook the on-call engineer follows?

What we are looking for

Required

• Strong Python with pandas and scikit-learn; comfortable shaping multi-million-row time-series data into model features.

• Practical experience with at least one anomaly detection method — Isolation Forest, autoencoder, change-point, robust z-score — applied to real-ish data.

• Have been on call, or run a service that paged you, or otherwise lived with the consequences of false positives and false negatives. Have an opinion about alert fatigue.

• Comfortable with SQL and at least one time-series storage system (Prometheus, InfluxDB, TimescaleDB, ClickHouse).

Nice to have

• Familiarity with the PyOD library, with the Adams & MacKay Bayesian change-point paper, or with the Microsoft spectral-residual algorithm.

• Experience contributing to an open-source observability or anomaly detection tool.

• Have read at least parts of the Google SRE Book and have specific opinions about service-level objectives versus service-level indicators.

• Statistical thinking — you instinctively reach for confidence intervals when someone shows you a percentage.

Success criteria

• By week 4: your tier deployed in shadow mode against historical telemetry, with a documented precision/recall curve.

• By week 10: the tier active in production, paging or ticketing on real signal, with a documented reduction in time-to-detect.

• By the end of the internship: at least one production incident attributed to your detector (caught early), and a quarterly trend report shared with the broader team.

Application prompt for this role

In 300 to 500 words, describe a system you would build to detect that something has changed in a stream of data — not a textbook spec, but a specific real or imagined stream (login rates, electricity demand, error rates in a service you use, anything concrete). Explain what "changed" means precisely for that stream, what method you would use, what threshold you would set, and what you would do when it fires. Honesty about what you do not know is a positive signal.

===================================================

Internship Opportunities — Economic Operating System for AI Infrastructure

We are building an Economic Operating System (EOS) for enterprise AI infrastructure: a closed-loop control plane that meters every token, forecasts demand probabilistically, routes requests under multi-objective constraints,

learns from outcomes, and rebalances provider portfolios. The platform composes mathematics from stochastic optimization, queueing theory, control engineering, reinforcement learning, mechanism design, and portfolio theory into a coherent runtime that operates across five timescales from millisecond routing decisions to quarterly provider negotiations.

We are hiring interns to work alongside the core team on distinct, well-scoped streams of this project. Each role contributes to a specific layer of the system; each intern will own a measurable outcome by the end of the internship;

and each role is designed so that the work products, code, simulations, papers, or operational dashboards are portfolio pieces that demonstrate serious technical depth.

Selection process

· Initial screen — resume + a short written response (300–500 words) to a prompt specific to the role.

· Technical interview — one 60-minute session focused on the role's primary domain (math, code review, or systems design, depending on role).

· Take-home — a small, scoped exercise (4–6 hours) using realistic data from our synthetic generator.

· Final conversation — meet the mentor, discuss scope, align on deliverables and timeline.

Shared expectations across all roles

· Write and review code in Python (primary) or one of Go, Rust, TypeScript, depending on role.

· Work in the open — pull requests with descriptive messages, design docs before non-trivial work, weekly written updates.

· Read papers critically and translate them into running code; bench specific claims against our synthetic and production traces.

· Communicate clearly in writing and in standups; intellectual honesty about what works and what does not.

· Care about correctness, reproducibility, and the user-facing impact of the platform on real bills and real latencies.

A note on what we mean by "intern"

We are looking for people who are still in formal education or within 12 months of graduation. We hire interns who are undergraduate students with strong technical foundations, master's students with deeper specialization, and doctoral students who want a focused industrial experience. The role descriptions are written to scale: an undergraduate intern on Role 1 might focus on the core deliverables and one stretch goal; a doctoral intern might own a stretch goal as their primary track and publish a paper from it. We will calibrate scope to your background during the final conversation.

A final note on culture

This project sits at an unusual intersection applied probability, distributed systems, reinforcement learning, and economics and we hire people who are genuinely curious about all of it, not just the part of it that touches their primary specialization. The four roles depend on each other. The simulator is the RL intern's training environment. The detector is the systems intern's last line of defense. The trained policy is the operations intern's hardest-to-debug failure mode. We expect interns to talk to each other early and often, to read each other's design docs, and to push back on each other's assumptions. Pleasant disagreement, in our experience, is the second-most reliable predictor of good outcomes after technical depth itself.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.