Reinforcement Learning & Optimization Intern

CloudNuro.AI

full-time

Required skills

Python
code review
GitHub
Rust
TypeScript
Pytorch

About the role

CloudNuro.AI

Website: cloudnuro.ai
Job details:

Program structure

Track: Research engineering

Reports to: Staff research engineer, EOS Intelligence Plane team

Duration: 20–24 weeks, full-time preferred

Primary languages: Python (PyTorch or JAX), familiarity with Stable Baselines / CleanRL / TorchRL

Outcome: A trained, sim-validated routing policy that demonstrably improves utility- per-dollar over the production baseline

Compensation: stipend per internal scale; conversion to full-time considered for strong performers.

Mentorship: each intern is paired with a senior engineer or researcher who is the technical owner of the area.

How to apply: Send

• Resume / CV (PDF).

• A link to a GitHub profile, portfolio, or representative project.

• The role number(s) you are applying for. You can apply for up to two.

• The application-prompt response for the role you are most interested in (300–500 words).

Applications without the prompt response will be deprioritized it is the single most useful signal we have.

About the role

The intelligence plane is where the platform's decisions actually get made. Today it runs a mix of multi- armed bandits and a primal-dual budget pacer. We are moving toward a learned policy — a PPO-trained router that takes the full request context, the agent's reasoning state, the current λ_dual, and the queue depth, and outputs an action that maximizes expected utility per dollar subject to budget and quality constraints. The transition is technically delicate. Reward shaping that does not preserve the policy. Reward hacking that makes the metric go up while the actual system gets worse. A sim-to-real gap that produces policies that look brilliant in the simulator and dangerous in production. This role is about navigating those concerns carefully.

You will train, evaluate, debug, and progressively roll out reinforcement learning policies on the platform. You will work closely with the stochastic modeling intern (their generator is your training environment) and with the platform systems intern (their gateway is what executes your policies).

What you will work on

Core deliverables

• Train a PPO routing policy on the synthetic environment; benchmark against ε-greedy, UCB, Thompson sampling, and the existing primal-dual pacer baseline.

• Design the reward function carefully — define utility, penalize budget overruns and quality floor violations, document the trade-offs in the reward shaping.

• Build a sim-to-real validation suite: domain randomization over distribution parameters, adversarial workload generation, replay of historical traces.

• Propose and execute a shadow-mode deployment: run the learned policy alongside the baseline, log decisions, compare without affecting real traffic.

• Write the technical report: training methodology, reward design, regret curves, ablations, and a frank discussion of what could go wrong in production.

Stretch goals

• Bandits-with-knapsack baseline for the budget-aware setting; compare regret bounds against the PPO policy empirically.

• Ofline RL on logged production traces (CQL, IQL, or behavior-regularized variants); compare to on-policy PPO trained from the simulator.

• Contextual LinUCB as an alternative for cases where a full MDP is overkill; characterize regimes where each algorithm wins.

What we are looking for

Required

• Understand RL fundamentals — MDPs, value iteration, policy gradients, importance sampling, the bias-variance trade-off in advantage estimation.

• Have implemented at least one of: REINFORCE, A2C, PPO, DQN — from scratch or by modifying a reference implementation. Have debugged a learning curve that did not move.

• Strong Python; comfortable with PyTorch or JAX, vectorized environments, and the surprising- but-real bugs of RL implementations (epsilon-clipping, advantage normalization, log-prob arithmetic).

• Read the PPO paper (Schulman et al. 2017) and at least one follow-up critique (e.g., Engstrom et al. 2020 on implementation matters).

Nice to have

• Course or research experience in constrained MDPs, safe RL, or ofline RL.

• Familiarity with multi-armed bandit theory beyond the standard textbook chapter — gap-dependent bounds, BAI, BwK.

• Have read Sutton & Barto cover to cover, or at least the parts you needed.

• Healthy skepticism about RL benchmarks — you have opinions about when a learning curve actually shows learning versus measurement artifact.

Success criteria

• By week 6: baseline RL implementation reproduces published regret numbers on a standard contextual bandit benchmark.

• By week 14: PPO policy trained on the synthetic environment beats the primal-dual baseline on utility-per-dollar with statistical significance.

• By the end of the internship: shadow-mode deployment running on a slice of production traffic; written report and a 30-minute technical talk to the team.

Application prompt for this role

In 300 to 500 words, describe one situation in which a reinforcement learning agent's reward function led to undesirable behavior -yours, a published result, or a textbook example and explain how you would have detected the problem earlier and what change to the reward, environment, or evaluation methodology you would propose. Be specific; "add more reward" is not an answer.

===================================================

Internship Opportunities — Economic Operating System for AI Infrastructure

We are building an Economic Operating System (EOS) for enterprise AI infrastructure: a closed-loop control plane that meters every token, forecasts demand probabilistically, routes requests under multi-objective constraints,

learns from outcomes, and rebalances provider portfolios. The platform composes mathematics from stochastic optimization, queueing theory, control engineering, reinforcement learning, mechanism design, and portfolio theory into a coherent runtime that operates across five timescales from millisecond routing decisions to quarterly provider negotiations.

We are hiring interns to work alongside the core team on distinct, well-scoped streams of this project. Each role contributes to a specific layer of the system; each intern will own a measurable outcome by the end of the internship;

and each role is designed so that the work products, code, simulations, papers, or operational dashboards are portfolio pieces that demonstrate serious technical depth.

Selection process

· Initial screen — resume + a short written response (300–500 words) to a prompt specific to the role.

· Technical interview — one 60-minute session focused on the role's primary domain (math, code review, or systems design, depending on role).

· Take-home — a small, scoped exercise (4–6 hours) using realistic data from our synthetic generator.

· Final conversation — meet the mentor, discuss scope, align on deliverables and timeline.

Shared expectations across all roles

· Write and review code in Python (primary) or one of Go, Rust, TypeScript, depending on role.

· Work in the open — pull requests with descriptive messages, design docs before non-trivial work, weekly written updates.

· Read papers critically and translate them into running code; bench specific claims against our synthetic and production traces.

· Communicate clearly in writing and in standups; intellectual honesty about what works and what does not.

· Care about correctness, reproducibility, and the user-facing impact of the platform on real bills and real latencies.

A note on what we mean by "intern"

We are looking for people who are still in formal education or within 12 months of graduation. We hire interns who are undergraduate students with strong technical foundations, master's students with deeper specialization, and doctoral students who want a focused industrial experience. The role descriptions are written to scale: an undergraduate intern on Role 1 might focus on the core deliverables and one stretch goal; a doctoral intern might own a stretch goal as their primary track and publish a paper from it. We will calibrate scope to your background during the final conversation.

A final note on culture

This project sits at an unusual intersection applied probability, distributed systems, reinforcement learning, and economics and we hire people who are genuinely curious about all of it, not just the part of it that touches their primary specialization. The four roles depend on each other. The simulator is the RL intern's training environment. The detector is the systems intern's last line of defense. The trained policy is the operations intern's hardest-to-debug failure mode. We expect interns to talk to each other early and often, to read each other's design docs, and to push back on each other's assumptions. Pleasant disagreement, in our experience, is the second-most reliable predictor of good outcomes after technical depth itself.

Click on Apply to know more.

This page is fully interactive when JavaScript is enabled. Please enable JavaScript to apply or browse related roles.