Sample candidate batch
A candidate batch is drawn from the problem bank via a proposal distribution, keeping the selection pool tractable across large and heterogeneous datasets.
Adaptive Problem Curation for Scalable RL Post-Training
A scalable, fully automated framework that trains a neural curator to adaptively select training problems for RL post-training of LLMs — directly optimizing for expected policy improvement via a principled non-stationary bandit formulation with regret guarantees.
Problem selection formulated as a non-stationary stochastic bandit, trained directly on expected policy improvement rather than heuristic difficulty signals.
Neural curator function-approximates over large heterogeneous problem banks with no human annotations or manual dataset structuring required.
Agnostic to RL algorithm (GRPO, GSPO, and others) and seamlessly combines with any actor update rule and problem bank.
ACTOR-CURATOR jointly trains an actor and a neural curator in an online, on-policy loop. The curator learns to select training problems that maximise expected policy improvement, while the actor trains on those curated batches — creating a co-adaptive cycle that accelerates learning and improves final performance.
A candidate batch is drawn from the problem bank via a proposal distribution, keeping the selection pool tractable across large and heterogeneous datasets.
The neural curator assigns selection probabilities proportional to expected policy-improvement utility for each candidate problem.
The actor rolls out solutions on the curated problem set, receives rewards, and updates via any RL algorithm such as GRPO or GSPO.
Observed performance improvements feed a bandit loss derived from online stochastic mirror descent, providing regret guarantees under partial feedback.
Beats Uniform sampling, SEC, and PCL on most benchmarks, including Countdown (+5.52%), Zebra (+4.50%), ARC-1D (+30.51%), and AIME24 (+28.57%). The principled policy-improvement signal gives the curator an edge that heuristic baselines cannot match.
+58.97% on ARC-Hard and +12.62% on Countdown-Hard over the strongest baseline. The curator thrives where problem difficulty is heterogeneous — it learns to avoid problems that are currently too easy or too hard and focuses training where it matters most.
On Zebra and ARC benchmarks, ACTOR-CURATOR reaches equivalent performance checkpoints up to 80% faster than uniform sampling. By concentrating training on high-utility problems, the curator eliminates wasted iterations on already-solved or intractable problems.
Unlike baselines that use mean-advantage or difficulty proxies (SEC, PCL), the principled bandit objective directly adapts to the current actor, avoiding stale or misleading signals as training dynamics evolve.
The learned curator generalizes to unseen problems without per-problem statistics, scaling to large and evolving problem banks. This makes ACTOR-CURATOR practical in real post-training pipelines where new problems are continually added.
Curriculum selection stabilizes training by steering away from problems that are currently too hard or too easy, reducing variance in actor updates. The curator itself adds only ~9% training-time overhead — a small cost for large performance gains.
ACTOR-CURATOR consistently outperforms uniform sampling and strong learning-based baselines (SEC, PCL) across all challenging reasoning benchmarks, demonstrating improved training stability, efficiency, and final accuracy.
accuracy ↑
Best baseline: 23.33 → ACTOR-CURATOR: 30.00. Trained with GRPO on the same base model.
+28.57% relative gain over the strongest baseline (PCL/Uniform).
accuracy ↑
Best baseline: 27.87 → ACTOR-CURATOR: 36.37. Largest absolute improvement across all benchmarks.
+30.51% relative gain — ARC-Hard shows an even larger +58.97% gain.
accuracy ↑
Best baseline: 58.87 → ACTOR-CURATOR: 62.12 on standard; 51.50 → 58.00 on Countdown-Hard.
+12.62% relative gain on the harder variant, consistent across difficulty levels.