ACTOR-CURATOR

Adaptive Problem Curation for Scalable RL Post-Training

A scalable, fully automated framework that trains a neural curator to adaptively select training problems for RL post-training of LLMs — directly optimizing for expected policy improvement via a principled non-stationary bandit formulation with regret guarantees.

Read the paper Jump to results ↓

+28.6% accuracy gain on AIME2024 over best baseline

+30.5% accuracy gain on ARC-1D over best baseline

80% faster training speedup vs. strongest baseline

Policy-improvement bandits

Problem selection formulated as a non-stationary stochastic bandit, trained directly on expected policy improvement rather than heuristic difficulty signals.

Fully automated

Neural curator function-approximates over large heterogeneous problem banks with no human annotations or manual dataset structuring required.

Plug-and-play

Agnostic to RL algorithm (GRPO, GSPO, and others) and seamlessly combines with any actor update rule and problem bank.

Paper

Zhengyao Gu^1*, Jonathan Light^2*, Raul Astudillo², Ziyu Ye¹, Langzhou He⁴, Henry Peng Zou⁵, Wei Cheng⁶, Santiago Paternain³, Philip S. Yu¹, Yisong Yue²

^* Equal contribution

¹ University of Illinois Chicago ² Caltech ³ RPI ⁴ MBZUAI
⁵ University of Chicago ⁶ NEC Laboratories America

How ACTOR-CURATOR Works

ACTOR-CURATOR jointly trains an actor and a neural curator in an online, on-policy loop. The curator learns to select training problems that maximise expected policy improvement, while the actor trains on those curated batches — creating a co-adaptive cycle that accelerates learning and improves final performance.

The Self-Improving Curriculum A curator learns which problems most improve the actor.

Problem Bank (10K+ problems)

easy → hard

Algorithm Stage

Training Progress

Sample candidate batch

A candidate batch is drawn from the problem bank via a proposal distribution, keeping the selection pool tractable across large and heterogeneous datasets.

Curator scores problems

The neural curator assigns selection probabilities proportional to expected policy-improvement utility for each candidate problem.

Actor trains on curation

The actor rolls out solutions on the curated problem set, receives rewards, and updates via any RL algorithm such as GRPO or GSPO.

Curator updates via OSMD

Observed performance improvements feed a bandit loss derived from online stochastic mirror descent, providing regret guarantees under partial feedback.

ACTOR-CURATOR teaser: example benchmarks showing curriculum learning gains — The curator adaptively selects problems from a large bank; a bandit reward based on post-update policy improvement trains it in return.

ACTOR-CURATOR method overview: the curator selects training problems, the actor trains on the curated batch, and curator utilities are updated based on observed policy improvement.

Key Findings

Finding 1: ACTOR-CURATOR consistently outperforms all baselines

-

Beats Uniform sampling, SEC, and PCL on most benchmarks, including Countdown (+5.52%), Zebra (+4.50%), ARC-1D (+30.51%), and AIME24 (+28.57%). The principled policy-improvement signal gives the curator an edge that heuristic baselines cannot match.

Finding 2: Biggest gains on the hardest benchmarks

-

+58.97% on ARC-Hard and +12.62% on Countdown-Hard over the strongest baseline. The curator thrives where problem difficulty is heterogeneous — it learns to avoid problems that are currently too easy or too hard and focuses training where it matters most.

Finding 3: Up to 80% training speedup

-

On Zebra and ARC benchmarks, ACTOR-CURATOR reaches equivalent performance checkpoints up to 80% faster than uniform sampling. By concentrating training on high-utility problems, the curator eliminates wasted iterations on already-solved or intractable problems.

Finding 4: Policy-improvement signal beats heuristics

-

Unlike baselines that use mean-advantage or difficulty proxies (SEC, PCL), the principled bandit objective directly adapts to the current actor, avoiding stale or misleading signals as training dynamics evolve.

Finding 5: Neural approximation generalizes to unseen problems

-

The learned curator generalizes to unseen problems without per-problem statistics, scaling to large and evolving problem banks. This makes ACTOR-CURATOR practical in real post-training pipelines where new problems are continually added.

Finding 6: Stable training dynamics with minimal overhead

-

Curriculum selection stabilizes training by steering away from problems that are currently too hard or too easy, reducing variance in actor updates. The curator itself adds only ~9% training-time overhead — a small cost for large performance gains.

Benchmark Results

ACTOR-CURATOR consistently outperforms uniform sampling and strong learning-based baselines (SEC, PCL) across all challenging reasoning benchmarks, demonstrating improved training stability, efficiency, and final accuracy.

AIME2024

accuracy ↑

Best baseline: 23.33 → ACTOR-CURATOR: 30.00. Trained with GRPO on the same base model.

+28.57% relative gain over the strongest baseline (PCL/Uniform).

ARC-1D

accuracy ↑

Best baseline: 27.87 → ACTOR-CURATOR: 36.37. Largest absolute improvement across all benchmarks.

+30.51% relative gain — ARC-Hard shows an even larger +58.97% gain.

Countdown

accuracy ↑

Best baseline: 58.87 → ACTOR-CURATOR: 62.12 on standard; 51.50 → 58.00 on Countdown-Hard.

+12.62% relative gain on the harder variant, consistent across difficulty levels.

ARC-1D accuracy over training steps — ACTOR-CURATOR (AC) reaches higher ARC-1D accuracy faster than all baselines across training.

Training efficiency on Zebra: steps to reach target accuracy — ACTOR-CURATOR achieves the same Zebra accuracy checkpoints up to 80% faster than uniform sampling.

Citation

If you find ACTOR-CURATOR helpful in your research, please cite the paper below:

@article{gu2025actorcurator,
  title        = {{ACTOR-CURATOR}: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for Scalable {RL} Post-Training},
  author       = {Gu, Zhengyao and Light, Jonathan and Astudillo, Raul and Ye, Ziyu and He, Langzhou and Zou, Henry Peng and Cheng, Wei and Paternain, Santiago and Yu, Philip S. and Yue, Yisong},
  journal      = {arXiv preprint},
  year         = {2025},
  eprint       = {2602.20532},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2602.20532}
}

ACTOR-CURATOR

Policy-improvement bandits

Fully automated

Plug-and-play

How ACTOR-CURATOR Works

Sample candidate batch

Curator scores problems

Actor trains on curation

Curator updates via OSMD

Policy-improvement objective

Regret guarantees

Online co-adaptation

Key Findings

Finding 1: ACTOR-CURATOR consistently outperforms all baselines

Finding 2: Biggest gains on the hardest benchmarks

Finding 3: Up to 80% training speedup

Finding 4: Policy-improvement signal beats heuristics

Finding 5: Neural approximation generalizes to unseen problems

Finding 6: Stable training dynamics with minimal overhead

Benchmark Results

AIME2024

ARC-1D

Countdown

ARC-1D Accuracy over Training

Training Efficiency on Zebra

Citation