Icon ACTOR-CURATOR
ACTOR-CURATOR icon

ACTOR-CURATOR

Adaptive Problem Curation for Scalable RL Post-Training

A scalable, fully automated framework that trains a neural curator to adaptively select training problems for RL post-training of LLMs — directly optimizing for expected policy improvement via a principled non-stationary bandit formulation with regret guarantees.

+28.6% accuracy gain on AIME2024 over best baseline
+30.5% accuracy gain on ARC-1D over best baseline
80% faster training speedup vs. strongest baseline

Policy-improvement bandits

Problem selection formulated as a non-stationary stochastic bandit, trained directly on expected policy improvement rather than heuristic difficulty signals.

Fully automated

Neural curator function-approximates over large heterogeneous problem banks with no human annotations or manual dataset structuring required.

Plug-and-play

Agnostic to RL algorithm (GRPO, GSPO, and others) and seamlessly combines with any actor update rule and problem bank.

Zhengyao Gu1*, Jonathan Light2*, Raul Astudillo2, Ziyu Ye1, Langzhou He4, Henry Peng Zou5, Wei Cheng6, Santiago Paternain3, Philip S. Yu1, Yisong Yue2

* Equal contribution

1 University of Illinois Chicago    2 Caltech    3 RPI    4 MBZUAI   
5 University of Chicago    6 NEC Laboratories America

How ACTOR-CURATOR Works

ACTOR-CURATOR jointly trains an actor and a neural curator in an online, on-policy loop. The curator learns to select training problems that maximise expected policy improvement, while the actor trains on those curated batches — creating a co-adaptive cycle that accelerates learning and improves final performance.

The Self-Improving Curriculum A curator learns which problems most improve the actor.
Problem Bank (10K+ problems)
easy hard
Problem Bank candidate batch Curator training batch training batch Actor
Algorithm Stage

Training Progress

Sample candidate batch

A candidate batch is drawn from the problem bank via a proposal distribution, keeping the selection pool tractable across large and heterogeneous datasets.

Curator scores problems

The neural curator assigns selection probabilities proportional to expected policy-improvement utility for each candidate problem.

Actor trains on curation

The actor rolls out solutions on the curated problem set, receives rewards, and updates via any RL algorithm such as GRPO or GSPO.

Curator updates via OSMD

Observed performance improvements feed a bandit loss derived from online stochastic mirror descent, providing regret guarantees under partial feedback.

ACTOR-CURATOR teaser: example benchmarks showing curriculum learning gains
The curator adaptively selects problems from a large bank; a bandit reward based on post-update policy improvement trains it in return.
ACTOR-CURATOR method overview
ACTOR-CURATOR method overview: the curator selects training problems, the actor trains on the curated batch, and curator utilities are updated based on observed policy improvement.

Policy-improvement objective

Curator is trained on the expected improvement signal, making it actor-aware and adaptive to non-stationary training dynamics.

Regret guarantees

The OSMD-based bandit formulation provides theoretical regret bounds under partial feedback on large problem sets.

Online co-adaptation

Actor and curator co-evolve every iteration — as the actor improves, the curator continuously updates its utility estimates to track the shifting training frontier.

Key Findings

Finding 1: ACTOR-CURATOR consistently outperforms all baselines

-

Beats Uniform sampling, SEC, and PCL on most benchmarks, including Countdown (+5.52%), Zebra (+4.50%), ARC-1D (+30.51%), and AIME24 (+28.57%). The principled policy-improvement signal gives the curator an edge that heuristic baselines cannot match.

Finding 2: Biggest gains on the hardest benchmarks

-

+58.97% on ARC-Hard and +12.62% on Countdown-Hard over the strongest baseline. The curator thrives where problem difficulty is heterogeneous — it learns to avoid problems that are currently too easy or too hard and focuses training where it matters most.

Finding 3: Up to 80% training speedup

-

On Zebra and ARC benchmarks, ACTOR-CURATOR reaches equivalent performance checkpoints up to 80% faster than uniform sampling. By concentrating training on high-utility problems, the curator eliminates wasted iterations on already-solved or intractable problems.

Finding 4: Policy-improvement signal beats heuristics

-

Unlike baselines that use mean-advantage or difficulty proxies (SEC, PCL), the principled bandit objective directly adapts to the current actor, avoiding stale or misleading signals as training dynamics evolve.

Finding 5: Neural approximation generalizes to unseen problems

-

The learned curator generalizes to unseen problems without per-problem statistics, scaling to large and evolving problem banks. This makes ACTOR-CURATOR practical in real post-training pipelines where new problems are continually added.

Finding 6: Stable training dynamics with minimal overhead

-

Curriculum selection stabilizes training by steering away from problems that are currently too hard or too easy, reducing variance in actor updates. The curator itself adds only ~9% training-time overhead — a small cost for large performance gains.

Benchmark Results

ACTOR-CURATOR consistently outperforms uniform sampling and strong learning-based baselines (SEC, PCL) across all challenging reasoning benchmarks, demonstrating improved training stability, efficiency, and final accuracy.

AIME2024

accuracy ↑

Best baseline: 23.33 → ACTOR-CURATOR: 30.00. Trained with GRPO on the same base model.

+28.57% relative gain over the strongest baseline (PCL/Uniform).

ARC-1D

accuracy ↑

Best baseline: 27.87 → ACTOR-CURATOR: 36.37. Largest absolute improvement across all benchmarks.

+30.51% relative gain — ARC-Hard shows an even larger +58.97% gain.

Countdown

accuracy ↑

Best baseline: 58.87 → ACTOR-CURATOR: 62.12 on standard; 51.50 → 58.00 on Countdown-Hard.

+12.62% relative gain on the harder variant, consistent across difficulty levels.

ARC-1D Accuracy over Training

ARC-1D accuracy over training steps
ACTOR-CURATOR (AC) reaches higher ARC-1D accuracy faster than all baselines across training.

Training Efficiency on Zebra

Training efficiency on Zebra: steps to reach target accuracy
ACTOR-CURATOR achieves the same Zebra accuracy checkpoints up to 80% faster than uniform sampling.

Citation

If you find ACTOR-CURATOR helpful in your research, please cite the paper below:

@article{gu2025actorcurator,
  title        = {{ACTOR-CURATOR}: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for Scalable {RL} Post-Training},
  author       = {Gu, Zhengyao and Light, Jonathan and Astudillo, Raul and Ye, Ziyu and He, Langzhou and Zou, Henry Peng and Cheng, Wei and Paternain, Santiago and Yu, Philip S. and Yue, Yisong},
  journal      = {arXiv preprint},
  year         = {2025},
  eprint       = {2602.20532},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2602.20532}
}