REINFORCE¶
REINFORCE is a policy gradient algorithm that directly maximizes the expected return of a completion by the policy. We optimize the following loss with off-policy importance sampling correction:
where:
Rolling Advantage Estimation
We recommend using Rolling Advantage Estimation (RAE), introduced in [1], to use a rolling baseline to optimize the advantage for improved training convergence and stability! RAE maintains separate baselines \(b_{G,p}\) for each game \(G \in \mathcal{G}\) and role \(p \in \{0,1\}\), where each baseline estimates the expected return \(\mathbb{E}[R_p(\tau)]\) for that role in that game. We update these baselines using exponential moving average (EMA) with a decay rate \(\alpha \in [0,1]\):
[1] Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., … & Jaques, N. (2025). SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2506.24119.
Hyperparameters¶
- epochs: int (default: 2)
Number of epochs to train the policy.
- local_batch_size: int (default: 384)
Per-GPU batch size.
- micro_batch_size: int (default: 1)
The micro batch size used during training.
- learning_rate: float (default: 1e-6)
Learning rate for the policy.
- lr_scheduler_type: str (default: “constant”)
Learning rate scheduler type.
- lr_warmup_ratio: float (default: 0.01)
Learning rate warmup ratio.
- grad_clip: float (default: 0.2)
Gradient clipping value for the policy.
- kl_coef: float (default: 0.0)
Coefficient for KL divergence penalty against the reference model.