REINFORCE¶

REINFORCE is a policy gradient algorithm that directly maximizes the expected return of a completion by the policy. We optimize the following loss with off-policy importance sampling correction:

\[\mathcal{L}_{\text{REINFORCE}}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} A_i \, \pi_\theta(a_i \mid s_i)\]

where:

\[\log \pi_\theta(a_i \mid s_i) = \sum_{t=1}^{T_i} \log \pi_\theta(a_{i,t} \mid s_{i,t})\]

Rolling Advantage Estimation

We recommend using Rolling Advantage Estimation (RAE), introduced in [1], to use a rolling baseline to optimize the advantage for improved training convergence and stability! RAE maintains separate baselines \(b_{G,p}\) for each game \(G \in \mathcal{G}\) and role \(p \in \{0,1\}\), where each baseline estimates the expected return \(\mathbb{E}[R_p(\tau)]\) for that role in that game. We update these baselines using exponential moving average (EMA) with a decay rate \(\alpha \in [0,1]\):

\[\begin{split}\begin{align} b_{G,p} &\leftarrow \alpha b_{G,p} + (1-\alpha) R_p(\tau) && \text{(update baseline)} \\ A_{G,p}(\tau) &= R_p(\tau) - b_{G,p} && \text{(compute advantage)} \end{align}\end{split}\]

[1] Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., … & Jaques, N. (2025). SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2506.24119.

Hyperparameters¶

epochs: int (default: 2): Number of epochs to train the policy.
local_batch_size: int (default: 384): Per-GPU batch size.
micro_batch_size: int (default: 1): The micro batch size used during training.
learning_rate: float (default: 1e-6): Learning rate for the policy.
lr_scheduler_type: str (default: “constant”): Learning rate scheduler type.
lr_warmup_ratio: float (default: 0.01): Learning rate warmup ratio.
grad_clip: float (default: 0.2): Gradient clipping value for the policy.
kl_coef: float (default: 0.0): Coefficient for KL divergence penalty against the reference model.