REINFORCE

REINFORCE is a policy gradient algorithm that directly maximizes the expected return of a completion by the policy. We optimize the following loss with off-policy importance sampling correction:

\[\mathcal{L}_{\text{REINFORCE}}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} A_i \, \pi_\theta(a_i \mid s_i)\]

where:

\[\log \pi_\theta(a_i \mid s_i) = \sum_{t=1}^{T_i} \log \pi_\theta(a_{i,t} \mid s_{i,t})\]

Rolling Advantage Estimation

We recommend using Rolling Advantage Estimation (RAE), introduced in [1], to use a rolling baseline to optimize the advantage for improved training convergence and stability! RAE maintains separate baselines \(b_{G,p}\) for each game \(G \in \mathcal{G}\) and role \(p \in \{0,1\}\), where each baseline estimates the expected return \(\mathbb{E}[R_p(\tau)]\) for that role in that game. We update these baselines using exponential moving average (EMA) with a decay rate \(\alpha \in [0,1]\):

\[\begin{split}\begin{align} b_{G,p} &\leftarrow \alpha b_{G,p} + (1-\alpha) R_p(\tau) && \text{(update baseline)} \\ A_{G,p}(\tau) &= R_p(\tau) - b_{G,p} && \text{(compute advantage)} \end{align}\end{split}\]

[1] Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., … & Jaques, N. (2025). SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2506.24119.

Hyperparameters

epochs: int (default: 2)

Number of epochs to train the policy.

local_batch_size: int (default: 384)

Per-GPU batch size.

micro_batch_size: int (default: 1)

The micro batch size used during training.

learning_rate: float (default: 1e-6)

Learning rate for the policy.

lr_scheduler_type: str (default: “constant”)

Learning rate scheduler type.

lr_warmup_ratio: float (default: 0.01)

Learning rate warmup ratio.

grad_clip: float (default: 0.2)

Gradient clipping value for the policy.

kl_coef: float (default: 0.0)

Coefficient for KL divergence penalty against the reference model.