REINFORCE
=========

**REINFORCE** is a policy gradient algorithm that directly maximizes the expected return of a completion by the policy.
We optimize the following loss with off-policy importance sampling correction:

.. math::

   \mathcal{L}_{\text{REINFORCE}}(\theta)
   = -\frac{1}{N} \sum_{i=1}^{N}
     A_i \,
     \pi_\theta(a_i \mid s_i)

where:

.. math::

   \log \pi_\theta(a_i \mid s_i)
   = \sum_{t=1}^{T_i}
     \log \pi_\theta(a_{i,t} \mid s_{i,t})

.. admonition:: Rolling Advantage Estimation
   :class: tip

   We recommend using Rolling Advantage Estimation (RAE), introduced in [1], to use a rolling baseline to optimize the advantage for improved training convergence and stability! RAE maintains separate baselines :math:`b_{G,p}` for each game 
   :math:`G \in \mathcal{G}` and role :math:`p \in \{0,1\}`, where each baseline 
   estimates the expected return :math:`\mathbb{E}[R_p(\tau)]` for that role in that game. 
   We update these baselines using exponential moving average (EMA) with a decay 
   rate :math:`\alpha \in [0,1]`:

   .. math::

      \begin{align}
      b_{G,p} &\leftarrow \alpha b_{G,p} + (1-\alpha) R_p(\tau) 
         && \text{(update baseline)} \\
      A_{G,p}(\tau) &= R_p(\tau) - b_{G,p} 
         && \text{(compute advantage)}
      \end{align}
   
   [1] Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., ... & Jaques, N. (2025). SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2506.24119.

Hyperparameters
"""""""""""""""
**epochs: int (default: 2)**
  Number of epochs to train the policy.
**local_batch_size: int (default: 384)**
  Per-GPU batch size.
**micro_batch_size: int (default: 1)**
  The micro batch size used during training.
**learning_rate: float (default: 1e-6)**
  Learning rate for the policy.
**lr_scheduler_type: str (default: "constant")**
  Learning rate scheduler type.
**lr_warmup_ratio: float (default: 0.01)**
  Learning rate warmup ratio.
**grad_clip: float (default: 0.2)**
  Gradient clipping value for the policy.
**kl_coef: float (default: 0.0)**
  Coefficient for KL divergence penalty against the reference model.