Train a reasoning model to play tic tac toe with PPO¶
In this tutorial, we train a small language model to play Tic Tac Toe using Proximal Policy Optimization and Generalized Advantage Estimation. We use self-play with a mirror model sampler, so the model learns by playing against copies of itself.
Setup¶
Make sure you have Unstable Baselines installed:
pip install unstable-rl
Quick start¶
The fastest way to get started is with the examples/run.py script:
python examples/run.py \
--name "tictactoe-ppo" \
--algorithm ppo \
--envs "TicTacToe-v0-train" \
--model "Qwen/Qwen3-1.7B-Base" \
--template "qwen3-zs"
This loads the default PPO config, overrides the environment and model, and starts training.
Custom config in Python¶
For more control, load and modify the config directly:
from unstable import train, get_algorithm_config
config = get_algorithm_config("ppo")
# Run settings
config["run"] = "tictactoe-ppo"
config["model_name"] = "Qwen/Qwen3-1.7B-Base"
config["training_iterations"] = 500
# Use the Tic Tac Toe environment
config["env_sampler"]["train"] = [
{
"id": "TicTacToe-v0-train",
"num_players": 2,
"num_actors": 2,
"prompt_template": "qwen3-zs",
}
]
config["env_sampler"]["eval"] = [
{
"id": "TicTacToe-v0-train",
"num_players": 2,
"prompt_template": "qwen3-zs",
"fixed_opponent": "google/gemini-2.0-flash-lite-001",
}
]
# Tune hyperparameters for Tic Tac Toe
config["learner"]["learning_rate"] = 1e-5
config["learner"]["local_batch_size"] = 128
config["learner"]["total_training_steps"] = 500
checkpoint_path = train(config)
Reward shaping¶
The replay buffer applies reward transformations to the raw game outcomes. For Tic Tac Toe, you might want to penalize invalid moves and give a small reward for correct formatting:
config["replay_buffer"]["reward_transformations"] = {
"final": {},
"step": {
"format_reward": {"reward": 0.1},
"invalid_move_penalty": {"reward": 0.1, "penalty": -0.1},
},
"sampling": {},
}
Evaluation¶
After training, evaluate the checkpoint against a fixed opponent:
python -m unstable.eval \
--checkpoint <checkpoint_path> \
--env "TicTacToe-v0-train" \
--opponent "google/gemini-2.0-flash-lite-001" \
--num_runs 128