Train a reasoning model to play tic tac toe with PPO ===================================================== In this tutorial, we train a small language model to play Tic Tac Toe using Proximal Policy Optimization and Generalized Advantage Estimation. We use self-play with a mirror model sampler, so the model learns by playing against copies of itself. Setup """"" Make sure you have Unstable Baselines installed: .. code-block:: bash pip install unstable-rl Quick start """"""""""" The fastest way to get started is with the ``examples/run.py`` script: .. code-block:: bash python examples/run.py \ --name "tictactoe-ppo" \ --algorithm ppo \ --envs "TicTacToe-v0-train" \ --model "Qwen/Qwen3-1.7B-Base" \ --template "qwen3-zs" This loads the default PPO config, overrides the environment and model, and starts training. Custom config in Python """"""""""""""""""""""" For more control, load and modify the config directly: .. code-block:: python from unstable import train, get_algorithm_config config = get_algorithm_config("ppo") # Run settings config["run"] = "tictactoe-ppo" config["model_name"] = "Qwen/Qwen3-1.7B-Base" config["training_iterations"] = 500 # Use the Tic Tac Toe environment config["env_sampler"]["train"] = [ { "id": "TicTacToe-v0-train", "num_players": 2, "num_actors": 2, "prompt_template": "qwen3-zs", } ] config["env_sampler"]["eval"] = [ { "id": "TicTacToe-v0-train", "num_players": 2, "prompt_template": "qwen3-zs", "fixed_opponent": "google/gemini-2.0-flash-lite-001", } ] # Tune hyperparameters for Tic Tac Toe config["learner"]["learning_rate"] = 1e-5 config["learner"]["local_batch_size"] = 128 config["learner"]["total_training_steps"] = 500 checkpoint_path = train(config) Reward shaping """""""""""""" The replay buffer applies reward transformations to the raw game outcomes. For Tic Tac Toe, you might want to penalize invalid moves and give a small reward for correct formatting: .. code-block:: python config["replay_buffer"]["reward_transformations"] = { "final": {}, "step": { "format_reward": {"reward": 0.1}, "invalid_move_penalty": {"reward": 0.1, "penalty": -0.1}, }, "sampling": {}, } Evaluation """""""""" After training, evaluate the checkpoint against a fixed opponent: .. code-block:: bash python -m unstable.eval \ --checkpoint \ --env "TicTacToe-v0-train" \ --opponent "google/gemini-2.0-flash-lite-001" \ --num_runs 128 Results """"""""""