ConfigurationΒΆ
Unstable Baselines follows a modular architecture. Every component can be easily configured, customized and extended. You can find below the default configuration used in REINFORCE. You can either adapt the values or pass over a new sub-class of the module. All parameters of the component are passed as arguments to the constructor.
from unstable import train, get_algorithm_config
class MyModelSampler(BaseModelSampler):
...
config = get_algorithm_config("reinforce")
config['learner']['learning_rate'] = 1e-5
config['learner']['grad_clip'] = 0.2
config['replay_buffer']['max_buffer_size'] = 800
config['model_sampler']['type'] = MyModelSampler
checkpoint_path = train(config)
This way it is possible to isolate research on specific branches. For example, if you want to experiment with new curriculum learning strategies, you can simply create a new environment sampler and pass it as a dictionary argument to the config. That way, you can measure specifically the effect of the new strategy while leaving the rest of the configuration unchanged.
You can find all default configuration files in the unstable/config-folder.
unstable.config.reinforce.yaml
1run: "default"
2project: "Test"
3model_name: &model_name "Qwen/Qwen3-1.7B-Base"
4collection_workers: 250
5evaluation_workers: 128
6evaluation_every_iterations: 20
7evaluation_runs_per_env: 256
8logging_dir: "outputs"
9training_iterations: &training_iterations 2000
10
11checkpoint:
12 policy:
13 uid: "base"
14 path: null
15 iteration: 0
16 wandb_id: null
17
18env_sampler:
19 type: "random"
20 train:
21 - id: "ConnectFour-v0-train"
22 num_players: 2
23 num_actors: 2
24 prompt_template: "qwen3-zs"
25 eval:
26 - id: "SimpleTak-v0-train"
27 num_players: 2
28 prompt_template: "qwen3-zs"
29 fixed_opponent: "google/gemini-2.5-flash-lite"
30
31action_sampler:
32 type: "default"
33
34model_sampler:
35 type: "mirror"
36 opponent_top_k: 20
37 opponent_temperature: 0.1
38
39learner:
40 num_gpus: 1
41 type: "reinforce"
42 total_training_steps: *training_iterations
43 model_name: *model_name
44 local_batch_size: 384
45 micro_batch_size: 1
46 learning_rate: 0.000001
47 lr_scheduler_type: "constant"
48 lr_warmup_ratio: 0.01
49 epochs: 2
50 grad_clip: 0.2
51 max_train_len: null
52 max_generation_len: &max_generation_len 4096
53 lora_cfg: &lora_cfg
54 lora_rank: 32
55 lora_alpha: 32
56 lora_dropout: 0.0
57 target_modules:
58 - "q_proj"
59 - "k_proj"
60 - "v_proj"
61 - "o_proj"
62 - "gate_proj"
63 - "up_proj"
64 - "down_proj"
65 activation_checkpointing: false
66 gradient_checkpointing: false
67 use_trainer_cache: false
68
69replay_buffer:
70 type: "step_buffer"
71 max_buffer_size: 768
72 reward_transformations:
73 final:
74 role_advantage: {}
75 step:
76 format_reward:
77 reward: 0.0
78 penalty: -0.1
79 invalid_move_penalty:
80 reward: 0.0
81 penalty: -0.1
82 sampling:
83 normalize_by_env:
84 z_score: true
85
86vllm_config:
87 model_name: *model_name
88 temperature: 0.7
89 top_k: 50
90 top_p: 0.95
91 max_tokens: *max_generation_len
92 max_parallel_seq: 100
93 max_loras: 8
94 max_model_len: 8192
95 lora_config: *lora_cfg