ConfigurationΒΆ

Unstable Baselines follows a modular architecture. Every component can be easily configured, customized and extended. You can find below the default configuration used in REINFORCE. You can either adapt the values or pass over a new sub-class of the module. All parameters of the component are passed as arguments to the constructor.

from unstable import train, get_algorithm_config

class MyModelSampler(BaseModelSampler):
     ...

config = get_algorithm_config("reinforce")
config['learner']['learning_rate'] = 1e-5
config['learner']['grad_clip'] = 0.2
config['replay_buffer']['max_buffer_size'] = 800
config['model_sampler']['type'] = MyModelSampler
checkpoint_path = train(config)

This way it is possible to isolate research on specific branches. For example, if you want to experiment with new curriculum learning strategies, you can simply create a new environment sampler and pass it as a dictionary argument to the config. That way, you can measure specifically the effect of the new strategy while leaving the rest of the configuration unchanged.

You can find all default configuration files in the unstable/config-folder.


unstable.config.reinforce.yaml

 1run: "default"
 2project: "Test"
 3model_name: &model_name "Qwen/Qwen3-1.7B-Base"
 4collection_workers: 250
 5evaluation_workers: 128
 6evaluation_every_iterations: 20
 7evaluation_runs_per_env: 256
 8logging_dir: "outputs"
 9training_iterations: &training_iterations 2000
10
11checkpoint:
12  policy:
13    uid: "base"
14    path: null
15  iteration: 0
16  wandb_id: null
17
18env_sampler:
19  type: "random"
20  train:
21    - id: "ConnectFour-v0-train"
22      num_players: 2
23      num_actors: 2
24      prompt_template: "qwen3-zs"
25  eval:
26    - id: "SimpleTak-v0-train"
27      num_players: 2
28      prompt_template: "qwen3-zs"
29      fixed_opponent: "google/gemini-2.5-flash-lite"
30
31action_sampler:
32  type: "default"
33
34model_sampler:
35  type: "mirror"
36  opponent_top_k: 20
37  opponent_temperature: 0.1
38
39learner:
40  num_gpus: 1
41  type: "reinforce"
42  total_training_steps: *training_iterations
43  model_name: *model_name
44  local_batch_size: 384
45  micro_batch_size: 1
46  learning_rate: 0.000001
47  lr_scheduler_type: "constant"
48  lr_warmup_ratio: 0.01
49  epochs: 2
50  grad_clip: 0.2
51  max_train_len: null
52  max_generation_len: &max_generation_len 4096
53  lora_cfg: &lora_cfg
54    lora_rank: 32
55    lora_alpha: 32
56    lora_dropout: 0.0
57    target_modules:
58      - "q_proj"
59      - "k_proj"
60      - "v_proj"
61      - "o_proj"
62      - "gate_proj"
63      - "up_proj"
64      - "down_proj"
65  activation_checkpointing: false
66  gradient_checkpointing: false
67  use_trainer_cache: false
68
69replay_buffer:
70  type: "step_buffer"
71  max_buffer_size: 768
72  reward_transformations:
73    final:
74      role_advantage: {}
75    step:
76      format_reward:
77        reward: 0.0
78        penalty: -0.1
79      invalid_move_penalty:
80        reward: 0.0
81        penalty: -0.1
82    sampling:
83      normalize_by_env:
84        z_score: true
85
86vllm_config:
87  model_name: *model_name
88  temperature: 0.7
89  top_k: 50
90  top_p: 0.95
91  max_tokens: *max_generation_len
92  max_parallel_seq: 100
93  max_loras: 8
94  max_model_len: 8192
95  lora_config: *lora_cfg