Configuration¶

Unstable Baselines follows a modular architecture. Every component can be easily configured, customized and extended. You can find below the default configuration used in REINFORCE. You can either adapt the values or pass over a new sub-class of the module. All parameters of the component are passed as arguments to the constructor.

from unstable import train, get_algorithm_config

class MyModelSampler(BaseModelSampler):
     ...

config = get_algorithm_config("reinforce")
config['learner']['learning_rate'] = 1e-5
config['learner']['grad_clip'] = 0.2
config['replay_buffer']['max_buffer_size'] = 800
config['model_sampler']['type'] = MyModelSampler
checkpoint_path = train(config)

This way it is possible to isolate research on specific branches. For example, if you want to experiment with new curriculum learning strategies, you can simply create a new environment sampler and pass it as a dictionary argument to the config. That way, you can measure specifically the effect of the new strategy while leaving the rest of the configuration unchanged.

You can find all default configuration files in the unstable/config-folder.

unstable.config.reinforce.yaml

run: "default"
project: "Test"
model_name: &model_name "Qwen/Qwen3-1.7B-Base"
collection_workers: 250
evaluation_workers: 128
evaluation_every_iterations: 20
evaluation_runs_per_env: 256
logging_dir: "outputs"
training_iterations: &training_iterations 2000

checkpoint:
  policy:
    uid: "base"
    path: null
  iteration: 0
  wandb_id: null

env_sampler:
  type: "random"
  train:
    - id: "ConnectFour-v0-train"
      num_players: 2
      num_actors: 2
      prompt_template: "qwen3-zs"
  eval:
    - id: "SimpleTak-v0-train"
      num_players: 2
      prompt_template: "qwen3-zs"
      fixed_opponent: "google/gemini-2.5-flash-lite"

action_sampler:
  type: "default"

model_sampler:
  type: "mirror"
  opponent_top_k: 20
  opponent_temperature: 0.1

learner:
  num_gpus: 1
  type: "reinforce"
  total_training_steps: *training_iterations
  model_name: *model_name
  local_batch_size: 384
  micro_batch_size: 1
  learning_rate: 0.000001
  lr_scheduler_type: "constant"
  lr_warmup_ratio: 0.01
  epochs: 2
  grad_clip: 0.2
  max_train_len: null
  max_generation_len: &max_generation_len 4096
  lora_cfg: &lora_cfg
    lora_rank: 32
    lora_alpha: 32
    lora_dropout: 0.0
    target_modules:
      - "q_proj"
      - "k_proj"
      - "v_proj"
      - "o_proj"
      - "gate_proj"
      - "up_proj"
      - "down_proj"
  activation_checkpointing: false
  gradient_checkpointing: false
  use_trainer_cache: false

replay_buffer:
  type: "step_buffer"
  max_buffer_size: 768
  reward_transformations:
    final:
      role_advantage: {}
    step:
      format_reward:
        reward: 0.0
        penalty: -0.1
      invalid_move_penalty:
        reward: 0.0
        penalty: -0.1
    sampling:
      normalize_by_env:
        z_score: true

vllm_config:
  model_name: *model_name
  temperature: 0.7
  top_k: 50
  top_p: 0.95
  max_tokens: *max_generation_len
  max_parallel_seq: 100
  max_loras: 8
  max_model_len: 8192
  lora_config: *lora_cfg