Configuration ============= Unstable Baselines follows a modular architecture. Every component can be easily configured, customized and extended. You can find below the default configuration used in REINFORCE. You can either adapt the values or pass over a new sub-class of the module. All parameters of the component are passed as arguments to the constructor. .. code-block:: python from unstable import train, get_algorithm_config class MyModelSampler(BaseModelSampler): ... config = get_algorithm_config("reinforce") config['learner']['learning_rate'] = 1e-5 config['learner']['grad_clip'] = 0.2 config['replay_buffer']['max_buffer_size'] = 800 config['model_sampler']['type'] = MyModelSampler checkpoint_path = train(config) This way it is possible to isolate research on specific branches. For example, if you want to experiment with new curriculum learning strategies, you can simply create a new environment sampler and pass it as a dictionary argument to the config. That way, you can measure specifically the effect of the new strategy while leaving the rest of the configuration unchanged. You can find all default configuration files in the `unstable/config `_-folder. -------------------------------- **unstable.config.reinforce.yaml** .. code-block:: yaml :linenos: run: "default" project: "Test" model_name: &model_name "Qwen/Qwen3-1.7B-Base" collection_workers: 250 evaluation_workers: 128 evaluation_every_iterations: 20 evaluation_runs_per_env: 256 logging_dir: "outputs" training_iterations: &training_iterations 2000 checkpoint: policy: uid: "base" path: null iteration: 0 wandb_id: null env_sampler: type: "random" train: - id: "ConnectFour-v0-train" num_players: 2 num_actors: 2 prompt_template: "qwen3-zs" eval: - id: "SimpleTak-v0-train" num_players: 2 prompt_template: "qwen3-zs" fixed_opponent: "google/gemini-2.5-flash-lite" action_sampler: type: "default" model_sampler: type: "mirror" opponent_top_k: 20 opponent_temperature: 0.1 learner: num_gpus: 1 type: "reinforce" total_training_steps: *training_iterations model_name: *model_name local_batch_size: 384 micro_batch_size: 1 learning_rate: 0.000001 lr_scheduler_type: "constant" lr_warmup_ratio: 0.01 epochs: 2 grad_clip: 0.2 max_train_len: null max_generation_len: &max_generation_len 4096 lora_cfg: &lora_cfg lora_rank: 32 lora_alpha: 32 lora_dropout: 0.0 target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "gate_proj" - "up_proj" - "down_proj" activation_checkpointing: false gradient_checkpointing: false use_trainer_cache: false replay_buffer: type: "step_buffer" max_buffer_size: 768 reward_transformations: final: role_advantage: {} step: format_reward: reward: 0.0 penalty: -0.1 invalid_move_penalty: reward: 0.0 penalty: -0.1 sampling: normalize_by_env: z_score: true vllm_config: model_name: *model_name temperature: 0.7 top_k: 50 top_p: 0.95 max_tokens: *max_generation_len max_parallel_seq: 100 max_loras: 8 max_model_len: 8192 lora_config: *lora_cfg