Game Scheduler

The game scheduler is responsible for starting games for both, training and evaluation. The class balances work across GPU actors and streams results to the replay buffer.

API Reference

class unstable.collection.game_scheduler.GameScheduler(vllm_config, tracker, buffer, model_sampler, env_sampler, action_sampler: str = 'default')
Parameters:
  • vllm_config (Mapping[str, Any]) – Configuration passed to each VLLMActor (see section vLLM Configuration below).

  • tracker (ray.actor.ActorHandle) – Actor that provides logging and weights and biases integration.

  • buffer (ray.actor.ActorHandle) – Actor that stores training trajectories and exposes.

  • model_sampler (BaseModelSampler) – Component that provides the models for the next game.

  • env_sampler (BaseEnvSampler) – Component that provides the environment for the next game.

  • action_sampler (str, optional) – Name of the action-sampling strategy to use for evaluation. Default is “default” which samples a single action from the model.

Methods

collect(num_train_workers: int, num_eval_workers: int | None = None)

Schedules training (and optionally evaluation) games until the buffer signals to stop.

Parameters:
  • num_train_workers (int) – Maximum number of concurrent training episodes.

  • num_eval_workers (Optional[int]) – Maximum number of concurrent evaluation episodes. If None, no evaluation episodes are scheduled.

vLLM Configuration

model_name

HuggingFace or local model identifier.

temperature

Sampling temperature used during text generation (must be ≥ 0.0).

max_tokens

Maximum number of tokens to generate per sequence.

max_parallel_seq

Maximum number of sequences processed in parallel on a single actor.

max_loras

Maximum number of concurrently loaded LoRA adapters.

max_model_len

Maximum context window length for the underlying model.

lora_config

Optional configuration for LoRA fine-tuning adapters.

lora_rank Rank of the LoRA projection matrices.

lora_alpha Scaling factor for the LoRA updates.

lora_dropout Dropout probability applied to LoRA layers (range 0.0–1.0).

target_modules List of target submodules where LoRA adapters are applied.

Typical target modules include:

  • q_proj

  • k_proj

  • v_proj

  • o_proj

  • gate_proj

  • up_proj

  • down_proj