Replay Buffers

We provide two replay buffers for multi-agent game collection: StepBuffer stores and samples individual steps across episodes while the EpisodeBuffer stores and samples entire episodes.

Both buffers support three kinds of reward transformations:

  • Final reward transformation: Applied once per trajectory and passed into step-level computations.

  • Step reward transformation Applied per step, typically using the final/episode reward.

  • Sampling reward transformation Applied at sampling time to the batch being returned.

API Reference

class BaseBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')

Abstract base with a minimal interface common to both buffers.

Parameters:
  • max_buffer_size (int) – Maximum number of steps the buffer can hold in total.

  • tracker (unstable.collection.trackers.BaseTracker) – Tracker for logging.

  • final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Transformation applied once per trajectory to produce an episode-level reward.

  • step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Per-step transformation producing the step reward.

  • sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Transformation applied when sampling a batch.

  • buffer_strategy (str) – Buffer when full. Currently "random" downsampling is implemented.

Expected methods

add_player_trajectory(player_traj, env_id)

Add one complete trajectory to the buffer.

get_batch(batch_size)

Sample and remove a batch from the buffer.

stop()

Signal collection to stop.

size() int

Current number of stored steps.

continue_collection() bool

Whether the buffer is still accepting data.

clear()

Remove all stored data.

StepBuffer

class unstable.collection.buffers.StepBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')

Stores steps from incoming trajectories in a single flat list and samples individual steps.

Parameters:
  • max_buffer_size (int) – Upper bound on number of stored steps. When exceeded, the buffer downsamples.

  • tracker (ray.actor.ActorHandle) – Tracker for logging.

  • final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Applied once per trajectory to produce a reward signal passed to per-step logic.

  • step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Applied to each step (given the trajectory and index) to yield the stored step reward.

  • sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Applied to the sampled batch of steps before returning it.

  • buffer_strategy (str) – Downsampling policy. Currently only "random" is supported.

Methods

add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)

Add the trajectory to the buffer. For each step:

  1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.

  2. Compute the step reward transformation if provided; otherwise it uses the final reward.

After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples.

Parameters:
  • player_traj (PlayerTrajectory) – The trajectory.

  • env_id (str) – Environment identifier.

get_batch(batch_size: int) List[Step]

Sample without replacement a set of steps and remove them from the buffer.

Parameters:

batch_size (int) – Number of steps to sample.

stop()

Mark the buffer as closed to further collection. Does not clear data.

size() int

Current number of stored steps.

continue_collection() bool

Returns True until stop() is called.

clear()

Remove all stored steps.

EpisodeBuffer

class unstable.collection.buffers.EpisodeBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')

Stores entire episodes and samples batches of full episodes.

Parameters:
  • max_buffer_size (int) – Upper bound on number of stored steps. When exceeded, the buffer downsamples.

  • tracker (ray.actor.ActorHandle) – Tracker for logging.

  • final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Applied once per trajectory to produce a reward signal passed to per-step logic.

  • step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Applied to each step (given the trajectory and index) to yield the stored step reward.

  • sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Applied to the sampled batch of steps before returning it.

  • buffer_strategy (str) – Downsampling policy. Currently only "random" is supported.

Methods

add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)

Add the trajectory to the buffer. For each step:

  1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.

  2. Compute the step reward transformation if provided; otherwise it uses the final reward.

After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples entire episodes.

Parameters:
  • player_traj (PlayerTrajectory) – The source trajectory.

  • env_id (str) – Environment identifier.

get_batch(batch_size: int) List[List[Step]]

Sample without replacement a set of episodes and remove them from the buffer.

Parameters:

batch_size (int) – Number of episodes to sample.

stop()

Mark the buffer as closed to further collection. Does not clear data.

size() int

Current number of stored steps.

continue_collection() bool

Returns True until stop() is called.

clear()

Remove all stored episodes.