Replay Buffers¶

We provide two replay buffers for multi-agent game collection: StepBuffer stores and samples individual steps across episodes while the EpisodeBuffer stores and samples entire episodes.

Both buffers support three kinds of reward transformations:

Final reward transformation: Applied once per trajectory and passed into step-level computations.
Step reward transformation Applied per step, typically using the final/episode reward.
Sampling reward transformation Applied at sampling time to the batch being returned.

API Reference¶

class BaseBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')¶

Abstract base with a minimal interface common to both buffers.

Parameters:

max_buffer_size (int) – Maximum number of steps the buffer can hold in total.
tracker (unstable.collection.trackers.BaseTracker) – Tracker for logging.
final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Transformation applied once per trajectory to produce an episode-level reward.
step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Per-step transformation producing the step reward.
sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Transformation applied when sampling a batch.
buffer_strategy (str) – Buffer when full. Currently "random" downsampling is implemented.

Expected methods

add_player_trajectory(player_traj, env_id): Add one complete trajectory to the buffer.

get_batch(batch_size): Sample and remove a batch from the buffer.

stop(): Signal collection to stop.

size() → int: Current number of stored steps.

continue_collection() → bool: Whether the buffer is still accepting data.

clear(): Remove all stored data.

StepBuffer¶

class unstable.collection.buffers.StepBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')

Stores steps from incoming trajectories in a single flat list and samples individual steps.

Parameters:

max_buffer_size (int) – Upper bound on number of stored steps. When exceeded, the buffer downsamples.
tracker (ray.actor.ActorHandle) – Tracker for logging.
final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Applied once per trajectory to produce a reward signal passed to per-step logic.
step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Applied to each step (given the trajectory and index) to yield the stored step reward.
sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Applied to the sampled batch of steps before returning it.
buffer_strategy (str) – Downsampling policy. Currently only "random" is supported.

Methods

add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)¶

Add the trajectory to the buffer. For each step:

Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.
Compute the step reward transformation if provided; otherwise it uses the final reward.

After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples.

Parameters:

player_traj (PlayerTrajectory) – The trajectory.
env_id (str) – Environment identifier.

get_batch(batch_size: int) → List[Step]¶

Sample without replacement a set of steps and remove them from the buffer.

Parameters:: batch_size (int) – Number of steps to sample.

stop()¶: Mark the buffer as closed to further collection. Does not clear data.

size() → int¶: Current number of stored steps.

continue_collection() → bool¶: Returns True until stop() is called.

clear()¶: Remove all stored steps.

EpisodeBuffer¶

class unstable.collection.buffers.EpisodeBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = 'random')

Stores entire episodes and samples batches of full episodes.

Parameters:

max_buffer_size (int) – Upper bound on number of stored steps. When exceeded, the buffer downsamples.
tracker (ray.actor.ActorHandle) – Tracker for logging.
final_reward_transformation (Optional[ComposeFinalRewardTransforms]) – Applied once per trajectory to produce a reward signal passed to per-step logic.
step_reward_transformation (Optional[ComposeStepRewardTransforms]) – Applied to each step (given the trajectory and index) to yield the stored step reward.
sampling_reward_transformation (Optional[ComposeSamplingRewardTransforms]) – Applied to the sampled batch of steps before returning it.
buffer_strategy (str) – Downsampling policy. Currently only "random" is supported.

Methods

add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)¶

Add the trajectory to the buffer. For each step:

Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.
Compute the step reward transformation if provided; otherwise it uses the final reward.

After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples entire episodes.

Parameters:

player_traj (PlayerTrajectory) – The source trajectory.
env_id (str) – Environment identifier.

get_batch(batch_size: int) → List[List[Step]]¶

Sample without replacement a set of episodes and remove them from the buffer.

Parameters:: batch_size (int) – Number of episodes to sample.

stop()¶: Mark the buffer as closed to further collection. Does not clear data.

size() → int¶: Current number of stored steps.

continue_collection() → bool¶: Returns True until stop() is called.

clear()¶: Remove all stored episodes.