Replay Buffers
~~~~~~~~~~~~~~
We provide two replay buffers for multi-agent game collection:
StepBuffer stores and samples individual steps across episodes while the EpisodeBuffer stores and samples entire episodes.
.. raw:: html
Both buffers support three kinds of reward transformations:
- **Final reward transformation**: Applied once per trajectory and passed into step-level computations.
- **Step reward transformation** Applied per step, typically using the final/episode reward.
- **Sampling reward transformation** Applied *at sampling time* to the batch being returned.
API Reference
"""""""""""""
.. py:class:: BaseBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random")
Abstract base with a minimal interface common to both buffers.
:param max_buffer_size: Maximum number of steps the buffer can hold in total.
:type max_buffer_size: int
:param tracker: Tracker for logging.
:type tracker: unstable.collection.trackers.BaseTracker
:param final_reward_transformation: Transformation applied once per trajectory to produce an episode-level reward.
:type final_reward_transformation: Optional[ComposeFinalRewardTransforms]
:param step_reward_transformation: Per-step transformation producing the step reward.
:type step_reward_transformation: Optional[ComposeStepRewardTransforms]
:param sampling_reward_transformation: Transformation applied when sampling a batch.
:type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms]
:param buffer_strategy: Buffer when full. Currently ``"random"`` downsampling is implemented.
:type buffer_strategy: str
**Expected methods**
.. py:method:: add_player_trajectory(player_traj, env_id)
:noindex:
Add one complete trajectory to the buffer.
.. py:method:: get_batch(batch_size)
:noindex:
Sample and remove a batch from the buffer.
.. py:method:: stop()
:noindex:
Signal collection to stop.
.. py:method:: size() -> int
:noindex:
Current number of stored steps.
.. py:method:: continue_collection() -> bool
:noindex:
Whether the buffer is still accepting data.
.. py:method:: clear()
:noindex:
Remove all stored data.
StepBuffer
----------
.. py:class:: StepBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random")
:module: unstable.collection.buffers
:noindex:
Stores steps from incoming trajectories in a single flat list and samples individual steps.
:param max_buffer_size: Upper bound on number of stored steps. When exceeded, the buffer downsamples.
:type max_buffer_size: int
:param tracker: Tracker for logging.
:type tracker: ray.actor.ActorHandle
:param final_reward_transformation: Applied once per trajectory to produce a reward signal passed to per-step logic.
:type final_reward_transformation: Optional[ComposeFinalRewardTransforms]
:param step_reward_transformation: Applied to each step (given the trajectory and index) to yield the stored step reward.
:type step_reward_transformation: Optional[ComposeStepRewardTransforms]
:param sampling_reward_transformation: Applied to the *sampled batch of steps* before returning it.
:type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms]
:param buffer_strategy: Downsampling policy. Currently only ``"random"`` is supported.
:type buffer_strategy: str
**Methods**
.. py:method:: add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)
Add the trajectory to the buffer. For each step:
1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.
2. Compute the step reward transformation if provided; otherwise it uses the final reward.
After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples.
:param player_traj: The trajectory.
:type player_traj: PlayerTrajectory
:param env_id: Environment identifier.
:type env_id: str
.. py:method:: get_batch(batch_size: int) -> List[Step]
Sample without replacement a set of steps and remove them from the buffer.
:param batch_size: Number of steps to sample.
:type batch_size: int
.. py:method:: stop()
Mark the buffer as closed to further collection. Does not clear data.
.. py:method:: size() -> int
Current number of stored steps.
.. py:method:: continue_collection() -> bool
Returns ``True`` until :meth:`stop` is called.
.. py:method:: clear()
Remove all stored steps.
EpisodeBuffer
-------------
.. py:class:: EpisodeBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random")
:module: unstable.collection.buffers
:noindex:
Stores entire episodes and samples batches of full episodes.
:param max_buffer_size: Upper bound on number of stored steps. When exceeded, the buffer downsamples.
:type max_buffer_size: int
:param tracker: Tracker for logging.
:type tracker: ray.actor.ActorHandle
:param final_reward_transformation: Applied once per trajectory to produce a reward signal passed to per-step logic.
:type final_reward_transformation: Optional[ComposeFinalRewardTransforms]
:param step_reward_transformation: Applied to each step (given the trajectory and index) to yield the stored step reward.
:type step_reward_transformation: Optional[ComposeStepRewardTransforms]
:param sampling_reward_transformation: Applied to the *sampled batch of steps* before returning it.
:type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms]
:param buffer_strategy: Downsampling policy. Currently only ``"random"`` is supported.
:type buffer_strategy: str
**Methods**
.. py:method:: add_player_trajectory(player_traj: PlayerTrajectory, env_id: str)
Add the trajectory to the buffer. For each step:
1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory.
2. Compute the step reward transformation if provided; otherwise it uses the final reward.
After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples entire episodes.
:param player_traj: The source trajectory.
:type player_traj: PlayerTrajectory
:param env_id: Environment identifier.
:type env_id: str
.. py:method:: get_batch(batch_size: int) -> List[List[Step]]
Sample without replacement a set of episodes and remove them from the buffer.
:param batch_size: Number of episodes to sample.
:type batch_size: int
.. py:method:: stop()
Mark the buffer as closed to further collection. Does not clear data.
.. py:method:: size() -> int
Current number of stored steps.
.. py:method:: continue_collection() -> bool
Returns ``True`` until :meth:`stop` is called.
.. py:method:: clear()
Remove all stored episodes.