Replay Buffers ~~~~~~~~~~~~~~ We provide two replay buffers for multi-agent game collection: StepBuffer stores and samples individual steps across episodes while the EpisodeBuffer stores and samples entire episodes. .. raw:: html
Both buffers support three kinds of reward transformations: - **Final reward transformation**: Applied once per trajectory and passed into step-level computations. - **Step reward transformation** Applied per step, typically using the final/episode reward. - **Sampling reward transformation** Applied *at sampling time* to the batch being returned. API Reference """"""""""""" .. py:class:: BaseBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random") Abstract base with a minimal interface common to both buffers. :param max_buffer_size: Maximum number of steps the buffer can hold in total. :type max_buffer_size: int :param tracker: Tracker for logging. :type tracker: unstable.collection.trackers.BaseTracker :param final_reward_transformation: Transformation applied once per trajectory to produce an episode-level reward. :type final_reward_transformation: Optional[ComposeFinalRewardTransforms] :param step_reward_transformation: Per-step transformation producing the step reward. :type step_reward_transformation: Optional[ComposeStepRewardTransforms] :param sampling_reward_transformation: Transformation applied when sampling a batch. :type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms] :param buffer_strategy: Buffer when full. Currently ``"random"`` downsampling is implemented. :type buffer_strategy: str **Expected methods** .. py:method:: add_player_trajectory(player_traj, env_id) :noindex: Add one complete trajectory to the buffer. .. py:method:: get_batch(batch_size) :noindex: Sample and remove a batch from the buffer. .. py:method:: stop() :noindex: Signal collection to stop. .. py:method:: size() -> int :noindex: Current number of stored steps. .. py:method:: continue_collection() -> bool :noindex: Whether the buffer is still accepting data. .. py:method:: clear() :noindex: Remove all stored data. StepBuffer ---------- .. py:class:: StepBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random") :module: unstable.collection.buffers :noindex: Stores steps from incoming trajectories in a single flat list and samples individual steps. :param max_buffer_size: Upper bound on number of stored steps. When exceeded, the buffer downsamples. :type max_buffer_size: int :param tracker: Tracker for logging. :type tracker: ray.actor.ActorHandle :param final_reward_transformation: Applied once per trajectory to produce a reward signal passed to per-step logic. :type final_reward_transformation: Optional[ComposeFinalRewardTransforms] :param step_reward_transformation: Applied to each step (given the trajectory and index) to yield the stored step reward. :type step_reward_transformation: Optional[ComposeStepRewardTransforms] :param sampling_reward_transformation: Applied to the *sampled batch of steps* before returning it. :type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms] :param buffer_strategy: Downsampling policy. Currently only ``"random"`` is supported. :type buffer_strategy: str **Methods** .. py:method:: add_player_trajectory(player_traj: PlayerTrajectory, env_id: str) Add the trajectory to the buffer. For each step: 1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory. 2. Compute the step reward transformation if provided; otherwise it uses the final reward. After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples. :param player_traj: The trajectory. :type player_traj: PlayerTrajectory :param env_id: Environment identifier. :type env_id: str .. py:method:: get_batch(batch_size: int) -> List[Step] Sample without replacement a set of steps and remove them from the buffer. :param batch_size: Number of steps to sample. :type batch_size: int .. py:method:: stop() Mark the buffer as closed to further collection. Does not clear data. .. py:method:: size() -> int Current number of stored steps. .. py:method:: continue_collection() -> bool Returns ``True`` until :meth:`stop` is called. .. py:method:: clear() Remove all stored steps. EpisodeBuffer ------------- .. py:class:: EpisodeBuffer(max_buffer_size, tracker, final_reward_transformation, step_reward_transformation, sampling_reward_transformation, buffer_strategy: str = "random") :module: unstable.collection.buffers :noindex: Stores entire episodes and samples batches of full episodes. :param max_buffer_size: Upper bound on number of stored steps. When exceeded, the buffer downsamples. :type max_buffer_size: int :param tracker: Tracker for logging. :type tracker: ray.actor.ActorHandle :param final_reward_transformation: Applied once per trajectory to produce a reward signal passed to per-step logic. :type final_reward_transformation: Optional[ComposeFinalRewardTransforms] :param step_reward_transformation: Applied to each step (given the trajectory and index) to yield the stored step reward. :type step_reward_transformation: Optional[ComposeStepRewardTransforms] :param sampling_reward_transformation: Applied to the *sampled batch of steps* before returning it. :type sampling_reward_transformation: Optional[ComposeSamplingRewardTransforms] :param buffer_strategy: Downsampling policy. Currently only ``"random"`` is supported. :type buffer_strategy: str **Methods** .. py:method:: add_player_trajectory(player_traj: PlayerTrajectory, env_id: str) Add the trajectory to the buffer. For each step: 1. Compute the final reward transformation if provided; otherwise it uses the final reward of the trajectory. 2. Compute the step reward transformation if provided; otherwise it uses the final reward. After insertion, if the number of stored steps exceeds the maximum buffer size, the buffer downsamples entire episodes. :param player_traj: The source trajectory. :type player_traj: PlayerTrajectory :param env_id: Environment identifier. :type env_id: str .. py:method:: get_batch(batch_size: int) -> List[List[Step]] Sample without replacement a set of episodes and remove them from the buffer. :param batch_size: Number of episodes to sample. :type batch_size: int .. py:method:: stop() Mark the buffer as closed to further collection. Does not clear data. .. py:method:: size() -> int Current number of stored steps. .. py:method:: continue_collection() -> bool Returns ``True`` until :meth:`stop` is called. .. py:method:: clear() Remove all stored episodes.