Transferable Reinforcement Learning via Generalized Occupancy Models

GOM is an unsupervised reinforcement learning method that models the distribution of all possible outcomes represented as discounted sums of state-dependent cumulants. The outcome model is paired with a readout policy that produces an action to realize a particular outcome. Assuming a linear dependence of rewards on cumulants, transferring to downstream tasks reduces to performing linear regression and solving a simple optimization problem for the optimal possible outcome.

Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning

We show that in the offline RL setting, Q-learning is limited by Bellman-completeness, and return-conditioned supervised learning (RCSL) does not perform trajectory stitching. We thereby propose a new algorithm, model-based return-conditioned supervised learning (MBRCSL), which augments the dataset with model-based rollouts of the behavior policy -- containing potentially optimal trajectories -- and then performs RCSL on the augmented dataset. MBRCSL is free from Bellman completeness and able to perform trajectory stitching, recovering optimal behavior from suboptimal datasets.

RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability

RePo is a visual model-based reinforcement learning method that learns a minimally task-relevant representation by optimizing an information bottleneck objective. This allows it to be resillient to spurious variations in the observations, e.g. random distractors and background changes.

Self-Supervised Reinforcement Learning that Transfers using Random Features

We propose RaMP, a self-supervised RL method that learns from unlabelled offline data and quickly transfers to arbitrary online reward. The idea is to learn environment dynamics by modeling the Q-values of random functions. These can then be linearly combined to reconstruct the Q-value corresponding to any test-time reward.

Model-Based Reinforcement Learning via Latent-Space Collocation

Our planner, LatCo, solves multi-stage long-horizon tasks much harder than those considered previously. By optimizing a sequence of future latent states instead of optimizing actions directly, it quickly discovers the high-reward region to create effective plans.

Pagination


© 2023 Chuning Zhu. All rights reserved.

Powered by Jekyll and Hydejack.