GOM is an unsupervised reinforcement learning method that models the distribution of all possible outcomes represented as discounted sums of state-dependent cumulants. The outcome model is paired with a readout policy that produces an action to realize a particular outcome. Assuming a linear dependence of rewards on cumulants, transferring to downstream tasks reduces to performing linear regression and solving a simple optimization problem for the optimal possible outcome.
We show that in the offline RL setting, Q-learning is limited by Bellman-completeness, and return-conditioned supervised learning (RCSL) does not perform trajectory stitching. We thereby propose a new algorithm, model-based return-conditioned supervised learning (MBRCSL), which augments the dataset with model-based rollouts of the behavior policy -- containing potentially optimal trajectories -- and then performs RCSL on the augmented dataset. MBRCSL is free from Bellman completeness and able to perform trajectory stitching, recovering optimal behavior from suboptimal datasets.
RePo is a visual model-based reinforcement learning method that learns a minimally task-relevant representation by optimizing an information bottleneck objective. This allows it to be resillient to spurious variations in the observations, e.g. random distractors and background changes.
We propose RaMP, a self-supervised RL method that learns from unlabelled offline data and quickly transfers to arbitrary online reward. The idea is to learn environment dynamics by modeling the Q-values of random functions. These can then be linearly combined to reconstruct the Q-value corresponding to any test-time reward.
Our planner, LatCo, solves multi-stage long-horizon tasks much harder than those considered previously. By optimizing a sequence of future latent states instead of optimizing actions directly, it quickly discovers the high-reward region to create effective plans.