DiSPOs learn a distribution of all possible outcomes in the dataset along with a readout policy that acts to realize a particular outcome. This enables zero-shot transfer to downstream rewards by performing a linear regression and planning for the optimal realizable outcome, with no additional training.
ASID learns an exploration policy in simulation and uses it to collect real-world exploration trajectories that reveal maximal information of unknown system parameters. These trajectories are then used for system identifcation to align the simulator to the real world.
In the offline RL setting, Q-learning is limited by Bellman-completeness, and return-conditioned supervised learning (RCSL) does not perform trajectory stitching. MBRCSL augments the offline dataset with model-based rollouts of the behavior policy -- covering potentially optimal trajectories -- and then performs RCSL on the augmented dataset. It is free from Bellman completeness and able to perform trajectory stitching.
RePo is a visual model-based reinforcement learning method that learns a minimally task-relevant representation by optimizing an information bottleneck objective. This allows it to be resillient to spurious variations in the observations, e.g. random distractors and background changes.
RaMP is a self-supervised RL method that learns from unlabelled offline data and quickly transfers to arbitrary online reward. The idea is to learn environment dynamics by modeling the Q-values of random functions. These can then be linearly combined to reconstruct the Q-value corresponding to any test-time reward.
Our planner, LatCo, solves multi-stage long-horizon tasks much harder than those considered previously. By optimizing a sequence of future latent states instead of optimizing actions directly, it quickly discovers the high-reward region to create effective plans.