Distributional Successor Features Enable Zero-Shot Policy Optimization

DiSPOs learn a distribution of all possible outcomes in the dataset along with a readout policy that acts to realize a particular outcome. This enables zero-shot transfer to downstream rewards by performing a linear regression and planning for the optimal realizable outcome, with no additional training.

Active Exploration for System Identification in Robotic Manipulation

ASID learns an exploration policy in simulation and uses it to collect real-world exploration trajectories that reveal maximal information of unknown system parameters. These trajectories are then used for system identifcation to align the simulator to the real world.

Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning

In the offline RL setting, Q-learning is limited by Bellman-completeness, and return-conditioned supervised learning (RCSL) does not perform trajectory stitching. MBRCSL augments the offline dataset with model-based rollouts of the behavior policy -- covering potentially optimal trajectories -- and then performs RCSL on the augmented dataset. It is free from Bellman completeness and able to perform trajectory stitching.

RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability

RePo is a visual model-based reinforcement learning method that learns a minimally task-relevant representation by optimizing an information bottleneck objective. This allows it to be resillient to spurious variations in the observations, e.g. random distractors and background changes.

Self-Supervised Reinforcement Learning that Transfers using Random Features

RaMP is a self-supervised RL method that learns from unlabelled offline data and quickly transfers to arbitrary online reward. The idea is to learn environment dynamics by modeling the Q-values of random functions. These can then be linearly combined to reconstruct the Q-value corresponding to any test-time reward.

Model-Based Reinforcement Learning via Latent-Space Collocation

Our planner, LatCo, solves multi-stage long-horizon tasks much harder than those considered previously. By optimizing a sequence of future latent states instead of optimizing actions directly, it quickly discovers the high-reward region to create effective plans.

Pagination


© 2023 Chuning Zhu. All rights reserved.

Powered by Jekyll and Hydejack.