2025-06-18

VAMOS

In this paper, we start from a simple observation: a robot can often reason about where it should go without fully understanding whether its body can actually get there. That mismatch becomes especially important when trying to build navigation systems that work across different embodiments. A legged robot may be able to climb stairs or step over roots, while a wheeled platform in the same scene needs to avoid them entirely. With VAMOS, our goal was to separate those two problems instead of forcing a single model to learn both semantic planning and embodiment-specific feasibility at once.

To do this, we designed VAMOS as a hierarchical system. The high-level vision-language planner proposes candidate paths directly in image space using diverse real-world navigation data, allowing it to learn broad semantic navigation behavior from heterogeneous datasets. We then pair that planner with a lightweight affordance model trained for each embodiment in simulation, where it can cheaply and safely learn what a particular robot can traverse. The planner generates routes that make sense semantically, and the affordance model re-ranks them based on what is physically feasible for the robot being deployed.

One of the main motivations for this design was the data-mixing problem in robotics. Large navigation datasets are useful, but once they combine demonstrations from robots with very different capabilities, it becomes difficult to train a single monolithic policy that transfers cleanly. In VAMOS, we keep the broadly reusable parts of navigation, such as scene understanding, obstacle avoidance, and long-horizon planning, in the shared high-level model, while handling embodiment-specific constraints with a smaller specialist model. This gives us a cleaner path toward cross-embodiment transfer than directly predicting low-level actions from raw observations.

Across our experiments, we found that this structure works well in practice. VAMOS outperforms both classical modular baselines and prior end-to-end or VLA-style navigation approaches in challenging real-world settings, including outdoor terrain and cross-embodiment deployment across wheeled and legged robots. The system is also naturally steerable through language, which lets users express navigation preferences at test time without redesigning the controller. More broadly, this paper reflects a direction I find compelling for robotics: using foundation-model-style generalization where it helps most, while preserving the task structure needed to respect the underlying physics of the robot.