2025-06-18

FAST

FAST asks a very practical question for vision-language-action models: if the policy is autoregressive and has to predict discrete tokens, what is the right way to turn a continuous robot action trajectory into those tokens? A lot of recent VLA work uses fairly simple binning schemes, where each action dimension at each timestep gets discretized independently. That is easy to implement, but it throws away structure in the action sequence and becomes especially painful for fast, dexterous control.

The paper’s main idea is to tokenize actions in frequency space instead of directly in the time domain. FAST, short for Frequency-space Action Sequence Tokenization, applies a discrete cosine transform to short action sequences and then compresses them into tokens that are easier for an autoregressive model to predict. Intuitively, this gives the model a more compact description of how an action chunk evolves over time, rather than forcing it to model every tiny timestep-level fluctuation separately.

What makes this interesting is that the contribution is not a new giant policy architecture, but a better interface between continuous robot control and discrete sequence modeling. The authors show that this matters a lot in the settings where VLA-style models usually struggle most: high-frequency control and dexterous manipulation. In those regimes, the usual tokenization baselines can fail outright, while FAST makes autoregressive training viable again.

They also introduce FAST+, a tokenizer trained on around one million real robot trajectories and intended to work as a more universal drop-in tokenizer across robots, action spaces, and control rates. That part is important because action representations are often annoyingly tied to one embodiment or one dataset. A reusable tokenizer makes the whole VLA stack feel much more scalable.

This also connects nicely to Sander Dieleman’s “diffusion is spectral autoregression” perspective. That post argues that diffusion works well in images partly because generation happens in a frequency-aware coarse-to-fine order, rather than treating every pixel update as equally primitive. FAST is not a diffusion paper, but it borrows a very similar intuition for robotics: action sequences have structure that looks much cleaner in a spectral basis than in raw timestep-by-timestep coordinates. Using a DCT-based tokenizer effectively gives the autoregressive policy a compressed frequency-domain description of motion, so the model spends less capacity on tiny local wiggles and more on the underlying trajectory shape.

My main takeaway is that this paper highlights how much progress in robotics can come from representation choices rather than only bigger models. If you want transformer-style policies to compete with diffusion-based approaches, it is not enough to scale data and compute; you also need the action space to be presented in a form the model can actually learn. FAST is a clean example of that idea, and the reported speedup over diffusion VLAs makes it especially compelling from a practical deployment standpoint.