TL;DR: Hierarchical VLA architectures can enable robotic manipulation with semantic, visual, and geometric generalization after trained on cheap off-domain data.
Motivation: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires expensive on-robot collection.
Highlights: We study a class of hierarchical VLA models, where high-level VLMs are trained on relatively cheap data to produce semantically meaningful intermediate predictions such as 2D paths indicating desired behavior. These predicted 2D paths can serve as guidance for low-level control policies that are 3D-aware and capable of precise manipulation. We show that separating prediction into semantic high-level predictions, and 3D-aware low-level predictions allows such hierarchical VLA policies to transfer across significant domain gaps, for instance from simulation to the real world or across scenes with widely varying visual appearance. Doing so allows for the usage of cheap, abundant data sources beyond teleoperated on-robot data thereby enabling broad semantic and visual generalization.