Task: Put the M on Mona Lisa; Put the E on Einstein
(a): First, we fine-tune a pre-trained VLM on a large, general, off-domain dataset $ \tilde{\mathcal{D}}_\text{off} $ to produce paths and points.
This includes robot simulation data and robot teleoperation data from environments different from the target environment.
(b): Then, we train a data-efficient, 3D-input policy on a small, in-domain dataset $ \mathcal{D}_\text{path} $ to produce actions conditioned on path-drawn images.
These path-drawn images help the policy generalize to new semantic, visual, and geometric task variations!
At inference time, we use the VLM to predict paths for the first observation $o_1$ conditioned on the text instruction.
These paths are drawn onto all images that the policy sees, $\tilde{o}_t$, where it then executes low-level environment actions.
Visual generalization comes from the fine-tuned VLM; expert low-level control comes from the separate low-level 3D policy.
Task: Put the M on Mona Lisa; Put the E on Einstein
Task: Put the coke can on Jensen Huang
Task: Put the S block on the plate pointed to by the arrow
Method | VLM | Finetuning Data | Average Human Rank (lower is better) |
---|---|---|---|
RT-Traj. (Gu et al. 2023) | 0-shot GPT-4o | - | 3.47 |
RT-Traj. (Gu et al. 2023) | Code as Policies GPT-4o (Liang et al. 2022) | - | 3.41 |
HAMSTER (ours) | VILA (Lin et al. 2023) | Our off-domain data $ \tilde{\mathcal{D}}_\text{off} $ without simulated RLBench | 2.13 |
HAMSTER (ours) | VILA (Lin et al. 2023) | Our off-domain data $ \tilde{\mathcal{D}}_\text{off} $ | 1.40 |
Human evaluation shows HAMSTER, with simulation data, excels in capturing spatial and semantic info across tasks, outperforming zero-shot VLM-based trajectory generation (Gu et al., 2023).
The low-level policy successfully puts the grapes into the white bowl.
The low-level 3D policy reasons about the different height of the milk carton to put it into the white bowl.
The 3D policy reasons about the different height of the red mug to successfully put the grapes into it.
Prompt | Drawn Path |
---|---|
Original Prompt: In the image, please execute the command described in <quest>move the coke to Taylor Swift</quest>. Provide a sequence of points denoting the trajectory of a robot gripper to achieve the goal. Format your answer as a list of tuples enclosed by <ans> and </ans> tags. For example: <ans>[(0.25, 0.32), (0.32, 0.17), (0.13, 0.24), <action>Open Gripper</action>, (0.74, 0.21), <action>Close Gripper</action>, ...]</ans> The tuple denotes point x and y location of the end effector of the gripper in the image. The action tags indicate the gripper action. The coordinates should be floats ranging between 0 and 1, indicating the relative locations of the points in the image. |
![]() |
In the provided image, perform the task described in <quest>have the coke on the lady</quest>. Generate a sequence of points representing the trajectory of a robot gripper to accomplish the objective. Present the output as a list of tuples encapsulated within <ans> and </ans> tags. For instance: <ans>[(0.74, 0.21), <action>Close Gripper</action>, (0.25, 0.32), (0.32, 0.17), (0.13, 0.24), <action>Open Gripper</action>, ...]</ans> |
![]() |
Policy | Averaged Success Rate (New Tasks) |
---|---|
RVT2 (Goyal et al. 2024) | 22.11 |
HAMSTER + RVT2 | 66.56 |
3D-DA (Ke et al. 2024) | 16.70 |
HAMSTER + 3D-DA | 72.55 |
HAMSTER performs well with different low-level policies (RVT2, 3D-DA), and demonstrates a significant performance improvement compared to the base policy with both.
@misc{li2025hamsterhierarchicalactionmodels,
title={HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation},
author={Yi Li and Yuquan Deng and Jesse Zhang and Joel Jang and Marius Memmel and Raymond Yu and Caelan Reed Garrett and Fabio Ramos and Dieter Fox and Anqi Li and Abhishek Gupta and Ankit Goyal},
year={2025},
eprint={2502.05485},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2502.05485},
}