The method achieves a state-of-the-art 94.3 % success rate on LIBERO-90
and
97.5 % success rate across the LIBERO Suites (Goal, Spatial, Object, Long).
Even a single-demonstration policy leaps from 4 % to 97 % success in just 15 optimization iterations.
Turn on the stove and put the moka pot on it
Put cream cheese and butter in basket
Put yellow and white mug in microwave and close it
Place book in back compartment of the caddy
Put the black bowl in the bottom drawer and close it
Place mugs on different plates
Put mug in microwave
Put mug on plate and pudding to the right
Put the black bowl in the bottom drawer and close it
Put cream cheese box and butter in basket
All clips show LIBERO-LONG tasks with 1-shot-SFT policy after RIPT-VLA post-training.
Starting from a supervised VLA policy, we roll out K completions per context, observe sparse success/failure rewards, and compute leave-one-out advantages. We then update the policy with a PPO objective.
Table 1: Multitask success rate (SR%) across the four LIBERO suites. Bold indicates the best result, underline marks the second-best. Improvements from RIPT-VLA are highlighted in red. *: OpenVLA-OFT results are obtained from official checkpoints per suite.
Table 2: Success rates (%) on LIBERO-90 and MetaWorld-45 benchmarks under both full-data and 5-shot settings. Bold indicates the best result, underline marks the second-best. Improvements from RIPT-VLA are highlighted in red in the bottom row.
Figure 1: Sample efficiency of RIPT-VLA on LIBERO-LONG under few-shot multitask training. RIPT-VLA significantly improves over standard SFT across all data scales, achieving a +20.8% absolute gain with just 1 demonstration per task. As the number of demos increases, RIPT-VLA continues to outperform SFT, demonstrating strong scalability and robustness in low-data regimes.
Figure 2: RIPT-VLA enables strong cross-scenario generalization: transferring skills from one scene to another with the same goal. Even with just 1 demo, RIPT-VLA achieves up to +82.7% improvement over standard SFT.
Figure 3: RIPT-VLA also supports cross-goal transfer within the same scene, generalizing to new instructions using minimal supervision. It consistently outperforms SFT, with gains as high as +84.7% using only 3–10 demos.
@misc{tan2025interactiveposttrainingvisionlanguageactionmodels,
title={Interactive Post-Training for Vision-Language-Action Models},
author={Shuhan Tan and Kairan Dou and Yue Zhao and Philipp Krähenbühl},
year={2025},
eprint={2505.17016},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.17016},
}