RIPT-VLA:
Interactive Post-Training for
Vision-Language-Action Models

1University of Texas at Austin, 2Nankai University
RIPT-VLA teaser

RIPT-VLA is a lightweight yet powerful interactive post-training
framework that refines pretrained VLA models using only sparse binary success signals.

The method achieves a state-of-the-art 94.3 % success rate on LIBERO-90
and 97.5 % success rate across the LIBERO Suites (Goal, Spatial, Object, Long).

Even a single-demonstration policy leaps from 4 % to 97 % success in just 15 optimization iterations.

Results Visualization

All clips show LIBERO-LONG tasks with 1-shot-SFT policy after RIPT-VLA post-training.

The RIPT-VLA Loop

Starting from a supervised VLA policy, we roll out K completions per context, observe sparse success/failure rewards, and compute leave-one-out advantages. We then update the policy with a PPO objective.

Key Results

LIBERO-Suites

LIBERO-Suites

Table 1: Multitask success rate (SR%) across the four LIBERO suites. Bold indicates the best result, underline marks the second-best. Improvements from RIPT-VLA are highlighted in red. *: OpenVLA-OFT results are obtained from official checkpoints per suite.

LIBERO-90 and MetaWorld-45

LIBERO-90 and MetaWorld-45

Table 2: Success rates (%) on LIBERO-90 and MetaWorld-45 benchmarks under both full-data and 5-shot settings. Bold indicates the best result, underline marks the second-best. Improvements from RIPT-VLA are highlighted in red in the bottom row.

LIBERO-LONG Few-Shot

LIBERO-LONG Few-Shot

Figure 1: Sample efficiency of RIPT-VLA on LIBERO-LONG under few-shot multitask training. RIPT-VLA significantly improves over standard SFT across all data scales, achieving a +20.8% absolute gain with just 1 demonstration per task. As the number of demos increases, RIPT-VLA continues to outperform SFT, demonstrating strong scalability and robustness in low-data regimes.

Cross-Scenario Generalization

Cross-Scenario Generalization

Figure 2: RIPT-VLA enables strong cross-scenario generalization: transferring skills from one scene to another with the same goal. Even with just 1 demo, RIPT-VLA achieves up to +82.7% improvement over standard SFT.

Cross-Goal Generalization

Cross-Goal Generalization

Figure 3: RIPT-VLA also supports cross-goal transfer within the same scene, generalizing to new instructions using minimal supervision. It consistently outperforms SFT, with gains as high as +84.7% using only 3–10 demos.

BibTeX


  @misc{tan2025interactiveposttrainingvisionlanguageactionmodels,
    title={Interactive Post-Training for Vision-Language-Action Models}, 
    author={Shuhan Tan and Kairan Dou and Yue Zhao and Philipp Krähenbühl},
    year={2025},
    eprint={2505.17016},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2505.17016}, 
  }