ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Sombit Dey1, Jan-Nico Zaech1, Nikolay Nikolov1, Danda Paudel1, Luc Van Gool1,
teaser image

ReVLA aims to improve the visual domain capabilities of Robotic Foundational Models like OpenVLA. forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training to regain its visual generalization ability.


Hardware Experiments

We test ReVLA on bridge robot and compare it with the OpenVLA. The robot is prompted with language instruction and a single 3rd person view uncalibrated image.

Task ReVLA OpenVLA
Pick up the lobster and place it on the plate( Testing OOD Objects) 4/10 0/10
Place the cup on the blue plate (Testing Visual Recognition) 5/10 2/10
Pick up banana and place it in the pan (Indomain Task) 8/10 9/10


We further tried Openvla-Bridge but it performs worse than OpenVLA, hence we compare only with OpenVLA.


Hardware Experiments

We demonstrate the capabilities of our method on a real robot, in this case the Widowx robot. The robot is prompted with language instruction and a single 3rd person view uncaliberated image.


ReVLA

OpenVLA

Task Prompt : Pick up the Lobster and place it on the plate


ReVLA

OpenVLA

Task Prompt : Pick up the cup and place it on the specified plate


Indomain Task + Objects : In these episodes we use the same objects as in the bridge dataset and try to test the drop in performance for the indomain tasks.

ReVLA

OpenVLA

Task Prompt : Pick up the banana and place it on the pan.

BibTeX

@misc{dey2024revlarevertingvisualdomain,
      title={ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models}, 
      author={Sombit Dey and Jan-Nico Zaech and Nikolay Nikolov and Luc Van Gool and Danda Pani Paudel},
      year={2024},
      eprint={2409.15250},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.15250}, 
}