Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Ricardo Garcia¹*

Shizhe Chen¹*

Cordelia Schmid¹

¹Inria, École normale supérieure, CNRS, PSL Research University
*Equal contribution

International Conference on Robotics and Automation (ICRA) 2025

arXiv

GEMBench

Code

Abstract

Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a frame- work that integrates 3D-LOTUS’s motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of- the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.

GEMBench tasks visualizer

Real robot experiments

Seen task variations

We train our 3D-LOTUS model on a set of task in the real world robot and evaluate on the same tasks but different objects placement:

"stack the yellow cup on top of the pink cup"

"put the yellow cup on top of the pink one"

"place the navy cup onto the yellow cup"

"pick up and set the navy cup down into the yellow cup"

"put the frog toy in the top part of the drawer"

"take the frog toy and put it in the top compartiment of the drawer"

"take the pink mug and put it on the middle part of the hanger"

"put the pink mug on the middle part of the hanger"

"put the strawberry in the box"

"take the strawberry and put it inside the box"

"put the peach in the box"

"take the peach and put it inside the box"

Unseen task variations

We then train our improved 3D-LOTUS++ model on the previous set of tasks and leverages an LLM and VLM models to generalize to unsen tasks variations by interacting with new objects and instructions: