Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy


¹Inria, École normale supérieure, CNRS, PSL Research University
*Equal contribution

International Conference on Robotics and Automation (ICRA) 2025

Abstract


Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a frame- work that integrates 3D-LOTUS’s motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of- the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.



Method Overview

GEMBench tasks visualizer





Real robot experiments


Seen task variations
We train our 3D-LOTUS model on a set of task in the real world robot and evaluate on the same tasks but different objects placement:

"stack the yellow cup on top of the pink cup"

"put the yellow cup on top of the pink one"

"place the navy cup onto the yellow cup"

"pick up and set the navy cup down into the yellow cup"

"put the frog toy in the top part of the drawer"

"take the frog toy and put it in the top compartiment of the drawer"

"take the pink mug and put it on the middle part of the hanger"

"put the pink mug on the middle part of the hanger"

"put the strawberry in the box"

"take the strawberry and put it inside the box"

"put the peach in the box"

"take the peach and put it inside the box"




Unseen task variations
We then train our improved 3D-LOTUS++ model on the previous set of tasks and leverages an LLM and VLM models to generalize to unsen tasks variations by interacting with new objects and instructions:

"put the banana in the box"

"take the banana and put it inside the box"

"put the lemon in the box"

"take the lemon and put it inside the box"

"put the tuna can in the box, then put the corn in the box"

"pick the tuna can and place it on the box, then place the corn in the box"

"put the grape on the yellow plate, then put the banana on the pink plate"

"put the grape on the yellow plate, then put the banana on the pink plate"

"stack the black cup into the orange one"

"pick the black cup and put it in the orange cup"

"keeping the yellow cup on the table, stack the red one onto it"

"pick the red cup and put it in the yellow cup"

"place the yellow cup inside the red cup, then the cyan cup on top"

"place the yellow cup inside the red cup, then the cyan cup on top"

Project Contributors