|
|
|
|
|
|
|
|
|
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
Failure Data Generation Pipeline. We introduce a novel generation pipeline generating failure cases both online in simulation (RLBench), and offline on the real-world dataset (BridgeDataV2). For each positive example, given its correct plan and successful trajectory, we generate a corresponding incorrect plan and unsuccessful trajectory.
Chain-of-Thought (CoT) Generation. We introduce an automatic method to generate step-by-step CoTs for training reasoning models. For each sample, we first collect the object category, spatial location, and robot state from the RLBench simulator or from ECoT annotations, together with the corresponding failure reason. We then prompt a large reasoning-capable VLM (InternVL3-38B) to generate step-by-step reasoning traces based on the initial text–image inputs and the aforementioned information. For planning samples, the model is instructed to sequentially verify each subtask and subsequently analyze the overall plan. For execution samples, the model is guided to describe the pre- and post-action images before assessing subtask completion. The reasoning trace contains 118 tokens on average.
Real-Robot, Policy-Driven Data Collection. We curate UR5-Fail, a real-robot dataset, collected using a UR5 arm with three cameras. We run the 3D-LOTUS++ policy on 34 tasks, recording initial and final multi-view images for each subtask. Subtasks are manually labeled as success or failure to obtain execution failure data. For planning failures, we annotate ground-truth plans and generate failures using the method above. Unlike RoboFail, which is single-view and relies solely on teleoperation, UR5-Fail is three-view and features autonomous policy rollouts yielding more realistic failures.
Model Architecture and Integration into a Robotic Manipulation Framework. Left: Overview of the Guardian model architecture. Right: Integration of Guardian model into a robot manipulation pipeline for planning and execution verification.
Architecture. The Guardian model is built upon InternVL3-8B. Rather than concatenating multiple images into a single grid-based image as in AHA, Guardian processes each image independently through the visual encoder. This design preserves fine-grained spatial details within each image and allows the model to explicitly reason about spatial and temporal changes for more accurate failure detection. Furthermore, unlike SuccessVQA and AHA that output direct classifications, Guardian leverages an explicit reasoning trace before concluding success or failure.
Failure Detection and Recovery. Guardian can be seamlessly plugged into existing robotic manipulation pipelines as a verification layer without requiring any architectural modification. Without loss of generality, consider a modular robotic manipulation framework. Guardian can be inserted at each planning and subtask execution step to detect potential failures. Upon detection, it can trigger replanning or re-execute the corresponding motion policy to facilitate recovery.
@article{pacaud2025guardiandetectingroboticplanning,
author = {Paul Pacaud and Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
title = {Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models},
booktitle = {arXiv:2512.01946},
year = {2025},
}
This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-AD011015795 and AD011015795R1). It was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “France 2030" program, reference ANR-23-IACL-0008 (PR[AI]RIE-PSAI project), the ANR project VideoPredict ANR-21-FAI1-0002- 01. Cordelia Schmid would like to acknowledge the support by the Körber European Science Prize.
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.