CosyPose: Consistent multi-view multi-object 6D pose estimation

ECCV: European Conference on Computer Vision, 2020

[Paper] [Code] [Video (1 min)] [Video (10 min)] [Slides]
Winner of the BOP Challenge 2020 at ECCV'20 [slides] [BOP challenge paper]

CosyPose: 6D object pose estimation optimizing multi-view COnSistencY. Given (a) a set of RGB images depicting a scene with known objects taken from unknown viewpoints, our method accurately reconstructs the scene, (b) recovering all objects in the scene, their 6D pose and the camera viewpoints. Objects are enlarged for the purpose of visualization.

Presentation videos

Abstract

We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed CosyPose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets.

Approach overview

Multi-view multi-object 6D pose estimation. In the first stage, we obtain initial object candidates in each view separately using our single-view pose estimation method. In the second stage, we robustly match these object candidates across views to recover a single consistent scene. In the third stage, we globally refine all object and camera poses to minimize multi-view reprojection error.

Qualtitative results

We provide additionnal qualitative examples of randomly selected 3D scene reconstructions on YCB-Video and T-LESS here. More 3D visualizations can be generated using the code provided on github.

Paper

Y. Labbé, J. Carpentier, M. Aubry and J. Sivic
CosyPose: Consistent multi-view multi-object 6D pose estimation
ECCV: European Conference on Computer Vision, 2020
[Paper on arXiv]

BibTeX

@inproceedings{labbe2020,
author={Y. {Labbe} and J. {Carpentier} and M. {Aubry} and J. {Sivic}},
title= {CosyPose: Consistent multi-view multi-object 6D pose estimation}
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2020}}

Code

We provide code and pre-trained models for the full approach presented in the paper, including:
  • Single-view single-object 6D pose estimator: Given an RGB image and a 2D bounding box of an object with known 3D model, the 6D pose estimator predicts the full 6D pose of the object with respect to the camera. Our method is inspired from DeepIM with several simplications and technical improvements. It is fully implemented in pytorch and achieve single-view state-of-the-art on YCB-Video and T-LESS. We provide pre-trained models used in our experiments on both datasets. We also provide training code that we used to train them available which can be parallelized on multiple GPUs.
  • Synthetic data generation: The single-view 6D pose estimation models are trained on a mix of synthetic and real images. We provide the code for generating the additionnal synthetic images.
  • Multi-view multi-object approach: We provide the full code, including robust object-level multi-view matching and global scene refinement. The method is agnostic to the 6D pose estimator used, and can therefore be combined with many other existing single-view object pose estimation method to solve problems on other datasets, or in real scenarios. We provide a utility for running CosyPose given a set of input 6D object candidates in each image.
  • BOP challenge 2020 models: We participate in the 6D pose estimation challenge which evaluates a single-view single-object pose estimation method on seven 6D pose estimation benchmarks. We provide pre-trained 2D detectors and 6D pose estimation models for all the datasets of the challenge.

Acknowledgements

This work was partially supported by the HPC resources from GENCI-IDRIS (Grant 011011181), the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468), Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright .