Learning from Synthetic Humans

People

Gül
Varol

Abstract

Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Paper

arXiv
Code
Data *** News *** The release now includes the optical flow data.
Poster
Presentation
Press coverage - "Synthetic humans help computers understand how real people act" by New Scientist

BibTeX

@INPROCEEDINGS{varol17_surreal,
  title     = {Learning from Synthetic Humans},
  author    = {Varol, G{\"u}l and Romero, Javier and Martin, Xavier and Mahmood, Naureen and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {CVPR},
  year      = {2017}
}

SURREAL dataset

Here are example sequences from our synthetic human videos together with the ground truth segmentation and depth.

Results on Human3.6M dataset

The following video presents the segmentation and depth estimation results on Human3.6M images using the convolutional neural network pre-trained on synthetic images, later fine-tuned on Human3.6M training data. Note that the model is applied on every frame independently.

Results on Youtube Pose dataset

The following video presents qualitative results on Youtube Pose Dataset using the convolutional neural network pre-trained only on synthetic images, without fine-tuning on real images.

Acknowledgements

This work was supported in part by the Alexander von Humbolt Foundation, ERC grants ACTIVIA and ALLEGRO, the MSR-Inria joint lab, and Google and Facebook Research Awards.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.