Gül Varol

PhD student, INRIA

I am a PhD student in the computer vision and machine learning research laboratory (WILLOW project team) in the Department of Computer Science of École Normale Supérieure (ENS) and in Inria Paris. I am working on human understanding in videos with Ivan Laptev and Cordelia Schmid. I have received my BS and MS degrees from the Computer Engineering Department of Boğaziçi University, Turkey.


News

04 / 2018
BodyNet is on arXiv!
04 / 2018
The 5th WiCV workshop will take place in conjunction with ECCV'18 in Munich. Submit your work before July 2nd.
04 / 2018
A month visit back in MPI.
10 / 2017
We are organizing a workshop on Multiview Relationships in 3D Data in ICCV'17 in Venice.
06 / 2017
I am interning at Adobe in San Jose during the summer.
05 / 2017
I am back in Tübingen for a month visit at MPI.
04 / 2017
02 / 2017
We are organizing Women in Computer Vision Workshop in CVPR'17 in Honolulu, Hawaii.
12 / 2016
We released code for "Long-term Temporal Convolutions" paper.
07 / 2016
I participated in the International Computer Vision Summer School'16 in Sicily.
07 / 2016
We released Charades Dataset for our ECCV'16 paper on human action understanding!
04 / 2016
I visited Perceiving Systems department of Max Planck Institute for Intelligent Systems in Tübingen, Germany for a month!
05 / 2015
I moved to Paris to join WILLOW project team!

Research

See Google Scholar profile for a full list of publications.

BodyNet: Volumetric Inference of 3D Human Body Shapes
Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid
arXiv, 2018.
@INPROCEEDINGS{varol18_bodynet,
  title     = {{BodyNet}: Volumetric Inference of {3D} Human Body Shapes},
  author    = {Varol, G{\"u}l and Ceylan, Duygu and Russell, Bryan and Yang, Jimei and Yumer, Ersin and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {arXiv},
  year      = {2018}
}

Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

Long-term Temporal Convolutions for Action Recognition
Gül Varol, Ivan Laptev, and Cordelia Schmid
PAMI, 2018.
@ARTICLE{varol18_ltc,
  title     = {Long-term Temporal Convolutions for Action Recognition},
  author    = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia},
  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year      = {2018},
  volume    = {40},
  number    = {6},
  pages     = {1510--1517},
  doi       = {10.1109/TPAMI.2017.2712608}
}

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

Learning from Synthetic Humans
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid
CVPR, 2017.
@INPROCEEDINGS{varol17_surreal,
  title     = {Learning from Synthetic Humans},
  author    = {Varol, G{\"u}l and Romero, Javier and Martin, Xavier and Mahmood, Naureen and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {CVPR},
  year      = {2017}
}

Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta
ECCV, 2016.
@INPROCEEDINGS{sigurdsson16_charades,
  title     = {Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  author    = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ivan Laptev and Ali Farhadi and Abhinav Gupta},
  booktitle = {ECCV},
  year      = {2016}
}

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but {\em boring} samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, \textit{Charades}, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over $15\%$ of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.


Teaching

Fall       2018
Object recognition and computer vision, TA - Masters level - MVA, École normale supérieure de Cachan
Fall       2017
Object recognition and computer vision, TA - Masters level - MVA, École normale supérieure de Cachan
Fall       2016
Object recognition and computer vision, TA - Masters level - MVA, École normale supérieure de Cachan
Spring  2015
Computer analysis of human behavior, TA - Masters level - Boğaziçi University
Spring  2013
Signal processing, TA - Undergraduate level - Boğaziçi University
Spring  2013
Systems programming, TA - Undergraduate level - Boğaziçi University
Fall       2012
Systems programming, TA - Undergraduate level - Boğaziçi University