Gül Varol

PhD student, INRIA

I am a second year PhD student in the computer vision and machine learning research laboratory (WILLOW project team) in the Department of Computer Science of École Normale Supérieure (ENS) and in Inria Paris. I am working on deep learning methods for video analysis under the supervisions of Ivan Laptev and Cordelia Schmid. I have received my BS and MS degrees from the Computer Engineering Department of Boğaziçi University, Turkey. I completed my MS thesis on action recognition in videos under the supervision of Albert Ali Salah.


10 / 2017
We are organizing a workshop on Multiview Relationships in 3D Data in ICCV'17 in Venice.
06 / 2017
I am interning at Adobe in San Jose during the summer.
05 / 2017
I am back in Tübingen for a month visit at MPI.
04 / 2017
02 / 2017
We are organizing Women in Computer Vision Workshop in CVPR'17 in Honolulu, Hawaii.
12 / 2016
We released code for "Long-term Temporal Convolutions" paper.
07 / 2016
I participated in the International Computer Vision Summer School'16 in Sicily.
07 / 2016
We released Charades Dataset for our ECCV'16 paper on human action understanding!
04 / 2016
I visited Perceiving Systems department of Max Planck Institute for Intelligent Systems in Tübingen, Germany for a month!
05 / 2015
I moved to Paris to join WILLOW project team!

Selected Projects

Learning from Synthetic Humans
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid
CVPR, 2017.
  TITLE     = {{Learning from Synthetic Humans}},
  AUTHOR    = {Varol, G{\"u}l and Romero, Javier and Martin, Xavier and Mahmood, Naureen and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  YEAR      = {2017}

Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Long-term Temporal Convolutions for Action Recognition
Gül Varol, Ivan Laptev, and Cordelia Schmid
PAMI, 2017.
  TITLE     = {{Long-term Temporal Convolutions for Action Recognition}},
  AUTHOR    = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia},
  JOURNAL   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  YEAR      = {2017}

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta
ECCV, 2016.
  TITLE     = {Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  AUTHOR    = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ivan Laptev and Ali Farhadi and Abhinav Gupta},
  YEAR      = {2016}

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but {\em boring} samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, \textit{Charades}, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over $15\%$ of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.