I am a third year Computer Vision and Machine Learning Ph.D. student in the WILLOW project-team which is part of Inria and Ecole Normale Supérieure, working with Ivan Laptev and Josef Sivic.
My main research interests are video understanding and weakly-supervised machine learning. More generally, I am interested in everything related to Computer Vision, Machine Learning and Natural Language Processing.
During the 2018 summer, I had the chance to collaborate with Du Tran, Heng Wang and Lorenzo Torresani at Facebook AI.
I was also lucky enough to be awarded the Google Ph.D. fellowship in 2018.
Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. We introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2, CrossTask, MSR-VTT.
Abstract: Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors.
Abtsract: Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply it to the problem of weakly-supervised learning of actions and actors from movies and corresponding movie scripts as supervision.
Abtract: We present state-of-the-art end-to-end learnable pooling method for video classification. Our method was used to achieve the best performance in the kaggle Youtube 8M challenge out of 650 teams.
MEE Text-to-Video Search Engine is Text-to-Video web demo search engine based on our proposed Mixture-of-Embedding-Experts (MEE) model. The model was trained on the MPII movie training set and it is tested on both MPII validation and test set and the MSR-VTT dataset. Our web demo runs in real time on a CPU based machine.
Video Dataset Overview is a Searchable and sortable compilation of annotated video datasets I am currently maintaining. It is supposed to help people to have a global overview of the existing annotated video datasets as well as some important features such as their size, published year or annotation type.
LOUPE (Learnable mOdUle for Pooling fEatures) is a Tensorflow toolbox that implements several modules for pooling features such as NetVLAD, NetRVLAD, NetFV and Soft-DBoW. It also allows to use their Gated version. This toolbox was mainly use in the winning approach of the Youtube 8M Large Scale Video Understanding challenge.