I am a third year Computer Vision and Machine Learning Ph.D. student in the WILLOW project-team which is part of Inria and Ecole Normale Supérieure, working with Ivan Laptev and Josef Sivic.
My main research interests are video understanding and weakly-supervised machine learning. More generally, I am interested in everything related to Computer Vision, Machine Learning and Natural Language Processing.
During the 2018 summer, I had the chance to collaborate with Du Tran, Heng Wang and Lorenzo Torresani at Facebook AI.
I was also lucky enough to be awarded the Google Ph.D. fellowship in 2018.
Abtsract: Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors.
Abtsract: Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply it to the problem of weakly-supervised learning of actions and actors from movies and corresponding movie scripts as supervision.
Abtract: We present state-of-the-art end-to-end learnable pooling method for video classification. Our method was used to achieve the best performance in the kaggle Youtube 8M challenge out of 650 teams.
MEE Text-to-Video Search Engine is Text-to-Video web demo search engine based on our proposed Mixture-of-Embedding-Experts (MEE) model. The model was trained on the MPII movie training set and it is tested on both MPII validation and test set and the MSR-VTT dataset. Our web demo runs in real time on a CPU based machine.
Video Dataset Overview is a Searchable and sortable compilation of annotated video datasets I am currently maintaining. It is supposed to help people to have a global overview of the existing annotated video datasets as well as some important features such as their size, published year or annotation type.
LOUPE (Learnable mOdUle for Pooling fEatures) is a Tensorflow toolbox that implements several modules for pooling features such as NetVLAD, NetRVLAD, NetFV and Soft-DBoW. It also allows to use their Gated version. This toolbox was mainly use in the winning approach of the Youtube 8M Large Scale Video Understanding challenge.