Antoine Miech

Ph.D. student

Bio

I am a fourth year Computer Vision Ph.D. student in the WILLOW project-team which is part of Inria and Ecole Normale Supérieure, working with Ivan Laptev and Josef Sivic. My main research interests are video understanding and weakly-supervised machine learning. More generally, I am interested in everything related to Computer Vision, Machine Learning and Natural Language Processing. During the 2018 summer, I had the chance to collaborate with Du Tran, Heng Wang and Lorenzo Torresani at Facebook AI. I was also lucky enough to be awarded the Google Ph.D. fellowship in 2018. I will be joining DeepMind as a Research Scientist in August for more video and language research :D.

Invited talk:

November 27th 2018: University of Bristol, Bristol
July 24th 2018: Google, Mountain View
July 3rd 2018: Google, Paris
March 23th 2018: LSCP-ENS, Paris
September 12th 2017: Facebook AI Research, Paris
July 26th 2017: CVPR17 Youtube-8M Workshop, Hawaii
July 7th 2017: DGA TIM2017 Seminar, Paris
July 4th 2017: Kaggle ML Meetup, Paris
June 27th 2017: Paris ML Meetup, Paris

Publications

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic and Andrew Zisserman

CVPR 2020 (Oral)

arXiv, webpage, I3D TF model, S3D TF model, S3D PT model, YouCook2 zero-shot search demo

Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev and Josef Sivic

ICCV 2019

arXiv, webpage, code

Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. We introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2, CrossTask, MSR-VTT.

Leveraging the Present to Anticipate the Future in Videos

Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani and Du Tran

CVPR 2019 Precognition workshop

paper

Abstract: Anticipating actions before they are executed is crucial for a wide range of practical applications including autonomous driving and the moderation of live video streaming. While most prior work in this area requires partial observation of executed actions, in the paper we focus on anticipating actions seconds before they start. Our proposed approach is the fusion of a purely anticipatory model with a complementary model constrained to reason about the present. In particular, the latter predicts present action and scene attributes, and reasons about how they evolve over time. By doing so, we aim at modeling action anticipation at a more conceptual level than directly predicting future actions. Our model outperforms previously reported methods on the EPIC-KITCHENS and Breakfast datasets.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Antoine Miech, Ivan Laptev and Josef Sivic

arXiv preprint

arXiv, webpage, code

Abstract: Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors.

Learning from Video and Text via Large-Scale Discriminative Clustering

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev and Josef Sivic

ICCV 2017 (Spotlight: acceptance rate 4.70%)

arXiv, webpage, poster

Abtsract: Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply it to the problem of weakly-supervised learning of actions and actors from movies and corresponding movie scripts as supervision.

Learnable pooling with Context Gating for video classification

Antoine Miech, Ivan Laptev and Josef Sivic

CVPR 2017 Youtube-8M Workshop (Oral + Kaggle winning approach)

arXiv, code, Kaggle submission code, poster, slides

Abtract: We present state-of-the-art end-to-end learnable pooling method for video classification. Our method was used to achieve the best performance in the kaggle Youtube 8M challenge out of 650 teams.

Misc.

MEE Text-to-Video Search Engine is Text-to-Video web demo search engine based on our proposed Mixture-of-Embedding-Experts (MEE) model. The model was trained on the MPII movie training set and it is tested on both MPII validation and test set and the MSR-VTT dataset. Our web demo runs in real time on a CPU based machine.

Video Dataset Overview is a Searchable and sortable compilation of annotated video datasets I am currently maintaining. It is supposed to help people to have a global overview of the existing annotated video datasets as well as some important features such as their size, published year or annotation type.

LOUPE (Learnable mOdUle for Pooling fEatures) is a Tensorflow toolbox that implements several modules for pooling features such as NetVLAD, NetRVLAD, NetFV and Soft-DBoW. It also allows to use their Gated version. This toolbox was mainly use in the winning approach of the Youtube 8M Large Scale Video Understanding challenge.

The Data Science Game is a student only and worldwide machine learning competition. I have been involved in the project in 2016 and 2017 as an organizer.

The 2016 edition was very successful, we got invited at NIPS 2016 CiML Workshop to present this poster.