End-to-End Learning of
Visual Representations from
Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Zero-shot Text-to-Video retrieval on YouCook2

We have implemented an online Text-to-Video retrieval demo that performs search on the YouCook2 training and testing video clips.
Please note that the implemented model was trained without using a single manually annotated dataset (e.g no ImageNet, Kinetics nor YouCook2 was involved). The model was purely trained from scratch on the uncurated HowTo100M videos.
Query examples: Cut salmon, Cut tuna, Cut pepper, Cut tomato, Dice tomato, fry samosa, grind nutmeg, crack eggs, whisk eggs, scramble eggs, boil eggs.



   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},

Model / Code