What is HowTo100M ?

HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of:

136M video clips with captions sourced from 1.2M Youtube videos (15 years of video)
23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness

Each video is associated with a narration available as subtitles automatically downloaded from Youtube.

Real-Time Natural Language search on HowTo100M

We have implemented an online Text-to-Video retrieval demo that performs search and localization in videos using a simple Text-Video model trained on HowTo100M. The demo runs on a single CPU machine and implements FAISS approximate nearest neighbour search implementation.
Please note that to make the search through hundreds of millions of video clips run in real time, this demo uses a lighter (and less accurate) version of the model than the one described in the paper.
Query examples: Check voltage, Cut paper, Cut salmon, Measure window length, Animal dance ....

Dataset statistics

Download

Data

Paper

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

Code

We note that the distribution of identities and activities in the HowTo100M dataset may not be representative of the global human population and the diversity in society. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.