NetVLAD: CNN architecture for weakly supervised place recognition

CNN architecture with the NetVLAD layer

Our trained NetVLAD descriptor correctly recognizes the location (b) of the query photograph (a) despite the large amount of clutter (people, cars), changes in viewpoint and completely different illumination (night vs daytime).

Authors

Abstract

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the ``Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks.

Paper

[Paper on arXiv] [Presentation (54 MB)]

BibTeX

@InProceedings{Arandjelovic16,
  author       = "Arandjelovi\'c, R. and Gronat, P. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NetVLAD}: {CNN} architecture for weakly supervised place recognition",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2016",
}

Offline demo

Place recognition results for all queries in the 24/7 Tokyo dataset.

Code

Downloads

Trained Models (02 May 2016)

The best model (VGG-16 + NetVLAD + whitening, trained on Pittsburgh) (529 MB)
All models (3 GB)
Individual models can be downloaded below. The best models are VGG-16 + NetVLAD + whitening.

Base Network + Pooling Method	Off-the-shelf on Pitts30k	Off-the-shelf on TokyoTM	Trained on Pitts30k	Trained on Pitts250k	Trained on TokyoTM
VGG-16 + NetVLAD + whitening (530 MB)	download	download	download		download
VGG-16 + NetVLAD (53 MB)	download	download	download		download
VGG-16 + Max (53 MB)	download		download		download
AlexNet + NetVLAD + whitening (250 MB)			download		download
AlexNet + NetVLAD (10 MB)	download	download	download		download
AlexNet + Max (10 MB)	download		download	download	download

Additional data

All dataset specifications (2 MB):
Matlab structures that define the datasets, e.g. define train/validation/test splits, GPS coordinates of all points, time stamps for Tokyo Time Machine, etc.
Initialization data (395 MB):
Data needed to construct off-the-shelf networks with NetVLAD pooling, used as starting points for training. It includes cluster centres one can use to compute VLAD.

Datasets

Place recognition datasets:
- Tokyo Time Machine, Tokyo 24/7, Pittsburgh 250k: available on request here. The train/val/test splits are provided with our code.
- Tiny subset of Tokyo Time Machine (21 MB). Contains 360 images, just to be used to validate if the NetVLAD code is set up correctly.
Object/image retrieval datasets (only used for testing):
- Oxford Buildings: Download the 5k images and ground truth files, and place them into images/ and groundtruth/ subfolders of the dataset root folder.
- Paris Buildings: Download the 6k images and ground truth files, and place them into images/ and groundtruth/ subfolders of the dataset root folder. Also get, from the same page, the list of corrupt images and place it into the dataset root.
- INRIA Holidays: Download the 1491 images and place them into the jpg/ subfolder of the dataset root folder. The rotated dataset is available on request (I got it from Artem Babenko) and should be placed into the jpg_rotated/ subfolder.

Acknowledgements

This work was partly supported by RVO13000 - Conceptual development of research organization, the ERC grant LEAP (no. 336845), ANR project Semapolis (ANR-13-CORD-0003), JSPS KAKENHI Grant Number 15H05313, the Inria CityLab IPL, and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory, contract FA8650-12-C-7212. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.