NetVLAD: CNN architecture for weakly supervised place recognition

CNN architecture with the NetVLAD layer

Our trained NetVLAD descriptor correctly recognizes the location (b) of the query photograph (a) despite the large amount of clutter (people, cars), changes in viewpoint and completely different illumination (night vs daytime).



We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the ``Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks.


[Paper on arXiv] [Presentation (54 MB)]


  author       = "Arandjelovi\'c, R. and Gronat, P. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NetVLAD}: {CNN} architecture for weakly supervised place recognition",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2016",

Offline demo

Place recognition results for all queries in the 24/7 Tokyo dataset.



Trained Models (02 May 2016)

Base Network +
Pooling Method
Off-the-shelf on Pitts30k Off-the-shelf on TokyoTM Trained on Pitts30k Trained on Pitts250k Trained on TokyoTM
VGG-16 + NetVLAD + whitening (530 MB) download download download download
VGG-16 + NetVLAD (53 MB) download download download download
VGG-16 + Max (53 MB) download download download
AlexNet + NetVLAD + whitening (250 MB) download download
AlexNet + NetVLAD (10 MB) download download download download
AlexNet + Max (10 MB) download download download download

Additional data



This work was partly supported by RVO13000 - Conceptual development of research organization, the ERC grant LEAP (no. 336845), ANR project Semapolis (ANR-13-CORD-0003), JSPS KAKENHI Grant Number 15H05313, the Inria CityLab IPL, and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory, contract FA8650-12-C-7212. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.