Documentation - C API
Fundamentals

Table of Contents

VLAD can be seen as a feature encoding and pooling method, similar to Fisher vectors. VLAD encodes a set of local feature descriptors \(I=(\bx_1,\dots,\bx_n)\) extracted from an image using a dictionary built using a clustering method such as Gaussian Mixture Models (GMM) or K-means clustering. Let \(q_{ik}\) be the strength of the association of data vector \(\bx_i\) to cluster \(\mu_k\), such that \(q_{ik} \geq 0\) and \(\sum_{k=1}^K q_{ik} = 1\). The association may be either soft (e.g. obtained as the posterior probabilities of the GMM clusters) or hard (e.g. obtained by vector quantization with K-means).

\(\mu_k\) are the cluster means, vectors of the same dimension as the data \(\bx_i\). VLAD encodes feature \(\bx_\) by considering the residuals

\[ \bv_k = \sum_{i=1}^{N} q_{ik} (\bx_{i} - \mu_k). \]

The residulas are stacked together to obtain the vector

\[ \hat\Phi(I) = \begin{bmatrix} \vdots \\ \bv_k \\ \vdots \end{bmatrix} \]

Before the VLAD encoding is used it is usually globally \(L^2\) normalized:

\[ \Phi(I) = \hat\Phi(I) / \|\hat\Phi(I)\|_2. \]

In this manner, the Euclidean distance and inner product between VLAD vectors becomre more meaningful.

However the size of each cluster could have a negative imapact on the appearance of the vector \( V \) and so, the normalization of each aggregate of vectors could be used:

\[ v_{j} = { \sum_{i=1}^{N} {q_{i,j} x_{i}} \over { \sum_{i=1}^{N} {q_{i,j}} } } - \mu_{j} \]

This normalization is controlled by the last argument of vl_vlad_encode.