Table of Contents
This page discusses the Fisher Kernels (FK) of [7] and shows how the FV of [16] can be derived from it as a special case. The FK induces a similarity measures between data points \(\bx\) and \(\bx'\) from a parametric generative model \(p(\bx|\Theta)\) of the data. The parameter \(\Theta\) of the model is selected to fit the a-priori distribution of the data, and is usually the Maximum Likelihood (MLE) estimate obtained from a set of training examples. Once the generative model is learned, each particular datum \(\bx\) is represented by looking at how it affects the MLE parameter estimate. This effect is measured by computing the gradient of the log-likelihood term corresponding to \(\bx\):
\[ \hat\Phi(\bx) = \nabla_\Theta \log p(\bx|\Theta) \]
The vectors \(\hat\Phi(\bx)\) should be appropriately scaled before they can be meaningfully compared. This is obtained by whitening the data by multiplying the vectors by the inverse of the square root of their covariance matrix. The covariance matrix can be obtained from the generative model \(p(\bx|\Theta)\) itself. Since \(\Theta\) is the ML parameter and \(\hat\Phi(\bx)\) is the gradient of the log-likelihood function, its expected value \(E[\hat\Phi(\bx)]\) is zero. Thus, since the vectors are already centered, their covariance matrix is simply:
\[ H = E_{\bx \sim p(\bx|\Theta)} [\hat\Phi(\bx) \hat\Phi(\bx)^\top] \]
Note that \(H\) is also the Fisher information matrix of the model. The final FV encoding \(\Phi(\bx)\) is given by the whitened gradient of the log-likelihood function, i.e.:
\[ \Phi(\bx) = H^{-\frac{1}{2}} \nabla_\Theta \log p(\bx|\Theta). \]
Taking the inner product of two such vectors yields the Fisher kernel:
\[ K(\bx,\bx') = \langle \Phi(\bx),\Phi(\bx') \rangle = \nabla_\Theta \log p(\bx|\Theta)^\top H^{-1} \nabla_\Theta \log p(\bx'|\Theta). \]
Fisher vector derivation
The FV of [16] is a special case of the Fisher kernel construction to encode local image features in an easy-to-compare vector representation. In this construction, an image is modeled as a collection of \(D\)-dimensional feature vectors \(I=(\bx_1,\dots,\bx_n)\) generated by a GMM with \(K\) components \(\Theta=(\mu_k,\Sigma_k,\pi_k:k=1,\dots,K)\). The covariance matrices are assumed to be diagonal, i.e. \(\Sigma_k = \diag \bsigma_k^2\), \(\sigma_k \in \real^D_+\).
The generative model of one feature vector \(\bx\) is given by the GMM density function:
\[ p(\bx|\Theta) = \sum_{k=1}^K \pi_k p(\bx|\Theta_k), \quad p(\bx|\Theta_k) = \frac{1}{(2\pi)^\frac{D}{2} (\det \Sigma_k)^{\frac{1}{2}}} \exp \left[ -\frac{1}{2} (\bx - \mu_k)^\top \Sigma_k^{-1} (\bx - \mu_k) \right] \]
where \(\Theta_k = (\mu_k,\Sigma_k)\). The Fisher Vector requires computing the derivative of the log-likelihood function with respect to the various model parameters. Consider in particular the parameters \(\Theta_k\) of a mode. Due to the exponent in the Gaussian density function, the derivative can be written as
\[ \nabla_{\Theta_k} p(\bx|\Theta_k) = p(\bx|\Theta_k) g(\bx|\Theta_k) \]
for a simple vector function \(g\). The derivative of the log-likelihood function is then
\[ \nabla_{\Theta_k} \log p(\bx|\Theta) = \frac{\pi_k p(\bx|\Theta_k)}{\sum_{t=1}^K \pi_k p(\bx|\Theta_k)} g(\bx|\Theta_k) = q_k(\bx) g(\bx|\Theta_k) \]
where \(q_k(\bx)\) is the soft-assignment of the point \(\bx\) to the mode \(k\). We make the approximation that \(q_k(\bx)\approx 1\) if \(\bx\) is sampled from mode \(k\) and \(\approx 0\) otherwise. Hence one gets:
\[ E_{\bx \sim p(\bx|\Theta)} [ \nabla_{\Theta_k} \log p(\bx|\Theta) \nabla_{\Theta_t} \log p(\bx|\Theta)^\top ] \approx \begin{cases} \pi_k E_{\bx \sim p(\bx|\Theta_k)} [ g(\bx|\Theta_k) g(\bx|\Theta_k)^\top], & t = k, \\ 0, & t\not=k. \end{cases} \]
Thus under this approximation there is no correlation between the parameters of the various Gaussian modes.
The function \(g\) can be further broken down as the stacking of the derivative w.r.t. the mean and the diagonal covariance.
\[ g(\bx|\Theta_k) = \begin{bmatrix} g(\bx|\mu_k) \\ g(\bx|\bsigma_k) \end{bmatrix}, \quad [g(\bx|\mu_k)]_j = \frac{x_j - \mu_{jk}}{\sigma_{jk}^2}, \quad [g(\bx|\bsigma_k^2)]_j = \frac{1}{2\sigma_{jk}^2} \left( \left(\frac{x_j - \mu_{jk}}{\sigma_{jk}}\right)^2 - 1 \right) \]
Thus the covariance of the model (Fisher information) is diagonal and the diagonal entries are given by
\[ H_{\mu_{jk}} = \pi_k E[g(\bx|\mu_{jk})g(\bx|\mu_{jk})] = \frac{\pi_k}{\sigma_{jk}^2}, \quad H_{\sigma_{jk}^2} = \frac{\pi_k}{2 \sigma_{jk}^4}. \]
where in the calculation it was used the fact that the fourth moment of the standard Gaussian distribution is 3. Multiplying the inverse square root of the matrix \(H\) by the derivative of the log-likelihood function results in the Fisher vector encoding of one image feature \(\bx\):
\[ \Phi_{\mu_{jk}}(\bx) = H_{\mu_{jk}}^{-\frac{1}{2}} q_k(\bx) g(\bx|\mu_{jk}) = q_k(\bx) \frac{x_j - \mu_{jk}}{\sqrt{\pi_k}\sigma_{jk}}, \qquad \Phi_{\sigma^2_{jk}}(\bx) = \frac{q_k(\bx)}{\sqrt{2 \pi_k}} \left( \left(\frac{x_j - \mu_{jk}}{\sigma_{jk}}\right)^2 - 1 \right) \]
Assuming that features are sampled i.i.d. from the GMM results in the formulas given in Fisher vector fundamentals (note the normalization factor). Note that:
The Fisher components relative to the prior probabilities \(\pi_k\) have been ignored. This is because they have little effect on the representation [17] .
Technically, the derivation of the Fisher Vector for multiple image features requires the number of features to be the same in both images. Ultimately, however, the representation can be computed by using any number of features.