Documentation - C API
Covariant image frames detector
Author:
Andrea Vedaldi
Karel Lenc
Credits:
May people have contributed with suggestions and bug reports. Although the following list is certainly incomplete, we would like to thank: Wei Dong, Loic, Giuseppe, Liu, Erwin, P. Ivanov, and Q. S. Luo.

framedet::h implements a Covariant frames detector, a reusable object to extract covariant image features (in the following text referred as frames) [6] [9] from one or multiple images.

Overview

Image frame is a set of attributes for a selected image region (also called keypoint) usually with an associated descriptor. These frames are extraced using Image frames detector and extracted frames can vary in its covariance to image transformation depending on the class of the detected frames. The frame detection can also differ in the image response function which was used for detection of distinct image regions. Special case of these frames are SIFT keypoint detector SIFT frames. All the detected frames can be accompanied with SIFT descriptors.

Image frame classes

frame-types.png
Classes of image frames.

Image frames can differ in the attributes which are computed for the selected image region (blob). Based on these attributes these frames gain covariance to certain image transformations which are summarised in the following table:

Attributes of image frame classes.
Frame class Frame attributes Covariant to See also
Disc Frame coordinates $ (x,y) $, scale (radius) $ \sigma $ Translations and scalings Disc detection
Oriented Disc Frame coordinates $ (x,y) $, scale $ \sigma $ and orientation $ \theta $ Translations, scalings and rotation (similarities) Disc detection
Ellipse Frame coordinates $ (x,y) $ and shape matrix $ E $ with three degrees of freedom Translation and affinities up to residual rotations Ellipse detection
Oriented Ellipse Frame coordinates $ (x,y) $ and affine transfomration $ A $ between ellipse and circle Translation and affinities up to residual rotations Ellipse detection

The name of the frame classes are selected in order to get graphical intuition of the blob structure and they share the way how they behave under linear transformations.

Image frames detection

Covariant frames detector can be configured to detect any of the mentioned classes. Because only some of the attributes are shared among classes the detection of each frame type differs.

frame-detection.png
Pipeline of frames detection.

In all cases the detection start by detecting disc frames using the VlScaleSpace object. Detecting the major orientations extends the disc frame into several oriented disc frames.

If the detector is configured to detect ellipses, image region of each disc is examined for its affine shape based on its second moment matrix. In order to assign orientations, local anisotropic structures has to be transformed into isotrpic circular regions. This step, in fact, transforms the elliptic blob into a circular blob and thereafter the orientation can be assigned in the same way as for disc frames.

Disc detection

See also:
Scale space technical details, Detector technical details

A disc frame is a circular image region and in case of oriented disx with assinged orientation. It is described by a geometric frame of three/four parameters: the keypoint center coordinates x and y, its scale (the radius of the region), and its orientation (an angle expressed in radians). The disc detector uses as keypoints image structures which resemble “blobs”. By searching for blobs at multiple scales and positions, the disc detector is invariant (or, more accurately, covariant) to translation, rotations, and rescaling of the image.

The keypoint orientation is also determined from the local image appearance and is covariant to image rotations. Depending on the symmetry of the keypoint appearance, determining the orientation can be ambiguous. In this case, the Covariant frames detector returns a list of up to four possible orientations, constructing up to four frames (differing only by their orientation) for each detected image blob.

sift-frame.png
Disc frames are circular image regions with an orientation.

There are several parameters that influence the detection of disc keypoints. First, searching keypoints at multiple scales is obtained by constructing a so-called “Gaussian scale space”. The scale space is just a collection of images obtained by progressively smoothing the input image, which is analogous to gradually reducing the image resolution. Conventionally, the smoothing level is called scale of the image. The construction of the scale space is influenced by the following parameters, set when creating the SIFT filter object by ::vl_sift_new():

  • Number of octaves. Increasing the scale by an octave means doubling the size of the smoothing kernel, whose effect is roughly equivalent to halving the image resolution. By default, the scale space spans as many octaves as possible (i.e. roughly log2(min(width,height)), which has the effect of searching keypoints of all possible sizes.
  • First octave index. By convention, the octave of index 0 starts with the image full resolution. Specifying an index greater than 0 starts the scale space at a lower resolution (e.g. 1 halves the resolution). Similarly, specifying a negative index starts the scale space at an higher resolution image, and can be useful to extract very small features (since this is obtained by interpolating the input image, it does not make much sense to go past -1).
  • Number of levels per octave. Each octave is sampled at this given number of intermediate scales (by default 3). Increasing this number might in principle return more refined keypoints, but in practice can make their selection unstable due to noise (see [1]).

Each image of the scale-space is then transformed using different image response functions in order to detect corner-like blobs in the image which are found to be stable under several image transformations. The Covariant frames detector supports the following response functions:

  • Difference of Gaussian. Computionaly effective approximation of Laplacian of Gaussian (LoG) - scalar value bases on the second order derivative of the image brightness function.
  • Hessian response. Image response based on determinant of hessian matrix which is constructed from second order partial derivatives of the image brigtness. Contrary to LoG it also counts with mixed derivatives.

Based on the response function new gaussian scale-space is constructed where the distinct image regions (keypoints) are detected as local extrema of the response function across spatial coordinates an scale. Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast. Filtering is controlled by the follow:

  • Peak threshold. This is the minimum amount of contrast to accept a keypoint. It is set by configuring the SIFT filter object by vl_covdet_set_peak_thresh().
  • Edge threshold. This is the edge rejection threshold. It is set by configuring the SIFT filter object by vl_covdet_set_edge_thresh().
Summary of the parameters influencing the disc Covariant frames detector.
Parameter See also Controlled by Comment
image repsonse function Disc detection vl_covdet_new_disc_detector Values: VL_IMRESP_DOG or VL_IMRESP_HESSIAN
number of octaves Disc detection vl_covdet_new_disc_detector
first octave index Disc detection vl_covdet_new_disc_detector set to -1 to extract very small features
number of scale levels per octave Disc detection vl_covdet_new_disc_detector can affect the number of extracted keypoints
edge threshold Disc detection vl_covdet_set_edge_thresh decrease to eliminate more keypoints
peak threshold Disc detection vl_covdet_set_peak_thresh increase to eliminate more keypoints

Ellipse detection

As it was said before, oriented discs are covariant only to similarities. Although in the case of the viewpoint change the image blob not only changes its scale and position but also its affine shape. This is why attributes covariant to affine transformation are introduced.

Estimation of the affine shape of the detected disc frame is based on iterative procedure based on the Second Moment Matrix (SMM). This matrix is used as a local image texture descriptor which is covariant to affine tranformation of the Image domain.

The used iterative procedure sequentialy tries to estimate the affine shape of the frame region by computing an affine transformation which transforms the local anisotrpic structure into an isotropic one. Using the graphic representations it tries to transform the elliptic region into a circular one. The size of the regions which is used for calculating the SMM can be controlled by parameter Affine Window Size .

The procedure stops when the isotropy measure is closer to ideal measure of circular region than the parameter convergence threshold . The procedure is also limited by the maximal number of iterations, controlled by parameter Maximum Iterations .

In order to obtain affine invariant descriptors and residual rotation these attributes are calculated from the transformed isotropic structure. In order to obtain higher precission, the descriptors are usually calculated from larger windows then the windows used for the affine shape estimation. The size of these windows can be controlled by parameter Descriptor Window Size .

Because the parameters Affine Window Size and Descriptor Window Size influence the size of the allocated memory they have to be defined in the constructor of the Covariant frames detector.

Summary of the parameters influencing the disc Covariant frames detector.
Parameter See also Controlled by Comment
Disc detector parameters, see Disc detection
Size of the window for affine shape estimation Ellipse detection vl_covdet_new_ellipse_detector affine shape estimation precision
Size of the window for descriptor and orientation calculation Ellipse detection vl_covdet_new_ellipse_detector descriptor calculation precision
Maximum number of iteration for aff. shape est. Ellipse detection vl_affineshapeestimator_set_max_iter affine shape estimation precision
Convergence criterion threshold of aff. shape est. Ellipse detection vl_affineshapeestimator_set_conv_thresh affine shape estimation precision

SIFT Descriptor

See also:
Descriptor technical details

A SIFT descriptor is a 3-D spatial histogram of the image gradients in characterizing the appearance of a keypoint. The gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-D histogram h, which (up to normalization and clamping) forms the SIFT descriptor of the region. An additional Gaussian weighting function is applied to give less importance to gradients farther away from the keypoint center. Orientations are quantized into eight bins and the spatial coordinates into four each, as follows:

sift-descr-easy.png
The SIFT descriptor is a spatial histogram of the image gradient.

SIFT descriptors are computed by either calling ::vl_sift_calc_keypoint_descriptor or ::vl_sift_calc_raw_descriptor. They accept as input a disc frame, which specifies the descriptor center, its size, and its orientation on the image plane. In the case of elliptic frames the descriptor is calculated from the affine-normalised image region of the image based on the affine shape of the frame.

The following parameters influence the descriptor calculation:

  • magnification factor. The descriptor size is determined by multiplying the keypoint scale by this factor. It is set by ::vl_sift_set_magnif.
  • Gaussian window size. The descriptor support is determined by a Gaussian window, which discounts gradient contributions farther away from the descriptor center. The standard deviation of this window is set by ::vl_sift_set_window_size and expressed in unit of bins.

VLFeat SIFT descriptor uses the following convention. The y axis points downwards and angles are measured clockwise (to be consistent with the standard image convention). The 3-D histogram (consisting of $ 8 \times 4 \times 4 = 128 $ bins) is stacked as a single 128-dimensional vector, where the fastest varying dimension is the orientation and the slowest the y spatial coordinate. This is illustrated by the following figure.

sift-conv-vlfeat.png
VLFeat conventions
Note:
Keypoints (frames) D. Lowe's SIFT implementation convention is slightly different: The y axis points upwards and the angles are measured counter-clockwise.
sift-conv.png
D. Lowes' SIFT implementation conventions
Summary of the parameters influencing the SIFT descriptor.
Parameter See also Controlled by Comment
magnification factor sift-intro-descriptor ::vl_siftdesc_set_magnif increase this value to enlarge the image region described
Gaussian window size sift-intro-descriptor ::vl_siftdesc_set_window_size smaller values let the center of the descriptor count more

Extensions

Eliminating low-contrast descriptors. Near-uniform patches do not yield stable keypoints or descriptors. ::vl_sift_set_norm_thresh() can be used to set a threshold on the average norm of the local gradient to zero-out descriptors that correspond to very low contrast regions. By default, the threshold is equal to zero, which means that no descriptor is zeroed. Normally this option is useful only with custom keypoints, as detected keypoints are implicitly selected at high contrast image regions.

Using the VlFrameDet object

The code provided in this module can be used in different ways. You can instantiate and use a VlFrameDet object to extract arbitrary type of frames and descriptors from one or multiple images.

To use a VlFrameDet object:

Please note that for each image the frame storage can be reallocated therefore the frames can be stored in different mamory location. That is why the new pointer to frames and descriptors should be get after each detection.

To compute SIFT descriptors of custom keypoints, use vl_sift_calc_descriptor().

SIFT keypoint detector

The ::VlFrameDet can be used for example as a SIFT frames detector by construction of the ::VlFrameDet object and settings the parameters as follows:

 {.c}
VlImRespFunction respFunction = VL_IMRESP_DOG;
vl_bool calcOrientation = VL_TRUE;
vl_bool calcDescriptor = VL_TRUE;
VlFrameDet siftDet = vl_covdet_new_disc_detector(w, h, respFunction,
                                                   O, firstOctave, S,
                                                   calcOrientation, calcDescriptor);

Then the detection is fairly simple:

 {.c}
vl_covdet_detect(siftDet, image);

vl_size framesNum = vl_covdet_get_frames_num(siftDet);
VlFrameOrientedDisc const *frames = vl_covdet_get_oriented_discs(siftDet);

vl_size descriptorSize = vl_covdet_get_descriptor_size(siftDet);
float const *descriptors = vl_covdet_get_descriptors(siftDet);

Hessian-Affine keypoint detector

Similalrly the Hessian-Affine detector can be contructed as well simply by constructing the frames detector with the following parameters:

 {.c}
VlImRespFunction respFunction = VL_IMRESP_HESSIAN;
vl_bool calcOrientation = VL_TRUE;
vl_bool calcDescriptor = VL_TRUE;
VlFrameDet hessaffDet = vl_covdet_new_ellipse_detector(w, h, respFunction,
                                                         O, firstOctave, S,
                                                         calcOrientation, calcDescriptor);

Keypoint conversion

The constructed ::VlFrameDet object can be also used for conversion between different classes of frames using the method vl_covdet_convert_frames() where the output class of frames is specified by the class of frames detected by the detector. The missing attributes if the frames are computed in the same way as they are computed by the detector. The following code shows how to convert disc frames into oriented ellipses and to calculate their decriptors:

 {.c}
VlFrameOrientedEllipse* dstFrames;
float const *dstDescriptors;

VlImRespFunction respFunction = VL_IMRESP_DOG; // Irrelevant value
vl_bool dstCalcOrientation = VL_TRUE;
vl_bool dstCalcDescriptor = VL_TRUE;
VlFrameDet detector = vl_covdet_new_ellipse_detector(w, h, respFunction,
                         O, firstOctave, S, dstCalcOrientation, dstCalcDescriptor);

VlFrameDisc *srcFrames = ...
vl_size srcFramesNum = ...
VlFrameType srcFrameType = VL_FRAME_DISC;
vl_covdet_convert_frames(detector, image, (const void*)srcFrames, srcFramesNum,
                           srcFramesType);

dstFrames = vl_covdet_get_oriented_ellipses(detector);
dstDescriptors = vl_covdet_get_descriptors(detector);

Technical details

Disc Detector

The SIFT frames (keypoints) are extracted based on local extrema (peaks) of the DoG scale space. Numerically, local extrema are elements whose $ 3 \times 3 \times 3 $ neighbors (in space and scale) have all smaller (or larger) value. Once extracted, local extrema are quadratically interpolated (this is very important especially at the lower resolution scales in order to have accurate keypoint localization at the full resolution). Finally, they are filtered to eliminate low-contrast responses or responses close to edges and the orientation(s) are assigned, as explained next.

Eliminating low contrast responses

Peaks which are too short may have been generated by noise and are discarded. This is done by comparing the absolute value of the DoG scale space at the peak with the peak threshold $t_p$ and discarding the peak its value is below the threshold.

Eliminating edge responses

Peaks which are too flat are often generated by edges and do not yield stable features. These peaks are detected and removed as follows. Given a peak $x,y,\sigma$, the algorithm evaluates the x,y Hessian of of the DoG scale space at the scale $\sigma$. Then the following score (similar to the Harris function) is computed:

\[ \frac{(\mathrm{tr}\,D(x,y,\sigma))^2}{\det D(x,y,\sigma)}, \quad D = \left[ \begin{array}{cc} \frac{\partial^2 \mathrm{DoG}}{\partial x^2} & \frac{\partial^2 \mathrm{DoG}}{\partial x\partial y} \\ \frac{\partial^2 \mathrm{DoG}}{\partial x\partial y} & \frac{\partial^2 \mathrm{DoG}}{\partial y^2} \end{array} \right]. \]

This score has a minimum (equal to 4) when both eigenvalues of the Jacobian are equal (curved peak) and increases as one of the eigenvalues grows and the other stays small. Peaks are retained if the score is below the quantity $(t_e+1)(t_e+1)/t_e$, where $t_e$ is the edge threshold. Notice that this quantity has a minimum equal to 4 when $t_e=1$ and grows thereafter. Therefore the range of the edge threshold is $[1,\infty)$.

Orientation assignment

A peak in the DoG scale space fixes 2 parameters of the keypoint: the position and scale. It remains to choose an orientation. In order to do this, SIFT computes an histogram of the gradient orientations in a Gaussian window with a standard deviation which is 1.5 times bigger than the scale $\sigma$ of the keypoint.

sift-orient.png

This histogram is then smoothed and the maximum is selected. In addition to the biggest mode, up to other three modes whose amplitude is within the 80% of the biggest mode are retained and returned as additional orientations.

Hessian keypoint detector

Scale adapted hessian matrix is defined as:

\[ H(\mathbf{x}, \sigma_D) = \sigma_D^2 \left[\begin{array}{cc} L_{xx}(\mathbf{x}, \sigma_D) & L_{xy}(\mathbf{x}, \sigma_D) \\ L_{xy}(\mathbf{x}, \sigma_D) & L_{yy}(\mathbf{x}, \sigma_D) \end{array}\right] \]

Where $ \sigma_D $ is scale of the current level.

Where $ L(\mathbf{x}, \sigma) $ is given as:

\[ L(\mathbf{x}, \sigma) = g(\mathbf{x}, \sigma) * I(\mathbf{x}) \]

Where $ I $ stands for the image intensities.

And $ L_{ab} $ stands for second derivations as:

\[ L_{ab} = \frac{\partial L}{\partial a} \frac{\partial L}{\partial b} \]

Note also that normalisation factor $ \sigma_{D} $ is needed due to maintain scale invariance of the response. Scale normalised derivative of order $ i $ is defined as:

\[ \sigma^{m} L_{i_1 \dotsc i_m}(\mathbf{x},\sigma) = \sigma^{m} g_{i_1 \dotsc i_m}(\mathbf{x},\sigma) * I(\mathbf{x}) \]

Due to the properties of partial derivative of Gaussian function [mikolajczyk01index]}.

Having the scale adapted Hessian matrix, the Hessian response is given as determinant of the Hessian matrix:

\[ \mathrm{Resp}(\sigma_D) = |H(\sigma_D))| = \sigma_D^4(L_{xx}(\sigma_D) L_{yy}(\sigma_D) - L_{xy}^2(\sigma_D)) \]

In the used implementation the Hessian matrix is calculated only in window of $ 3 \times 3 $ pixels and the second order partial derivatives are approximated as follows:

\[ L_{xx} = \left[\begin{array}{ccc}1 & -2 & 1\end{array}\right] * L \quad L_{xx} = \left[\begin{array}{ccc} -\frac{1}{4} & 0 & \frac{1}{4} \\ 0 & 0 & 0 \\ \frac{1}{4} & 0 & -\frac{1}{4} \\ \end{array}\right] * L \quad L_{yy} = \left[\begin{array}{c}1 \\ -2 \\ 1\end{array}\right] * L \]

Descriptor

A SIFT descriptor of a local region (keypoint) is a 3-D spatial histogram of the image gradients. The gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-D histogram h, which (up to normalization and clamping) forms the SIFT descriptor of the region. An additional Gaussian weighting function is applied to give less importance to gradients farther away from the keypoint center.

Construction in the canonical frame

Denote the gradient vector field computed at the scale $ \sigma $ by

\[ J(x,y) = \nalba I_\sigma(x,y) = \left[\begin{array}{cc} \frac{\partial I_\sigma}{\partial x} & \frac{\partial I_\sigma}{\partial y} & \end{array}\right] \]

The descriptor is a 3-D spatial histogram capturing the distribution of $ J(x,y) $. It is convenient to describe its construction in the canonical frame. In this frame, the image and descriptor axes coincide and each spatial bin has side 1. The histogram has $ N_\theta \times N_x \times N_y $ bins (usually $ 8 \times 4 \times 4 $), as in the following figure:

sift-can.png
Canonical SIFT descriptor and spatial binning functions

Bins are indexed by a triplet of indexes t, i, j and their centers are given by

\begin{eqnarray*} \theta_t &=& \frac{2\pi}{N_\theta} t, \quad t = 0,\dots,N_{\theta}-1, \\ x_i &=& i - \frac{N_x -1}{2}, \quad i = 0,\dots,N_x-1, \\ y_j &=& j - \frac{N_x -1}{2}, \quad j = 0,\dots,N_y-1. \\ \end{eqnarray*}

The histogram is computed by using trilinear interpolation, i.e. by weighing contributions by the binning functions

\begin{eqnarray*} \displaystyle w(z) &=& \mathrm{max}(0, 1 - |z|), \\ \displaystyle w_\mathrm{ang}(z) &=& \sum_{k=-\infty}^{+\infty} w\left( \frac{N_\theta}{2\pi} z + N_\theta k \right). \end{eqnarray*}

The gradient vector field is transformed in a three-dimensional density map of weighed contributions

\[ f(\theta, x, y) = |J(x,y)| \delta(\theta - \angle J(x,y)) \]

The historam is localized in the keypoint support by a Gaussian window of standard deviation $ \sigma_{\mathrm{win}} $. The histogram is then given by

\begin{eqnarray*} h(t,i,j) &=& \int g_{\sigma_\mathrm{win}}(x,y) w_\mathrm{ang}(\theta - \theta_t) w(x-x_i) w(y-y_j) f(\theta,x,y) d\theta\,dx\,dy \\ &=& \int g_{\sigma_\mathrm{win}}(x,y) w_\mathrm{ang}(\angle J(x,y) - \theta_t) w(x-x_i) w(y-y_j) |J(x,y)|\,dx\,dy \end{eqnarray*}

In post processing, the histogram is $ l^2 $ normalized, then clamped at 0.2, and $ l^2 $ normalized again.

Calculation in the image frame

Invariance to similarity transformation is attained by attaching descriptors to SIFT keypoints (or other similarity-covariant frames). Then projecting the image in the canonical descriptor frames has the effect of undoing the image deformation.

In practice, however, it is convenient to compute the descriptor directly in the image frame. To do this, denote with a hat quantities relative to the canonical frame and without a hat quantities relative to the image frame (so for instance $ \hat x $ is the x-coordinate in the canonical frame and $ x $ the x-coordinate in the image frame). Assume that canonical and image frame are related by an affinity:

\[ \mathbf{x} = A \hat\mathbf{x} + T, \qquad \mathbf{x} = \left[\begin{array}{cc} x \\ y \end{arraty}\right], \quad \hat\mathbf{x} = \left[\begin{array}{cc} \hat x \\ \hat y \end{arraty}\right]. \]

sift-image-frame.png

Then all quantities can be computed in the image frame directly. For instance, the image at infinite resolution in the two frames are related by

\[ \hat I_0(\hat{\mathbf{x}}) = I_0(\mathbf{x}), \qquad \mathbf{x} = A \hat{\mathbf{x}} + T. \]

The canonized image at scale $ \hat \sigma $ is in relation with the scaled image

\[ \hat I_{\hat{\sigma}}(\hat{\mathbf{x}}) = I_{A\hat{\sigma}}(\mathbf{x}), \qquad \mathbf{x} = A \hat{\mathbf{x}} + T \]

where, by generalizing the previous definitions, we have

\[ I_{A\hat \sigma}(\mathbf{x}) = (g_{A\hat\sigma} * I_0)(\mathbf{x}), \quad g_{A\hat\sigma}(\mathbf{x}) = \frac{1}{2\pi|A|\hat \sigma^2} \exp \left( -\frac{1}{2} \frac{\mathbf{x}^\top A^{-\top}A^{-1}\mathbf{x}}{\hat \sigma^2} \right) \]

Deriving shows that the gradient fields are in relation

\[ \hat J(\hat{\mathbf{x}}) = J(\mathbf{x}) A, \quad J(\mathbf{x}) = (\nabla I_{A\hat\sigma})(\mathbf{x}), \qquad \mathbf{x} = A \hat{\mathbf{x}} + T. \]

Therefore we can compute the descriptor either in the image or canonical frame as:

\begin{eqnarray*} h(t,i,j) &=& \int g_{\hat \sigma_\mathrm{win}}(\hat{\mathbf{x}})\, w_\mathrm{ang}(\angle \hat J(\hat{\mathbf{x}}) - \theta_t)\, w_{ij}(\hat{\mathbf{x}})\, |\hat J(\hat{\mathbf{x}})|\, d\hat{\mathbf{x}} \\ &=& \int g_{A \hat \sigma_\mathrm{win}}(\mathbf{x} - T)\, w_\mathrm{ang}(\angle J(\mathbf{x})A - \theta_t)\, w_{ij}(A^{-1}(\mathbf{x} - T))\, |J(\mathbf{x})A|\, d\mathbf{x}. \end{eqnarray*}

where we defined the product of the two spatial binning functions

\[ w_{ij}(\hat{\mathbf{x}}) = w(\hat x - \hat x_i) w(\hat y - \hat y_j) \]

In the actual implementation, this integral is computed by visiting a rectangular area of the image that fully contains the keypoint grid (along with half a bin border to fully include the bin windowing function). Since the descriptor can be rotated, this area is a rectangle of sides $m/2\sqrt{2} (N_x+1,N_y+1)$ (see also the illustration).

Standard SIFT descriptor

For a SIFT-detected keypoint of center $ T $, scale $ \sigma $ and orientation $ \theta $, the affine transformation $ (A,T) $ reduces to the similarity transformation

\[ \mathbf{x} = m \sigma R(\theta) \hat{\mathbf{x}} + T \]

where $ R(\theta) $ is a counter-clockwise rotation of $ \theta $ radians, $ m \mathcal{\sigma} $ is the size of a descriptor bin in pixels, and m is the descriptor magnification factor which expresses how much larger a descriptor bin is compared to the scale of the keypoint $ \sigma $ (the default value is m = 3). Moreover, the standard SIFT descriptor computes the image gradient at the scale of the keypoints, which in the canonical frame is equivalent to a smoothing of $ \hat \sigma = 1/m $. Finally, the default Gaussian window size is set to have standard deviation $ \hat \sigma_\mathrm{win} = 2 $. This yields the formula

\begin{eqnarray*} h(t,i,j) &=& m \sigma \int g_{\sigma_\mathrm{win}}(\mathbf{x} - T)\, w_\mathrm{ang}(\angle J(\mathbf{x}) - \theta - \theta_t)\, w_{ij}\left(\frac{R(\theta)^\top \mathbf{x} - T}{m\sigma}\right)\, |J(\mathbf{x})|\, d\mathbf{x}, \\ \sigma_{\mathrm{win}} &=& m\sigma\hat \sigma_{\mathrm{win}}, \\ J(\mathbf{x}) &=& \nabla (g_{m \sigma \hat \sigma} * I)(\mathbf{x}) = \nabla (g_{\sigma} * I)(\mathbf{x}) = \nabla I_{\sigma} (\mathbf{x}). \end{eqnarray*}

Ellipse Detector

Scale and affine invariant keypoint

affine-frame.png
Affine invariant keypoint.

Affine invariant keypoint (frame) consists from from the following values:

\[ f_K = \left( \begin{array}{ccccc} x_K & y_K & e_{11} & e_{12} & e_{22} \end{array} \right)^T \]

Affine shape is described by symetric shape matrix $ E^{-1} $ where $ \mathbf{x}^T E \mathbf{x} = 1 $ is the equation of the elllipse matrix representation. The shape matrix is given as:

\[ E^{-1} = \left( \begin{array}{cc} e_{11} & e_{12} \\ e_{12} & e_{22} \end{array} \right) \]

Elipse is used in a similar manner as circles in scale invariant features because it can represent the affine shape of the feature - it is affine transformation of a circle. In fact this affine transformation $ A^{-1} $ (it is inverted matrix $ A $ because in the following text, $ A $ is used as a transformation which transforms ellipse to circle) can be converted into the ellipse matrix as:

\[ E = A^T A \]

Ellipse matrix, because it is real, symmetric and positive deffinite, can be also decomposed using eigen decomposition:

\[ E = Q^{-1} \left( \begin{array}{cc} \lambda_1 & 0 \\ 0 & \lambda_2 \end{array} \right) Q \]

The eigen-values can be relted to the size of ellipse axes where the smaller eigen value $ \lambda_{max} $ with its eigen vector represents the direction of the fastest change and the bigger eigen value $ \lambda_{max} $ with its eig. vector the direction of the slowest change.

Affine shape estimation

In order to obtain frames which are covariant to affine transformation we have to obtain its attributes which would be covariant to affine transformation. These attributes can be also related as 'Affine shape' of the image region because then the blob can be normalised into isotropic blob using affine transformation encoded in these attributes (which is an implication from the affine covariance).

The basic idea of affine shape estimation is to find an affine transformation which would normalise the image blob. For a simplicity at first it would be shown how to find a transformation which would normalise ellipse into a circle.

affine-transf.png
Affine shape estimation.

Anisotrpic image blob can be represented by an elipse with equation:

\[ \bar{\mathbf{x}}^T E \bar{\mathbf{x}} = 1 \]

Where $ \bar{\mathbf{x}} = \mathbf{x} - \mathbf{x}_0 $ is a coordinate system based on the kypoint centre $ \mathbf{x}_0 $ and $ \mathbf{x} $w is coordinate in the image $ I(\mathbf{x}) $.

The affine transformation can be defined as a positive-definite matrix $ A $ of size $ 2 \times 2 $ such that:

\[ \hat{\mathbf{x}} = A \bar{\mathbf{x}} \]

Where $ \hat{\mathbf{x}} $ are coordinates in the transformed image $ \hat{I}(\hat{\mathbf{x}}) $.

We want to find affine transformation which would transform points on the ellipse $ \mathbf{x} $ to points on a circle $ \hat{\mathbf{x}} $ s.t.

\[ \hat{\mathbf{x}}^T \hat{\mathbf{x}} = 1 \]

This can be done by decomposing matrix $ E $ to:

\begin{eqnarray*} E = E^{\frac{T}{2}} R^T R E^{\frac{1}{2}} \end{eqnarray*}

Where $ R $ is an arbitrary rotation matrix. Then when substituted to the ellipse equation we can see that the affine transformation can be expressed as:

\[ A = R E^{\frac{1}{2}} \]

Therefore it transforms points on the ellipse to points on a circle.

NOTE - this is wrong somewhere because we have $ A^T A = E^{-1} $ not E. It should also be true from the ellipse equation comparing the scales.

In order to describe the anisotropic structure we need a descriptor $ \mu: \mathbb{R}^2 \rightarrow \mathbb{R}^2 $which covariant to affine transformations such that:

\[ \mu (\hat{I}(\hat{\mathbf{x}})) = \mu_0 = \mu (I( A^{-1} \hat{\mathbf{x}})) \]

It was observed that a windowed second moment matrix has got some interesting properties which makes is suited for estimating local linear distortion of the image blob.

Windowed Second moment matrix

To show that second moment matrix can be used as a descriptor of the affine shape of local anisotrpic structure we have to extend the classic definition to our scale space and circular Gaussian distribution must be replaced by multivariate gaussian distribution in order to be able to detect anisotropic structures in our basically isotropic scale space.

Multivariate 2D gaussian distribution is given as:

\[ g(\mathbf{x}, \Sigma) = \frac{1}{2 \pi | \Sigma |^{1/2}} \exp \left( -\frac{1}{2} \mathbf{x}^T \Sigma^{-1} \mathbf{x} \right) \]

Where $ \Sigma $ is covariance matrix of size $ 2 \times 2 $. In the 1D case $ \Sigma $ would be a real number, variance of the distribution $ \sigma^2 $.

Extending our framework we can now define the second moment matrix (SMM):

\[ \mu(\mathbf{x}, \Sigma_I, \Sigma_D) = |\Sigma_D| g(\Sigma_I) * (\nabla L(\mathbf{x}, \Sigma_D) \nabla L(\mathbf{x}, \Sigma_D)^T) \]

Where we define two covariance matrices, $ \Sigma_D $ which determines 'derivation' (or local) gaussian kernel. This covariance matrix describes what initial smoothing was used for the data on which the derivations were calculated. In our scale space framework we can see that it would be in relation with the scale of the current level. The covariance matrix $ \Sigma_I $ define 'integration' gaussian kernel which define the window function over which the SMM is calculated, in our case the window is selected as Gaussian function.

Transformation of SMM

Now having extended the definition of the second moment matrix, we would like to see how SMM behaves when the image was transformed using some affine transformation $ A $. Let have coordinates $ \bar{\mathbf{x}} $ in 'left' image $ I(\bar{\mathbf{x}}) $ and a cordinate $ \hat{\mathbf{x}} $ in 'right' image $ \hat{I}(\hat{\mathbf{x}}) $. The image coordinate are in relation $ \hat{\mathbf{x}} = A \bar{\mathbf{x}}$ and image intensity values are in relation $ \hat{I}(\hat{\mathbf{x}}) = I(A^{-1} \hat{\mathbf{x}}) $

The SMM $ M $ computed in point $ \bar{\mathbf{x}} $ in the left image is in relation with the SMM $ \hat{M} $ computed in point $ \hat{\mathbf{x}} $ in the right image:

\begin{eqnarray*} M = \mu(\mathbf{x},\Sigma_{I}, \Sigma_{D}) &=& A^T \mu(\hat{\mathbf{x}}, \hat{\Sigma}_{I}, \hat{\Sigma}_{D}) A \\ &=& A^T \mu(\hat{\mathbf{x}}, A \Sigma_{I} A^T, A \Sigma_{D} A^T) A = A^T \hat{M} A \end{eqnarray*}

Where the gaussian kernels are in relation: $ \hat{\Sigma} = A \Sigma A^T $

This is an important property of the SMM which is demanded by previously defined affine shape estimation framework. Therefore we can use the SMM in a similar way as the Ellipse matrix and use it to find the affine transformation which would normalise the anisotropic image blob.

Covariance matrices and the SMM

In the previous section we have not considered the shape of the covariance matrices in the SMM computation. General variation gives us six degrees of freedom. However in this algorithm these matrices are chosen to be in direct proportion to the second moment matrix because it yields quite agreeable properties:

\[ \hat{\Sigma} = A ~\Sigma~ A^T = R M^{\frac{1}{2}} ~\Sigma~ M^{\frac{T}{2}} R^T \]

And when we choose $ \Sigma = \sigma M^{-1} $ we get:

\[ \hat{\Sigma} = \sigma ~ R M^{\frac{1}{2}} ~M^{-1}~ M^{\frac{T}{2}} R^T = \sigma I \]

Which complies with the fact that the isotropic structure in the right image should be sampled with isotropic, circular, gaussian kernels. Therefore we define integration scale $ \sigma_I $ and derivation scale $ \sigma_D $ s.t. $ \Sigma_I = \sigma_I M^{-1} $ and $ \Sigma_D = \sigma_D M^{-1} $.

This decision has got also intuitive explanation that the image blob is sampled with a gaussian window which has the same shape as its affine region.

Iterative procedure

We have shown that SMM can be used as an approximation descriptor of the affine shape of an image blob. However for selecting the covariance kernels of the SMM computation we need to know the SMM. That is why the first covariance matrices are estimated as circular and then we use iterative procedure which in each step selects more precise affine region which is defined by shape adaptation matrix $ U_k $.

The convergence criterion of the iterative procedure is based on simple isotropy measure based on second moment matrix $ M $ eigen values $ \lambda_{min}(M) $ and $ \lambda_{max}(M) $:

\[ \mathcal{Q} = \frac{\lambda_{min}(M)}{\lambda_{max}(M)} \]

Where the $ \mathcal{Q} \in \langle 0, 1 \rangle $ with $ 1 $ for perfect isotropic structure when the SMM is close to pure rotation. Then the convergence criterion is defined as:

\[ 1 - \mathcal{Q} < \epsilon_C \]

Where usually $ \epsilon_C = 0.05 $.

The iterative procedure of estimation of affine shape of point $ \mathbf{x} $ goes as follows:

Input: Scale invariant frame $ K $ with spatial coordinate $ \mathbf{x}_K = (x_k, y_k) $ and scale $ \sigma_K $

  1. initialize $ U_0 $ to identity matrix and set $ k = 0 $
  2. set $ k = k+1 $ and if $ k \geq k_{max} $ reject the frame $ K $ as unstable.
  3. normalize the actual window $ W(\hat{\mathbf{x}}) = I(\mathbf{x}) $ centered in $ \sigma_K U_{k-1} \hat{\mathbf{x}}_K = \mathbf{x}_K $. Window $ W $ now contains affine region normalised into a circular one based on the actual estimation of the affine shape $ U_{k-1} $ and the detected derivation scale $ \sigma_K $ of the frame (because $ |U_{k-1}| = 1 $). The window size is defined by argument window_size.
  4. calculate the second moment matrix of the normalised window $ W $

    \[ M_k = \mu_W(\hat{\mathbf{c}}, \sigma_I I, \sigma_K I) \]

    This is easily achieved by weighting the window values by circular gaussian window which standard deviation is $ \sigma_W = \frac{0.5 \mathrm{win\_size}}{3} $ based on the empirical 3-sigma rule. We can see that because the frame affine regions is always fitted to the window of same size the integration scale is a constant multiply of the frame derivation scale.
  1. concatenate transformation

    \[ U_K = \frac{M_k^{1/2}}{\left|M_k^{1/2}\right|} ~ U_{k-1} \]

    Where the matrix $ M_k^{1/2} $ is normalised by its determinant in order not to scale the affine region size and only change its shape. In this way the actual affine transformation is iteratively improved as the measurement region gets more adapted to the real affine shape of the image structure.
  2. In the case of divergence:

    \[ \frac{\lambda_{min}(M_k^{1/2})}{\lambda_{min}(M_k^{1/2})} > 6 \]

    e.g. when the point $ K $ lies on an edge, reject the point as unstable.
  3. go to Step 2 if

    \[ 1 - \frac{\lambda_{min}(M_k^{1/2})}{\lambda_{min}(M_k^{1/2})} \geq \epsilon_C \]

    where the default value is $ \epsilon_C = 0.05 $. Please note that it is calculated from the SMM of the estimated normalised isotropic shape whereas the divergence criterion is calculated from the estimated affine transformation. If not, the estimated affine transformation is equal $ A = U_k $.

In the current implementation the derivation kernel used in the windowed SMM is the kernel used in the GSS level where the point was detected, contrary to the definition, because of the performance issues. This also fixes the scale for which the frame was detected.

The affine transformation can be transformed to the shape matrix which also contains the scale of the frame:

\[ E^{-1} = \sigma_{K}^{2} (A^T A) \]

In the case when oriented ellipse frames are detected, the affine image blob is firstly normalised into a larger window. This normalisation is also performed in more precise manner from the original image rather than from the scale space level which is usually blurred more than it is neccessary.

Affine shape normalisation

In order to calculate descriptor of the affine invariant frame we have to normalise the associated image blob according to its shape in order to be able to calculate affine invariant frame descriptor. Generally we normalise frame neighbourhood multiplied by measurement scale (parameter mr_scale) into a square patch $P$ (normalised region is isotropic, therefore circular) of size patch_size.

Because the scale space generally simulate downsampling of the input image and is smoothed with circular kernels it does not take into account the affine shape of the frame. Therefore if we would use scale space values for descriptor calculation, the data would be smoothed by wrong kernel which would not reflect the affine shape of the point neighbourhood (where the distances between pixels are not constant).

That is why the data from the original image are used. Because the downsampling is done using bilinear interpolation which takes into account only nearest pixels, the image has to be smoothed accordingly as described in the followng procedure. This can be also viewed from the sampling theorem point of view, where the smoothing is just a low-pass 2D filter suppressing higher frequencies in order to prevent aliasing.

Input: Affine invariant frame $ K $ with spatial coordinate $ \mathbf{x}_K $, scale of the original scale invariant frame $ \sigma_K $ and its affine shape described by the affine transformation $ A_K, |A_K| = 1 $ and shape matrix $ E_K $.

  1. Test if the frame measurement region touches the image boundary, if so, it is not possible to calculate the frame descriptor and the kyepoint is rejected.
  2. Calaculate the radius $ r_c $ of circumscribed circle of the anistropic image structure as:

    \[ r_{c_K} = \sigma_K \mathrm{mr\_scale} \]

    and the scale between the patch and the circumscribed circle:

    \[ s_{p \rightarrow c} = \frac{2 ~ r_{c_K}}{\mathrm{patch\_sz}} \]

  3. If the $ s_{p \rightarrow c} > 0.4 $, i.e. that the distance between the patch pixels in image is bigger than $ 0.4 $ so the image must be smoothed in order to perform correct bilinear interpolation
    1. Warp the measurement region into a temporary patch of size $ 2 * r_{c_K} $ with the affine transformation $ A_K $ using bilinear interpolation. Image is not anyhow scaled ( $ |A_K| = 1 $) so the pixel distance remains the same.
    2. Smooth the $ P_t $ with circular gaussian kernel with $ \sigma = 1.5 s_{p \rightarrow c} $. The multiplication factor $ 1.5 $ has been chosen due to properties of the bilinear transformation so that pixels value influence spread accordingly to its neighbourhood.
      1. Downsample the $ P_t $ to $ P $ using bilinear interpolation.
  4. If the $ s_{p \rightarrow c} \leq 0.4 $ then the smoothing is not needed and generally the measurement region is oversampled. Warp the affine measurement region directly to the patch $ P $ using bilinear interpolation.