- Credits:
- May people have contributed with suggestions and bug reports. Although the following list is certainly incomplete, we would like to thank: Wei Dong, Loic, Giuseppe, Liu, Erwin, P. Ivanov, and Q. S. Luo.
framedet::h implements a Covariant frames detector, a reusable object to extract covariant image features (in the following text referred as frames) [6] [9] from one or multiple images.
Overview
Image frame is a set of attributes for a selected image region (also called keypoint) usually with an associated descriptor. These frames are extraced using Image frames detector and extracted frames can vary in its covariance to image transformation depending on the class of the detected frames. The frame detection can also differ in the image response function which was used for detection of distinct image regions. Special case of these frames are SIFT keypoint detector SIFT frames. All the detected frames can be accompanied with SIFT descriptors.
Image frame classes
Image frames can differ in the attributes which are computed for the selected image region (blob). Based on these attributes these frames gain covariance to certain image transformations which are summarised in the following table:
| Frame class | Frame attributes | Covariant to | See also |
| Disc | Frame coordinates , scale (radius) | Translations and scalings | Disc detection |
| Oriented Disc | Frame coordinates , scale and orientation | Translations, scalings and rotation (similarities) | Disc detection |
| Ellipse | Frame coordinates and shape matrix with three degrees of freedom | Translation and affinities up to residual rotations | Ellipse detection |
| Oriented Ellipse | Frame coordinates and affine transfomration between ellipse and circle | Translation and affinities up to residual rotations | Ellipse detection |
The name of the frame classes are selected in order to get graphical intuition of the blob structure and they share the way how they behave under linear transformations.
Image frames detection
Covariant frames detector can be configured to detect any of the mentioned classes. Because only some of the attributes are shared among classes the detection of each frame type differs.
In all cases the detection start by detecting disc frames using the VlScaleSpace object. Detecting the major orientations extends the disc frame into several oriented disc frames.
If the detector is configured to detect ellipses, image region of each disc is examined for its affine shape based on its second moment matrix. In order to assign orientations, local anisotropic structures has to be transformed into isotrpic circular regions. This step, in fact, transforms the elliptic blob into a circular blob and thereafter the orientation can be assigned in the same way as for disc frames.
Disc detection
- See also:
- Scale space technical details, Detector technical details
A disc frame is a circular image region and in case of oriented disx with assinged orientation. It is described by a geometric frame of three/four parameters: the keypoint center coordinates x and y, its scale (the radius of the region), and its orientation (an angle expressed in radians). The disc detector uses as keypoints image structures which resemble “blobs”. By searching for blobs at multiple scales and positions, the disc detector is invariant (or, more accurately, covariant) to translation, rotations, and rescaling of the image.
The keypoint orientation is also determined from the local image appearance and is covariant to image rotations. Depending on the symmetry of the keypoint appearance, determining the orientation can be ambiguous. In this case, the Covariant frames detector returns a list of up to four possible orientations, constructing up to four frames (differing only by their orientation) for each detected image blob.
There are several parameters that influence the detection of disc keypoints. First, searching keypoints at multiple scales is obtained by constructing a so-called “Gaussian scale space”. The scale space is just a collection of images obtained by progressively smoothing the input image, which is analogous to gradually reducing the image resolution. Conventionally, the smoothing level is called scale of the image. The construction of the scale space is influenced by the following parameters, set when creating the SIFT filter object by ::vl_sift_new():
- Number of octaves. Increasing the scale by an octave means doubling the size of the smoothing kernel, whose effect is roughly equivalent to halving the image resolution. By default, the scale space spans as many octaves as possible (i.e. roughly
log2(min(width,height)), which has the effect of searching keypoints of all possible sizes. - First octave index. By convention, the octave of index 0 starts with the image full resolution. Specifying an index greater than 0 starts the scale space at a lower resolution (e.g. 1 halves the resolution). Similarly, specifying a negative index starts the scale space at an higher resolution image, and can be useful to extract very small features (since this is obtained by interpolating the input image, it does not make much sense to go past -1).
- Number of levels per octave. Each octave is sampled at this given number of intermediate scales (by default 3). Increasing this number might in principle return more refined keypoints, but in practice can make their selection unstable due to noise (see [1]).
Each image of the scale-space is then transformed using different image response functions in order to detect corner-like blobs in the image which are found to be stable under several image transformations. The Covariant frames detector supports the following response functions:
- Difference of Gaussian. Computionaly effective approximation of Laplacian of Gaussian (LoG) - scalar value bases on the second order derivative of the image brightness function.
- Hessian response. Image response based on determinant of hessian matrix which is constructed from second order partial derivatives of the image brigtness. Contrary to LoG it also counts with mixed derivatives.
Based on the response function new gaussian scale-space is constructed where the distinct image regions (keypoints) are detected as local extrema of the response function across spatial coordinates an scale. Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast. Filtering is controlled by the follow:
- Peak threshold. This is the minimum amount of contrast to accept a keypoint. It is set by configuring the SIFT filter object by vl_covdet_set_peak_thresh().
- Edge threshold. This is the edge rejection threshold. It is set by configuring the SIFT filter object by vl_covdet_set_edge_thresh().
| Parameter | See also | Controlled by | Comment |
| image repsonse function | Disc detection | vl_covdet_new_disc_detector | Values: VL_IMRESP_DOG or VL_IMRESP_HESSIAN |
| number of octaves | Disc detection | vl_covdet_new_disc_detector | |
| first octave index | Disc detection | vl_covdet_new_disc_detector | set to -1 to extract very small features |
| number of scale levels per octave | Disc detection | vl_covdet_new_disc_detector | can affect the number of extracted keypoints |
| edge threshold | Disc detection | vl_covdet_set_edge_thresh | decrease to eliminate more keypoints |
| peak threshold | Disc detection | vl_covdet_set_peak_thresh | increase to eliminate more keypoints |
Ellipse detection
As it was said before, oriented discs are covariant only to similarities. Although in the case of the viewpoint change the image blob not only changes its scale and position but also its affine shape. This is why attributes covariant to affine transformation are introduced.
Estimation of the affine shape of the detected disc frame is based on iterative procedure based on the Second Moment Matrix (SMM). This matrix is used as a local image texture descriptor which is covariant to affine tranformation of the Image domain.
The used iterative procedure sequentialy tries to estimate the affine shape of the frame region by computing an affine transformation which transforms the local anisotrpic structure into an isotropic one. Using the graphic representations it tries to transform the elliptic region into a circular one. The size of the regions which is used for calculating the SMM can be controlled by parameter Affine Window Size .
The procedure stops when the isotropy measure is closer to ideal measure of circular region than the parameter convergence threshold . The procedure is also limited by the maximal number of iterations, controlled by parameter Maximum Iterations .
In order to obtain affine invariant descriptors and residual rotation these attributes are calculated from the transformed isotropic structure. In order to obtain higher precission, the descriptors are usually calculated from larger windows then the windows used for the affine shape estimation. The size of these windows can be controlled by parameter Descriptor Window Size .
Because the parameters Affine Window Size and Descriptor Window Size influence the size of the allocated memory they have to be defined in the constructor of the Covariant frames detector.
| Parameter | See also | Controlled by | Comment |
| Disc detector parameters, see Disc detection | |||
| Size of the window for affine shape estimation | Ellipse detection | vl_covdet_new_ellipse_detector | affine shape estimation precision |
| Size of the window for descriptor and orientation calculation | Ellipse detection | vl_covdet_new_ellipse_detector | descriptor calculation precision |
| Maximum number of iteration for aff. shape est. | Ellipse detection | vl_affineshapeestimator_set_max_iter | affine shape estimation precision |
| Convergence criterion threshold of aff. shape est. | Ellipse detection | vl_affineshapeestimator_set_conv_thresh | affine shape estimation precision |
SIFT Descriptor
- See also:
- Descriptor technical details
A SIFT descriptor is a 3-D spatial histogram of the image gradients in characterizing the appearance of a keypoint. The gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-D histogram h, which (up to normalization and clamping) forms the SIFT descriptor of the region. An additional Gaussian weighting function is applied to give less importance to gradients farther away from the keypoint center. Orientations are quantized into eight bins and the spatial coordinates into four each, as follows:
SIFT descriptors are computed by either calling ::vl_sift_calc_keypoint_descriptor or ::vl_sift_calc_raw_descriptor. They accept as input a disc frame, which specifies the descriptor center, its size, and its orientation on the image plane. In the case of elliptic frames the descriptor is calculated from the affine-normalised image region of the image based on the affine shape of the frame.
The following parameters influence the descriptor calculation:
- magnification factor. The descriptor size is determined by multiplying the keypoint scale by this factor. It is set by ::vl_sift_set_magnif.
- Gaussian window size. The descriptor support is determined by a Gaussian window, which discounts gradient contributions farther away from the descriptor center. The standard deviation of this window is set by ::vl_sift_set_window_size and expressed in unit of bins.
VLFeat SIFT descriptor uses the following convention. The y axis points downwards and angles are measured clockwise (to be consistent with the standard image convention). The 3-D histogram (consisting of
bins) is stacked as a single 128-dimensional vector, where the fastest varying dimension is the orientation and the slowest the y spatial coordinate. This is illustrated by the following figure.
- Note:
- Keypoints (frames) D. Lowe's SIFT implementation convention is slightly different: The y axis points upwards and the angles are measured counter-clockwise.
| Parameter | See also | Controlled by | Comment |
| magnification factor | sift-intro-descriptor | ::vl_siftdesc_set_magnif | increase this value to enlarge the image region described |
| Gaussian window size | sift-intro-descriptor | ::vl_siftdesc_set_window_size | smaller values let the center of the descriptor count more |
Extensions
Eliminating low-contrast descriptors. Near-uniform patches do not yield stable keypoints or descriptors. ::vl_sift_set_norm_thresh() can be used to set a threshold on the average norm of the local gradient to zero-out descriptors that correspond to very low contrast regions. By default, the threshold is equal to zero, which means that no descriptor is zeroed. Normally this option is useful only with custom keypoints, as detected keypoints are implicitly selected at high contrast image regions.
Using the VlFrameDet object
The code provided in this module can be used in different ways. You can instantiate and use a VlFrameDet object to extract arbitrary type of frames and descriptors from one or multiple images.
To use a VlFrameDet object:
- Initialize a VlFrameDet object with vl_covdet_new_ellipse_detector or vl_covdet_new_disc_detector. The filter can be reused for multiple images of the same size (e.g. for an entire video sequence).
- Call vl_covdet_detect() to detect frames in an image. This function also calculates descriptors if it was defined in the constructor.
- Number of detected keypoints can be accesed by ::vl_covdet_get_num_frames
- With methods vl_covdet_get_discs, vl_covdet_get_oriented_discs, vl_covdet_get_ellipses or vl_covdet_get_oriented_ellipses. Or the array storing frames can be accessed directly with method vl_covdet_get_frames_storage and the type of the frames by vl_covdet_get_frames_type
- Descriptors can be accessed with method vl_covdet_get_descriptors(), the size of a frame descriptor is vl_covdet_get_descriptor_size()
- Delete the SIFT filter by vl_covdet_delete().
Please note that for each image the frame storage can be reallocated therefore the frames can be stored in different mamory location. That is why the new pointer to frames and descriptors should be get after each detection.
To compute SIFT descriptors of custom keypoints, use vl_sift_calc_descriptor().
SIFT keypoint detector
The ::VlFrameDet can be used for example as a SIFT frames detector by construction of the ::VlFrameDet object and settings the parameters as follows:
{.c}
VlImRespFunction respFunction = VL_IMRESP_DOG;
vl_bool calcOrientation = VL_TRUE;
vl_bool calcDescriptor = VL_TRUE;
VlFrameDet siftDet = vl_covdet_new_disc_detector(w, h, respFunction,
O, firstOctave, S,
calcOrientation, calcDescriptor);
Then the detection is fairly simple:
{.c}
vl_covdet_detect(siftDet, image);
vl_size framesNum = vl_covdet_get_frames_num(siftDet);
VlFrameOrientedDisc const *frames = vl_covdet_get_oriented_discs(siftDet);
vl_size descriptorSize = vl_covdet_get_descriptor_size(siftDet);
float const *descriptors = vl_covdet_get_descriptors(siftDet);
Hessian-Affine keypoint detector
Similalrly the Hessian-Affine detector can be contructed as well simply by constructing the frames detector with the following parameters:
{.c}
VlImRespFunction respFunction = VL_IMRESP_HESSIAN;
vl_bool calcOrientation = VL_TRUE;
vl_bool calcDescriptor = VL_TRUE;
VlFrameDet hessaffDet = vl_covdet_new_ellipse_detector(w, h, respFunction,
O, firstOctave, S,
calcOrientation, calcDescriptor);
Keypoint conversion
The constructed ::VlFrameDet object can be also used for conversion between different classes of frames using the method vl_covdet_convert_frames() where the output class of frames is specified by the class of frames detected by the detector. The missing attributes if the frames are computed in the same way as they are computed by the detector. The following code shows how to convert disc frames into oriented ellipses and to calculate their decriptors:
{.c}
VlFrameOrientedEllipse* dstFrames;
float const *dstDescriptors;
VlImRespFunction respFunction = VL_IMRESP_DOG; // Irrelevant value
vl_bool dstCalcOrientation = VL_TRUE;
vl_bool dstCalcDescriptor = VL_TRUE;
VlFrameDet detector = vl_covdet_new_ellipse_detector(w, h, respFunction,
O, firstOctave, S, dstCalcOrientation, dstCalcDescriptor);
VlFrameDisc *srcFrames = ...
vl_size srcFramesNum = ...
VlFrameType srcFrameType = VL_FRAME_DISC;
vl_covdet_convert_frames(detector, image, (const void*)srcFrames, srcFramesNum,
srcFramesType);
dstFrames = vl_covdet_get_oriented_ellipses(detector);
dstDescriptors = vl_covdet_get_descriptors(detector);
Technical details
Disc Detector
The SIFT frames (keypoints) are extracted based on local extrema (peaks) of the DoG scale space. Numerically, local extrema are elements whose
neighbors (in space and scale) have all smaller (or larger) value. Once extracted, local extrema are quadratically interpolated (this is very important especially at the lower resolution scales in order to have accurate keypoint localization at the full resolution). Finally, they are filtered to eliminate low-contrast responses or responses close to edges and the orientation(s) are assigned, as explained next.
Eliminating low contrast responses
Peaks which are too short may have been generated by noise and are discarded. This is done by comparing the absolute value of the DoG scale space at the peak with the peak threshold
and discarding the peak its value is below the threshold.
Eliminating edge responses
Peaks which are too flat are often generated by edges and do not yield stable features. These peaks are detected and removed as follows. Given a peak
, the algorithm evaluates the x,y Hessian of of the DoG scale space at the scale
. Then the following score (similar to the Harris function) is computed:
This score has a minimum (equal to 4) when both eigenvalues of the Jacobian are equal (curved peak) and increases as one of the eigenvalues grows and the other stays small. Peaks are retained if the score is below the quantity
, where
is the edge threshold. Notice that this quantity has a minimum equal to 4 when
and grows thereafter. Therefore the range of the edge threshold is
.
Orientation assignment
A peak in the DoG scale space fixes 2 parameters of the keypoint: the position and scale. It remains to choose an orientation. In order to do this, SIFT computes an histogram of the gradient orientations in a Gaussian window with a standard deviation which is 1.5 times bigger than the scale
of the keypoint.
This histogram is then smoothed and the maximum is selected. In addition to the biggest mode, up to other three modes whose amplitude is within the 80% of the biggest mode are retained and returned as additional orientations.
Hessian keypoint detector
Scale adapted hessian matrix is defined as:
Where
is scale of the current level.
Where
is given as:
Where
stands for the image intensities.
And
stands for second derivations as:
Note also that normalisation factor
is needed due to maintain scale invariance of the response. Scale normalised derivative of order
is defined as:
Due to the properties of partial derivative of Gaussian function [mikolajczyk01index]}.
Having the scale adapted Hessian matrix, the Hessian response is given as determinant of the Hessian matrix:
In the used implementation the Hessian matrix is calculated only in window of
pixels and the second order partial derivatives are approximated as follows:
Descriptor
A SIFT descriptor of a local region (keypoint) is a 3-D spatial histogram of the image gradients. The gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-D histogram h, which (up to normalization and clamping) forms the SIFT descriptor of the region. An additional Gaussian weighting function is applied to give less importance to gradients farther away from the keypoint center.
Construction in the canonical frame
Denote the gradient vector field computed at the scale
by
The descriptor is a 3-D spatial histogram capturing the distribution of
. It is convenient to describe its construction in the canonical frame. In this frame, the image and descriptor axes coincide and each spatial bin has side 1. The histogram has
bins (usually
), as in the following figure:
Bins are indexed by a triplet of indexes t, i, j and their centers are given by
The histogram is computed by using trilinear interpolation, i.e. by weighing contributions by the binning functions
The gradient vector field is transformed in a three-dimensional density map of weighed contributions
The historam is localized in the keypoint support by a Gaussian window of standard deviation
. The histogram is then given by
In post processing, the histogram is
normalized, then clamped at 0.2, and
normalized again.
Calculation in the image frame
Invariance to similarity transformation is attained by attaching descriptors to SIFT keypoints (or other similarity-covariant frames). Then projecting the image in the canonical descriptor frames has the effect of undoing the image deformation.
In practice, however, it is convenient to compute the descriptor directly in the image frame. To do this, denote with a hat quantities relative to the canonical frame and without a hat quantities relative to the image frame (so for instance
is the x-coordinate in the canonical frame and
the x-coordinate in the image frame). Assume that canonical and image frame are related by an affinity:
Then all quantities can be computed in the image frame directly. For instance, the image at infinite resolution in the two frames are related by
The canonized image at scale
is in relation with the scaled image
where, by generalizing the previous definitions, we have
Deriving shows that the gradient fields are in relation
Therefore we can compute the descriptor either in the image or canonical frame as:
where we defined the product of the two spatial binning functions
In the actual implementation, this integral is computed by visiting a rectangular area of the image that fully contains the keypoint grid (along with half a bin border to fully include the bin windowing function). Since the descriptor can be rotated, this area is a rectangle of sides
(see also the illustration).
Standard SIFT descriptor
For a SIFT-detected keypoint of center
, scale
and orientation
, the affine transformation
reduces to the similarity transformation
where
is a counter-clockwise rotation of
radians,
is the size of a descriptor bin in pixels, and m is the descriptor magnification factor which expresses how much larger a descriptor bin is compared to the scale of the keypoint
(the default value is m = 3). Moreover, the standard SIFT descriptor computes the image gradient at the scale of the keypoints, which in the canonical frame is equivalent to a smoothing of
. Finally, the default Gaussian window size is set to have standard deviation
. This yields the formula
Ellipse Detector
Scale and affine invariant keypoint
Affine invariant keypoint (frame) consists from from the following values:
Affine shape is described by symetric shape matrix
where
is the equation of the elllipse matrix representation. The shape matrix is given as:
Elipse is used in a similar manner as circles in scale invariant features because it can represent the affine shape of the feature - it is affine transformation of a circle. In fact this affine transformation
(it is inverted matrix
because in the following text,
is used as a transformation which transforms ellipse to circle) can be converted into the ellipse matrix as:
Ellipse matrix, because it is real, symmetric and positive deffinite, can be also decomposed using eigen decomposition:
The eigen-values can be relted to the size of ellipse axes where the smaller eigen value
with its eigen vector represents the direction of the fastest change and the bigger eigen value
with its eig. vector the direction of the slowest change.
Affine shape estimation
In order to obtain frames which are covariant to affine transformation we have to obtain its attributes which would be covariant to affine transformation. These attributes can be also related as 'Affine shape' of the image region because then the blob can be normalised into isotropic blob using affine transformation encoded in these attributes (which is an implication from the affine covariance).
The basic idea of affine shape estimation is to find an affine transformation which would normalise the image blob. For a simplicity at first it would be shown how to find a transformation which would normalise ellipse into a circle.
Anisotrpic image blob can be represented by an elipse with equation:
Where
is a coordinate system based on the kypoint centre
and
w is coordinate in the image
.
The affine transformation can be defined as a positive-definite matrix
of size
such that:
Where
are coordinates in the transformed image
.
We want to find affine transformation which would transform points on the ellipse
to points on a circle
s.t.
This can be done by decomposing matrix
to:
Where
is an arbitrary rotation matrix. Then when substituted to the ellipse equation we can see that the affine transformation can be expressed as:
Therefore it transforms points on the ellipse to points on a circle.
NOTE - this is wrong somewhere because we have
not E. It should also be true from the ellipse equation comparing the scales.
In order to describe the anisotropic structure we need a descriptor
which covariant to affine transformations such that:
It was observed that a windowed second moment matrix has got some interesting properties which makes is suited for estimating local linear distortion of the image blob.
Windowed Second moment matrix
To show that second moment matrix can be used as a descriptor of the affine shape of local anisotrpic structure we have to extend the classic definition to our scale space and circular Gaussian distribution must be replaced by multivariate gaussian distribution in order to be able to detect anisotropic structures in our basically isotropic scale space.
Multivariate 2D gaussian distribution is given as:
Where
is covariance matrix of size
. In the 1D case
would be a real number, variance of the distribution
.
Extending our framework we can now define the second moment matrix (SMM):
Where we define two covariance matrices,
which determines 'derivation' (or local) gaussian kernel. This covariance matrix describes what initial smoothing was used for the data on which the derivations were calculated. In our scale space framework we can see that it would be in relation with the scale of the current level. The covariance matrix
define 'integration' gaussian kernel which define the window function over which the SMM is calculated, in our case the window is selected as Gaussian function.
Transformation of SMM
Now having extended the definition of the second moment matrix, we would like to see how SMM behaves when the image was transformed using some affine transformation
. Let have coordinates
in 'left' image
and a cordinate
in 'right' image
. The image coordinate are in relation
and image intensity values are in relation 
The SMM
computed in point
in the left image is in relation with the SMM
computed in point
in the right image:
Where the gaussian kernels are in relation: 
This is an important property of the SMM which is demanded by previously defined affine shape estimation framework. Therefore we can use the SMM in a similar way as the Ellipse matrix and use it to find the affine transformation which would normalise the anisotropic image blob.
Covariance matrices and the SMM
In the previous section we have not considered the shape of the covariance matrices in the SMM computation. General variation gives us six degrees of freedom. However in this algorithm these matrices are chosen to be in direct proportion to the second moment matrix because it yields quite agreeable properties:
And when we choose
we get:
Which complies with the fact that the isotropic structure in the right image should be sampled with isotropic, circular, gaussian kernels. Therefore we define integration scale
and derivation scale
s.t.
and
.
This decision has got also intuitive explanation that the image blob is sampled with a gaussian window which has the same shape as its affine region.
Iterative procedure
We have shown that SMM can be used as an approximation descriptor of the affine shape of an image blob. However for selecting the covariance kernels of the SMM computation we need to know the SMM. That is why the first covariance matrices are estimated as circular and then we use iterative procedure which in each step selects more precise affine region which is defined by shape adaptation matrix
.
The convergence criterion of the iterative procedure is based on simple isotropy measure based on second moment matrix
eigen values
and
:
Where the
with
for perfect isotropic structure when the SMM is close to pure rotation. Then the convergence criterion is defined as:
Where usually
.
The iterative procedure of estimation of affine shape of point
goes as follows:
Input: Scale invariant frame
with spatial coordinate
and scale 
- initialize
to identity matrix and set 
- set
and if
reject the frame
as unstable. - normalize the actual window
centered in
. Window
now contains affine region normalised into a circular one based on the actual estimation of the affine shape
and the detected derivation scale
of the frame (because
). The window size is defined by argument window_size. - calculate the second moment matrix of the normalised window
This is easily achieved by weighting the window values by circular gaussian window which standard deviation is
based on the empirical 3-sigma rule. We can see that because the frame affine regions is always fitted to the window of same size the integration scale is a constant multiply of the frame derivation scale.
- concatenate transformation
Where the matrix
is normalised by its determinant in order not to scale the affine region size and only change its shape. In this way the actual affine transformation is iteratively improved as the measurement region gets more adapted to the real affine shape of the image structure. - In the case of divergence:
e.g. when the point
lies on an edge, reject the point as unstable. - go to Step 2 if
where the default value is
. Please note that it is calculated from the SMM of the estimated normalised isotropic shape whereas the divergence criterion is calculated from the estimated affine transformation. If not, the estimated affine transformation is equal
.
In the current implementation the derivation kernel used in the windowed SMM is the kernel used in the GSS level where the point was detected, contrary to the definition, because of the performance issues. This also fixes the scale for which the frame was detected.
The affine transformation can be transformed to the shape matrix which also contains the scale of the frame:
In the case when oriented ellipse frames are detected, the affine image blob is firstly normalised into a larger window. This normalisation is also performed in more precise manner from the original image rather than from the scale space level which is usually blurred more than it is neccessary.
Affine shape normalisation
In order to calculate descriptor of the affine invariant frame we have to normalise the associated image blob according to its shape in order to be able to calculate affine invariant frame descriptor. Generally we normalise frame neighbourhood multiplied by measurement scale (parameter mr_scale) into a square patch
(normalised region is isotropic, therefore circular) of size patch_size.
Because the scale space generally simulate downsampling of the input image and is smoothed with circular kernels it does not take into account the affine shape of the frame. Therefore if we would use scale space values for descriptor calculation, the data would be smoothed by wrong kernel which would not reflect the affine shape of the point neighbourhood (where the distances between pixels are not constant).
That is why the data from the original image are used. Because the downsampling is done using bilinear interpolation which takes into account only nearest pixels, the image has to be smoothed accordingly as described in the followng procedure. This can be also viewed from the sampling theorem point of view, where the smoothing is just a low-pass 2D filter suppressing higher frequencies in order to prevent aliasing.
Input: Affine invariant frame
with spatial coordinate
, scale of the original scale invariant frame
and its affine shape described by the affine transformation
and shape matrix
.
- Test if the frame measurement region touches the image boundary, if so, it is not possible to calculate the frame descriptor and the kyepoint is rejected.
- Calaculate the radius
of circumscribed circle of the anistropic image structure as:
and the scale between the patch and the circumscribed circle:
- If the
, i.e. that the distance between the patch pixels in image is bigger than
so the image must be smoothed in order to perform correct bilinear interpolation- Warp the measurement region into a temporary patch of size
with the affine transformation
using bilinear interpolation. Image is not anyhow scaled (
) so the pixel distance remains the same. - Smooth the
with circular gaussian kernel with
. The multiplication factor
has been chosen due to properties of the bilinear transformation so that pixels value influence spread accordingly to its neighbourhood.- Downsample the
to
using bilinear interpolation.
- Downsample the
- Warp the measurement region into a temporary patch of size
- If the
then the smoothing is not needed and generally the measurement region is oversampled. Warp the affine measurement region directly to the patch
using bilinear interpolation.
, scale (radius)