Image Classification

Image Classification Practical, 2011
Andrea Vedaldi and Andrew Zisserman

See most recent version of this assignment on vgg website.

Goal
In image classification, an image is classified according to its visual content. For example, does it contain an airplane or not. An important application is image retrieval - searching through an image dataset to obtain (or retrieve) those images with particular visual content.

The goal of this session is to get basic practical experience with image classification. It includes: (i) training a visual classifier for five different image classes (aeroplanes, motorbikes, people, horses and cars); (ii) assessing the performance of the classifier by computing a precision-recall curve; (iii) varying the visual representation used for the feature vector, and the feature map used for the classifier; and (iv) obtaining training data for new classifiers using Google image search.

Getting started

Download the code and data (code only, data only ~450Mb). The data includes images and pre-computed features.
Unpack the code archive. This will make a directory called practical-image-classification.
Unpack the data archive in the directory practical-image-classification.
Finally, start MATLAB in the directory practical-image-classification.
Try running setup.m command (type setup without the .m suffix). If all goes well, you should obtain a greeting message.

As you progress in the exercises you can use MATLAB help command to display the help of the MATLAB functions that you need to use. For example, try typing help setup.

Exercise description

Open and edit the script exercise1.m in the MATLAB editor. The script contains commented code and a description for all steps of this exercise. You can cut and paste this code into the MATLAB window
to run it, and will need to modify it as you go through the session.

Part 1: Training and testing an Image Classifier

Stage A: Data Preparation

The data provided in the directory data consists of images and pre-computed feature vectors for each image. The JPEG images are contained in data/images. The data consists of three image classes
(containing aeroplanes, motorbikes or persons) and`background' images (i.e. images that do not contain these three classes). In the data preparation stage, this data is divided as:

	aeroplane	motorbike	person	background
Training	112	120	1025	1019
Test	126	125	983	1077
Total	238	245	2008	2096

The feature vector consists of SIFT features computed on a regular grid across the image (`dense SIFT') and vector quantized into visual words. The frequency of each visual word is then recorded in a histogram for each tile of a spatial tiling as shown. The final feature vector for the image is a concatenation of these histograms. This process is summarized in the figure below:

Why is spatial tiling used in the image representation?

We will start by training a classifier for images that contain aeroplanes. The files data/aeroplane_train.txt and data/aeroplane_val.txt list images that contain aeroplanes. Look through example images of the aeroplane class and the background images by browsing the image files in the data directory.

Stage B: Train a classifier for images containing aeroplanes

The aeroplane training images will be used as the positives, and the background images as the negatives. The classifier is a linear Support Vector Machine (SVM). Train the classifier by following the steps in exercise1.m.

We will first assess qualitatively how well the classifier works by using it to rank all the training images. What do you expect to happen? View the ranked list using the provided function displayRankedImageList as shown in excercise1.m.

You can use the function displayRelevantVisualWords to display the image patches that correspond to the visual words which the classifier thinks are most related to the class (see the example embedded in exercise1.m).

Stage C: Classify the test images and assess the performance

Now apply the learnt classifier to the test images. Again, you can look at the qualitative performance by using the classifier score to rank all the test images. Note the bias term is not needed for this ranking, only the classification vector w. Why?

Now we will measure the retrieval performance quantitatively by computing a Precision-Recall curve. Recall the definitions of Precision and Recall:

The Precision-Recall curve is computed by varying the threshold on the classifier (from high to low) and plotting the values of precision against recall for each threshold value. In order to assess the retrieval performance by a single number (rather than a curve), the Average Precision (AP, the area under the curve) is often computed.

Stage D: Learn a classifier for the other classes and assess its performance

Now repeat stages (B) and (C) for each of the other two classes: motorbikes and persons. To do this you can simply rerun exercise1.m after changing the dataset loaded at the beginning in stage (A). Remember to change both the training and test data. In each case record the AP performance measure.

Does the AP performance match your expectations based on the variation of the class images?

Stage E: Vary the image representation

Up to this point, the image feature vector has used spatial tiling. Now, we are going to`turn this off' and see how the performance changes. In this part, the image will simply be represented by a single histogram recording the frequency of visual words (but not taking any account of their image position). This is a bag-of-visual-words representation.

A spatial histogram can be converted back to a simple histogram by merging the tiles. Edit exercise1.m to turn the part of the code that does so. Then evaluate the classifier performance on the test images.

Make sure you understand the reason for the change in performance.

Stage F: Vary the classifier

Up to this point we have used a linear SVM, treating the histograms representing each image as vectors normalized to a unit Euclidean norm. Now we will use a Hellinger kernel classifier but instead of computing kernel values we will explicitly compute the feature map, so that the classifier remains linear (in the new feature space). The definition of the Hellinger kernel (also known as the Bhattacharyya coefficient) is

where h and h' are normalized histograms.

So, in fact, all that is involved in computing the feature map is taking the square root of the histogram values and normalizing the resulting vector to unit Euclidean norm.

Edit exercise1.m so that the square root of the histograms are used for the feature vectors. Note, this involves writing a line of Matlab code for the training and test histograms.
Retrain the classifier for the aeroplane class, and measure its performance on the test data.

Make sure you understand why this procedure is equivalent to using the Hellinger kernel.
Why is it an advantage to keep the classifier linear, rather than using a non-linear kernel?
Try removing the L2 normalization step. Does this affect the performance? Why? (Hint: the histogram are L1 normalized by construction)
Go back to the linear kernel and remove the L2 normalization step. What do you observe?

Note: when learning the SVM, to save training time we are not changing the C parameter. This parameter influences the generalization error and should be learnt on a validation set when the kernel is changed.

Stage G: Vary the number of training images

Up to this point we have used all the available training images. Now edit exercise1.m the fraction variable to use 10% and 50% of the training data.

What performance do you get with the linear kernel? And with the Hellinger kernel?
Do you think the performance has `saturated' if all the training images are used, or would adding more training images give an improvement?

Part 2: Training an Image Classifier for Retrieval using Google images

In Part 1 of this practical the training data was provided and all the feature vectors pre-computed. The goal of this second part is to choose the training data yourself in order to optimize the classifier performance. The task is the following: you are given a large corpus of images and asked to retrieve images of a certain class, e.g. containing a bicycle. You then need to obtain training images, e.g. using Google Image Search, in order to train a classifier for images containing bicycles and optimize its retrieval performance.

The MATLAB code exercise2.m provides the following functionality: it uses the images in the directory data/myImages and the default negative list data/background_train.txt to train a classifier and rank the test images. To get started, we will train a classifier for horses:

Use Google image search with `horses' as the text query (you can also set the photo option on)

Pick 5 images and drag and drop (save) them into the directory data/myImages. These will provide the positive training examples.

Run the code exercise2.m and view the ranked list of images. Note, since feature vectors must be computed for all the training images, this may take a few moments.

Now, add in 5 more images and retrain the classifier.

The test data set contains 148 images with horses. Your goal is to train a classifier that can retrieve as many of these as possible in a high ranked position. You can measure your success by how many appear in the first 36 images (this performance measure is `precision at rank-36'). Here are some ways to improve the classifier:

Add more positive training images.
Add more positive training images, but choose these to be varied from those you already have

Note: all images are automatically normalized to a standard size, and descriptors are saved for
each new image added in the data/cache directory.

The test data also contains the category car. Train classifiers for it and compare the difficulty of this and the horse class.

Links and further work:

The code for this practical is written using the software package VLFeat. This is a software library written in MATLAB and C, and is freely available as source code and binary, see http://www.vlfeat.org/.
The images for this practical are taken from the PASCAL VOC 2007 benchmark, see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/
If there is a significant difference between the training and test performance, then that indicates over fitting. The difference can often be reduced, and the test performance (generalization) improved by changing the SVM C parameter. In Part I, vary the C parameter in the range 0.1 to 1000 (the default is C=100), and plot the AP on the training and test data as C varies for the linear and Hellinger kernels.

Acknowledgements:

Guidance from Ivan Laptev and Josef Sivic
Comments from Relja Arandjelovic, Yusuf Aytar and Varun Gulshan
Funding from ERC grant VisRec Grant No. 228180

History:

Used at ENS/INRIA Visual Recognition and Machine Learning Summer School, 2011
Used at Microsoft Computer Vision School, Moscow, 2011
Bug correction due to Oleg Tishutin