[RecVis12] Assignment 2 - image classification

Object recognition and computer vision 2012

Ivan Laptev, Jean Ponce, Cordelia Schmid and Josef Sivic

Assignment 2: Image classification

(adapted from A. Zisserman and A. Vedaldi)

Goal

In image classification, an image is classified according to its visual content. For example, does it contain an airplane or not. An important application is image retrieval - searching through an image dataset to obtain (or retrieve) those images with particular visual content.

The goal of this exercise is to get basic practical experience with image classification. It includes: (i) training a visual classifier for five different image classes (aeroplanes, motorbikes, people, horses andcars); (ii) assessing the performance of the classifier by computing a precision-recall curve; (iii) varying the visual representation used for the feature vector, and the feature map used for the classifier; and (iv) obtaining training data for new classifiers using Google / Bing image search.

Getting started

Download the code and data (code only, data ~414MB).
Unpack the code archive. This will make a directory called image-classification.
Unpack the data archive into image-classification/data/. This should create several sub-directories in the data directory such as image-classification/data/images/
Start your MATLAB in the directory image-classification.
Try running setup.m command (type setup without the .m suffix). If all goes well, you should obtain a greeting message. Note: this exercise is using several functions from the open source Matlab package vlfeat by A. Vedaldi.

As you progress in the exercises you can use MATLAB help command to display the help of the MATLAB functions that you need to use. For example, try typing help setup.

Exercise description

The image-classification directory contains Matlab scripts exercise1.m and exercise2.m, which contain a commented skeleton of the code. You will need to (1) complete these scripts by writing your code, and (2) produce a written report containing answers to questions given below (shown in red). Your answers should include results, graphs or images as specified in each question.

Part 1: Training and testing an Image Classifier

Stage A: Data preparation and feature extraction

The JPEG images are contained in data/images. The data consists of three image classes (containing aeroplanes, motorbikes or persons) and`background' images (i.e. images that do not contain these three classes). This data is divided as:

	aeroplane	motorbike	person	background
Training	112	120	1025	1019
Test	126	125	983	1077
Total	238	245	2008	2096

The list of training images for each class is given in text file “data/image_lists/*_train.txt”, where * is the class name. The list of testing images for each class is given in text file “data/image_lists/*_val.txt”, where * is the class name.

The feature vector you will compute for each image consists of SIFT features computed on a regular grid across the image (`dense SIFT') and vector quantized into visual words. The frequency of each visual word is then recorded in a histogram for each tile of a spatial tiling as shown. The final feature vector for the image is a concatenation of these histograms. This process is summarized in the figure below:

Start by opening the script excercise1.m. You can modify this script and add your code to it. Follow the steps below:

Step 1. Build vocabulary. Compute visual vocabulary of 1,000 visual words by running the k-means algorithm on features extracted from (a subset of ) training images of all classes. To achieve this use function computeVocabularyFromImageList.

To save the computation time, you can use only a subset of the training images to train the vocabulary. Make sure you use at least 100 images from each class including background.

Don’t forget to save the vocabulary into file ./data/vocabulary/vocabulary.mat so you don’t have to re-compute it each time you run the code. Note that building a vocabulary from 500 images may take 10-20 minutes.

Step 2. Compute the spatial histogram representation of all training and test images. To achieve this use the function computeHistogramsFromImageList.m Don’t forget to save the histograms so that you don’t have to compute it each time you run the code. To be able to use the provided example code in excercise1.m, save the computed histograms in the same format as given in the precomputed example files “./data/histograms/aeroplane_train.mat” and “./data/histograms/aeroplane_val.mat”.

Notes: (1) Your histogram representation will be different than the one given in the provided example mat files as your visual vocabulary will be different (depending on the k-means initialization). (2) Extracting features from a single image takes about 1s. So extracting features from the entire dataset may take about 2hrs.

Step 3: Visualize computed features for one image.

Use the following code to compute and show densely sampled SIFT keypoints for image “000060.jpg”. Note that for enhanced clarity, only every 50th keypoint is shown.

im = imread('./data/images/000060.jpg');

[keypoints, descriptors] = computeFeatures(im);

col = keypoints(1,:);

row = keypoints(2,:);

binsize = keypoints(4,:); % bin size SIFT descriptor (pixels).

% Recall that SIFT is composed of a spatial grid of 4x4 bins

radius = binsize*2; % visualize SIFT by a circle with radius of 2 bin widths

% i.e. the diameter of the circle is 4 bin sizes.

figure(1); clf; imagesc(im);

vl_plotframe([col(1:50:end); row(1:50:end); radius(1:50:end)]); % only each 50th keypoint

axis image;

For “stage A” Answer the following questions:

QA1: Show the extracted dense SIFT keypoints overlaid over image “000060.jpg” (show only every 50th keypoint for clarity). What are the used scales of the extracted SIFT descriptors? Consider scale to be the radius of the circle showing the spatial support of each SIFT descriptor.

Hint: Recall that the SIFT descriptor is composed of 4x4 grid of bins. You can determine the used binsizes by analyzing the output of the function “computeFeatures.m”.

QA2: What are the number of SIFT descriptors extracted at each scale for image “000060.jpg”? What is the total number of SIFT descriptors extracted for this image?

QA3: Why is the spatial tiling used in the histogram image representation?

Stage B: Train a classifier for images containing aeroplanes

We will start by training a classifier for images that contain aeroplanes. The files data/image_lists/aeroplane_train.txt and data/image_lists/aeroplane_val.txt list images that contain aeroplanes. Look through example images of the aeroplane class and the background images by browsing the image files in the data/images directory.

The aeroplane training images will be used as the positives, and the background images as the negatives. The classifier is a linear Support Vector Machine (SVM). Train the classifier using the function trainLinearSVM by following the steps in excercise1.m.

We will first assess qualitatively how well the classifier works by using it to rank all the training images. View the ranked list using the provided function displayRankedImageList as shown in excercise1.m.

QB1: Show the ranked images in your report.

Use function displayRelevantVisualWords to display the image patches that correspond to the visual words which the classifier thinks are most related to the class (see the example embedded in excercise1.m).

QB2: In your report, show relevant patches for the three most relevant visual words (in three separate figures) for the top ranked image. Are the most relevant visual words on the airplane or also appear on background?

Stage C: Classify the test images and assess the performance

Now apply the learnt classifier to the test images. Again, you can look at the qualitative performance by using the classifier score to rank all the test images. Note the bias term is not needed for this ranking, only the classification vector w.

QC1: Why is the bias term not needed for the image ranking?

Now we will measure the retrieval performance quantitatively by computing a Precision-Recall curve. Recall the definitions of Precision and Recall:

The Precision-Recall curve is computed by varying the threshold on the classifier (from high to low) and plotting the values of precision against recall for each threshold value. In order to assess the retrieval performance by a single number (rather than a curve), the Average Precision (AP, the area under the curve) is often computed.

Stage D: Learn a classifier for the other classes and assess its performance

Now repeat stages (B) and (C) for each of the other two classes: motorbikes and persons. Re-use your code after changing the dataset loaded at the beginning of stage (B). Remember to change both the training and test data.

QD1: In your report, show the top ranked images, precision-recall curves and APs for the test data of all the three classes (aeroplanes, motorbikes, and persons). Does the AP performance for the different classes match your expectations based on the variation of the class images?

Stage E: Vary the image representation

Up to this point, the image feature vector has used spatial tiling. Now, we are going to`turn this off' and see how the performance changes. In this part, the image will simply be represented by a single histogram recording the frequency of visual words (but not taking any account of their image position). This is a bag-of-visual-words representation.

A spatial histogram can be converted back to a simple histogram by merging the tiles. To achieve that use the provided function removeSpatialInformation. Then evaluate the classifier performance on the test images. Make sure you re-train the classifier after removing the spatial information.

QE1: Report precision recall-curves and APs, and compare the test performance to the spatially tiled representation in stage D. How is the performance changing? Why?

Stage F: Vary the classifier

Up to this point we have used a linear SVM, treating the histograms representing each image as vectors normalized to a unit Euclidean norm. Now we will use a Hellinger kernel classifier but instead of computing kernel values we will explicitly compute the feature map, so that the classifier remains linear (in the new feature space). The definition of the Hellinger kernel (also known as the Bhattacharyya coefficient) is

where h and h' are L1 normalized histograms.

So, in fact, all that is involved in computing the feature map is taking the square root of the histogram values and normalizing the resulting vector to unit Euclidean norm.

Take the square root of both the training and test histograms to be used for the feature vectors. Note, this involves writing a line of Matlab code to convert the training and test histograms. Use the Matlab function sqrt, which takes an element wise square root of a vector/matrix.

Retrain the classifier for the aeroplane class, and measure its performance on the test data.

Make sure you understand why this procedure is equivalent to using the Hellinger kernel.

QD1: Why is it an advantage to keep the classifier linear, rather than using a non-linear kernel?

QD2: Try removing the L2 normalization step. Does this affect the performance? Why? (Hint: the initial histograms are L1 normalized by construction).

QD3: Go back to the previously used linear kernel and remove the L2 normalization step. Do you observe a change in performance?

Note: when learning the SVM, to save training time we are not changing the C parameter. This parameter influences the generalization error and should be learnt on a validation set when the kernel is changed.

Stage G: Vary the number of training images

Up to this point you have used all the available training images. Now train the classifier with only 10% and 50% of the training data. To achieve this, you can use the fraction variable in the provided example code in excercise1.m

QG1: Report and compare performance you get with the linear kernel and with the Hellinger kernel for the different classes and proportions of training images (10%, 50% and 100%). You don’t have to report the precision-recall curves, just APs are sufficient. Plot the APs for one class into a graph, with AP on the y-axis and the proportion of training images on the x-axis. You can use the matlab function plot Plot three curves (one curve for each class) into one figure. Produce two figures, one for the linear kernel and one for the Hellinger kernel. Make sure to properly label axis (use functions xlabel and ylabel), show each curve in a different color, and have a legend (function legend) in each figure. Show the two figures in your report.

QG2: By analyzing the two figures, do you think the performance has `saturated' if all the training images are used, or would adding more training images give an improvement?

Part 2: Training an Image Classifier for Retrieval using Internet image search.

In Part 1, the training images were provided. The goal of this second part is to choose the training data yourself in order to optimize the classifier performance. The task is the following: you are given a large corpus of images and asked to retrieve images of a certain class, e.g. containing a bicycle. You then need to obtain training images, e.g. using Bing/Google Image Search, in order to train a classifier for images containing bicycles and optimize its retrieval performance.

Complete the missing Matlab code in exercise2.m using the detailed instructions provided inside the file. The aim is to implement the following functionality: use the images downloaded to the directory data/myImages as positive training data and the default negative list data/image_lists/background_train.txt to train a classifier (linear SVM with Hellinger kernel) and rank the test images. The list of testing negative images is given in /data/image_lists/background_val.txt. The list of testing positive images is given in /data/image_lists/horse_val.txt (for horses) and /data/image_lists/car_val.txt (for cars). To get started, we will train a classifier for horses:

Use Internet image search (e.g. Bing or Google) with `horses' as the text query (you can also set the photo option on).

Pick 5 images and save them into the directory data/myImages. These will provide the positive training examples.

Complete the code and run exercise2.m. View the ranked testing images. Note, since feature vectors must be computed for all the training images, this may take a few moments.

Now, add in 5 more images and retrain the classifier.

The test data set contains 148 images with horses. Your goal is to train a classifier that can retrieve as many of these as possible in a high ranked position. You can measure your success by counting how many appear in the first 36 images (this performance measure is `precision at rank-36') by visually inspecting the top ranked results. Here are some ways to improve the classifier:

Add more positive training images.
Add more positive training images, but choose these to be varied from those you already have

Note: all images are automatically normalized to a standard size, and descriptors are saved for each new image added in the data/cache directory.

The test data also contains the category car. Train classifiers for it and compare the difficulty of this and the horse class.

QP2.1: For the horse class, report the precision at rank-36 for 5 and 10 training images. Show the training images you used. Did the performance of the classifier improve when 10 images were used?

QP2.2: What is the best performance (measured by precision at rank-36) you were able to achieve for the horse and the car class? How many training images did you use? For each of the two classes, show examples of your training images, show the top ranked 36 images, and report the precision at rank-36. Compare the difficulty of of retrieving horses and cars.

Links and further work (optional):

The code for this assignment is written using the software package VLFeat. This is a software library written in MATLAB and C, and is freely available as source code and binary, see http://www.vlfeat.org/.
The images for this practical are taken from the PASCAL VOC 2007 benchmark, see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/
If there is a significant difference between the training and test performance, then that indicates overfitting. The difference can often be reduced, and the test performance (generalization) improved by changing the SVM C parameter. In Part I, vary the C parameter in the range 0.1 to 1000 (the default is C=100), and plot the AP on the training and test data as C varies for the linear and Hellinger kernels.

What to hand-in:

Hand-in a written report containing brief answers to the above questions (shown in red). Include the label of the question, e.g. “QA1” at the beginning of each answer. Your answers should include results, graphs or images as requested in each question.

Instructions for formatting and handing-in assignments:

At the top of the first page of your report include (i) your name, (ii) date, (iii) the assignment number and (iv) the assignment title.

The report should be a single pdf file and should be named using the following format: A#_lastname_firstname.pdf, where you replace # with the assignment number and “firstname” and “lastname” with your name, e.g. A2_Sivic_Josef.pdf.

Zip your code and any additional files (e.g. additional images) into a single zip file using the same naming convention as above, e.g. A2_Sivic_Josef.zip. We do not intend to run your code, but we may look at it and try to run it if your results look wrong.

Send your report (a single pdf file) and code (excluding the data) in a single zip file to Josef Sivic<Josef.Sivic@ens.fr>.