Object recognition and computer vision 2011/2012

Jean Ponce, Ivan Laptev, Cordelia Schmid and Josef Sivic

Final project

Description:

The final project amounts to 40% of the final grade. You will have the opportunity to choose your own research topic and to work on a method recently published at a top-quality computer vision conference (ECCV, ICCV, CVPR) or journal (IJCV, TPAMI). We also provide a list of interesting topics / papers below. If you would like to work on another topic (not from the list below), which you may have seen during the class or elsewhere, please consult the topic with the class instructors (I. Laptev and J. Sivic). You may work alone or in a group of 2-3 people. If working in a group, we expect a more substantial project, and an equal contribution from each student in the group.

Your task will be to:

(i) read and understand the research paper,

(ii) implement (a part of ) the paper, and

(iii) perform qualitative/quantitative experimental evaluation.

Evaluation and due dates:

Project proposal (due on Nov 15th). You will submit a 1-page project proposal indicating (i) your chosen topic, (ii) the plan of work, i.e. what are you going to implement, what data you are going to use, what experiments you are going to do, (iii) if working in a group, who are the members of the group and how you plan to share the work. The project proposal is due on Nov 8th. The project proposal will represent 10% of the final project grade.
Project report (due on Dec 23rd). You will write a short report (<3 pages) summarizing your work. The report is due on Dec 23. The report will represent 70% of the final project grade.
Project presenation (on Dec 9 or Dec 11). You will present your work in the class on Dec 9 or Dec 11. The project presentation will represent 20% of the final project grade.

Re-using other’s people code:

You can re-use other people’s code. However, you should clearly indicate in your report/presentation, what is your own code and what was provided by others (don’t forget to indicate the source). We expect projects balanced between implementation / experimental evaluation. For example, if you implement a difficult algorithm from scratch, only few qualitative experimental results may suffice. On the other hand, if you completely use someone else’s implementation, we expect a strong quantitative experimental evaluation with analysis of the obtained results and comparison with baseline methods.

Suggested papers / topics:

Below are some suggested papers and topics for the final projects. If you would like to work on a different topic, please consult your choice with the course instructors (I. Laptev and J. Sivic).

Topic 1. - Spatio-temporal alignment of videos

Paper: Aligning Sequences and Actions by Maximizing Space-Time Correlations (2006) Y. Ukrainitz and M. Irani, ECCV’06

Project page: http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeCorrelations.html

Description: Implement the spatio-temporal alignment algorithm described in (Ukrainitz and Irani 2006). Demonstrate spatio-temporal alignment on their video sequences available here (focus on alignment of human actions, i.e. you can skip sections 5.1 and 6 of the paper). Demonstrate spatio-temporal alignment on your own captured videos. For groups of 2-3 people, experiment with different features for alignment, e.g. HOG3D, and applying the resulting alignment cost for action retrieval in a feature length movie Coffee and Cigarettes. The zip file with annotations is here. The summary of annotations is here. Ask the course instructors for the video.

Topic 2. - Action detection and recognition in still images

Paper: Articulated Pose Estimation with Flexible Mixtures of Parts (2011) Y. Yang, D. Ramanan, CVPR’11

Project page and code: http://phoenix.ics.uci.edu/software/pose/

Description: Yang and Ramanan present a new trainable method for person detection and pose estimation based on the manually annotated locations of body parts (see project page for the training and detection code). Your goal is to try this method on a different problem, which is action recognition in still images as defined in the PASCAL VOC Challenge 2011 competition, see example images here . Since actions often involve objects (reading - books, magazines; playing music - guitar, piano). You should try to learn an action-specific model of a person where your extended model should combine person parts (head, hands, etc.) as well as objects or object parts (phone, photo camera, parts of the motorbike). You should manually annotate such parts in training images (we will provide an annotation tool) and re-train the modified model of Yang and Ramanan for selected action classes. You should compare the performance of models trained with and without objects (parts). You will also be able to compare your method to the results of VOC 2011 competition to be published in November 2011. Groups of 2-3 students, should experiment with more classes.

Topic 3. - Crowd density estimation

Paper: Learning to Count Objects in Images (2010) V. Lempitsky and A. Zisserman, NIPS’10

Project page: http://www.robots.ox.ac.uk/~vgg/research/counting/index.html

Description: Crowd analysis is interesting to many applications. While person detection is a complicated task especially in crowded scenes, crowd density estimation can be approached without explicit person detection and counting. Your goal is to train and evaluate a crowd density estimator using the new discriminative training technique in Lempitsky and Zisserman (the code is available from the project page). You should test the algorithm on highly crowded scenes and experiment with alternative features such as HOG, Bag-of-Fetaures, responses of a person detector or ObjectBanks. Groups of students should apply crowd density estimation to improve person detection as explored in Rodriguez et al., project page: http://www.di.ens.fr/willow/research/crowddensity/index.html

Topic 4. - Object detection with region cues

Paper: The Truth About Cats and Dogs (2011) O.M. Parkhi and A. Vedaldi and C.V. Jawahar and A. Zisserman, ICCV’11

Description: Object detection is especially challenging for non-rigid object classes such as cats and dogs. Animals, however, are often described well by regions of similar color or texture. Parkhi et al. demonstrate a significant improvement in localizing cats and dogs by combining a standard object detector with image segmentation. Your task is to implement their method and to apply it to the detection of cats and/or dogs in PASCAL VOC 2010 object detection task. You should use the available code for training and running object detection (DefPM) as well as the code for image segmentation (GrabCut, see links in the paper). Motivated students may further investigate how this approach applies to other object categories (horses, cars, buses, trains, …)

Topic 5. - Image classification with trained features

Paper: Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification (2010) L.-J. Li, H. Su, E.P. Xing and L. Fei-Fei, NIPS’10

Project page: http://vision.stanford.edu/projects/objectbank/

Description: Bag-of-features is the state-of-the-art technique for image classification. Li et al. has recently introduced an extension of this technique where histograms of local features (quantized SIFTs, etc.) are replaced by response vectors of object detectors. This new descriptor, called Object Bank, describes how much an image region R_i resembles object O_j. A set of objects detectors O_1...N is assumed to be pre-trained on a separate data and applied at the training and testing time of an Object Bank classifier. Your task is to train and apply Object Bank classifier (see project page for the code) to PASCAL VOC image classification task and to compare results to the Bag-of-Features approach in your Assignment 2. Groups of students should try to improve the classification performance by combining the Object Bank classifier with the Bag-of-features classifier used in Assignment 2 and experiment with different kernel combinations.

Topic 6. - Scene clustering and alignment in TV series

Paper: Recognising panoramas (2003) M. Brown and D. G. Lowe, ICCV’03

Project page: http://www.cs.bath.ac.uk/brown/autostitch/autostitch.html

Description: If you enjoyed the class Assignment #1 (stitching photo mosaics) and if you like watching TV series, this project is for you. Local features can be efficiently used for image alignment and for large scale image search. Combining these two advantages, one can, for example, automatically cluster images and construct panoramas from a collection of holiday photos as demonstrated in Brown and Lowe. In a similar way, one can try to cluster and align video shots showing the same view of a scene. Your task is to implement such a video clustering and alignment and to run it on one (or multiple) episodes of the TV series “Friends”. More precisely, your algorithm should automatically group video shots with respect to the scenes and their views. The algorithm should then spatially align the shots within each group to a common coordinate frame, so that we can watch e.g. a “video of a kitchen table” and see all things which has happened around it. We will provide the videos and the Matlab interface to extract the video frames.

Topic 7. - Single view reconstruction of movie sets.

Papers:

1. V. Hedau, D. Hoiem, D. Forsyth, Recovering the spatial layout of cluttered rooms, ICCV, 2009

2. D. Lee, M. Hebert, T. Kanade, Geometric reasoning for single image structure recovery, CVPR, 2009

Project pages: 1. http://www.cs.illinois.edu/homes/dhoiem/

2. http://www.cs.cmu.edu/~dclee/projects/scene.html

Description: The goal of this project is to evaluate the performance of existing single view 3D structure recovery algorithms on datasets from TV videos (e.g. Sitcoms) and scenes from feature length movies. The first dataset will be provided by course instructors (TV show “Friends” and office scenes from feature length movies), however the student is free to choose additional data from TV sitcoms and movies.

The code is available for both methods. (for 2. the code is available here, for 1. ask course instructors). For single students we recommend starting with 1. For groups of 2 and 3 students, you are expected to try out and compare both methods. In this project you will: (i) annotate 50-100 video frames with surface orientations; (ii) quantitatively evaluate the existing scene recovery algorithms on this data and qualitatively analyze the successes and errors; and (iii) implement an extension of the approach to video. For the extension (iii), the goal is to exploit the temporal consistency in video, by aggregating resulting labels over multiple frames in a single video shot. Different aggregation strategies (e.g. max, average) should be tried and evaluated. Correspondences between the video frames should be established using an existing dense tracking algorithm (binary here, paper here).

Topic 8. - Matching and retrieval of smooth objects.

Paper: R. Arandjelovic, A. Zisserman, Smooth Object Retrieval using a Bag of Boundaries, ICCV 2011.

Description: The goal of this project is to (i) implement the object boundary segmentation, (ii) object boundary descriptor, and (iii) object boundary matching algorithm described by Arandjelovic and Zisserman. Some (at least qualitative) results should be shown on the smooth sculpture dataset described in the paper. Groups of 2-3 students should also implement the bag-of-boundaries representation, and perform quantitative evaluationof the retrieval algorithm. The data described in the paper will be obtained from the class instructors.

Topic 9. - Reconstructing an image from its local descriptors

Paper: P. Weizenapfel, H. Jegou, and P. Perez, Reconstructing an image from its local descriptors, CVPR 2011

Description: The goal of this project is to (i) implement the image reconstruction method described in Weizenapfel et al., (ii) demonstrate reconstruction results on several examples similar to those shown in the paper, and (iii) show example reconstructions on several sequences of videoframes. You can pick few example videos from here. For groups of 2-3 people, you should also experiment with reconstructions based on visual vocabularies, rather than nearest neighbour matching. The goal would be to demonstrate reconstructions of images from the Oxford building dataset. The images and the extracted visual words including their spatial positions and shape can be found here. The goal is to re-construct an image given only its visual word representation (and a database of images with the same visual words extracted). You can experiment with different approaches, such as (i) taking the mean or median representative for each visual word, or (ii) using for reconstruction only a subset of images with high similarity, measured by the normalized scalar product of tf-idf vectors.

Your own chosen topic.

You can also choose your own topic, e.g. based on a paper which has been discussed in the class. Please validate the topic with the course instructors (I. Laptev or J. Sivic) first. You can discuss the topic with the course instructors after the class or email to Ivan.Laptev@ens.fr or Josef.Sivic@ens.fr.

Joint topics with the “Introduction to graphical model” class (F. Bach and G. Obozinksi).

Topic J1 - Hierarchical Context

Paper: Exploiting Hierarchical Context on a Large Database of Object Categories

Myung Jin Choi, Joseph Lim, Antonio Torralba, and Alan S. Willsky

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San

Francisco, CA, June 2010.

http://people.csail.mit.edu/torralba/publications/hcontext.pdf

Topic J2 - Tracking objects

Paper: Globally-Optimal Greedy Algorithms for Tracking a Variable Number of Objects

H. Pirsiavash, D. Ramanan, C. Fowlkes, Computer Vision

and Pattern Recognition (CVPR) Colorado Springs, Colorado, June 2011.

http://www.ics.uci.edu/~dramanan/papers/tracking2011.pdf

The joint project is expected to be more substantial and will have a strong machine learning as well as computer vision component. Please contact the instructors of both courses if you are interested in the joint project. We will discuss and adjust the requirements from each course depending on the size of the group.

You can also define your own topic for a joint project between the two classes. You need to validate the topic with the instructors for both courses.

Instructions for writing and submitting the project proposal.

You will submit a 1-page project proposal indicating (i) your chosen topic, (ii) the plan of work, i.e. what are you going to implement, what data you are going to use, what experiments you are going to do, (iii) if working in a group, who are the members of the group and how you plan to share the work. The due date for the proposal is given at the beginning of this page. The project proposal should be a single 1-page pdf file.
The proposal pdf should be named using the following format: FP_lastname1_lastname2_lastname3.pdf, where you replace "lastname*" with last names of all members of your group in alphabetical order, e.g. for a group consisting of 3 people: I. Laptev, J. Ponce and C. Schmid, the file name should be FP_Laptev_Ponce_Schmid.pdf.
Send the pdf file of your proposal to Ivan Laptev <Ivan.Laptev@ens.fr>.

Instructions for writing and submitting the final project report

You will hand-in a 3 page report in the format of the submission to the IEEE Computer Vision and Pattern Recognition conference (CVPR) . Use the latex or word templates provided at the CVPR Author Guidelines webpage. Note, that your are asked to produce only a 3-page double-column report (in contrast, a standard CVPR submission is up-to 8 pages).

At the top of the first page of your report include (i) names of all members of your group (up to 3 people), (ii) date, and (iii) the title of your final project.

The report should be a single pdf file and should be named using the following format: FP_lastname1_lastname2_lastname3.pdf, where you replace "lastname*" with last names of all members of your group in alphabetical order, e.g. for a group consisting of 3 people: I. Laptev, J. Ponce and C. Schmid, the file name should be FP_Laptev_Ponce_Schmid.pdf.

Send the pdf file of your report to Ivan Laptev <Ivan.Laptev@ens.fr>.

Instructions on preparing the project presentation.

Each group will present their final project work in the class.
Timing. Depending on the size of the group you will have 10-20 min slot to present your work. The exact timing and schedule of the presentations will be determined during the course.
Who should speak? If you are working in a group, you can have one person presenting for the whole group, but it is preferable that all members of the group get to present a part of the project.
Content. You should introduce the topic, clearly state what the goal of the project is. Show the work you have done. When describing results, please show both qualitative and quantitative results you have obtained and any interesting observations / findings you have made. Your audience are the other students in the class and the class instructors. You want to show us that you have done interesting work. Remember, it is good to illustrate your findings with images.
Re-using material / figures / slides from other people. You can take figures from papers or other people’s slides to illustrate an algorithm or explain a method. However, allways properly acknowledge the source if you do so.