Description:
The final project amounts to 40% of the final grade. You will have the opportunity to choose your own research topic and to work on a method recently published at a top-quality computer vision conference (ECCV, ICCV, CVPR) or journal (IJCV, TPAMI). We also provide a list of interesting topics / papers below. If you would like to work on another topic (not from the list below), which you may have seen during the class or elsewhere, please consult the topic with the class instructors (I. Laptev and J. Sivic). You may work alone or in a group of 2-3 people. If working in a group, we expect a more substantial project, and an equal contribution from each student in the group.
Your task will be to:
(i) read and understand the research paper,
(ii) implement (a part of ) the paper, and
(iii) perform qualitative/quantitative experimental evaluation.
Evaluation and due dates:
Re-using other’s people code:
You can re-use other people’s code. However, you should clearly indicate in your report/presentation, what is your own code and what was provided by others (don’t forget to indicate the source). We expect projects balanced between implementation / experimental evaluation. For example, if you implement a difficult algorithm from scratch, only few qualitative experimental results may suffice. On the other hand, if you completely use someone else’s implementation, we expect a strong quantitative experimental evaluation with analysis of the obtained results and comparison with baseline methods.
Suggested papers / topics:
Below are some suggested papers and topics for the final projects. If you would like to work on a different topic, please consult your choice with the course instructors (I. Laptev and J. Sivic).
Topic 1. - Spatio-temporal alignment of videos
Paper: Aligning Sequences and Actions by Maximizing Space-Time Correlations (2006) Y. Ukrainitz and M. Irani, ECCV’06
Project page: http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeCorrelations.html
Description: Implement the spatio-temporal alignment algorithm described in (Ukrainitz and Irani 2006). Demonstrate spatio-temporal alignment on their video sequences available here (focus on alignment of human actions, i.e. you can skip sections 5.1 and 6 of the paper). Demonstrate spatio-temporal alignment on your own captured videos. For groups of 2-3 people, experiment with different features for alignment, e.g. HOG3D, and applying the resulting alignment cost for action retrieval in a feature length movie Coffee and Cigarettes. The zip file with annotations is here. The summary of annotations is here. Ask the course instructors for the video. |
Topic 2. - Action detection and recognition in still images
Paper: Articulated Pose Estimation with Flexible Mixtures of Parts (2011) Y. Yang, D. Ramanan, CVPR’11
Project page and code: http://phoenix.ics.uci.edu/software/pose/
Description: Yang and Ramanan present a new trainable method for person detection and pose estimation based on the manually annotated locations of body parts (see project page for the training and detection code). Your goal is to try this method on a different problem, which is action recognition in still images as defined in the PASCAL VOC Challenge 2011 competition, see example images here . Since actions often involve objects (reading - books, magazines; playing music - guitar, piano). You should try to learn an action-specific model of a person where your extended model should combine person parts (head, hands, etc.) as well as objects or object parts (phone, photo camera, parts of the motorbike). You should manually annotate such parts in training images (we will provide an annotation tool) and re-train the modified model of Yang and Ramanan for selected action classes. You should compare the performance of models trained with and without objects (parts). You will also be able to compare your method to the results of VOC 2011 competition to be published in November 2011. Groups of 2-3 students, should experiment with more classes. |
Topic 3. - Crowd density estimation
Paper: Learning to Count Objects in Images (2010) V. Lempitsky and A. Zisserman, NIPS’10
Project page: http://www.robots.ox.ac.uk/~vgg/research/counting/index.html
Description: Crowd analysis is interesting to many applications. While person detection is a complicated task especially in crowded scenes, crowd density estimation can be approached without explicit person detection and counting. Your goal is to train and evaluate a crowd density estimator using the new discriminative training technique in Lempitsky and Zisserman (the code is available from the project page). You should test the algorithm on highly crowded scenes and experiment with alternative features such as HOG, Bag-of-Fetaures, responses of a person detector or ObjectBanks. Groups of students should apply crowd density estimation to improve person detection as explored in Rodriguez et al., project page: http://www.di.ens.fr/willow/research/crowddensity/index.html |
Topic 4. - Object detection with region cues
Paper: The Truth About Cats and Dogs (2011) O.M. Parkhi and A. Vedaldi and C.V. Jawahar and A. Zisserman, ICCV’11
Description: Object detection is especially challenging for non-rigid object classes such as cats and dogs. Animals, however, are often described well by regions of similar color or texture. Parkhi et al. demonstrate a significant improvement in localizing cats and dogs by combining a standard object detector with image segmentation. Your task is to implement their method and to apply it to the detection of cats and/or dogs in PASCAL VOC 2010 object detection task. You should use the available code for training and running object detection (DefPM) as well as the code for image segmentation (GrabCut, see links in the paper). Motivated students may further investigate how this approach applies to other object categories (horses, cars, buses, trains, …) |
Topic 5. - Image classification with trained features
Paper: Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification (2010) L.-J. Li, H. Su, E.P. Xing and L. Fei-Fei, NIPS’10
Project page: http://vision.stanford.edu/projects/objectbank/
Description: Bag-of-features is the state-of-the-art technique for image classification. Li et al. has recently introduced an extension of this technique where histograms of local features (quantized SIFTs, etc.) are replaced by response vectors of object detectors. This new descriptor, called Object Bank, describes how much an image region R_i resembles object O_j. A set of objects detectors O_1...N is assumed to be pre-trained on a separate data and applied at the training and testing time of an Object Bank classifier. Your task is to train and apply Object Bank classifier (see project page for the code) to PASCAL VOC image classification task and to compare results to the Bag-of-Features approach in your Assignment 2. Groups of students should try to improve the classification performance by combining the Object Bank classifier with the Bag-of-features classifier used in Assignment 2 and experiment with different kernel combinations. |
Topic 6. - Scene clustering and alignment in TV series
Paper: Recognising panoramas (2003) M. Brown and D. G. Lowe, ICCV’03
Project page: http://www.cs.bath.ac.uk/brown/autostitch/autostitch.html
Description: If you enjoyed the class Assignment #1 (stitching photo mosaics) and if you like watching TV series, this project is for you. Local features can be efficiently used for image alignment and for large scale image search. Combining these two advantages, one can, for example, automatically cluster images and construct panoramas from a collection of holiday photos as demonstrated in Brown and Lowe. In a similar way, one can try to cluster and align video shots showing the same view of a scene. Your task is to implement such a video clustering and alignment and to run it on one (or multiple) episodes of the TV series “Friends”. More precisely, your algorithm should automatically group video shots with respect to the scenes and their views. The algorithm should then spatially align the shots within each group to a common coordinate frame, so that we can watch e.g. a “video of a kitchen table” and see all things which has happened around it. We will provide the videos and the Matlab interface to extract the video frames. |
Topic 7. - Single view reconstruction of movie sets.
Papers:
1. V. Hedau, D. Hoiem, D. Forsyth, Recovering the spatial layout of cluttered rooms, ICCV, 2009
2. D. Lee, M. Hebert, T. Kanade, Geometric reasoning for single image structure recovery, CVPR, 2009
Project pages: 1. http://www.cs.illinois.edu/homes/dhoiem/
2. http://www.cs.cmu.edu/~dclee/projects/scene.html
Description: The goal of this project is to evaluate the performance of existing single view 3D structure recovery algorithms on datasets from TV videos (e.g. Sitcoms) and scenes from feature length movies. The first dataset will be provided by course instructors (TV show “Friends” and office scenes from feature length movies), however the student is free to choose additional data from TV sitcoms and movies. The code is available for both methods. (for 2. the code is available here, for 1. ask course instructors). For single students we recommend starting with 1. For groups of 2 and 3 students, you are expected to try out and compare both methods. In this project you will: (i) annotate 50-100 video frames with surface orientations; (ii) quantitatively evaluate the existing scene recovery algorithms on this data and qualitatively analyze the successes and errors; and (iii) implement an extension of the approach to video. For the extension (iii), the goal is to exploit the temporal consistency in video, by aggregating resulting labels over multiple frames in a single video shot. Different aggregation strategies (e.g. max, average) should be tried and evaluated. Correspondences between the video frames should be established using an existing dense tracking algorithm (binary here, paper here). |
Topic 8. - Matching and retrieval of smooth objects.
Paper: R. Arandjelovic, A. Zisserman, Smooth Object Retrieval using a Bag of Boundaries, ICCV 2011.
Description: The goal of this project is to (i) implement the object boundary segmentation, (ii) object boundary descriptor, and (iii) object boundary matching algorithm described by Arandjelovic and Zisserman. Some (at least qualitative) results should be shown on the smooth sculpture dataset described in the paper. Groups of 2-3 students should also implement the bag-of-boundaries representation, and perform quantitative evaluationof the retrieval algorithm. The data described in the paper will be obtained from the class instructors. |
Topic 9. - Reconstructing an image from its local descriptors
Paper: P. Weizenapfel, H. Jegou, and P. Perez, Reconstructing an image from its local descriptors, CVPR 2011
Description: The goal of this project is to (i) implement the image reconstruction method described in Weizenapfel et al., (ii) demonstrate reconstruction results on several examples similar to those shown in the paper, and (iii) show example reconstructions on several sequences of videoframes. You can pick few example videos from here. For groups of 2-3 people, you should also experiment with reconstructions based on visual vocabularies, rather than nearest neighbour matching. The goal would be to demonstrate reconstructions of images from the Oxford building dataset. The images and the extracted visual words including their spatial positions and shape can be found here. The goal is to re-construct an image given only its visual word representation (and a database of images with the same visual words extracted). You can experiment with different approaches, such as (i) taking the mean or median representative for each visual word, or (ii) using for reconstruction only a subset of images with high similarity, measured by the normalized scalar product of tf-idf vectors.
Your own chosen topic.
You can also choose your own topic, e.g. based on a paper which has been discussed in the class. Please validate the topic with the course instructors (I. Laptev or J. Sivic) first. You can discuss the topic with the course instructors after the class or email to Ivan.Laptev@ens.fr or Josef.Sivic@ens.fr.
Joint topics with the “Introduction to graphical model” class (F. Bach and G. Obozinksi).
Topic J1 - Hierarchical Context
Paper: Exploiting Hierarchical Context on a Large Database of Object Categories
Myung Jin Choi, Joseph Lim, Antonio Torralba, and Alan S. Willsky
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San
Francisco, CA, June 2010.
http://people.csail.mit.edu/torralba/publications/hcontext.pdf
Topic J2 - Tracking objects
Paper: Globally-Optimal Greedy Algorithms for Tracking a Variable Number of Objects
H. Pirsiavash, D. Ramanan, C. Fowlkes, Computer Vision
and Pattern Recognition (CVPR) Colorado Springs, Colorado, June 2011.
http://www.ics.uci.edu/~dramanan/papers/tracking2011.pdf
The joint project is expected to be more substantial and will have a strong machine learning as well as computer vision component. Please contact the instructors of both courses if you are interested in the joint project. We will discuss and adjust the requirements from each course depending on the size of the group.
You can also define your own topic for a joint project between the two classes. You need to validate the topic with the instructors for both courses.
Send the pdf file of your report to Ivan Laptev <Ivan.Laptev@ens.fr>.