2013 Computer Vision Internships in the Willow Group

We are looking for strongly motivated candidates with interest in computer vision and applications of machine learning to computer vision problems. Good background in applied mathematics, strong programming skills and prior experience with Matlab are required. The internships can lead to a PhD in the Willow Group.

Proposed internship topics:

1. Large-scale image classification and object detection with Deep Convolutional Neural Networks

2. Predicting actions in places

3. Triangulation de nuages de points

4. Learning discriminative part models

5. Modeling viewpoint variation in object detection

We will assign topics to qualified students in the first-come, first-served basis. To apply, please send us your CV and come to visit us in the lab to discuss the topics.

1. Large-scale image classification and object detection with Deep Convolutional Neural Networks

Project supervisors: Leon Bottou <leon@bottou.org>, Ivan Laptev <Ivan.Laptev@ens.fr> and Josef Sivic <Josef.Sivic@ens.fr>

Location: Willow Group, Laboratoire d'Informatique de l'École Normale Supérieure

Goal

You will experiment with a very recent and, as it appears, groundbreaking approach to image classification based on deep convolutional neural networks in [Krizhevsky12]. The goals are to replicate the state-of-the-art results in [Krizhevsky12] and to extend the method to object detection.

Motivation

Recognizing thouthands of object categories from images is a long-standing goal of computer vision. In recent years the research on large-scale image classification, e.g., [Sanchez11] has been sparked by the large amounts of now available image data and large-scale datasets such as ImageNet. Convolutional Neural Network (CNN) based methods exist for several decades, however, until recently the successful applications of CNNs have been only shown for relatively limited problems such as handwritten digit recognition [LeCun90] and face detection [Rowley98]. The groundbreaking results of [Krizhevsky12] presented in the Large Scale Visual Recognition Challenge 2012 Workshop (ILSVRC2012) now indicate that CNN is a highly competitive tool when powered with lots of image data. It might be that the necessary amount of image data and the critical processing power of modern GPUs sufficient to train successful CNN classifiers has just been reached and we are in front of many exciting new applications of CNNs. This internship will investigate this very timely topic by first re-producing results in [Krizhevsky12], investigating the performance and properties of this method when applied to other classification tasks, such as PASCCAL VOC, and then extending the classification method in [Krizhevsky12] to a more challenging task of object detection. This is an exploratory internship topic in an exciting and emerging area, which may have a significant impact on the current state of visual recognition.

Project description

The project will build on an existing publically available codebase available from [Krizhevsky12] and will proceed in the following three steps:

Understand the approach and the existing code of [Krizhevsky12]. Reproduce their quantitative and qualitative image classification results on the ImageNet database.

Improve image classification accuracy of [Krizhevsky12] by extending their work. The project will consider different extensions such as enlarging the class of image transformations when “jittering” the training data.

Apply and extend the deep convolutional neural network approach to object detection/localization on the Pascal VOC dataset.

The project will be co-supervised by Leon Bottou who is one of the world leading experts on neural networks and large-scale learning.

Requirements

We are looking for strongly motivated candidates with an interest in computer vision and machine learning. The project requires strong background in applied mathematics and excellent programming skills. The project will also involve using and possibly programming GPUs. Prior experience with GPUs will be also useful, but not required. If we find a mutual, match the project can lead to a Phd at the Willow group.

References

[Krizhevsky09] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks (2012), In Proc. NIPS 2012.

[Sanchez11] J. Sanchez, F. Perronnin. High-dimensional signature compression for large-scale image classification, In Proc. CVPR 2011.

[LeCun90] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel. Handwritten digit recognition with a back-propagation network. in Proc. NIPS 1990.

[Rowley98] H. A. Rowley, S. Baluja, and T. Kanade, Neural Network-Based Face Detection, in PAMI 1(20) 23-38, 1998

2. Predicting actions in places

Project supervisors: Ivan Laptev <Ivan.Laptev@ens.fr>, Josef Sivic <Josef.Sivic@ens.fr> and Aude Oliva, CSAIL, MIT, Visiting professor at Willow in Spring 2013.

Location: Willow Group, Laboratoire d'Informatique de l'École Normale Supérieure

Goal

The goal of this project is to design algorithms able to predict human actions for particular places. Given images or videos of places as input data, the aim is to learn visual predictors of actions using supervision (a) acquired by mining textual resources, e.g. thousands of movie scripts available on the Internet, and (b) obtained by large-scale manual image labelling using crowdsourcing.

Motivation

What is a person trying to do on the left figure below? What actions can we expect to happen in scenes depicted on the right? Currently there exist very little computer vision technology that can answer these and other similar questions.

A classical framework for classifying objects, actions, or events from still images or videos in computer vision involves identifying visually informative features within a category. Humans, on the other hand, heavily use other sources of information to predict which object or action is occurring or going to happen, like the place or visual context (e.g. cooking and eating in a kitchen), and the affordances of the object and spatial structure in the world (e.g. a chair affords seating). The aim of this internship is to leverage the human-like strategy of action recognition to computer vision to enhance both the numbers of different actions artificial systems should learn to discriminate and the overall recognition accuracy of current systems.

Given still images or dynamic visual scenes, we will train predictors of human actions. The problem will be formulated as an automatic image tagging task. Action tags (run, sit, having a meeting, getting married, ...) will be obtained from discriminative classifiers trained on image data directly. We will in particular investigate different sources of supervision to train such classifiers, these sources will include (a) knowledge mined from generic textual resources (e.g. movie scripts), describing what people do in particular scenes, and (b) obtained by large-scale manual image labelling using Amazon Mechanical Turk. We will also aim to provide functional categorization and grouping of places according to their similarity in typical human actions, and will identify places and situations allowing for better action predictions. The project will build on our recent work on action recognition [3,4] (Willow) and scene understanding [1,2] (MIT) and will be co-supervised by Aude Oliva (CSAIL, MIT) who is a visiting professor at Willow in Spring 2013.

Requirements

We are looking for strongly motivated candidates with an interest in computer vision and machine learning. The project requires strong background in applied mathematics and excellent programming skills. Prior experience with text processing will be also useful, but not required. If we find a mutual match the project can lead to a Phd at the Willow group.

References

[1] Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN Database: Large Scale Scene Recognition from Abbey to Zoo. Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (pp. 3485-3492), IEEE Computer Society.

[2] SUN dataset and scene category recognition benchmark: http://groups.csail.mit.edu/vision/SUN/

[3] M. Marszałek, I. Laptev and C. Schmid. (2009). Actions in Context. in Proc. CVPR'09, Miami, US.

[4]. V. Delaitre, J. Sivic and I. Laptev. (2011). Learning person-object interactions in still images. in Proc. NIPS'11, Granada, Spain.

3. Triangulation de nuages de points

Project supervisor: Jean Ponce <Jean.Ponce@ens.fr>

Location: Willow Group, Laboratoire d'Informatique de l'École Normale Supérieure

Project description

Des algorithmes de stéréo multi-vies extrêmement performants sont disponibles aujourd’hui, par

exemple le logiciel PMVS (cf. [1] et http://www.di.ens.fr/pmvs/ ). Combinés avec des logiciels de « structure from motion » tels que Bundler (cf. [2] et http://phototour.cs.washington.edu/bundler ), ils permettent de modéliser facilement et avec grande précision des objets et des environnements complexes, en général sous la forme d’un nuage de points. Passer de ce nuage de points à une triangulation plus facile à manipuler et visualiser reste problématique, la plupart des algorithmes de « meshing » disponibles aujourd’hui ne prenant pas en compte la position des capteurs. Une exception est formée par une classe d’algorithmes qui construisent une tétraédrisation de Delaunay du nuage de points et effacent les tétraèdres coupés pas les rayons joignant les capteurs aux points de mesure [3,4]. Le sujet de ce stage est l’implantation d’un tel algorithme, adapté à des données à très grande échelle (centaines de millions de points), et permettant de prendre en compte les informations (approximatives) de visibilité et d’adjacence disponibles pour les points reconstruits par PMVS.

References

[1] Yasutaka Furukawa and Jean Ponce, Accurate, Dense, and Robust Multi-View Stereopsis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, Issue 8, Pages 1362-1376, August 2010.

[2] Noah Snavely, Steven M. Seitz, Richard Szeliski, Photo Tourism: Exploring image collections in 3D, ACM Transactions on Graphics (Proceedings of SIGGRAPH 2006), 2006.

[3] Jean-Daniel Boissonnat , Olivier Faugeras, E. Le Bras-Mehlman, Representing stereo data with the Delaunay triangulation, Artificial Intelligence, 44:41-87, 1990.

[4] Hoang-Hiep Vu, Patrick Labatut, Jean-Philippe Pons, Renaud Keriven, High Accuracy and Visibility Consistent Dense Multiview Stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 5, pp. 889-901, 2010.

4. Learning discriminative part models

Project supervisor: Jean Ponce <jean.ponce@ens.fr>

Location: Willow Group, Département d’informatique, Ecole normale supérieure (http://www.di.ens.fr/willow/)

Project description:

Object detection and categorization are fundamental and difficult computer vision tasks [1,2]. One of

difficulties arising in these problems is that the variation in appearance can be higher within a class

than between classes, making direct comparison between instances of a same class meaningless. The

objective of this project is to learn discriminative deformable sub-parts among training exemplars

that can move independently, forming a flexible model robust to occlusion and high intra-class

variations.

Références :

[1] P. Felzenszwalb, R. B. Girshick, D. McAllester and D. Ramanan, Object Detection with

Discriminatively Trained Part Based Models, IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 32, No. 9, pages 1627-1645, 2010.

[2] S. Lazebnik, C. Schmid and J. Ponce, Beyond bags of features: spatial pyramid matching for

recognizing natural scene categories, Proc. IEEE Conference on Computer Vision and Pattern

Recognition, 2006.

5. Modeling viewpoint variation in object detection

Project supervisor: Jean Ponce <jean.ponce@ens.fr>

Location: Willow Group, Département d’informatique, Ecole normale supérieure (http://www.di.ens.fr/willow/)

Project description:

Image categorization and object detection are fundamental computer vision tasks [1]. Most existing

methods, however, essentially ignore the effect of viewpoint on this problem, including appearance

changes and occlusion [2]. The objective of this project is to construct new visual models capable of

explicitly handling viewpoint variations. We will consider an approach where we construct an

intermediary 2D/3D deformable model capable of representing simultaneously all the possible

viewpoints, while remaining compact and easy to learn.

Références :

[1] P. Felzenszwalb, R. B. Girshick, D. McAllester and D. Ramanan, Object Detection with

Discriminatively Trained Part Based Models, IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 32, No. 9, pages 1627-1645, 2010.

[2] O. Duchenne, A. Joulin and J. Ponce, A Graph-matching Kernel for Object Categorization , Proc.

Int. Conference on Computer Vision, 2011.

6. More internship topics available upon request.

Talk with the course instructors if you wish to know additional internship topics.