2015 Computer Vision Internships at Willow

We are looking for strongly motivated candidates with interest in computer vision and applications of machine learning to computer vision problems. Good background in applied mathematics, strong programming skills and prior experience with Matlab are required. The internships can lead to a PhD in the Willow Group.

Proposed internship topics:

1. Learning to predict events in videos of crowded scenes

2. Learning spatial relations from images with natural text

3. Automatic generation of video description

4. Co-segmentation of object categories

5. More internship topics possibly available upon request

To apply, please send us your CV and come to visit us in the lab to discuss the topics.

1. Learning to predict events in videos of crowded scenes

Project supervisors: Josef Sivic <Josef.Sivic@ens.fr> and Ivan Laptev <Ivan.Laptev@inria.fr>

Location: Willow Group, Departement d'Informatique de l'École Normale Supérieure

Figure. 1: Crowd panic: Love Parade, Duisburg, 2010

Scientific objectives

Recent work suggests that the human visual system constantly generates predictions that anticipate the relevant future to help understand, navigate and interact with the surrounding environment [Bar09]: People can predict what may happen when they encounter an angry dog, when they see a quickly approaching car, or even what is the likely color of a particular plant at a certain time of the year. These predictions take into account the current situation as well as previous experiences and memories integrated across different time scales [Bar09]. Currently there is no artificial system with a similar level of visual analysis and prediction capabilities. The goal of this internship is to develop models, representations and learning algorithms for automatic prediction of events in video with specific application to videos of crowded scenes.

Motivation and expected outcome

Why should computers learn to predict events in video? One potential application with high potential impact is automatic analysis of crowded scenes. Imagine that a machine could automatically analyze a large database of videos of past crowded scenes and identify sequences of events that may lead to accidents such as crowd panic or stampede with applications in early disaster warning and prevention. This internship is expected to make a step towards these applications by developing and testing new models and algorithms for learning to predict events in videos.

Project description

The internship is going to build on the recent break-through advances in learning and transferring image representations using convolutional neural networks [Krizhevsky12,Oquab14] as well as recent advances in using neural networks for time-series modeling [Michalski14]. The aim of the internship is to develop and test convolutional neural network models of videos with long-term spatio-temporal dependencies.

The density of a crowd is an important local statistics of the crowd [Rodriguez11b]. The goal of the first part of the project will be to develop a reliable regressor of crowd density from video using a convolutional neural network [Oquab14]. The aim is to investigate different representations of video and train the network from real labelled data as well as data synthesized using computer graphics techniques.

The goal of the second part of the project is to develop a predictive model of spatio-temporal evolution of the crowd density in video using recurrent neural networks with long-term dependencies [Michalski14].

The goal of the third part of the project will be to extend the analysis beyond the crowd density to detecting and predicting crowd events such as person falling or person moving against the flow of the crowd.

Requirements

We are looking for strongly motivated candidates with an interest in computer vision, visual recognition and machine learning. The project requires strong background in applied mathematics and excellent programming skills. If we find a mutual match the project can lead to a Phd in the Willow group.

References

[Bar09] M. Bar. The proactive brain: memory for predictions. Theme issue: Predictions in the brain: using our past to generate a future (M. Bar Ed.). Philosophical Transactions of the Royal Society B, 364(1521):1235–1243, 2009.

[Krizhevsky12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS 2012.

[Oquab14] Oquab, M., Bottou, L., Laptev, I. and Sivic. J., Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks, CVPR 2014.

[Michalski14] V. Michalski, R. Memisevic and K. Konda. Modeling Deep Temporal Dependencies with Recurrent “Grammar Cells”, NIPS 14.

[Shah14] Idrees et al., MultiSource MultiScale Counting in Extremely Dense Crowd Images, CVPR 2013.

[Rodriguez11] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert. Density-aware person detection and tracking in crowds. In ICCV 2011.

2. Learning spatial relations from images with natural text

Project supervisors: Ivan Laptev <Ivan.Laptev@inria.fr> and Josef Sivic <Josef.Sivic@ens.fr>

Location: Willow Group, Departement d'Informatique de l'École Normale Supérieure

Query: A dog laying on a sidewalk
next to a bicycle

Query: A man catching a wave on a surfboard

Scientific objectives

Learning from images and videos annotated with natural language descriptions is one of the fundamental problems of computer vision, natural language understanding and machine learning. Imagine a machine that learns to perceive by analyzing Internet images and videos together with their surrounding text. Such capabilities are, however, beyond the level of visual intelligence of current recognition systems. The goal of this internship is to make a step in that direction and to develop representations and learning algorithms for modeling and recognizing spatial relations between objects learnable from images with natural text.

Motivation and expected outcome

Such capability would significantly impact the way visual information is accessed and searched.

Current image search engines are limited to simple queries such as "red car" or "Eiffel tower".

Imagine an automatic tool to search public archives of images and videos using a complex natural text queries that involves spatial relations between multiple objects, such as “three people sitting on a sofa in front of a tv with, one of them holding a bag of chips”. The expected outcome of the internship is a representation of spatial relations between objects learnable from image/video data weakly annotated with natural language and amenable to efficient indexing for visual search.

Project description

The project will build on the weakly supervised convolutional neural network architecture of [Oquab14b] as well as the recently developed text embedding methods [Mikolov13,Karpathy14].

1. Extend the weakly supervised convolutional neural network method of [Oquab14b] to a large vocabulary of tags obtained from natural language. Investigate the sensitivity of the method to noise in the labels.

2. Extend the weakly supervised convolutional neural network to explicitly model spatial relations between objects such as "bottle on a table". Develop a learning method to estimate the model parameters from images described with natural text. Demonstrate recognition capabilities of a method on an image retrieval benchmark.

3. Investigate learning compact representations [Gong14] for efficient indexing for spatial relations between objects.

Requirements

References

[Gong14] Gong, Y., Wang, L., Guo, R. and Lazebnik, S. Multi-scale Orderless Pooling of Deep Convolutional Activation Features, ECCV 2014.

[Gupta08] Abhinav Gupta and Larry S. Davis, Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers, In ECCV 2008

[Karpathy14] Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions, TR

http://cs.stanford.edu/people/karpathy/deepimagesent/

[Oquab14b] Oquab, M. and Bottou, L. and Laptev, I. and Sivic, J., Weakly Supervised Object Recognition with Convolutional Neural Networks, 2014

http://www.di.ens.fr/willow/research/weakcnn/

[Mikolov13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.

3. Automatic generation of video description

Project supervisors: Ivan Laptev <Ivan.Laptev@inria.fr> and Cordelia Schmid <cordelia.schmid@inria.fr>
Location: Willow Group, Departement d'Informatique de l'École Normale Supérieure

Scientific objectives

Automatic annotation of video by natural text descriptions is a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. Despite recent advances in image and video understanding, such capabilities are still beyond automatic systems (see an earlier attempt [Gupta09]) . The goal of this internship is to make a step in that direction and to develop representations and learning algorithms for automatic generation of video descriptions using a large corpus of videos and corresponding video scripts at the training time.

Motivation and expected outcome

Automatic video description will enable video summarization in the form of natural language. Such descriptions will greatly facilitate the search and browsing of video archives. Different descriptions could be provided depending on the context of use and the expected length of a video summary. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video. The expected outcome of the internship is the development of models and learning methods for text generation given video data.

Project description

The project will build on the recent advances in text generation [Sutskever11] and weakly-supervised models for video description [Oquab14b] as well as the recently developed text embedding methods [Mikolov13].

1. Extend the model for text generation [Sutskever11] to video by integrating visual information obtained from static and dynamic video representations.

2. Integrate reasoning about temporal relations among consecutive events.

3. Evaluate the method no a video retrieval task

References

[Gupta09] Abhinav Gupta, Praveen Srinivasan, Jianbo Shi and Larry S. Davis, Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos, In CVPR 2009

[Mikolov13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.

[Oquab14b] Oquab, M. and Bottou, L. and Laptev, I. and Sivic, J., Weakly Supervised Object Recognition with Convolutional Neural Networks, 2014

http://www.di.ens.fr/willow/research/weakcnn/

[Sutskever11] Sutskever, I. and Martens, J. and Hinton, G.. "Generating text with recurrent neural networks." Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.

4. Co-segmentation of object categories

Project supervisors: Jean Ponce <Jean.Ponce@ens.fr>, Cordelia Schmid: <Cordelia.Schmid@inria.fr>, Armand Joulin <ajoulin@fb.com>.

Please see more details at: http://lear.inrialpes.fr/allegro/jobs/coseg.pdf

5. More internship topics possibly available upon request

Talk with the course instructors if you wish to know additional internship topics.