MSR-INRIA Workshop on 
Computer Vision and Machine Learning

Date: 25/1/2010 
Location: Orange room, 
23 avenue d'Italie, Paris


Most talks will be 30 min blocks: 25 min talk + 5 min discussion.
Some talks will be "short": 15 min talk + 5 min discussion.


Programme

9:30 - 10:00 Welcome and coffee

10:00-11:30 Session I: Machine learning
Francis Bach (INRIA-WILLOW)

11:30-11:45 Coffee

11:45 - 12:45 Session II: Image restoration
Julien Mairal (INRIA-WILLOW) 
Oliver Whyte (INRIA-WILLOW)

12:45-14:00: French boxed lunch in the building

14:00 - 15:50 Session III:  Video
Timothee Cour (INRIA-WILLOW)
   Weakly supervised learning for video understanding and object recognition
Adrien Gaidon (INRIA-LEAR)
   Mining visual actions from movies
Etienne MĂ©min (INRIA-Rennes)
   Data assimilation techniques for the analysis of geophysical flows from satellite images

15:50-16:15 Coffee

16:15-17:55 Session IV:  Image matching, retrieval and medical imaging
   Object Recognition in Medical Imagery: Organ Detection and Brain Segmentation
Bryan Russell (INRIA WILLOW)
   Aligning paintings and images (short talk)








Abstracts:


Sebastian Nowozin - Structured Prediction in Computer Vision: Take-off ahead

Abstract: Good models -- whatever application domain -- capture relations important to the task while generalizing across the irrelevant variations. Many successful computer vision models possess rich structure leading to good performance but hard inference and learning problems. Recent advances in machine learning make it possible to efficiently learn parameters and structure of richer model classes than ever before.  I will review some of these developments relevant to computer vision researchers. Despite rapid progress, some basic questions regarding parameter learning remain unanswered and I will discuss my own related work trying to answer some of these questions and pinpoint directions I consider worthwhile for future investigation.




Carsten Rother - Tractable Higher order models in Computer Vision

Abstract: In recent years the Markov Random Field (MRF) has become the de facto probabilistic model for many vision applications. However, by definition Markov models decompose into a product of low-order (often pairwise) interactions between latent variables. The goal of this talk is to motivate a) why higher-order (or global) interactions are useful, and b) how they can be optimized efficiently. One example of a global interaction is to enforce connectivity of an object. Another example is to overcome the problem that MRFs inherently encourage delta function marginal statistics of the training data. For this we introduce the global Marginal Probability Field (MPF) which can model any arbitrary marginal statistics. By giving examples in various domains, such as segmentation, denoising, and synthesis, I hope to encourage a discussion on the aspect of higher-order models in other application domains.



Francis Bach, Armand Joulin, Jean Ponce - Discriminative Clustering for Image Co-segmentation

AbstractPurely bottom-up, unsupervised segmentation of a single image into two segments remains a challenging task for computer vision. The co-segmentation problem is the process of jointly segmenting several images with similar foreground objects but different backgrounds. In this paper, we combine existing tools from bottom-up image segmentation such as normalized cuts, with kernel methods commonly
used in object recognition. These two sets of techniques are used within a discriminative clustering framework: we aim to assign foreground/background labels jointly to all images, so that a supervised classifier trained with these labels leads to maximal separation of the two classes. In practice, we obtain a combinatorial problem which is relaxed to a continuous convex optimization problem, that can be solved efficiently for up to dozens of images. We show that our framework works well on images with very similar
foreground objects, which are usually considered in the literature, as well as on more challenging problems with objects with higher intra-class variations.


Julien Mairal - Non-local Sparse Models for Image Restoration

Abstract: We propose to unify two different approaches to image restoration: On the one hand, learning a basis set (dictionary) adapted to sparse signal descriptions has proven to be very effective in image reconstruction and classification tasks. On the other hand, explicitly exploiting the self-similarities of natural images has led to the successful non-local means approach to image restoration. We propose simultaneous sparse coding as a framework for combining these two approaches in a natural manner. This is achieved by jointly decomposing groups of similar signals on subsets of the learned dictionary. Experimental results in image denoising and demosaicking tasks with synthetic and real noise show that the proposed method outperforms the state of the art, making it possible to effectively restore raw images from digital cameras at a reasonable speed and memory cost. (joint work with Francis Bach, Jean Ponce, Guillermo Sapiro and Andrew Zisserman)



Oliver Whyte - Non-uniform Deblurring for Shaken Images

Abstract: We argue that blur resulting from camera shake is mostly due to the 3D rotation of the camera, causing a blur that can be significantly non-uniform across the image. However, most current deblurring methods model the observed image as a convolution of a sharp image with a uniform blur kernel. We propose a new parametrized geometric model of the blurring process in terms of the rotational velocity of the camera during exposure. We apply this model in the context of two different algorithms for camera shake removal: the first uses a single blurry image (blind deblurring), while the second uses both a blurry image and a sharp but noisy image of the same scene. We show that our approach makes it possible to model and remove a wider class of blurs than previous approaches, and demonstrate its effectiveness with experiments on real images.



Neva Cherniavsky - Video analysis for sociology

Abstract: The display of human actions in mass media and its implications for our society is intensively studied in sociology, marketing and health care. Video analysis required for these studies currently involves hours of tedious manual labeling, rendering large-scale experiments infeasible. Automating the detection and classification of human traits and actions
in video will potentially increase the quantity and diversity of experimental data.

In our work we aim to describe and analyze human appearance over time. In particular, we investigate a weakly-supervised approach to learn person attributes from only a limited set of labeled images and tracks of people in the video. We show that the use of video information for training can significantly improve attribute classification without additional supervision. Preliminary results will be demonstrated for gender classification in the movie "Love, Actually" showing that training from video tracks is better than training from still images alone, even though there is more variety across people in the still image data.



Timothee Cour - Weakly supervised learning for video understanding and object recognition

Abstract: The exponential growth of image datasets and online videos presents both a challenge and an opportunity for vision based semantic search and indexing. The amount of labeled data and processing power grows at a much slower pace, posing a difficulty for traditional, heavily supervised learning methods. I will present in this talk scalable, weakly supervised algorithms for video understanding and object recognition, with a special focus on identifying people in movies. Key components of the algorithms we present are (1) alignment between multiple modalities: images, audio and text, and (2) unified convex formulation with strong theoretical guarantees for learning under weak supervision.

Screenplays can tell us who is in a given movie scene, but not when and where they are on the screen. This is in fact a common scenario in many image and video collections, where only partial access to labels is available. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. The goal in each case is to learn a person classifier that can not only disambiguate the labels of the training faces, but also generalize classification to unseen data. We consider a partially-supervised multiclass classification setting where each detected face is labeled ambiguously with more than one label, using screenplay names. We propose a convex formulation based on minimization of a surrogate loss, and show theoretically and empirically that effective learning is possible even when all examples are ambiguously labeled.

We also investigate the challenging scenario of naming people in video without screenplay. Our only source of "supervision" are person references mentioned in dialog, such as "Hey, Jack!''. We resolve identities by learning a classifier incorporating multiple instance constraints from dialog, gender and local grouping constraints, in a unified convex  formulation. Grouping constraints are provided by a novel temporal grouping model that learns a partition classifier from a set of training videos using structured learning.

We have deployed our framework on hundreds of hours of movies and TV, and will show some sample videos of detected and named characters in TV series. If time permits, I will briefly present ongoing work for learning object categories from weakly labeled image datasets.



Adrien Gaidon, Marcin Marszalek, Cordelia Schmid - Title: Mining visual actions from movies

Abstract: This paper presents an approach for mining visual actions from real-world videos. Given a large number of movies, we want to automatically extract short video sequences corresponding to visual human actions. Firstly, we retrieve actions by mining verbs extracted from the transcripts aligned with the videos. Not all of these samples visually characterize the action and, therefore, we rank these videos by visual consistency. We investigate two unsupervised outlier detection methods: one-class Support Vector Machine (SVM) and densest component estimation of a similarity graph. Alternatively, we show how to use automatic weak supervision provided by a random background class, either by directly applying a binary SVM, or by using an iterative re-training scheme for Support Vector Regression machines (SVR). Experimental results explore actions in 144 episodes of the TV series ``Buffy the Vampire Slayer'' and show: (a) the applicability of our approach to a large scale set of real-world videos, (b) the importance of visual consistency for ranking videos retrieved from text, (c) the added value of random non-action samples and (d) the ability of our iterative SVR re-training algorithm to handle weak supervision. The quality of the rankings obtained is assessed on manually annotated data for six different action classes.



Etienne Memin - Data assimilation techniques for the analysis of geophysical flows from satellite images

Abstract: In this talk I will review different data assimilation strategies for the analysis of images depicting fluid flows phenomena.  Data assimilation allows to construct estimation procedures that couple a dynamics of the 
phenomena of interest with image based measurements. This cooperation is either expressed within stochastic  or deterministic frameworks and enables to construct  dynamically consistant estimation process. I will show how such frameworks can be used for the tracking of curves  and velocity fields. Application of these techniques for the tracking of layered atmospheric wind fields and  convective cells will be described.


Jamie Shotton, Antonio Criminisi - Object Recognition in Medical Imagery: Organ Detection and Brain Segmentation
Abstract


Barbara Andre, Nicholas Ayache - Introducing space and time in local feature-based endomicroscopic image retrieval

Abstract: Interpreting endomicroscopic images is still a significant challenge, especially since one single still image may not always contain enough information to make a robust diagnosis. To aid the physicians, we investigated some local feature-based retrieval methods that provide, given a query image, similar annotated images from a database of endomicroscopic images combined with high-level diagnosis represented as textual information. Local feature-based methods may be limited by the small field of view (FOV) of endomicroscopy and by the fact that they do not take into account the spatial relationship between the local features, and they do not include the time relationship between successive images of the video sequences. To extract discriminative information over the entire image field, our proposed method collects local features in a dense manner instead of using a standard salient region detector. After the retrieval process, we introduce a verification step driven by the textual information in the database and in which spatial relationship between the local features is used. A spatial criterion is built from the co-occurrence matrix of local features and used to remove outliers by thresholding on this criterion.
To overcome the small FOV problem and take advantage of the video sequence, we propose to combine image retrieval and mosaicing. Mosaicing essentially projects the temporal dimension onto a large field of view image. In this framework, videos, represented by mosaics, and single images can be retrieved with the same tools. With a leave-n-out cross-validation, our results show that taking into account the spatial relationship between local features and the temporal information of endomicroscopic videos by image mosaicing improves the retrieval accuracy.



Bryan Russell, Helene Dessales, Josef Sivic, Alyosha Efros, Fredo Durand, Jean Ponce - Aligning paintings and images

Abstract: Recently, there has been much success with the application of structure from motion algorithms to Internet-scale images depicting famous landmarks. While these techniques mostly cope with photographic images depicting a rigid scene, there are other depictions of a scene, such as paintings and drawings. In this work, we seek to align paintings, drawings, and photographs. Important for this task are the following: (i) how to describe and match regions undergoing drastic appearance changes, and (ii) the ability to handle deviations from rigid 3D structure. We consider images and paintings from the archaeological site at Pompeii, Italy, along with other famous landmarks, for tasks such as computational re-photography and image indexing.



Andrew Zisserman - Visual Search and Classification of Art Collections

Abstract: The talk will describe how state of the art computer vision algorithms can be applied to art collections both to provide access and information to novices, and to help with organizing and updating the data. Two classes of algorithm are explored: large scale search/matching for the same object; and classification into object categories.  Examples and demos will be given on the Beazley Classical Art Collection of around 130K images of Greek vases.