Vincent Delaitre
Inria - WILLOW Project
23 avenue d'Italie
CS 81321
75214 PARIS Cedex 13

E-mail: vincent.delaitre -at-
Tel: +33.(0)1 3963 5550
Fax: +33.(0)1 3963 5575


I am now a fourth-year Ph.D student working within WILLOW, a joint research team between the Inria Rocquencourt, the École Normale Supérieure de Paris (ENS) and the Centre National de la Recherche Scientifique (CNRS). I am particularly interested in computer vision, image processing, machine learning and robotics. My advisors are Ivan Laptev and Josef Sivic.

Concerning my education, I integrated the École Normale Supérieure de Lyon in 2007 and graduated from the MPRI (Parisian Master of Research in Computer Science) in 2010.

More about my resume here (in english) or here (in french).

I am also currently involved in my start-up Smyle which is a visual search engine for fashion.


  • 2012:

    • Scene semantics from long-term observation of people
      V. Delaitre, D. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. Efros.
      European Conference on Computer Vision (ECCV), 2012
      Abstract | BibTeX | PDF | Poster | Project page | Code


      Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and ob- ject appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.


      	                    title = {Scene semantics from long-term observation of people},
      	                    author = {V. Delaitre and D. Fouhey and I. Laptev and J. Sivic and A. Gupta and A. Efros},
      	                    booktitle = {Proc. 12th European Conference on Computer Vision},
      	                    year = {2012},
    • People Watching: Human Actions as a Cue for Single-View Geometry
      D. Fouhey, V. Delaitre, A. Gupta, A. Efros, I. Laptev, and J. Sivic.
      European Conference on Computer Vision (ECCV), 2012
      Abstract | BibTeX | PDF | Project page | Code


      We present an approach which exploits the coupling between human actions and scene geometry. We investigate the use of human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints about the scene. These constraints are then used to improve state-of-the-art single-view 3D scene understanding approaches. The proposed method is validated on a collection of monocular time lapse sequences collected from YouTube and a dataset of still images of indoor scenes. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.


      	                    title = {People Watching: Human Actions as a Cue for Single-View Geometry},
      	                    author = {Fouhey, D. and Delaitre, V. and Gupta, A. and Efros, A. and Laptev, I. and Sivic, J.},
      	                    booktitle = {Proc. 12th European Conference on Computer Vision},
      	                    year = {2012},

  • 2011:

    • Learning person-object interactions for action recognition in still images
      V. Delaitre, J. Sivic, and I. Laptev.
      Advances in Neural Information Processing Systems (NIPS), 2011
      Abstract | BibTeX | PDF | Poster


      We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline.


      	                    author = {Delaitre, V. and Sivic, J. and Laptev, I.},
      	                    title = {Learning person-object interactions for action recognition in still images},
      	                    booktitle = {Advances in Neural Information Processing Systems},
      	                    year = {2011},

  • 2010:

    • Recognizing human actions in still images: a study of bag-of-features and part-based representations
      [Updated version with new database]
      V. Delaitre, I. Laptev, and J. Sivic.
      Proceedings of the British Machine Vision Conference (BMVC), 2010
      Abstract | BibTeX | PDF | Poster | Project page | Database | Code


      Recognition of human actions is usually addressed in the scope of video interpretation. Meanwhile, common human actions such as “reading a book”, “playing a guitar” or “writing notes” also provide a natural description for many still images. In addition, some actions in video such as “taking a photograph” are static by their nature and may require recognition methods based on static cues only. Motivated by the potential impact of recognizing actions in still images and the little attention this problem has received in computer vision so far, we address recognition of human actions in consumer photographs. We construct a new dataset with seven classes of actions in 968 Flickr images representing natural variations of human actions in terms of camera view-point, human pose, clothing, occlusions and scene background. We study action recognition in still images using the state-of-the-art bag-of-features methods as well as their combination with the part-based Latent SVM approach of Felzenszwalb et al. In particular, we investigate the role of background scene context and demonstrate that improved action recognition performance can be achieved by (i) combining the statistical and part-based representations, and (ii) integrating person-centric description with the background scene context. We show results on our newly collected dataset of seven common actions as well as demonstrate improved performance over existing methods on the datasets of Gupta et al. and Yao and Fei-Fei.


      	                    author = {Delaitre, V. and Laptev, I. and Sivic, J.},
      	                    title = {Recognizing human actions in still images: a study of bag-of-features and part-based representations},
      	                    booktitle = {Proceedings of the British Machine Vision Conference},
      	                    note = {updated version, available at}, 
      	                    year = {2010},

  • 2009:

    • Classifying ELH Ontologies In SQL Databases
      V. Delaitre, and Y. Kazakov.
      OWL: Experiences and Directions (OWLED), 2009.
      Abstract | BibTeX | PDF


      The current implementations of ontology classification procedures use the main memory of the computer for loading and processing ontologies, which soon can become one of the main limiting factors for very large ontologies. We describe a secondary memory implementation of a classification procedure for ELH ontologies using an SQL relational database management system. Although secondary memory has much slower characteristics, our preliminary experiments demonstrate that one can obtain a comparable performance to those of existing in-memory reasoners using a number of caching techniques.


      	                    title = {Classifying {ELH} Ontologies In {SQL} Databases},
      	                    author = {Delaitre, V. and Kazakov, Y.},
      	                    booktitle = {OWL: Experiences and Directions},
      	                    year = {2009},