Joint Discovery of Object States and Manipulation Actions

People

Jean-Baptiste Alayrac
Josef Sivic
Ivan Laptev
Simon Lacoste-Julien

Abstract

Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

Paper

[ICCV paper] [paper+appendix] [github (code & data)] [SVM visualization (Sec. 5.3)]

BibTeX

@InProceedings{alayrac16objectstates,
    author      = "Alayrac, Jean-Baptiste and Sivic, Josef and Laptev, Ivan and Lacoste-Julien, Simon",
    title       = "Joint Discovery of Object States and Manipulation Actions",
    booktitle   = "International Conference on Computer Vision (ICCV)",
    year        = "2017"
}

Qualitative results

Text based clip retrieval

We provide additional visualization of our text based action retrieval technique (on Automatic Speech Recognition text) here.

Video of the project

This video highlights the output of our method. The first part presents some positive results of our method used in the wild setting (see Section 5.3 in the paper). We give results for the 5 actions of our dataset. First, the clips are fast forwarded to the retrieved clip using text analysis of the narration. At the bottom of the video, we show two timelines: the first one corresponds to our action video prediction and the second one displays the ground truth. Note that our algorithm predicts only one time interval for each action. In the main frame (left part of the video), we show the current video where we pause when making predictions. In the right part we display three squares (state1 (S1), action (A) and state2 (S2)) that we fill when making the corresponding predictions. Note that we pause the video at the end to let the viewer look again at the summary of predictions on the right. The second part of the video shows the Input and Output of the method, in order to higlight the challenges and the importance of the introduced constraints. The input is a video along with detections of a given object in the form of tracklets. Different tracklets are shown in different colors. Note the challenges of having multiple temporally overlapping detections on different objects as well as false positive detections. The output of our joint method is threefold (i) tracklet assignment to state 1, (ii) time video prediction for the action and (iii) tracklet assignment to state 2. The video is pausing during predictions. The last part presents example failure cases of the method.

Code and Data

Code is available on GitHub. The raw data is available as follows. The metadata (containing the README, and the complete annotations) is available here. The raw images are available here (17GB).

Acknowledgements

This research was supported in part by a Google research award and the ERC grants VideoWorld (no. 267907), Activia (no. 307574) and LEAP (no. 336845).