J. Lezama, K. Alahari, J. Sivic, I. Laptev
Track to the Future: Spatio-temporal Video Segmentation with Long-range Motion Cues
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011)
PDF | Abstract | BibTeX | Poster


Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. We aim to capture such interactions and to construct powerful intermediate-level video representation for subsequent recognition. Motivated by this goal, we seek to obtain spatio-temporal over-segmentation of the video into regions that respect object boundaries and, at the same time, associate object pixels over many video frames. The contributions of this paper are two-fold. First, we develop an efficient spatio-temporal video segmentation algorithm, which naturally incorporates long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion. Second, we devise a new track clustering cost-function that includes occlusion reasoning, in the form of depth ordering constraints, as well as motion similarity along the tracks. We evaluate the proposed approach on a challenging set of video sequences of office scenes from feature length movies.


  author = {Lezama, J. and Alahari, K. and Sivic, J. and Laptev, I.},
  title = {Track to the Future: Spatio-temporal Video Segmentation with Long-range Motion Cues},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year = {2011}



The goal of this work is to provide a spatio-temporal segmentation of videos, i.e. a segmentation that is consistent with object boundaries and associates object pixels over time. For example, given a video sequence such as the one in the figure (a) below (shown as a three frames from the video sequence), the goal is to generate a segmentation shown in the figure (b) below, where the scene is divided into two foreground regions consisting of two people -- one walking and the other sitting -- and the background region.

Sample frames from a video sequence, and their corresponding segmentations into three regions.

We propose a method for unsupervised spatio-temporal segmentation of videos, which is a building block for many other tasks such as object and human action recognition in videos. Whilst there have been many attempts to address the segmentation problem, they are restricted to only a local analysis of the video. Our method uses point tracks to capture long-range motion cues, and also infers local depth-ordering to separate objects. We build on the graph-based agglomerative segmentation work of [Felzenszwalb and Huttenlocher 2004, Grundmann et al. 2010], and group neighbouring pixels with similar colour and motion. Our framework is summarized in the figure below:

Pixels in one image frame are connected to corresponding pixels in the next frame using optical flow. We also introduce point-tracks for long-range support over time and encourage all pixels in a track to belong to the same segment. We ensure that dissimilar tracks are assigned to different segments.

We illustrate the benefits of using long-range point tracks with the following two examples. In the first example below, we consider two objects A (the moving object) and B (the stationary object). The point tracks corresponding to these objects are shown in (a) below. If one were to only use short term motion analysis (in the initial frames), the two objects would be merged into one segment. However, by observing the entire length of the point track, we note that the two objects belong to different segments. Using this (dis)similarity constraint on our video example leads to track clustering shown in (b) below.

Using motion (dis)similarity to cluster point-tracks in our example video.

Although the above result is reasonable, it does not separate the sitting person from the background. In the second example below, we consider three objects B (the moving object), A and C (the stationary objects). Object B moves in front of object A, but behind object C. Thus, the point tracks corresponding to objects A and C should belong different segments. This toy example corresponds to our example video, where "object" A is the background, object B is the walking person, and object C is the sitting person. We use such local depth-ordering constraints on our video example to obtain the track clustering shown in (b) below.

Using local depth-ordering constraints separates objects. In (b), the sitting person is separated from the background.

We formulate the point track clustering problem as an energy function, and solve it using Sequential Tree-reweighted message passing algorithm [Kolmogorov 2005]. Further details can be found in our paper.

Additional results and videos will be available here soon.


This work was partly supported by the Quaero Programme, funded by OSEO, and by the MSR-INRIA laboratory.