Unsupervised Object Discovery and Tracking

in Video Collections

Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

Paper

S. Kwak, M. Cho, I. Laptev, J. Ponce, C. Schmid
Unsupervised Object Discovery and Tracking in Video Collections
Proceedings of the IEEE International Conference on Computer Vision (2015)
PDF | Abstract | BibTeX |

Abstract

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision. We formulate the problem as a combination of two complementary processes: discovery and tracking. The first one establishes correspondences between prominent regions across videos, and the second one associates similar object regions within the same video. Interestingly, our algorithm also discovers the implicit topology of frames associated with instances of the same object class across different videos, a role normally left to supervisory information in the form of class labels in conventional image and video understanding methods. Indeed, as demonstrated by our experiments, our method can handle video collections featuring multiple object classes, and substantially outperforms the state of the art in colocalization, even though it tackles a broader problem with much less supervision.

BibTeX

@InProceedings{kwak2015,
    author      = {Kwak, S. and Cho, M. and Laptev, I. and Ponce, J. and Schmid, C.},
    title       = {Unsupervised Object Discovery and Tracking in Video Collections},
    booktitle   = {International Conference on Computer Vision},
    year        = {2015},
}

Demo videos

Spotlight video

More example results

Figure 1. Visualization of examples that are correctly localized by our full method: (red) our full method, (green) our method without motion information, (yellow) ground-truth localization. The sequences come from (a) “aeroplane”, (b) “car”, (c) “cat”, (d) “dog”, (e) “motorbike”, and (f) “train” classes. Frames are ordered by time from top to bottom. The localization results of our full method are spatiotemporally consistent. On the other hand, the simpler version often fails due to pose variations of objects (a, c–f) or produces inconsistent tracks when multiple target objects exist (b).

Acknowledgements

This research was supported by the ERC advanced grant Allegro, Activia, VideoWorld, and Institut Universitaire de France.

Last updated: Dec 2015