Upcoming Seminar: -
How should a robot perceive the world?
Abstract: In order to perform assistive tasks, a robot needs to perceive a functional understanding of the environment. This comprises learning how to use the objects (i.e., object affordances) through the humans that use them. In this talk, I will discuss a few affordance-based representations together with data-driven algorithms to learn them. Specifically, I will present Infinite Latent CRFs (ILCRFs) that allow modeling the data with different plausible graph structures. Unlike CRFs, where the graph structure is fixed, ILCRFs learn distributions over possible graph structures in an unsupervised manner.
We then show that our idea of modeling environments using object affordances and hidden humans is not only useful for robot manipulation tasks such as arranging a disorganized house, haptic manipulation, unloading items from a dishwasher, but also in significantly improving standard robotic tasks such as scene segmentation, 3D object detection, human activity detection and anticipation, and task and path planning.
Short Bio: Ashutosh Saxena is an assistant professor in computer science department at Cornell University. His research interests include machine learning and robotics perception, especially in the domain of personal robotics. He received his MS in 2006 and Ph.D. in 2009 from Stanford University, and his B.Tech. in 2004 from Indian Institute of Technology (IIT) Kanpur. He is a recipient of National Talent Scholar award in India, Google Faculty award, Alfred P. Sloan Fellowship, Microsoft Faculty Fellowship, and NSF Career award.
In the past, Ashutosh developed Make3D (http://make3d.cs.cornell.edu), an algorithm that converts a single photograph into a 3D model. Tens of thousands of users used this technology to convert their pictures to 3D. He has also developed algorithms that enable robots (such as STAIR, POLAR, see http://pr.cs.cornell.edu) to perform household chores such as unload items from a dishwasher, place items in a fridge, etc. His work has received substantial amount of attention in popular press, including the front-page of New York Times, BBC, ABC, New Scientist, Discovery Science, and Wired Magazine. He has won best paper awards in 3DRR, IEEE ACE and RSS, and was named a co-chair of the IEEE technical committee on robot learning.
Multi-granularity steering for human interactions: pose, motion and intention
Abstract: Tracking people and their body pose in videos is a central problem in computer vision. Standard tracking representations reason about temporal coherence of detected people and body parts. They have difficulty tracking targets under partial occlusions or unusual body poses, where detectors are often inaccurate, due to the small number of training examples in comparison to the exponential variability of such configurations.
We propose novel tracking representations that track and segment people and their body pose in videos by exploiting information at multiple detection and segmentation granularities when available, whole body, parts or point trajectories. A key challenge is resolving contradictions among different information granularities, such as detections and motion estimates in the case of false alarm detections or leaking motion affinities. We introduce graph steering, a clustering with bias algorithm that targets graph partitioning under sparse unary potentials and dense pairwise node associations - a particular characteristic of the video signal with sparse confident detections and dense motion affinities.
First, we present powerful video segmentation representations from partitioning point trajectories and image regions that target articulated motion, going beyond rigid object motion assumptions. We show video segments adapt to target visibility masks under partial occlusions and deformations. Second, we augment bottom-up video segmentation representations with object detection for tracking under persistent occlusions. We demonstrate how to steer dense optical flow trajectory affinities with repulsions from sparse confident detections to reach a global consensus of detection and tracking in crowded scenes. Third, we study human motion and pose estimation. We segment hard to detect, fast moving body limbs from their surrounding clutter and match them against pose exemplars to detect body pose and improve body part motion estimates with kinematic constraints. We learn certainty of detections under various pose and motion specific contexts, and use such certainty during steering for jointly inferring multi-frame body pose and video segmentation.
We show empirically that such multi-granularity tracking representation is worthwhile, obtaining significantly more accurate body and pose tracking in popular datasets.
Short Bio: Katerina Fragkiadaki is a Ph.D. student in Computer and Information Science in the University of Pennsylvania. She received her diplomat in Computer Engineering from the National Technical University of Athens. She works on tracking, segmentation and pose estimation of people under close interactions, for understanding their actions and intentions.
"Virtual Unwrapping" and Other Stories Around the Digitization of Antiquities
Abstract: I will talk about my experience digitizing materials in the following collections: the British Library (London), Lichfield Cathedral (Lichfield), the Marciana Library (Venice), el Escorial (Madrid), the Institut de France (Paris), the National Palace Museum (Taipei), the Academia Sinica (Taipei), the National Palace Museum (Taipei). In particular, I will explain a technical approach I have been developing for revealing and discovering text within objects that are fragile - something I have termed "virtual unwrapping". More broadly, I will show results from having digitized collections with an eye toward digital restoration, discovery, visualization, and wide accessibility.
Static and dynamic texture mixing using optimal transport
Abstract: In this presentation, we will tackle static and dynamic texture mixing by combining the statistical properties of an input set of images or videos. We focus on spot noise textures that follow a stationary and Gaussian model which can be learned from the given exemplars. From here, we define using optimal transport, the distance between texture models, derive the geodesic path, and define the barycenter between several texture models. These derivations are useful because they allow the user to navigate inside the set of texture models, interpolating a new texture model at each element of the set. From these new interpolated models, new textures can be synthesized of arbitrary size in space and time. Numerical results obtained from a library of exemplars show the ability of our method to generate new complex realistic static and dynamic textures.
Abstract: Humans often agree on matters of aesthetic judgment. If we can predict which visual content most humans will appreciate, we can use these predictions in automated and partially-automated tools for creating better visual content. I will describe a number of projects where I collect large-scale datasets of human aesthetic judgments and use machine learning techniques to predict them. First, I describe experiments in understanding the colors that humans find aesthetically compatible. Second, I discuss a method to predict which frames of a video of a face best serve as candid portraits that capture the moment. Third, I will show a statistical study of the factors that influence human judgment of realism of image composites, as well as an algorithm to make composites more realistic. Finally, I will give a brief overview of an ongoing project to help people choose fonts by first automatically predicting font attributes, such as whether a font is 'modern' or 'playful'.
Egocentric Recognition of Objects and Activities
Abstract: Advances in camera miniaturization and mobile computing have enabled the development of wearable camera systems which can capture both the user's view of the scene (the egocentric, or first-person, view) and their gaze behavior. In contrast to the established third-person video paradigm, the egocentric paradigm makes it possible to easily collect examples of naturally-occurring human behavior, such as activities of daily living, from a consistent vantage point. We will demonstrate that the consistency of the egocentric viewpoint can provide a powerful cue for the weakly-supervised learning of objects and activities. We focus on activities requiring hand-eye coordination and model the spatio-temporal relationship between the gaze point, the scene objects, and the action label. We demonstrate that gaze measurement can provide a powerful cue for recognition. In addition, we present an inference method that can predict gaze locations and use the predicted gaze to infer action labels. We demonstrate improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods, on a new dataset containing egocentric videos of daily activities and gaze. We will also describe some applications in psychology, where we are developing methods for automating the measurement of children's behavior, as part of a large effort targeting autism and other behavioral disorders. This is joint work with Alireza Fathi, Yin Li, and Zhefan Ye.
Bio: James M. Rehg (pronounced "ray") is a Professor in the School of Interactive Computing at the Georgia Institute of Technology, where he is the Director of the Center for Behavior Imaging, co-Director of the Computational Perception Lab, and Associate Director of Research in the Center for Robotics and Intelligent Machines. He received his Ph.D. from CMU in 1995 and worked at the Cambridge Research Lab of DEC (and then Compaq) from 1995-2001, where he managed the computer vision research group. He received the National Science Foundation (NSF) CAREER award in 2001, and the Raytheon Faculty Fellowship from Georgia Tech in 2005. He and his students have received a number of best paper awards, including best student paper awards at ICML 2005 and BMVC 2010. Dr. Rehg is active in the organizing committees of the major conferences in computer vision, most-recently serving as the Program co-Chair for ACCV 2012. Dr. Rehg is currently leading a multi-institution effort to develop the science and technology of Behavior Imaging, funded by an NSF Expedition award (see www.cbs.gatech.edu for details).
Predicting Image Memorability
Abstract: When glancing at a magazine or browsing the Internet, we are continuously exposed to images. Despite this overflow of visual information, humans are extremely good at remembering thousands of pictures along with their visual details. But not all images are created equal. Whereas some images stick in our minds, others are ignored or quickly forgotten. What makes an image memorable? Our recent work shows that one can predict image memorability, opening a new domain of investigation at the interface between human cognition and computer vision.
Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Semantic scene labeling using feature learning and graph variational approaches
Abstract: In this talk we address the problem of assigning an semantic category to every pixel of an image, or video. We first introduce a model architecture that allows us to learn hierarchies of multi-scale features. We present results on 4 datasets, including one that contains depth information, which may be included in our trainable model very easily. As the output predictions may be noisy, we compare several smoothing approaches including multi-label energy minimization techniques. Finally, we propose a temporal smoothing method to deal with causal videos. It produces temporally consistent superpixels using minimum spanning trees, providing an efficient solution for embedded, real-time applications.
(Joint work with Clément Farabet, Laurent Najman and Yann LeCun)
Clustering Segmentations and Viewpoint-Aware Detection
Abstract: In this talk I will present an unsupervised, shape-based method for joint clustering of multiple image segmentations. Given two or more closely related images, along with an initial over-segmentation, our method computes a joint clustering of segments in the two frames. The clustering is computed as an approximate minimizer of a functional which gives preference to selections whose shape matches across frames and which are internally coherent within each frame. We introduce a novel contour-based representation that allows us to compute the shape similarity of a subset of segments in one frame to the other. The comparison looks only at the exterior bounding contours, receiving no contribution from segment boundaries which fall inside the union. Combining this contour-based score with region information gives rise to a quadratic semi-assignment problem whose solution we approximate by applying an efficient linear programming relaxation. This is joint work with Shiv N. Vitaladevuni and Ronen Basri.
In the second part of the talk I will describe a simple technique for generating models of object classes, which relates 3D shape and 2D appearance. I will demonstrate its application to viewpoint-invariant detection and pose estimation of a rigid object from a single 2D image. This is joint work with Meirav Galun, Sharon Alpert, Ronen Basri and Gregory Shakhnarovich.
The Role of V4 During Natural Vision
Abstract: The functional organization of area V4 in the mammalian ventral visual pathway is far from being well understood. V4 is believed to play an important role in the recognition of shapes and objects and in visual attention, but its complexity makes it hard to analyze. Individual cells in V4 have been shown to exhibit a large diversity of preferences to visual stimuli characteristics, including orientation, curvature, motion, color and texture. Such observations were for a large part obtained from electrophysiological and imaging studies, when a subject (monkey or human) is shown a sequence of artificial stimuli during data acquisition. In our study, we intend to go beyond such an approach and analyze a population of V4 neurons in naturalistic conditions. More precisely, we record responses from V4 neurons to grayscale still natural images---that is, discarding color and motion content. We propose a new computational model for V4 that does not rely on any pre-defined image features but only on invariance and sparse coding principles. Our approach is the first to achieve comparable prediction performance for V4 as for V1 cells on responses to natural images. Our model is also interpretable using sparse principal component analysis. In the neuron population observed and based on our computational model, we discover as our main finding two groups of neurons: those selective to texture versus those selective to contours. This supports the thesis that one primary role of V4 is to extract objects from background in the visual field. Moreover, our study also confirms the diversity of V4 neurons. Among those selective to contours, some of them are selective to orientation, others to acute curvature features.
This is a joint work with Yuval Benjamini, Ben Willmore, Michael Oliver, Jack Gallant and Bin Yu. All of this work was performed at UC Berkeley.
Image Segmentation with Geometrical Constraints
Abstract: While images display objects very well, we often want to separate these objects of interest from its surroundings. It has been shown in the past that modeling foreground and background as a GMM is not enough to solve this problem. We therefore have to incorporate prior knowledge into image segmentation.
In this talk I will speak about my recent work (first presented at ECCV'12) that shows how geometrical constraints can be applied to image segmentation. In particular, I will talk about
* Constraining the distribution of appearances in binary segmentation
* Adding maximal Hausdorff distances to nested multi-label segmentation
The energies that rise from these formulations are not submodular anymore and can therefore not be minimized globally via graph cuts. I will show how we can nonetheless receive meaningful solutions by introducing
* A continuous process of discrete graph cuts called 'line-search cut'
* A submodular-supermodular procedure governed by iterated distance maps.
e-Heritage, Cyber Archaeology, and Cloud Museum
Abstract: We have been conducting the e-Heritage project, which converts assets that form our cultural heritage into digital forms, by using computer vision and computer graphics technologies. We hope to utilize such forms 1) for preservation in digital form of our irreplaceable treasures for future generations, 2) for planning and physical restoration, using digital forms as basic models from which we can manipulate data, 3) for cyber archaeology, i.e., investigation of digitized data through computer analysis, and 4) for education and promotion through multimedia contents based on the digital data. This talk briefly overviews our e-Heritage projects underway in Italy, Cambodia, and Japan. We will explain what hardware and software issues have arisen, how to overcome them by designing new sensors using recent computer vision technologies, as well as how to process these data using computer graphics technologies. We will also explain how to use such data for archaeological analysis, and review new findings. Finally, we will discuss a new way to display such digital data by using the mixed reality systems, i.e. headmount displays on site, connected from cloud computers.
Bio: Dr. Katsushi Ikeuchi is a Professor at the University of Tokyo. He received a Ph.D. degree in Information Engineering from the University of Tokyo in 1978. After working at the Massachusetts Institute of Technology's AI Lab for two years, Electrotechnical Lab, Japan for five years, and Carnegie Mellon University for ten years, he joined the university in 1996. His research interest spans computer vision, robotics, and computer graphics. He has received several awards, including the IEEE Marr Award, the IEEE RAS "most active distinguished lecturer" award and the IEEE-CS ICCV Significant Researcher Award as well as ShijuHoushou (the Medal of Honor with Purple ribbon) from the Emperor of Japan. He is a fellow of IEEE, IEICE, IPSJ, and RSJ.
Counterfactual Reasoning and Learning Systems
Abstract: Using the search engine ad placement problem as an example, we explain the central role of causal inference for the design of learning system interacting with their environment. Thanks to importance sampling techniques, data collected during randomized experiments gives precious cues to assist the designer of such learning systems and useful signals to drive learning algorithms. Thanks to a sharp distinction between the learning algorithms and the extraction of the signals that drive them, these methods can be tailored to causal models with different structures. Thanks to mathematical foundations shared with physics, these signals can describe the response of the system when equilibrium conditions are reached.
Visual Dictionary Learning and Latent Conditional Random Fields for Joint Object Categorization and Segmentation
Abstract: Object categorization and segmentation are very challenging problems in computer vision. While most of the prior literature solves these problems independently, recently there has been a growing interest in solving them simultaneously. However, this is difficult because most segmentation algorithms utilize local information to produce a dense set of labels in a bottom-up manner, while most categorization algorithms utilize global information to produce a sparse set of labels in a top-down manner. Moreover, categorization algorithms rely on dictionaries of visual words that are learned off-line and independently from the segmentation task.
In this talk, I will present a unified approach where object categorization, segmentation and dictionary learning are integrated within the same mathematical framework. In our approach, we represent objects in terms of a dictionary of visual words and propose a latent conditional random field (CRF) model in which the observed variables are category labels and the latent variables are visual word assignments. The CRF energy consists of a segmentation cost, a bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, between visual words and object categories, and among visual words. The segmentation, categorization, and dictionary learning parameters are learned jointly using latent structural SVMs, and the segmentation and visual words are inferred jointly using energy minimization techniques.
Joint work with Dheeraj Singaraju and Aastha Jain.
Two Granularity Tracking: Mediating Trajectory and Detection Graphs for Tracking under Occlusions
Abstract: We propose a framework for mediating grouping cues from two levels of tracking granularities, detection tracklets (detectlets) and point trajectories (trajectlets), for tracking and segmenting objects in crowded scenes.
Detectlets capture objects when they are mostly visible. They may be sparse in time, spatially inaccurate especially during occlusions, and can contain false positives. Trajectlets are dense in space and time. Their affinities integrate long range motion and 3D disparity information but leak across similarly moving objects, since they lack model knowledge. We establish one trajectlet and one detectlet graph, encoding affinities in each space which, as we show, are complementary in nature.
We simultaneously classify detectlets as true or false positives and group in the joint trajectlet and detectlet space by resolving contradictions between affinities in the two graphs. Detectlet classification modifies trajectlet affinities to reflect object specific dis-associations. Non-accidental grouping alignments between detectlet and trajectlet clusters boost or reject corresponding detectlets, changing accordingly their classification. We show our model can track objects through sparse, inaccurate detections and persistent partial occlusions. It provides better spatial grounding in comparison to detection based trackers, by mediating effectively the contradictory information available in detections or point trajectories.
This is joint work with Katerina Fragkiadaki.
Tutorial: Research in Computational Perception and Cognition
Abstract: Computer vision started with the goal of building machines that can see like humans. Currently, many techniques in automatic visual understanding are inspired by how humans recognize and interact with objects and scenes, and how visual information is stored in memory. Research in Computational Perception and Cognition builds on the synergy between human and machine vision, and how it applies to solving high-level recognition problems like understanding scenes and events, perceiving space, recognizing objects, modeling attention, eye movements and visual memory, as well as predicting subjective properties of images (like image memorability).
Using Collaborative Filtering for Applications in Vision
Object and scene representation in the human brain
Abstract: Behavioral and computational studies suggest that visual scene analysis rapidly produces a rich description of both the objects and the spatial layout of surfaces in a scene. Recent neuroscience work in human neuro-imaging suggests that object and scene representation are distributed over a collection of high-level brain regions, with different region capitalizing on the functional properties of the visual entity to be processed.
Minimal Solvers in Computer Vision
Abstract: I will give an overview of minimal problem solving in computer vision and will explain the state of the art approach to systematic solving of polynomial equations using the Groebner basis method. I will also explain some special tricks how to make minimal solvers simpler and faster for computer vision problems.
General link: Minimal Problems
CVPR 2012: PDF
On Template-Based Reconstruction from a Single View: Analytical Solutions and Proofs of Well-Posedness for Developable, Isometric and Conformal Surfaces
Abstract: Recovering a deformable surface's 3D shape from a single view registered to a 3D template requires one to provide additional constraints. A recent approach has been to constrain the surface to deform quasi-isometrically. This is applicable to surfaces of materials such as paper and cloth. Current 'closed-form' solutions solve a convex approximation of the original problem whereby the surface's depth is maximized under the isometry constraints (this is known as the maximum depth heuristic). No such convex approximation has yet been proposed for the conformal case. We give a unified problem formulation as a system of PDEs for developable, isometric and conformal surfaces that we solve analytically. This has important consequences. First, it gives the first analytical algorithms to solve this type of reconstruction problems. Second, it gives the first algorithms to solve for the exact constraints. Third, it allows us to study the well-posedness of this type of reconstruction: we establish that isometric surfaces can be reconstructed unambiguously and that conformal surfaces can be reconstructed up to a few discrete ambiguities and a global scale. In the latter case, the candidate solution surfaces are obtained analytically. Experimental results on simulated and real data show that our methods generally perform as well as or outperform state of the art approaches in terms of reconstruction accuracy.
CornellNYC Tech: A New Campus for 21st Century Technology Research and Education
Abstract: Cornell University is building a campus for graduate education and research in technology-related fields on Roosevelt Island in New York City. This new campus is the result of an international competition held by Mayor Bloomberg's administration to create a new graduate school in order to help fuel the growth of the city's technology sector. The campus will start operation this fall in leased space in Manhattan's Chelsea neighborhood, and move to Roosevelt Island in 2017. The campus is organized around interdisciplinary "hubs" rather than traditional departments; the initial hubs are Connective Media, Healthier Life, and the Built Environment. The goal is to create an institution that brings together leading-edge academic research, commercial success and societal impact. A key component of the institution is the Technion-Cornell Innovation Institute, a global partnership for a global city. This partnership between Cornell and the Technion will offer a novel new masters of science degree rooted in the interdisciplinary hubs.
Mosaicing, Segmentation and Categorization of Dynamic Scenes
Abstract: Non-rigid dynamical objects are common in our day to day life. Some examples include waves rippling on the surface of a lake, a flag fluttering in the wind, a moving person, etc. The categorization of videos of such scenes is incredibly challenging, because their appearance constantly changes as a function of time. Most existing video categorization algorithms are not useful for such objects as they either consider the object to be rigid or do not account for changes in scale, illumination and viewpoint.
In this talk, I will describe our recent work on modeling, segmentation and categorization of video sequences of non-rigid dynamical objects. We model the video as the output of a linear dynamical system and use features extracted from the model parameters to perform both segmentation and categorization of non-rigid dynamical objects with invariance to changes in scale, illumination and viewpoint.
A functional framework for the design of steerable wavelet frames with application to bioimaging
Abstract: We present a functional approach to the construction and parametrization of steerable wavelets in any number of dimensions. It relies on an $N$th-order extension of the Riesz transform that has the remarkable property of mapping any primary wavelet frame (or basis) of $L_2(R^d)$ into another "steerable" wavelet frame, while preserving the frame bounds. Concretely, this means that we can design reversible multi-scale decompositions in which the analysis wavelets (feature detectors) can be spatially rotated in any direction via a suitable linear combination of wavelet coefficients. The concept provides a rigorous functional counterpart to Simoncelli's steerable pyramid whose construction was entirely based on digital filter design.
The shaping of the steerable wavelets is controlled by an $M \times M$ unitary matrix (where $M$ is the number of wavelet channels) that can be selected arbitrarily; this allows for a much wider range of solutions than the traditional equiangular configuration (steerable pyramid). We describe some concrete examples of transforms, including a principal-component-based method for signal-adapted wavelet design, and perform a comparison of their denoising performance. The results are in favor of an optimized wavelet design (equalized PCA) which consistently performs best.
Bio: Michael Unser is Professor and Director of EPFL's Biomedical Imaging Group, Lausanne, Switzerland. His main research area is biomedical image processing. He has a strong interest in sampling theories, multiresolution algorithms, wavelets, and the use of splines for image processing. He has published about 200 journal papers on those topics.
From 1985 to 1997, he was with the Biomedical Engineering and Instrumentation Program, National Institutes of Health, Bethesda USA, conducting research on bioimaging and heading the Image Processing Group.
Dr. Unser is a fellow of the IEEE (1999), an EURASIP fellow (2009), and a member of the Swiss Academy of Engineering Sciences. He is the recipient of several international prizes including three IEEE-SPS Best Paper Awards and two Technical Achievement Awards from the IEEE (2008 SPS and EMBS 2010).
Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
Abstract: Deformable Part Models (DPMs) play a prominent role in current object recognition research. In this talk we will see how bounding-based techniques, such as Branch-and-Bound and Cascaded Detection can be used to efficiently detect objects with DPMs. Instead of evaluating the classifier score exhaustively over all image locations and scales, such techniques use bounding to focus on promising image locations and discard less promising ones.
The core problem that we will address is how to compute bounds that accommodate part deformations; for this we adapt the Dual-Tree data structure of Gray et al to our problem. We evaluate our approach using the DPMs of Felzenszwalb et al; we obtain exactly the same results but can perform the part combination substantially faster. For a conservative threshold the speedup can be double, for a less conservative we can have tenfold or higher speedups. These speedups refer to the part combination process, after the unary part scores have been computed.
We also develop a multiple-object detection variation of the system, where hypotheses for 20 categories are inserted in a common priority queue. For the problem of finding the strongest category in an image this can result in more than 100-fold speedups.
For additional information and code, please visit: http://vision.mas.ecp.fr/Personnel/iasonas/dpms.html.
Bio: Iasonas Kokkinos obtained the Diploma of Engineering in 2001 and the Ph.D. Degree in 2006, both from the School of Electrical and Computer Engineering of the National Technical University of Athens in Greece. In 2006 he joined the Center for Image and Vision Sciences in the University of California at Los Angeles as a postdoctoral scholar. As of 2008 he is an Assistant Professor at the Department of Applied Mathematics of Ecole Centrale Paris and is also affiliated with the Galen group of INRIA-Saclay in Paris.
His research interests are in the broader areas of computer vision, signal processing and machine learning, while he has worked on nonlinear speech processing, biologically motivated vision, texture analysis and image segmentation. His currently research activity is focused on efficient algorithms for object detection, shape-based object recognition and learning-based approaches to feature detection.
He has been awarded a young researcher grant by the French National Research Agency, and serves regularly as a reviewer for all major computer vision conferences and journals; he has served as an area chair for CVPR 2012, co-organized POCV 2012 and is an associate editor for the Image and Vision Computing Journal.
Graph-based Image Matching and its Progressive Framework
Abstract: Establishing feature correspondence between images lies at the heart of computer vision problems, and a myriad of matching algorithms have been proposed to date for a wide range of applications such as object recognition, image retrieval, and image registration. Most of them, however, require some restrictive assumptions or supervised settings for reliable performance; e.g., a rigid motion assumption, a clear reference image, or distinctive patters. Robust feature matching under non-rigid deformation or intra-category variation still remains as an open problem in general real-world scenarios. In this talk, after a brief review of my prior work in SNU, I will present a graph matching approach to robust image matching, and introduce its progressive framework for real-world matching problems.
In our random walk view on (hyper-)graph matching, matching between two graphs is formulated as node selection on an association graph whose nodes represent candidate correspondences between the two graphs. The solution is obtained by simulating random walks with reweighting jumps enforcing the matching constraints on the association graph. Our algorithm achieves noise-robust graph matching by iteratively updating and exploiting the confidences of candidate correspondences. Comparative experiments on synthetic graphs and real images demonstrate that it outperforms the state-of-the-art graph matching algorithms especially in the presence of outliers and deformation.
Despite its powerful performance and robustness, the computational complexity of graph matching limits the permissible size of input graphs in practice. Therefore, in real-world applications such as image matching, the initial construction of graphs to match becomes a critical factor for the matching performance, and often leads to unsatisfactory results. To resolve the issue, a novel progressive framework is introduced which combines probabilistic progression of graphs with matching of graphs. The algorithm re-estimates in a Bayesian manner the most plausible target graphs based on the current matching result, and guarantees to boost the matching objective at the subsequent graph matching. Experimental evaluation demonstrates that our general framework effectively handles the limits of conventional graph matching and achieves significant improvement in challenging image matching problems.
Bio: M. Cho received the PhD degree in 2012 under the supervision of Prof. Kyoung Mu Lee from Seoul National University. His work focused on graph-based approach for robust image matching and object recognition. Recently, he joined the WILLOW team in INRIA/ENS as a postdoctral researcher.
Virtual Production at ILM and Lucasfilm: Reinventing the Creative Process
Abstract: Virtual production technologies have matured dramatically over the past 15 years, and are now commonly used by directors and storytellers across many media. They not only bring the creative closer to the center of the production process, but also intermingle creative approaches between film, animation, and games. In this talk, we will review how ILM has advanced these techniques in films from AI to Transformers to Rango, then look forward to virtual production's likely evolution and impact across several types of media.
Helping each other to see: Humans and machines
Abstract: Humans and machines see the world differently, each having their own strengths and weaknesses. In this talk, I describe two projects exploring how they may help each other.
Visual object recognition by machines is notoriously difficult. To help in the learning process, humans are typically used to gather large hand-labeled training datasets from which the machines may learn. However, humans may also be used to "debug" the machine's recognition pipeline to learn what aspects are lacking. Specifically, we explore the various stages of part-based person detectors. We perform human studies in which subjects perform the same sub-tasks as their machine counterparts, and accuracies are compared.
The typical human has significant difficultly in drawing everyday objects containing complex structures, such as faces or bikes. When learning to draw, humans must learn to see the word differently. That is, they must not only recognize what they are seeing, but they must perceive the spacing and structural layout of an object. We demonstrate an application in which machines can recognize what a human is drawing and provide visual guidance to the drawer in the form of shadows. The shadows, which may be either used or ignored by the drawer, help the drawer achieve more realistic overall shapes and spacing, while maintaining their own unique drawing style.
Bio: C. Lawrence Zitnick received the PhD degree in robotics from Carnegie Mellon University in 2003. His thesis focused on efficient inference algorithms for large-problem domains. Previously, his work centered on stereo vision, including the development of a commercial portable 3D camera. Currently, he is a senior researcher at the Interactive Visual Media group at Microsoft Research where he is exploring object recognition and computational photography.
Navigation within an ocean of (trusted) audiovisual contents
Abstract: Ina contains almost 1,5 million hours of audiovisual material, representing 9 million objects. These collections are daily accessed by professionals and general public searching for images and sounds, based on textual production based descriptions. Totally new possibilities of use appear when using discovery tools, since descriptions don't cover all the possible ways of locating or finding contents; this offers rich possibilities for research actions, based on content discovery and content structuring. Ina's research team concentrates on these issues and regularly collaborates with other research teams to improve access and exploitation, as well as long-term preservation. A presentation of use-cases and examples will be done during the talk.
Patch Complexity, Finite Pixel Correlations and Optimal Denoising
Abstract: Most image restoration tasks are ill-posed problems, typically solved with priors. While the optimal prior is the exact unknown density of natural images, actual priors are only approximate and typically restricted to small patches. This raises several questions: How much may we hope to improve current restoration results with future sophisticated algorithms? and more fundamentally, even with perfect knowledge of natural image statistics, what is the inherent ambiguity of the problem?
In addition, since most current methods are limited to finite support patches or kernels, what is the relation between the "patch complexity" of natural images, patch size, and restoration errors? Focusing on image denoising, we make several contributions. First, in light of computational constraints, we study the relation between denoising gain and sample size requirements in a non parametric approach. We present a law of diminishing return, namely that with increasing patch size, rare patches not only require a much larger dataset, but also gain little from it. This result suggests novel adaptive variable-sized patch schemes for denoising. Second, we study absolute denoising limits, regardless of the algorithm used, and the converge rate to them as a function of patch size. Scale invariance of natural images plays a key role here and implies both a strictly positive lower bound on denoising and a power law convergence. Extrapolating this parametric law gives a ballpark estimate of the best achievable denoising, suggesting that some improvement, although modest, is still possible.
Joint work with Boaz Nadler, Fredo Durand and Bill Freeman.
Microsoft Research India
Abstract: Anandan will talk about the recent work at Microsoft Research, India, and the Digital Heritage project.
Structured Prediction and Ranking in Computer Vision
Abstract: In this talk, I will discuss the application of structured output prediction techniques to the problem of object detection. Structured output prediction is a generalization of regression to complex and interdependent output spaces. We show novel variants of structured output objectives that incorporate ideas from ranking to enforce a (partial) ordering of output predictions. Efficient training can be achieved using a cutting plane approach. Examples will be given for weakly supervised and cascaded object detection.
Visual localization by linear combination of image descriptors
Abstract: We seek to predict the GPS location of a query image given a database of images localized on a map with known GPS locations. The contributions of this work are three-fold: (1) we formulate the image-based localization problem as a regression on an image graph with images as nodes and edges connecting close-by images; (2) we design a novel image matching procedure, which computes similarity between the query and pairs of database images using edges of the graph and considering linear combinations of their feature vectors. This improves generalization to unseen viewpoints and illumination conditions, while reducing the database size; (3) we demonstrate that the query location can be predicted by interpolating locations of matched images in the graph without the costly estimation of multi-view geometry. We demonstrate benefits of the proposed image matching scheme on the standard Oxford building benchmark, and show localization results on a database of 8,999 panoramic Google Street View images of Pittsburgh.
Around Helly numbers
Abstract: A Helly-type theorem is a result of the following form: Given a family F of sets, if every subfamily of cardinality at most h has non-empty intersection, then the whole family has non-empty intersection. In such a case, F is said to have Helly number at most h. Helly numbers have been investigated since the 1920's, when Helly showed that any family of convex sets in R^d has Helly number at most d+1.
In this talk, I'll explain how Helly numbers naturally arise in the context of optimization problems before outlining some results on Helly numbers in line geometry and homology.
A note from the speaker: I'll make an effort at keeping the presentation elementary, assuming no specific geometry/topology prerequisite (and interruptions will be welcome).
Two topics in MAP-MRF inference
(Part of the Energy Minimization Symposium)
Abstract: In the first part of the talk I will present a TRW-S algorithm, which is a message passing technique for MAP inference in MRFs with unary and pairwise terms. It is a variation of the tree-reweighted message algorithm by Wainwright et al. Unlike Wainwright's techniques, TRW-S has certain convergence guarantees, and also performs better in practice.
Then I will talk about maxflow-based inference for functions with high-order cliques. The maxflow technique is an important tool for minimizing functions with pairwise terms. It can compute a global minimum if all terms are submodular, and for general pairwise terms it can identify a part of an optimal solution by solving the roof duality relaxation. I will consider extensions of these techniques to higher-order terms. Previously proposed extensions converted the function to a sum of pairwise terms by introducing auxiliary variables. I will discuss a more direct approach that does not require such a conversion.
Time permitting, I will also talk briefly about a dual decomposition algorithm for the graph matching optimisation problem (joint work with L. Torresani and C. Rother).
Super-Labels and Hierarchical Label Costs
(Part of the Energy Minimization Symposium)
Abstract: The first part of the talk will be about a simple segmentation functional similar to Boykov-Jolly / GrabCut, The standard setup includes a GMM for foreground label, another for the background, and an MRF for regularization. Everyone agrees that the MRF aspect is important, because without it the pixels are assumed i.i.d. and this is simply not a good model for natural images. However, if the i.i.d. assumption is inappropriate for generating an image, then it is equally inappropriate for generating complex objects *within* the image (person, car), and yet this is what the standard setup entails. I'll talk about how to do better using a "2-level MRF" defined over "super-labels."
The second part of the talk will present some unpublished work on energies with a hierarchy of "label costs" and how to optimize them effectively. Such energies facilitate a kind of "hierarchical MDL" criterion and should be useful in detecting multiple objects, motions, homographies, and much more. I'll briefly demonstrate a way to detect repetitive patterns using this framework.
Learning to Segment with Diverse Data
Abstract: Semantic segmentation (assigning a semantic class such as road, person or car to every pixel of a given image) is a classical problem in computer vision, with many applications such as automatic surveillance or autonomous driving. When learning a segmentation model, the main difficulty one faces is the lack of fully supervised data. To address this issue, we present a principled framework for learning with diverse data, where the training samples offer varying levels of supervision: from pixelwise segmentation to image-level labels.
Specifically, we formulate our problem using a latent SVM where the latent, or hidden, variables model any missing information in the human annotation. In order to deal with the large amount of noise inherent in our setting, we propose a new algorithm for learning the parameters of a latent SVM, called self-paced learning (SPL). SPL builds on the intuition that the learner should be presented with the data in a meaningful order: starting with easy examples and gradually introducing more difficult ones in subsequent iterations. At each iteration, SPL simultaneously selects easy examples and learns a new set of parameters. Using large, publicly available datasets we show that our approach is able to exploit the information present in different annotations to improve the accuracy of a state-of-the-art region-based model.
Neuronal circuits can generate sparse representations for predictive coding in sensory systems
Abstract: Early sensory systems, such as retina and olfactory bulb, face a challenge of quickly and accurately transmitting information about the world to higher brain areas through a limited bandwidth channel ("Barlow's bottleneck"). To solve a similar problem, engineers introduce a predictor module that cancels out signal components which can be accurately predicted from preceding signals, freeing up limited transmission bandwidth. Such a strategy, known as predictive coding, was introduced in neuroscience by Srinivasan et al.(1982), who proposed that a linearly computed prediction is subtracted in ganglion cells. However, non-linear processing remains unexplained by this account. In a separate line of work, Rozell et al.(2008) showed that neural networks can compute sparse over-complete representations of the sensory input and Koulakov & Rinberg (2010) argued that granule cells of the olfactory bulb encode such representation in their steady state activity. Here, we advance the predictive coding view by proposing a general-purpose non-linear predictor module based on sparse approximation by network dynamics. We demonstrate that non-linear dynamics of the inhibitory interneurons can be seen as a traverse through a regularization path from highly sparse and more robust representations, to less sparse and more accurate representations. These regularized representations are not transmitted directly to higher brain areas (due to the large number of axons this would require) but are subtracted from the sensory signal in projection neurons to generate the residual. We demonstrate that the transmission of the residual of the regularization path prediction by the projection neurons trades off accuracy with speed subject to the bandwidth limitation. The proposed neural circuit implementation of the predictor module using sparse representations in inhibitory interneurons solidifies predictive coding as a conceptual framework of early sensory processing.
Procedural Reconstruction of Buildings: Towards Large Scale Reconstruction of Urban Environments
Abstract: 3D reconstruction has recently made significant progress. In this talk, we consider new challenges such as semantics and structure inference. We will describe two approaches involving shape grammars to represent buildings. In both approaches, the goal is to optimize the derivation tree of a (style specific) grammar so as to comply with some observations. The first approach tackles the problem of facade parsing from a single ortho-rectified image. Based on a 2D split grammar, semantic hierarchical segmentation of the facade domain can be randomly generated. We will present an efficient optimization algorithm based on reinforcement learning. From the optimal segmentation, a 3D model of the building can be inferred by extending the 2D grammar with 3D operators. In the second approach, we consider a 3D grammar and a sequence of images to derive the 3D structure of the building. The sequence is automatically calibrated and used to extract a noisy point cloud using a structure from motion algorithm. An evolutionary algorithm is proposed to recover the optimal parsing tree.
Body Part Recognition: Making Kinect Robust
Last November, Microsoft launched Xbox Kinect (http://www.xbox.com/kinect), a
revolution in gaming where your whole body becomes the controller : you need
not hold any device or wear anything special. Human pose estimation has long
been a "grand challenge" of computer vision, and Kinect is the first
product that meets the speed, cost, accuracy, and robustness requirements to
take pose estimation out of the lab and into the living room.
In this talk we will discuss some of the challenges of pose estimation and the technology behind Kinect, detailing our new approach which forms one of the core algorithms inside Kinect: body part recognition. Deriving from our earlier work that uses machine learning to recognize categories of objects in photographs, body part recognition uses a classifier to produce an interpretation of pixels coming from the Kinect depth-sensing camera into different parts of the body: head, left hand, right knee, etc. Estimating this pixel-wise classification is extremely efficient, as each pixel can be processed independently on the GPU. The classifications can then be pooled across pixels to produce hypotheses of 3D body joint positions for use by any suitable skeletal tracking algorithm. Our approach has been designed to be robust, in two ways in particular. Firstly, we train the system with a vast and highly varied training set of synthetic images to ensure the system works for all ages, body shapes & sizes, clothing and hair styles. Secondly, the recognition does not rely on any temporal information, and this ensures that the system can initialize from arbitrary poses and prevents catastrophic loss of track, enabling extended gameplay for the first time. We further discuss the huge promise this technology holds for many other applications.
Recent Advances in Large-Scale Image Retrieval and Real-time Object Detection and Tracking
Abstract: In this talk, I will give an overview of recent work by my group at RWTH Aachen University along two research directions:
* Object detection and tracking for mobile robotics and automotive applications
* Landmark building discovery in large-scale photo collections
Shape Matching in Subcubic Runtime
Abstract: To understand the world, humans have always used certain forms of abstraction to model their surroundings. In Computer Vision, the concept of 'shape' has been very helpful to model observed objects. In this talk, I will address the problem of shape matching in order to measure the similarity of two shapes. The classical approach solves this problem in O(N^3) where N is the amount of boundary points of each shape. I will present two different approaches to reduce the runtime to O(N^2 log(N)).
* The first method computes N shortest paths on a simple
grid of size N^2. By computing these paths in a specific order, the runtime can
be reduced to N^2 log(N).
* The second method finds a cut in a graph. Also here, we profit from the graph's planarity. In order to reduce the complexity to N^2 log(N), we need to use the data structure of 'dynamic trees' which I will also briefly present in my talk.
Wasserstein Methods in Imaging
Abstract: In this talk I will review the use of optimal transport methods to tackle various imaging problems such as texture synthesis and mixing, color transfer, and shape retrieval. Representing texture variations as well as shapes geometry can be achieved by recording histograms of high dimensional feature distributions. I will present a fast approximate Wasserstein distance to achieve fast optimal transport manipulations of these high dimensional histograms. The resulting approximate distance can be optimized using standard first order optimization schemes to perform color equalization and texture synthesis. It is also possible to use this optimal transport as a data fidelity term in standard inverse problems regularization. One can try online several ideas related to Wasserstein imaging (as many other imaging methods) by visiting www.numerical-tours.com (computer graphics section). This a joint work with Julien Rabin, Julie Delon and Marc Bernot.
Classification et Représentations Invariantes
Abstract: La classification d'images nécessite de trouver des distances et des representations qui sont localement invariantes, stables relativement a des deformations elastiques et discriminantes. Nous introduisons une classe de representations satisfaisant ces proprietes, qui s'implemente avec un reseau de convolutions. Le lien avec SIFT et l'etat de l'art des algorithmes de vision sera discute. Une architecture d'apprentissage sera présenté avec des applications a la classifications d'images et de textures.
Energy Minimization with Label costs and Applications in Multi-Model Fitting
Abstract: The a-expansion algorithm has had a significant impact in computer vision due to its generality, effectiveness, and speed. Until recently, it could only minimize energies that involve unary, pairwise, and specialized higher-order terms. We propose an extension of a-expansion that can simultaneously optimize ``label costs'' with certain optimality guarantees. An energy with label costs can penalize a solution based on the set of labels that appear in it. The simplest special case is to penalize the number of labels in the solution, but the proposed energy is significantly more general than this. Usefulness of label costs is demonstrated by a number of specific applications in vision that appeared in the last year.
Our work (see CVPR 2010, IJCV accepted) studies label costs from a general perspective, including discussion of multiple algorithms, optimality bounds, extensions, and fast special cases (e.g. UFL heuristics). In this talk we focus on natural generic applications of label costs is multi-model fitting and demonstrate several examples: homography detection, motion segmentation, unsupervised image segmentation, compression, and FMM. We also discuss a method for effective exploration of the continuum of labels - an important practical obstacle for a-expansion in model fitting. We discuss why our optimization-based approach to multi-model fitting is significantly more robust than standard extensions of RANSAC (e.g. sequential RANSAC) currently dominant in vision.
Robot Vision for the Visually Impaired
Abstract: Vision is one of the primary sensory modalities for humans that assists in performing several life sustaining and life-enhancing tasks, including the execution of actions such as obstacle avoidance and path-planning necessary for independent locomotion. Visual impairment has a debilitating impact on such independence and the visually impaired are often forced to restrict their movements to familiar locations or employ assistive devices such as the white cane. More recently, various electronic travel aids have been proposed that incorporate electronic sensor configurations and the mechanism of sensory substitution to provide relevant information - such as obstacle locations and body position - via audio or tactile cues. By providing higher information bandwidth (compared to the white cane) and at a greater range, it is hypothesized that the independent mobility performances of the visually impaired can be improved. The challenge is to extract and deliver information in a manner that keeps cognitive load at a level suitable for a human user to interpret in real-time. We present a novel mobility aid for the visually impaired that consists of only a pair of cameras as input sensors and a tactile vest to deliver navigation cues. By adopting a head-mounted camera design, the system creates an implicit interaction scheme where scene interpretation is done in a context-driven manner, based on head rotations and body movements of the user. Novel computer vision algorithms are designed and implemented to build a rich, 3D map of the environment, estimate current position and motion of the user and detect obstacles in the vicinity. A multi-threaded and factored simultaneous localization and mapping framework is used to tie all the different software modules together for interpreting the scene in real-time and accurately. The system always maintains a safe path for traversal through the current map, and tactile cues are generated to keep the person on this path, and delivered only when deviations are detected. With this strategy, the end user only needs to focus on making incremental adjustments to the direction of travel. We also present one of the very few computer-vision based mobility aids that have been tested with visually impaired subjects. Standard techniques employed in the assessment of mobility for people with vision loss were used to quantify performance through an obstacle course. Experimental evidence demonstrates that the number of contacts with objects in the path are reduced with the proposed system. Qualitatively, subjects with the device also follow safer paths compared to white cane users in terms of proximity to obstacles.
Optimization for Pixel Labeling Problems With Structured Layout
Abstract: Pixel labeling problems are pervasive in computer vision research. In this talk, we discuss optimization approaches for labeling problems which have some structure imposed on the layout of the labels. In other words, the relationships between labels is not arbitrary but has a well defined spatial structure. We will describe two approaches for structured layout scenes. The first approach is for a more restrictive type of scenes, for which we develop new graph-cut moves which we call order-preserving. The advantage of order preserving moves is that they act on all labels simultaneously, unlike the popular expansion algorithm, and, therefore, escape local minima more easily. The second approach is for a more general type of structured layout scenes and it is based on dynamic programming. In the second case, the exact minimum can be found efficiently. This is very rare for a 2D labeling problem to have an efficient and global optimizer. For both approaches, our applications include geometric class labeling and segmentation with a shape prior.
Abstract: I'll talk about our recent work in which we investigate long-term tracking of objects in a video stream. The object is given by a user-specified bounding box in a single frame. In every frame that follows, the task is to determine the object location or indicate that the object is not present. We design a novel tracking framework (TLD) that decomposes the long-term tracking task into three sub-tasks: tracking, learning and detection, which are running in parallel. The tracker follows the object from frame to frame. The detector localizes all appearances that have been previously observed and corrects the tracker if necessary. Exploiting the spatio-temporal structure in the data, the learning recognizes errors performed by the detector and updates it in order to avoid these errors in the future. In particular, we focus on: (i) detection of tracking failures, (ii) online learning of an object detector from an unlabeled video stream, and (iii) offline learning of an object detector from a large labeled data set.
Modelling Multi-Feature Active Contours
Abstract: Active contours are a powerful method used in image and video processing as well as computer vision for applications such as segmentation and tracking of non-rigid objects. Formally, active contours are deformable curves that evolve in the image plane from an initial position to the foreground boundaries, characterizing then the shape and the location of target objects.
Hence, this technique is well suited for accurate delineation of deformable objects evolving in scenes captured by either static or mobile cameras. However, most of the current approaches rely on one single feature, and then they do not usually offer enough robustness in highly-complex environments. In these difficult situations, combining several types of information could potentially increase global system performance. Few works on active contours use multiple features. Mostly, those approaches suffer from a lack of generality and/or flexibility as they propose some solutions for only some specific feature associations. Other methods try to sequentially pre/post incorporate features to the active contour technique. This leads to an incomplete or redundant use of information carried in the different features and a considerable rise in computational load.
My presented approach for multi-feature active contours mainly differs in that it proposes the combination of multiple features into a unified, generic, and mathematical framework I called Multi-Feature Vector Flow (MFVF). By means of MFVF, features of different structure, nature, and level could be homogeneously integrated into the core itself of the active contour process. The resulting multi-feature active contours were successfully tested for detection and extraction of objects with highly-changing shape and appearance in real-world image and video sequences. As demonstrated, the proposed system presents high accuracy and strong robustness in complex natural situations, while being computationally efficient.
Interactive Multi-Agent and Crowd Simulation
Abstract: Modeling of multiple agents and crowd-like behaviors has been widely studied in virtual reality, robotics, computer animation, psychology, social sciences, and civil and traf#c engineering. Realistic visual simulation of many avatars requires modeling of group behaviors, pedestrian dynamics, motion synthesis, and graphical rendering. In this talk, we give an overview of the work related to multi-agent and crowd simulation at UNC Chapel Hill. This includes new algorithms for local collision avoidance based on reciprocal velocity obstacles, automatic generation of emerging behaviors using composite agents or proxies, directing crowd simulation using navigation functions, data-driven crowd simulation, and new parallel algorithms that can exploit the capabilities of upcoming multi-core and many-core processors and can handle up to 200K agents at interactive rates. We demonstrate their application to evacuation planning, urban simulations, traffic engineering and simulating large crowds at social or religious gatherings.
Joint work with GAMMA group members at UNC Chapel Hill.
Bio: Dinesh Manocha is currently the Phi Delta Theta/Mason Distinguished Professor of Computer Science at the University of North Carolina at Chapel Hill. He received his Ph.D. in Computer Science at the University of California at Berkeley 1992. He has received Junior Faculty Award, Alfred P. Sloan Fellowship, NSF Career Award, Office of Naval Research Young Investigator Award, Honda Research Initiation Award, Hettleman Prize for Scholarly Achievement. Along with his students, Manocha has also received 12 best paper & panel awards at the leading conferences on graphics, geometric modeling, visualization, multimedia and high-performance computing. He is an ACM Fellow.
Manocha has published more than 300 papers in the leading conferences and journals on computer graphics, geometric computing, robotics, and scientific computing. He has also served as a program committee member and program chair for more than 75 conferences in these areas, and editorial boards of many leading journals. Some of the software systems related to collision detection, GPU-based algorithms and geometric computing developed by his group have been downloaded by more than 100,000 users and are widely used in the industry. He has supervised 18 Ph.D. dissertations.
Scene Understanding in an Energy Minimization Framework
Abstract: One of the goals of computer vision is to interpret a scene semantically given an image. It involves various individual tasks, such as object recognition, image segmentation, object detection, and 3D scene recovery. Substantial progress has been made in each of these tasks in the past few years. In light of these successes, the challenging problem now is to put these individual elements together to achieve the grand goal : "scene understanding", a problem which has received increasing attention recently, with the introduction of applications such as Google Street View, Microsoft Bing maps. The problem of scene understanding is particularly challenging in these scenarios owing to the large variability in classes. For instance, road scene datasets contain classes with specific shapes such as person, car, bicycle (known as "things"), as well as classes such as road, sky, grass (known as "stuff"), which lack a distinctive shape. We address the problems of "what", "where", and "how many": we recognize objects, find their location and spatial extent, segment them, and also provide the number of instances of objects.
We formulate the scene understanding problem in an energy minimization framework, defined on pixels, segments, and objects. In the context of such labelling problems, the talk will present: (i) How to model the problem; (ii) How to learn the parameters of the energy function; and (iii) How to solve the problem efficiently for gigapixel images. We will also look at other labelling problems such as stereo matching, structure detection, single view reconstruction.
Learning human actions in video
Abstract: The presented work targets the recognition of human actions in realistic video data, such as movies. To this end, we develop state-of-the-art feature extraction algorithms that robustly encode video information for both, action classification and action localization.
In a first part, we discuss bag-of-features approaches for action classification. Recent approaches that use bag-of-features as representation have shown excellent results in the case of realistic video data. We, therefore, conduct an extensive comparison of existing methods for local feature detection and description. We, then, propose a new approach that extends the concept of histograms over gradient orientations to the spatio-temporal domain.
In a second part, we investigate how human detection can help action localization in Hollywood-style movies. To this end, we extend a human tracking approach to work robustly on realistic video data. Furthermore we develop an action representation that is adapted to human tracks. Our experiments suggest that action localization benefits significantly from human detection. In addition, our system shows a large improvement over current state-of-the-art approaches.
The Information is in the Maps
Abstract: Geometric data in the form of 3D scans, images, videos, or GPS traces is becoming abundantly available on the Web and increasingly important to our economy and life. The usual pipeline in transforming such data to useful models involves data analysis operations such as feature extraction, interpolation, smoothing, fitting, segmentation, etc. In this talk we argue for a different perspective on understanding geometric data that is a based on the study of informative mappings between different data sets, within a single data set, or from a data set to a simpler space that captures its essential structure. The computation of such good mappings leads to interesting but challenging optimization problems. When our data acquisition samples the world in a dense fashion, correlations between multiple data sets create networks of maps that provide additional information both about the structure of the data as well as about the acquisition process itself. We present examples of this approach for understanding isometries between 3D scans, or for connecting large image corpora into useful webs through map networks.
Beyond Perspective Cameras: Multi-perspective Imaging, Reconstruction, Rendering, and Projection
Abstract: A perspective image represents the spatial relationships of objects in a scene as they appear from a single viewpoint. In contrast, a multi-perspective image combines what is seen from several viewpoints into a single image. Despite their incongruity of view, effective multi-perspective images are able to preserve spatial coherence and can depict, within a single context, details of a scene that are simultaneously inaccessible from a single view.
In this talk, I will provide a complete framework for using multi-perspective imaging models for computer vision and graphics. Our multi-perspective framework consists of four key components: acquisition, reconstruction, rendering, and display. A multi-perspective camera captures a scene from multiple viewpoints in a single image. From the input image, intelligent software can recover 3D scene geometry using multi-perspective stereo matching algorithms or shape-from-distortion approaches. A specific class of surfaces that are suitable to be reconstructed using shape-from-distortion are specular (reflective and refractive) surfaces, which can also be viewed as general multi-perspective cameras. The recovered geometry, along with lighting and surface reflectance, can then be loaded into the multi-perspective graphics pipeline for real-time rendering. Finally, we visualize the rendering results on a unique multi-perspective display that combines a single consumer projector and specially-shaped mirrors/lenses. Such displays will offer an unprecedented level of flexibility in terms of aspect ratio, size, field of view, etc.
Bio: Jingyi Yu is an Associate Professor at the Computer and Information Sciences Department at the University of Delaware. He received his B.S. from Caltech in 2000 and Ph.D. degree in EECS from MIT in 2005. His research interests span a range of topics in computer vision, computer graphics, and the emerging field of computational photography. He has received an NSF Career Award in 2009 and an Air Force Young Investigator Award in 2010.
Shape Representations for Object Detection
Abstract: The problem of visual object detection in still images has been widely addressed in the last decade in computer vision research. The most successful approaches up to now use mainly edge- or texture-based representations. The shape of an object outline, albeit having been widely studied as an object representation, has found limited applicability to the detection problem in real imagery. The fact that shape is a truly holistic global percept proves to be challenging because background objects and interior object contours can easily clutter a global descriptor and render it unusable. Therefore, bottom-up grouping, which selects object boundaries and regions, is of paramount importance. However, image segmentation is very often unstable, and selecting regions is a hard combinatorial problem.
In this talk, we present a contour-based holistic shape representation, called a chordiogram, which addresses the above challenges. The chordiogram is based on geometric relationships of object boundary edges and can be compactly parametrized in terms of a selection of object boundaries and regions. This allows us to link shape detection with bottom-up grouping which is defined in terms of perceptual cues such as region coherence and small object perimeter. The resulting method performs shape-based detection and segmentation simultaneously. Thus it handles image clutter and results in exact object localization and segmentation. The method is formulated compactly as an integer quadratic program and solved in a single step using a semidefinite programming relaxation. Our approach improves over state-of-the-art methods on several object detection and segmentation benchmarks.
Bio: Alexander Toshev is a PhD candidate at the Computer and Information Science Department of the University of Pennsylvania, being advised by Prof. Kostas Daniilidis. Alexander holds a MSc degree in Computer Science from the University of Pennsylvania and Dipl.-Inform. (MSc equivalent) in Computer Science from the University of Karlsruhe, Germany. In 2005, he was a research intern at INRIA, France, and in 2007 at Google Research, New York.
Alexander's research interests lie mainly in computer vision, especially object recognition and detection, shape representation and matching, and analysis of 3D range data.
Actively Using Vision and Context for Home Robotics
Abstract: Increasingly we want computers and robots to observe us and know who we are and what we are doing, and to understand the objects and tasks in our world, both at work and in the home. I will describe how we've built systems for mobile robots to find objects using visual cues and learn about shared workspaces. Further I will review how a range of visual capabilities permits the robot to work for and with humans.
We've demonstrated these abilities on Curious George, our visually-guided mobile robot that has competed and won the Semantic Robot Vision Challenge at AAAI (2007), CVPR (2008) and ISVC (2009), in a completely autonomous visual search task. In the SRVC visual classifiers are learned from images leaned from the Web. Challenges include poor image quality, badly labelled data and confusing semantics (e.g., synonyms). Clustering of training data, image quality analysis, and viewpoint-guided visual attention enable effective object search by a home robot.
Statistical models for analyzing human genetic variation
Abstract: Advances in sequencing and genomic technologies are providing new opportunities to understand the genetic basis of phenotypes including complex human diseases. Translating the large volumes of heterogeneous, often noisy, data from these technologies into biological insights presents challenging problems of statistical inference. In this talk, I will describe two (if time permits three) important statistical problems that arise in our efforts to understand human genetic variation:
- identifying amino acid residues that are critical for the function of a protein. I will describe a statistical predictor that uses a combination of evolutionary and 3-D structural information to accurately predict these residues. Case studies of well-characterized enzymes show that these predictions recover known functional residues and suggest new targets for future experiments.
- inferring the fine-scale genetic structure of human populations such as African-Americans and Latinos. I will describe a probabilistic model for these "admixed" populations as well as efficient algorithms to infer their ancestries. These algorithms can infer the ancestries even when the ancestral populations are unknown or extinct and can be used to estimate other parameters of biological interest such as the allele frequencies of ancestral populations.
- understanding genomic privacy. Sharing individual genomic data, while essential for scientific discovery, brings with it the risk of breaches to individual privacy. In the context of genomewide association studies, I will present an analysis that provides limits on the achievable privacy as well as guidelines on how to safely share data from these studies.
Sparse factor analysis and related methods: applications to three biological problems
Abstract: Working in the context of low-rank, sparse matrix factorization, latent structure in biological data is captured in a small set of identifiable factors. In this talk, a novel matrix factorization method, sparse factor analysis (SFA) is introduced and discussed in the context of related methods. We show results from applying SFA to three different problems in biology. First, on the problem of identifying population structure, SFA produced results that correspond well to those from principal components analysis and latent Dirichlet allocation models. Second, in a genome-wide association study with a complex phenotype, SFA identified factors with simple phenotypic interpretations that enabled possible SNP associations to the phenotypes of interest to be found. Third, we present preliminary results from applying SFA to identify networks in gene expression data from different human tissues.
Visual Recognition with Humans in the Loop
Abstract: We present an interactive, hybrid human-computer method for object classification. The method applies to classes of problems that are difficult for most people, but are recognizable by people with the appropriate expertise (e.g., animal species or airplane model recognition). The classification method can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. Incorporating user input drives up recognition accuracy to levels that are good enough for practical applications; at the same time, computer vision reduces the amount of human interaction required. The resulting hybrid system is able to handle difficult, large multi-class problems with tightly-related categories. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate the accuracy and computational properties of different computer vision algorithms and the effects of noisy user responses on a dataset of 200 bird species and on the Animals With Attributes dataset. Our results demonstrate the effectiveness and practicality of the hybrid human-computer classification paradigm.
This work is part of the Visipedia project, in collaboration with Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder and Pietro Perona.
Rich Photography on a Budget
Abstract: Computation is playing an increasingly central role in how we capture and process our images, opening up richer forms of imaging that go beyond conventional photography. Recent examples of rich photography involve merging multiple shots to obtain seamless panoramas, 3D shape, deeper focus, or a wider range of tones. In this talk, I will argue that the future of photography lies in richer capture, paying special attention to our limited budget of light, time, and sensor throughput. By analyzing tradeoffs and limits in imaging, we can develop ways to enrich photography while making efficient use of our cameras.
First, I will address the basic problem of capturing an in-focus image in a fixed time budget. As our analysis shows, the number of shots captured is a crucial determinant of quality, and taking this into account places the conventional camera in a surprisingly favorable light. Second, I will describe how existing cameras can be used more efficiently to capture scenes with a wide range of tones. By adjusting the camera's amplifier as well as its shutter speed, we can achieve up to 10x noise reduction in the darkest parts of the scene. Both of these projects demonstrate not only how computation enables rich photography, but also how a deeper understanding of imaging can lead to significant gains over the state-of-the-art.
Bio: Sam Hasinoff received the BSc degree in computer science from the University of British Columbia in 2000, and the MSc and PhD degrees in computer science from the University of Toronto in 2002 and 2008, respectively. He is currently an NSERC Postdoctoral Fellow at the Massachusetts Institute of Technology. In 2006, he received an honorable mention for the Longuet-Higgins Best Paper Award at the European Conference on Computer Vision. He is the recipient of the Alain Fournier Award for the top Canadian dissertation in computer graphics in 2008.
Pruned dynamic programming for optimal multiple change-point detection, by Guillen Rigaill
Multiple change-point detection models assume that the observed data is a
realization of an independent random process affected by K-1 abrupt changes,
called change-points, at some unknown positions. For off-line detection a
dynamic programming (DP) algorithm retrieves the K-1 change-points minimizing
the quadratic loss and reduces the complexity from \Theta(n^K) to \Theta(Kn^2)
where n is the number of observations. The quadratic complexity in n still
restricts the use of such an algorithm to small or intermediate values of n. We
propose a pruned DP algorithm that recovers the optimal solution. We
demonstrate that at worst the complexity is in O(Kn^2) time and O(Kn) space and
is therefore at worst equivalent to the classical DP algorithm. We show
empirically that the run-time of our proposed algorithm is drastically reduced
compared to the classical DP algorithm. More precisely, our algorithm is able
to process a million points in a matter of minutes compared to several days
with the classical DP algorithm. Moreover, the principle of the proposed
algorithm can be extended to other convex losses (for example the Poisson loss)
and as the algorithm process one observation after the other it could be
adapted for on-line problems.
Fast methods for change-point detection in single or multiple profiles, by Kevin Bleakley and Jean-Philippe Vert
Abstract: In this talk we will discuss two recent works for the detection of change-points in noisy piecewise constant profiles.
1) When a single profile is considered, we consider the problem of approximating a signal by a piecewise-constant function that minimizes a sum-of-square approximation error penalized by a total variation penalty. Harchaoui and LÃvy-Leduc (NIPS 2008) reformulated the problem as a LASSO regression problem and showed that finding k breakpoints in a signal of length n takes O(n*k) using the LARS algorithm. Here we show that the solution can in fact be found in O(n*ln(k)) with a dichotomic segmentation approach.
2) When several profiles are considered jointly, we propose to detect joint breakpoints by extending the total-variation based penalty to multiple dimensions. The resulting estimator is formulated as the solution of a group LASSO problem, which we propose to approximate by a fast group LARS procedure in O(n*k*p) to find k breakpoints in n p-dimensional profiles. We illustrate the use of this method to detect genomic regions of frequent amplifications and deletion in cancer tumors from comparative genomic hybridization profiles.
SuperParsing: Scalable Nonparametric Image Parsing with Superpixels
Abstract: We will present a simple and effective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach requires no training, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. It works by scene-level matching with global image descriptors, followed by superpixel-level matching with local features and efficient Markov random field (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art nonparametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 15,150 images and 170 labels. To our knowledge, this is the first complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem.
This is joint work with Joseph Tighe.
Object Recognition by Ranking and Tiling Figure-Ground Hypotheses
Abstract: I will present an approach to visual object-class segmentation based on models that combine multiple, holistic figure-ground hypotheses generated in a bottom-up, object independent processing, and classification decisions based on continuous value ranking of their estimated spatial overlap with putative classes. We differ from existing approaches not only in our seemingly unreasonable assumption that plausible object level segments can be obtained in a feed-forward fashion using a novel parametric max-flow procedure (Constrained Parametric Min-Cuts), but also in framing recognition as a regression problem. Instead of focusing on a one-vs-all winning margin that may not preserve ordering inside the non-maximum (non-winning) set, learning produces a globally consistent ranking with close ties to segment quality, hence to the extent entire object or part hypotheses spatially overlap with the ground truth. Time permitting, I will also describe models that explain entire images by tiling multiple figure-ground segment hypotheses using depth and perceptual grouping features. I will show results for image classification, object detection and semantic segmentation, in a number of challenging datasets including Caltech-101, ETHZ-Shape and PASCAL VOC 2009, where the system ranked first.
This is joint work with J. Carreira, F. Li and A.Ion at the University of Bonn.
Approximate Bayesian Inference and Experimental Design for Large Scale Sparse Linear Models
Abstract: Recent advances in sparse reconstruction can be seen as maximizing the posterior of a sparse linear model (SLM). Beyond its mode's location, the posterior contains much more useful information. Its mean is in general a more robust estimator, while its covariance represents remaining uncertainty in the signal, which is used in Bayesian experimental design to improve data acquisition. Unfortunately, SLM Bayesian inference is intractable and has to be approximated.
Compared to MAP, most previous approximate inference relaxations are not well understood algorithmically and are orders of magnitude slower on large problems. I will describe recent advances in variational inference for SLMs: (1) a relaxation which is a convex problem iff MAP is one, avoiding any factorization assumptions needed in Variational Bayes; (2) a fast double loop algorithm to solve it for SLMs with many different sparsity and other potentials, which reduces to well known scalable problems of convex reconstruction and numerical mathematics. I will show how this method drives automatic sampling optimization for speeding up magnetic resonance image acquisition, and motivate potential applications to sparse bilinear models.
A Latent Variable Graphical Model Derivation of Diversity for Set-based Retrieval
Abstract: Diversity has been heavily motivated as an objective criterion for result sets in the information retrieval literature and various ad-hoc heuristics have been proposed to explicitly optimize for it. In this talk, we will start from first principles and show that optimizing a simple criterion of set-based relevance in a latent variable graphical model:- a framework we refer to as probabilistic latent accumulated relevance (PLAR) :- leads to diversity as a naturally emergent property of the solution. PLAR derives variants of latent semantic indexing (LSI) kernels for relevance and diversity and does not require ad-hoc tuning parameters to balance them. PLAR also directly motivates the general form of many other ad-hoc diversity heuristics in the literature, albeit with important modifications that we show can lead to improved performance on a diversity testbed from the TREC 6-8 Interactive Track.
Bio: I received a Bachelor's degree in computer science in 2004, and a Master's degree in pattern recognition and intelligent systems in 2007. I started my PhD program in computer science in July 2007. Currently, I am a PhD candidate at the Australian National University, and a Graduate Researcher in the National ICT Australia.
An investigation of discrete-state discriminant approaches to single-sensor audio source separation
Abstract: In this talk, we will present a new discriminant scheme derived from an existing generative approach for single sensor audio source separation. Audio source separation consists in separating several speakers, musical instruments or other sources from a mixture, i.e. a recording in which the sources are simultaneously present. The underdetermined case (less sensors than sources) can be attacked by modeling each source with a finite number of discrete states. In this context, we propose new discriminant approaches derived from the existing spectral Gaussian mixture model approach. The study includes algorithms to establish oracle performance bounds by considering each model as a set of constraints to derive optimal parameters. This leads to more realistic performance bounds than the existing ones. The promising theoretical results are completed by actual separation algorithms.
As a side discussion, we will address the question of the audio vs image processing methods, in terms of similar schemes (e.g. patch/frame-based processing), related algorithmic complexity and ongoing works on sparse representations.
Combining Color and Shape Information for Image Classification
Abstract: Generally the bag-of-words based image representation follows a bottom-up paradigm. The subsequent stages of the process: feature detection, feature description, vocabulary construction and image representation are performed independent of the intentioned object classes to be detected. In such a framework, combining multiple cues such as shape and color often provides below-expected results.
The two main strategies to combine multiple cues, known as early- and late fusion both suffer from significant drawbacks. In this talk I presents a novel method by separating the shape and color cue. Subsequently, color is used to construct a top-down category-specific attention map. The color attention map is then further deployed to modulate the shape features by taking more features from regions within an image that are likely to contain an object instance. This procedure leads to a category-specific image histogram representation for each category.
Evaluation on several data sets shows that the proposed method outperforms both early- and late fusion. Additionally, I will comment on its usage in our submission to the VOC PASCAL 2009 image classification challenge.
Bio: Joost van de Weijer is a Ramon y Cajal fellow in the Color in Context group (CIC) in the Computer Vision Center in Barcelona. He received his M.Sc. degree in applied physics at Delft University of Technology in 1998. In 2005, he obtained the Ph.D. in the ISLA group at the University of Amsterdam. From 2005-2007 he was a Marie Curie Intra-European Fellow in the LEAR team at INRIA Rhone-Alpes in France. His main research is usage of color information in computer vision application. He has published in the fields of color constancy, color feature extraction and detection, color image filtering, color edge detection and color naming.
Bands and bi-clusters in binary matrices
Abstract: I will discuss about two combinatorial problems in binary matrices that involve finding simultaneous permutations of rows and columns to exhibit uniform patterns of 1s. On the one hand, banded structures correspond to a maximal number of 1s grouped close to the main diagonal; on the other hand, bi-clusters correspond to the simultaneous clustering of rows and columns such that each of the sub-matrices induced by those clusters are as uniform as possible. These structures are useful for different applications, for example in the physical mapping problem of the human genome, in micro-array data for describing the interaction of genes under certain conditions, for predicting species in paleontological data, or in network data for the discovery of overlapping communities without cycles. I will discuss the combinatorial properties of these problems and the algorithmic consequences of finding such structures in real binary data.
Learning a generative model of images by factoring appearance and shape
Abstract: Computer vision has grown tremendously in the last two decades. Despite all efforts, existing attempts at matching parts of the human visual system's extraordinary ability to understand visual scenes lack either scope or power. This work aims at combining advantages of general low-level generative models and powerful layer-based and hierarchical models. Starting from our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring out the appearance of any patch region from its shape, we propose a generative model of larger images using a field of such RBMs. Finally, we also discuss how masked RBMs can be stacked to form a deep model for hierarchical segmentation.
Efficient(?) nonconvex optimization algorithms for large-scale machine learning applications
Abstract: Optimization on manifolds is a relatively recent algorithmic framework tailored to solve particular optimization problems characterized by an "easy" cost function and "highly symmetric" constraints. The framework sometimes produces very efficient algorithms that result from a solid geometric foundation and well-chosen numerical parametrizations. The talk will survey a few such examples relevant to machine learning applications. In particular, we will show that the approach is potentially relevant for rank-constrained SDP relaxations in large-scale problems.
Learning Components for Human Sensing
Abstract: Providing computers with the ability to understand human behavior from sensory data (e.g. video, audio, or wearable sensors) is an essential part of many applications that can benefit society such as clinical diagnosis, human computer interaction, and social robotics. A critical element in the design of any behavioral sensing system is to find a good representation of the data for encoding, segmenting, classifying and predicting subtle human behavior. In this talk I will propose several extensions of Component Analysis (CA) techniques (e.g. kernel principal component analysis, support vector machines, and spectral clustering) that are able to learn spatio-temporal representations or components useful in many human sensing tasks.
In the first part of the talk I will give an overview of several ongoing projects in the CMU Human Sensing Laboratory, including our current work on depression assessment and deception detection from video, as well as hot-flash detection from wearable sensors. In the second part of the talk I will show how several extensions of the CA methods outperform state-of-the-art algorithms in problems such as temporal alignment of human behavior, temporal segmentation/clustering of human activities, joint segmentation and classification of human behavior, and facial feature detection in images. The talk will be adaptive, and I will discuss the topics of major interest to the audience.
Bio: Fernando De la Torre received his B.Sc. degree in Telecommunications (1994), M.Sc. (1996), and Ph.D (2002) degrees in Electronic Engineering from La Salle School of Engineering in Ramon Llull University, Barcelona, Spain. In 1997 and 2000 he was an Assistant and Associate Professor in the Department of Communications and Signal Theory in Enginyeria La Salle. Since 2005 he has been a Research Assistant Professor in the Robotics Institute at Carnegie Mellon University. Dr. De la Torre's research interests include computer vision and machine learning, in particular face analysis, optimization and component analysis methods, and its applications to human sensing. Dr. De la Torre co-organized the first workshop on component analysis methods for modeling, classification and clustering problems in computer vision in conjunction with CVPR'07 and the workshop on human sensing from video jointly with CVPR'06. He has also given several tutorials at international conferences (ECCV'06, CVPR'06, ICME'07, ICPR'08) on the use and extensions of component analysis methods. Currently he leads the Component Analysis Lab and the Human Sensing Lab at CMU.
Learning Distinguishing Marks for Image Classification
Abstract: We tackle here the problem of multi-class image classification from few training examples, where only small parts of the image help discriminating between classes. Such problems arise when classifiying images of objects/persons in the wild. In such settings, standard kernel-based classifiers perform well only when combined with strong prior knowledge and efficient discriminative part detectors. We propose here a convex sparsity-enforced kernel-based methods for this task, introducing a pool-L1 penalty which automatically singles out discriminant "distinguishing marks" to leverage classification performance. We report experimental results on a horses in the wild dataset and on several benchmarks datasets.
Issues in Event and Object Recognition
Abstract: I will first review some of the on-going projects in object/event recognition, scene analysis and reconstruction, reviewing in particular some of the open problems that we are currently investigating which might be of interest for this group to pursue with us. I'll spend more time on ideas in event/action recognition in videos. In this area, we are trying to investigate approaches that capture better the temporal structure of the event models than existing approaches inherited of BoW models (loses too much structure) and spatio-temporal volumetric approaches (too rigid, don't generalize well). Our approach naturally uses fragments of trajectories to represent actions. The difficulty is to represent this information in a way that is usable for classification and that can be estimated from relatively limited training data. We'll discuss some preliminary ideas.
Light field photography, microscopy, and illumination
Abstract: The light field is a four-dimensional function representing radiance along rays as a function of position and direction in space. At Stanford we have built a number of devices for capturing light fields, including (1) an array of 128 synchronized video cameras, (2) a handheld camera in which a microlens array has been inserted between the main lens and sensor plane, and (3) a microscope in which a similar microlens array has been inserted at the intermediate image plane.
The third device permits us to capture light fields of microscopic biological (or man-made) objects in a single snapshot. From these light fields, we can generate perspective flyarounds using light field rendering, or 3D focal stacks using digital refocusing. Applying 3D deconvolution to these focal stacks, we can produce a set of cross sections, which can be visualized using volume rendering. By inserting a second microlens array and video projector into the microscope's illumination path, one can control the light field falling on a specimen, as well as record the light field leaving it.
In this talk I will describe a prototype system we have built that implements these ideas, and I will demonstrate three applications for it: microscope scatterometry - measuring reflectance as a function of incident and reflected angle, "designer illumination" - illuminating one part of a microscopic object while avoiding illuminating another, and correcting optical aberrations digitally - using the illumination system as a "guide star" and the recording system as a Shack-Hartmann sensor.
Learning query-dependent prefilters for scalable image retrieval
Abstract: I argue that very few of the current crop of object-recognition or search papers are designed to really scale to a billion images. Effectively, the only game in town is min-hash, which is proving hard to extend from near-duplicate search to similar- content search. In this talk, we describe an algorithm for similar-image search which is designed to be efficient for extremely large collections. For each query, a small response set is selected by a fast prefilter, after which a more accurate ranker may be applied to each image in the response set. We consider a class of prefilters comprising disjunctions of conjunctions ("ORs of ANDs") of Boolean features. Because "AND" filters can be implemented efficiently using skipped inverted files, these structures permit search in time proportional to the response set size. The prefilters are learned from training examples, and refined at query time to yield a response set of bounded size.
We cast prefiltering as an optimization problem: for each test query, select the OR-of-AND filter which maximizes training-set recall for an adjustable bound on response set size. Tests on object class recognition show that this relatively simple filter is nevertheless powerful enough to capture some semantic information.
Simultaneous pose Tracking and Action Recognition (STAR)
Abstract: Inferring human activity in a video is an important task in video content extraction; it is needed for applications such as monitoring and alerts and content-based indexing of videos and for human-computer interaction.
A standard approach to activity recognition is to first track objects, such as humans and vehicles, and then to infer actions from the tracks. Many actions can be inferred just from the knowledge of the positions of the objects of interest, but for finer activity differentiation, such as for gesture recognition, it is necessary to also infer the body poses (i.e. the limb positions and joint angles). However, human pose tracking, in a purely "bottom-up" fashion, from a single video stream, is an extremely difficult task and existing methods are slow and highly limited.
We propose that the task can be simplified by simultaneous computation of pose tracks and activities (STAR), in analogy with the SLAM approach commonly used in robot navigation. In this approach, pose inference is guided by the activity models whereas the activity inferences, in turn, depend on evidence for different poses in the images. Our experience is that such simultaneous processing allows for efficient computation and much more robust performance. However, it does come at the cost of being applicable only to a pre-defined set of basic actions, though these basic actions can be composed into arbitrarily complex sequence of actions. This talk will describe our recent work using this approach.
Geometry, Motion and Appearance Modeling in Multi-camera and Multi-lighting (MVML) Dome
Abstract: Images of object filmed under multiple temporal instants, multiple lights, and multiple viewpoints constitute the visual filed of the object. How to fuse the available images for high fidelity 3D reconstruction and other vision enhancement applications is crucial for computer vision and computer graphics researches. This fusion problem covers lots of hot topics in state-of-the-art research, such as multi-view stereo, photometric stereo, image based relighting, performance capture and animation, structure from motion, etc. To help the understanding of the base mechanisms behind this complicate fusion problem, a multi-camera and multi-lighting dome is developed in Tsinghua University, China. The talk begins from this dome and includes the following contents:
Multi-camera and multi-lighting (MVML) dome The MVML dome is composed of 40 cameras and 31 light sources. Such a dome can be regarded as a combination of 3D production stereo and light stage. The goals, the design and the future of this dome will be introduced. Continuous depth map based multi-view stereo (CMVS) CMVS is a new multi-view stereo technique for 3D reconstruction under sparse multi-view images. CMVS is robust, and achieves the most accurate reconstruction results for geometry details such as wrinkles on the cloth.
Multi-view photometric stereo (MPS) for free-viewpoint video An MPS algorithm is designed aiming at 3D reconstruction for the MVML datasets. Compared with the MVS algorithm using only the multi-view images captured under constant lights, MPS takes advantages of lighting information and improves reconstruction accuracy and robustness.
Future works At last, the talk will introduce some basic ideas to fuse the images captured under the MVML dome for high quality motion and geometry reconstruction and appearance relighting.
Learning Hierarchies of Invariant Visual Features
Abstract: Intelligent tasks, such as visual perception, auditory perception, and language understanding require the construction of good internal representations of the world. Internal representations (or "features") must be invariant (or robust) to irrelevant variations of the input, but must preserve the information relevant to the task. An important goal of our research is to devise methods that can automatically learn good internal representations from labeled and unlabeled data. Results from theoretical analysis, and experimental evidence from visual neuroscience, suggest that the visual world is best represented by a multi-stage hierarchy, in which features in successive stages are increasingly global, invariant, and abstract. The main question is how can one train such deep architectures from unlabeled data and limited amounts of labeled data.
Several methods have recently been proposed to train deep architectures in an unsupervised fashion. Each layer of the deep architecture is composed of a feed-forward encoder which computes a feature vector from the input, and a feed-back decoder which reconstructs the input from the features. The training shapes an energy landscape with low valleys around the training samples and high plateaus everywhere else. A number of such layers can be stacked and trained sequentially. A particular class of methods for deep energy-based unsupervised learning will be described that imposes sparsity constraints on the features. When applied to natural image patches, the method produces hierarchies of filters similar to those found in the mammalian visual cortex. A simple modification of the sparsity criterion produces locally-invariant features with similar characteristics as hand-designed features, such as SIFT.
An application to category-level object recognition with invariance to pose and illumination will be described. By stacking multiple stages of sparse features, and refining the whole system with supervised training, state-the-art accuracy can be achieved on standard datasets with very few labeled samples. A real-time demo will be shown. Another application to vision-based navigation for off-road mobile robots will be shown. After a phase of off-line unsupervised learning, the system autonomously learns to discriminate obstacles from traversable areas at long range using labels produced with stereo vision for nearby areas.
This is joint work with Y-Lan Boureau, Karol Gregor, Raia Hadsell, Koray Kavakcuoglu, and Marc'Aurelio Ranzato.
Abstract: Sparse signal and image models were first developed for data compression but are now getting integrated at all stages of the data processing chain, including acquisition (compressed sensing) and interpretation (machine learning). The recent in-depth mathematical analysis of these models and the related algorithms has demonstrated their ability to provide concise descriptions of complex data collections, together with algorithms of bounded complexity and provable performance. Yet, to exploit these models for efficient data processing, a crucial assumption is needed: one must know a "dictionary" of atoms providing concise descriptions adapted to the data of interest. Besides off the shelf dictionaries (Fourier, wavelets, etc.), a promising approach consists in learning the dictionary from a corpus of training samples. However, this raises many difficulties due for example to the high dimensionality of the raw data, the limited availability of training data, and the presence of outliers in the corpus. Moreover, even though several empirically successful algorithms have been proposed, it remains a challenge to characterize the relevance and robustness of these in a mathematically founded framework.
In this lecture, I will propose elements of a theoretical framework to analyze dictionary learning. Many questions will be raised, a few will be answered. NB: this is joint work with Karin Schnass (EPFL).
Power Watersheds: A Unifying Graph Based Optimization Framework
Abstract: In this work, we extend a common framework for seeded image segmentation that includes the graph cuts, random walker, and shortest path optimization algorithms. Viewing an image as a weighted graph, these algorithms can be expressed by means of a common energy function with differing choices of a parameter q acting as an exponent on the differences between neighboring nodes. Introducing a new parameter p that fixes a power for the edge weights allows us to also include the optimal spanning forest algorithm for watersheds in this same framework. We then propose a new family of segmentation algorithms that fixes p to produce an optimal spanning forest but varies the power q beyond the usual watershed algorithm, which we term power watersheds. In particular when q=2, the power watershed leads to a unique global minimum obtained in practice in quasi-linear time. Placing the watershed algorithm in this energy minimization framework also opens new possibilities for using unary terms in traditional watershed segmentation and using watersheds to optimize more general models of use in application beyond image segmentation.
Learning to predict where people look
Abstract: For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene. Where eye tracking devices are not a viable option, models of saliency can be used to predict fixation locations. Most saliency approaches are based on bottom-up computation that does not consider top-down image semantics and often does not match actual eye movements. To address this problem, we collected a large database of eye tracking data of 15 viewers on 1003 images and use this database as training and testing examples to learn a model of saliency based on low, middle and high-level image features. We have made the eye tracking database available at http://people.csail.mit.edu/tjudd/wherepeoplelook.html. This work was published in ICCV 2009. In addition I will be showing results from new experiments that follow-up on this work.
Bio: Tilke Judd is a PhD student in computer graphics at MIT working with professors Fredo Durand and Antonio Torralba. She got her Bachelors in Math in 2003 and Masters in CS in 2007 from MIT and has studied abroad at Cambridge and Ecole Polytechnique. Her research work lies in non-photorealistic rendering, computational photography and image saliency. Outside of work she enjoys making short films and ballroom dancing.
On the geometry of some non-central projections
Abstract: Pushbroom, omnivergent, spherical and many other mosaics are constructed by acquiring a large number of images by a moving camera and then stitching some pixels from each image. The rays corresponding to individual pixels of the mosaics thus do not pass through one center of projection. Instead, they may be incident to, e.g., a line, a circle, or a sphere. Such mosaic can be so viewed as images taken by a non-central camera, a camera that does not have a single projection center. The mosaics can be, and actually are, used to reconstruct scenes because they posses the important advantage compared to classical pinhole cameras that almost complete surrounding is seen from each viewpoint. However, the mosaic geometry and stereo geometry can be rather complicated, depending on the way the mosaic was acquired. I will review some results on geometry and stereo geometry of non-central cameras corresponding to mosaics and make links to the general geometrical concepts.
Beyond Mere Novelty Detection
Abstract: For novel class identification we propose to rely on the natural hierarchy of object classes, using a new approach to detect incongruent events. Here detection is based on the discrepancy between the responses of two different classifiers trained at different levels of generality: novelty is detected when the general level classifier accepts, and the specific level classifier rejects. Thus our approach is arguably more robust than traditional approaches to novelty detection, and more amendable to effective information transfer between known and new classes.
Specifically, we start from a given hierarchal tree organization of known objects, where siblings at lower levels of the tree (closer to the leaves) are more similar to each other than siblings at higher levels. Each non-leaf node is assigned two classifiers - one based on all training examples from the class (the general level classifier), and one based on the disjunction of discriminative classifiers from all descendent nodes (the specific level classifier). For each new sample the prediction of these two classifiers is compared, and whenever the general level classifier accepts while the specific level classifier rejects, the algorithm detects a novel object class. The new object is judged to be a member of the general class represented by the current node, but not a member of any known object classes. The algorithm's performance was evaluated on two sets of objects (two hierarchies), and two embedded recognition algorithms (one generative and one discriminative).
Solving Image Matching Problems Using Interior Point Methods
Abstract: The problem of finding correspondences between two images is central to many applications in computer vision. This talk will describe a new approach to tackling a variety of image matching problems using Linear Programming. The approach proceeds by constructing a piecewise-linear, convex approximation to the match score function associated with each of the pixels being matched. Regularization terms related to the first and second derivatives of the displacement function can also be modeled in this framework as convex functions. Once this has been done, the global image matching problem can be reformulated as a large scale Linear Program which can be solved using Interior Point methods. The resulting optimization problems are highly structured and efficient algorithms which exploit these regularities will be presented. The talk will describe applications of this approach to stereo matching and to recovering parametric image deformations that optimally register two frames.
Bio: Dr. Taylor received his A.B. degree in Electrical Computer and Systems Engineering from Harvard College in 1988 and his M.S. and Ph.D. degrees from Yale University in 1990 and 1994 respectively. Dr. Taylor was the Jamaica Scholar in 1984, a member of the Harvard chapter of Phi Beta Kappa and held a Harvard College Scholarship from 1986-1988. From 1994 to 1997 Dr. Taylor was a postdoctoral researcher and lecturer with the Department of Electrical Engineering and Computer Science at the University of California, Berkeley. He joined the faculty of the Computer and Information Science Department at the University of Pennsylvania in September 1997. He received an NSF CAREER award in 1998 and the Lindback Minority Junior Faculty Award in 2001. Dr Taylor's research interests lie primarily in the fields of Computer Vision and Robotics and include: reconstruction of 3D models from images, vision-guided robot navigation and smart camera networks. Dr. Taylor has served as an Associate Editor of the IEEE Transactions of Pattern Analysis and Machine Intelligence. He has also served on numerous conference organizing committees and was a Program Chair of the 2006 edition of the IEEE Conference on Computer Vision and Pattern Recognition.
Organizing Visual Data
Abstract: What is an image? There is a growing trend in the last decade or so to treat images as a bag of patches. This can be seen in texture synthesis, object and action recognition and even image denoising. This approach has great success but it comes at a price. All geometric information is lost. In this talk I argue that much geometric information is implicitly encoded in the bag of patches, demonstrate how to recover it and show a number of potential image editing applications.
In particular, I show that reconstructing an image from a bag of patches is akin to solving a jigsaw puzzle subject to user constraints. When no constraints are given this defaults to solving a standard jigsaw puzzle. Image editing operations are mapped to various user constraints such as fixing the spatial location of some of the patches, the size of the target image or the pool of patches to use. We define terms in a Markov network to specify a good image reconstruction from patches: neighboring patches must fit to form a plausible image, and each patch should be used only once. We find an approximate solution to the Markov network using loopy belief propagation, introducing an approximation to handle the combinatorial difficult patch exclusion constraint. We show that this approach allows for various image editing operations such as object cut and paste, hole filling, image retargeting and object removal.
Research done with Taeg Sang Cho, Moshe Butman and Bill Freeman.
The Confluence of Sparse Representation and Computer Vision
Abstract: In the past few years, sparse representation and compressive sensing have arisen as a very powerful and popular framework for signal and image processing. It has armed people with new mathematical principles and computational tools that can effectively and efficiently harness sparse, low-dimensional structures of high-dimensional data such as images and videos. In this talk, we contend that the same principles and tools are equally important for analyzing the meaning and semantics of images and help solve many outstanding problems in computer vision.
As an example, we will focus on the recent success of sparse representation in human face recognition. On one hand, tools from sparse representation such as L1-minimization have seen great empirical success in enhancing the robustness of face recognition with occlusion, illumination change, and registration error, leading to striking recognition performance far exceeding human expectation or capability. On the other hand, the peculiar structures of face images have led to new mathematical discovery of remarkable properties of L1 minimization that far exceed the existing sparse representation theory.
We will also illustrate with many other examples in computer vision the importance of sparsity as a guiding principle for extracting and harnessing the structures of high-dimensional visual data. In return, we will see that overwhelming empirical evidences from those examples suggest that an even richer set of new mathematical results can be developed if we systematically extend the theory of sparse representation to clustering or classification of high-dimensional visual data. The confluence of sparse representation and computer vision is leading us to a brand new mathematical foundation for high-dimensional pattern analysis and recognition.
This is joint work with my former PhD students John Wright, Allen Yang, and Shankar Rao.
Bio: Yi Ma is an associate professor at the Electrical & Computer Engineering Department of the University of Illinois at Urbana-Champaign. He is currently on leave as research manager of the Visual Computing group at Microsoft Research Asia in Beijing. His research interests include computer vision, image processing, and systems theory. Yi Ma received two Bachelors' degree in Automation and Applied Mathematics from Tsinghua University (Beijing, China) in 1995, a Master of Science degree in EECS in 1997, a Master of Arts degree in Mathematics in 2000, and a PhD degree in EECS in 2000, all from the University of California at Berkeley. Yi Ma received the David Marr Best Paper Prize at the International Conference on Computer Vision 1999 and the Longuet-Higgins Best Paper Prize at the European Conference on Computer Vision 2004. He also received the CAREER Award from the National Science Foundation in 2004 and the Young Investigator Award from the Office of Naval Research in 2005. He is an associate editor of IEEE Transactions on Pattern Analysis and Machine Intelligence. He is a senior member of IEEE and a member of ACM, SIAM, and ASEE.
Multi-People Tracking through Global Optimization
Abstract: Given three or four synchronized videos taken at eye level and from different angles, we show that we can effectively detect and track people, even when the only available data comes from the binary output of a simple blob detector and the number of present individuals is a priori unknown.
We start from occupancy probability estimates in a top view and rely on a generative model to yield probability images to be compared with the actual input images. We then refine the estimates so that the probability images match the binary input images as well as possible. Finally, having performed this computation independently at each time step, we use dynamic programming to accurately follow individuals across thousands of frames. Our algorithm yields metrically accurate trajectories for each one of them, in spite of very significant occlusions.
In short, we combine a mathematically well-founded generative model that works in each frame individually with a simple approach to global optimization. This yields excellent performance using very simple models that could be further improved.
Computation is the New Optics: Coded Imaging in Computational Photography
The principle of image formation had so far remained largely unchanged since the invention of photography: a lens focuses light rays from the scene onto a two-dimensional sensor that records this information directly into a picture. The final image is a simple copy of the optical image reaching the sensor and image quality enhancement is usually obtained through improvement in the optics. The emerging field of computational photography challenges this view and proposes to leverage computation between the optical images and the final picture to alleviate physical limitation, enable flexible post-capture editing, record new types of information such as depth, and enable novel visual experience.
The addition of computation means that the optical image does not need to be similar to the final image, which greatly expands the possible imaging strategies. Computation can be more than a simple post- processing that takes as input a traditionally-formed image, it also deeply changes the rules of the game for the optical side of imaging. New optics must be designed together with the computation to optimize the whole imaging process. Until now, optics has been the key to enhancing our ability to view and image the world, but digital processing provides us with a new tool that vastly expands our ability to form and enhance images.
Differential Colour Structure: from edges to image statistics to visual categorisation
Abstract: Colour is an intriguing source of information in images. Many aspects of physics and human vision are inextricably linked in local image patches. This talk aims to shed light on the aspects of physics and the differential observation of colour images. A natural starting point is to investigate the physical formation process of images. Visual appearance depends on the illuminant position and intensity, the incidental viewpoint of the camera, material properties of the objects, and the composition of the scene resulting in occlusions and clutter. We follow the physics of light, as it is emitted from the source, reflected by the materials in the scene, and recorded by the camera (or human eye alike). By modeling the image observation by a linear diffusion process, we end up with a scale-space theory of colour image formation. This theory includes and explains many properties of vision, like receptive fields, colour edge detection, colour scale-space blurring, colour constancy, and last but not least colour SIFT descriptors. Further combining colour scale-space theory with natural image statistics yields a solid basis for compact local statistics of images, allowing to capture gist of scene like information.
We will show how colour information can be exploited in visual categorisation. So far, intensity-based descriptors have been widely used. To increase illumination invariance and discriminative power, colour descriptors have been proposed only recently. As many descriptors exist, a structured overview is required of colour invariant descriptors in the context of image categorisation. Therefore, we have studied 1. the invariance properties and 2. the distinctiveness of colour descriptors in a structured way. The invariance properties of colour descriptors are shown analytically using a taxonomy based on invariance properties with respect to photometric transformations. The distinctiveness of colour descriptors is assessed experimentally by means of the PASCAL Visual Object Classification challenge and the TRECvid video retrieval benchmark.