Researcher at INRIA - Sierra team
DI - École Normale Supérieure
NIPS 2017 ORAL Prova
slides
Abstract
bibtex
We study the generalization properties of ridge regression with random features in the statistical learning framework. We show for the first time that O(1/sqrt(n)) learning bounds can be achieved with only O(n sqrt(logn)) random features rather than O(n) as suggested by previous results. Further, we prove faster learning rates and show that they might require more random features, unless they are sampled according to a possibly problem dependent distribution. Our results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.
pdf code slides video@article{rudi2016generalization, title={Generalization properties of learning with random features}, author={Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo}, journal={arXiv preprint arXiv:1602.04474}, year={2016} }
Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic projections, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt(n)) time. Extensive experiments show that state of the art results on available large scale datasets can be achieved even on a single machine.
pdf code slides video@article{rudi2017falkon, title={FALKON: An Optimal Large Scale Kernel Method}, author={Rudi, Alessandro and Carratino, Luigi and Rosasco, Lorenzo}, journal={arXiv preprint arXiv:1705.10958}, year={2017} }
Key to multitask learning is exploiting relationships between different tasks to improve prediction performance. If the relations are linear, regularization approaches can be used successfully. However, in practice assuming the tasks to be linearly related might be restrictive, and allowing for nonlinear structures is a challenge. In this paper, we tackle this issue by casting the problem within the framework of structured prediction. Our main contribution is a novel algorithm for learning multiple tasks which are related by a system of nonlinear equations that their joint outputs need to satisfy. We show that the algorithm is consistent and can be efficiently implemented. Experimental results show the potential of the proposed method.
pdf code slides video@article{ciliberto2017consistent, title={Consistent Multitask Learning with Nonlinear Output Relations}, author={Ciliberto, Carlo and Rudi, Alessandro and Rosasco, Lorenzo and Pontil, Massimiliano}, journal={arXiv preprint arXiv:1705.08118}, year={2017} }
Early stopping is a well known approach to reduce the time complexity for performing training and model selection of large scale learning machines. On the other hand, memory/space (rather than time) complexity is the main constraint in many applications, and randomized subsampling techniques have been proposed to tackle this issue. In this paper we ask whether early stopping and subsampling ideas can be combined in a fruitful way. We consider the question in a least squares regression setting and propose a form of randomized iterative regularization based on early stopping and subsampling. In this context, we analyze the statistical and computational properties of the proposed method. Theoretical results are complemented and validated by a thorough experimental analysis.
pdf code slides video@inproceedings{camoriano2016nytro, title={NYTRO: When subsampling meets early stopping}, author={Camoriano, Raffaello and Angles, Tom{\'a}s and Rudi, Alessandro and Rosasco, Lorenzo}, booktitle={Artificial Intelligence and Statistics}, pages={1403--1411}, year={2016} }
We propose and analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed method. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.
pdf code slides video@inproceedings{ciliberto2016consistent, title={A Consistent Regularization Approach for Structured Prediction}, author={Ciliberto, Carlo and Rosasco, Lorenzo and Rudi, Alessandro}, booktitle={Advances in Neural Information Processing Systems}, pages={4412--4420}, year={2016} }
We study Nystrom type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. In particular, we prove that these approaches can achieve optimal learning bounds, provided the subsampling level is suitably chosen. These results suggest a simple incremental variant of Nystr\"om Kernel Regularized Least Squares, where the subsampling level implements a form of computational regularization, in the sense that it controls at the same time regularization and computations. Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.
pdf code slides video@incollection{rudi2015less, title = {Less is More: Nystr\"{o}m Computational Regularization}, author = {Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo}, booktitle = {Advances in Neural Information Processing Systems 28}, editor = {C. Cortes and N. D. Lawrence and D. D. Lee and M. Sugiyama and R. Garnett}, pages = {1657--1665}, year = {2015}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf} }
We consider here the classic problem of support estimation, or learning a set from random samples, and propose a natural but novel approach to address it. We do this by investigating its connection with a seemingly distinct problem, namely subspace learning.
pdf code slides video@incollection{rudi2014learning, title={Learning Sets and Subspaces}, author={Rudi, Alessandro and Canas, G. D. and De Vito, Ernesto and Rosasco, Lorenzo}, booktitle={Regularization, Optimization, Kernels, and Support Vector Machines}, pages={337--357}, year={2014}, publisher={CRC Press} }
In this paper we discuss the Spectral Support Estimation algorithm [1] by analyzing its geometrical and computational properties. The estimator is non-parametric and the model selection depends on three parameters whose role is clarified by simulations on a two-dimensional space. The performance of the algorithm for novelty detection is tested and compared with its main competitors on a collection of real benchmark datasets of different sizes and types.
pdf code slides video@inproceedings{rudi2013sample, title={On the sample complexity of subspace learning}, author={Rudi, Alessandro and Canas, Guillermo D and Rosasco, Lorenzo}, booktitle={Advances in Neural Information Processing Systems}, pages={2067--2075}, year={2013} }
A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.
pdf code slides video@inproceedings{rudi2013sample, title={On the sample complexity of subspace learning}, author={Rudi, Alessandro and Canas, Guillermo D and Rosasco, Lorenzo}, booktitle={Advances in Neural Information Processing Systems}, pages={2067--2075}, year={2013} }
The process of model selection and assessment aims at finding a subset of parameters that minimize the expected test error for a model related to a learning algorithm. Given a subset of tuning parameters, an exhaustive grid search is typically performed. In this paper an automatic algorithm for model selection and assessment is proposed. It adaptively learns the error function in the parameters space, making use of scale space theory and statistical learning theory in order to estimate a reduced number of models and, at the same time, to make them more reliable. Extensive experiments are performed on the MNIST dataset.
pdf code slides video@inproceedings{rudi2012adaptive, title={Adaptive Optimization for Cross Validation.}, author={Rudi, Alessandro and Chiusano, Gabriele and Verri, Alessandro}, booktitle={ESANN}, year={2012} }
A novel approach to 3D gaze estimation for wearable multi-camera devices is proposed and its effectiveness is demonstrated both theoretically and empirically. The proposed approach, firmly grounded on the geometry of the multiple views, introduces a calibration procedure that is efficient, accurate, highly innovative but also practical and easy. Thus, it can run online with little intervention from the user. The overall gaze estimation model is general, as no particular complex model of the human eye is assumed in this work. This is made possible by a novel approach, that can be sketched as follows: each eye is imaged by a camera; two conics are fitted to the imaged pupils and a calibration sequence, consisting in the subject gazing a known 3D point, while moving his/her head, provides information to 1) estimate the optical axis in 3D world; 2) compute the geometry of the multi-camera system; 3) estimate the Point of Regard in 3D world. The resultant model is being used effectively to study visual attention by means of gaze estimation experiments, involving people performing natural tasks in wide-field, unstructured scenarios.
pdf code slides video@inproceedings{pirri2011general, title={A general method for the point of regard estimation in 3D space}, author={Pirri, Fiora and Pizzoli, Matia and Rudi, Alessandro}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on}, pages={921--928}, year={2011}, organization={IEEE} }
The Viewing Graph [1] represents several views linked by the corresponding fundamental matrices, estimated pairwise. Given a Viewing Graph, the tuples of consistent camera matrices form a family that we call the Solution Set. This paper provides a theoretical framework that formalizes different properties of the topology, linear solvability and number of solutions of multi-camera systems. We systematically characterize the topology of the Viewing Graph in terms of its solution set by means of the associated algebraic bilinear system. Based on this characterization, we provide conditions about the linearity and the number of solutions and define an inductively constructible set of topologies which admit a unique linear solution. Camera matrices can thus be retrieved efficiently and large viewing graphs can be handled in a recursive fashion. The results apply to problems such as the projective reconstruction from multiple views or the calibration of camera networks.
pdf code slides video@inproceedings{rudi2010linear, title={Linear solvability in the viewing graph}, author={Rudi, Alessandro and Pizzoli, Matia and Pirri, Fiora}, booktitle={Asian Conference on Computer Vision}, pages={369--381}, year={2010}, organization={Springer} }