Alessandro Rudi

Researcher at INRIA - Sierra team 
DI - École Normale Supérieure  

NIPS 2017 ORAL Prova
Autori slides




  • Learning with SGD and Random Features

    Luigi Carratino, Alessandro Rudi, Lorenzo Rosasco NIPS 2018

    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlight how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.

    pdf code slides video
      title={Learning with sgd and random features},
      author={Carratino, Luigi and Rudi, Alessandro and Rosasco, Lorenzo},
      booktitle={Advances in Neural Information Processing Systems},
  • Manifold Structured Prediction

    Alessandro Rudi, Carlo Ciliberto, Gian Maria Marconi, Lorenzo Rosasco NIPS 2018

    Structured prediction provides a general framework to deal with supervised problems where the outputs have semantically rich structure. While classical approaches consider finite, albeit potentially huge, output spaces, in this paper we discuss how structured prediction can be extended to a continuous scenario. Specifically, we study a structured prediction approach to manifold valued regression. We characterize a class of problems for which the considered approach is statistically consistent and study how geometric optimization can be used to compute the corresponding estimator. Promising experimental results on both simulated and real data complete our study.

    pdf code slides video
      title={Manifold Structured Prediction},
      author={Rudi, Alessandro and Ciliberto, Carlo and Marconi, GianMaria and Rosasco, Lorenzo},
      booktitle={Advances in Neural Information Processing Systems},

    structured-prediction manifold manifold-learning manifold-optimization learning kernel-methods

  • Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

    Giulia Luise, Alessandro Rudi, Massimiliano Pontil, Carlo Ciliberto NIPS 2018

    Applications of optimal transport have recently gained remarkable attention thanks to the computational advantages of entropic regularization. However, in most situations the Sinkhorn approximation of the Wasserstein distance is replaced by a regularized version that is less accurate but easy to differentiate. In this work we characterize the differential properties of the original Sinkhorn distance, proving that it enjoys the same smoothness as its regularized version and we explicitly provide an efficient algorithm to compute its gradient. We show that this result benefits both theory and applications: on one hand, high order smoothness confers statistical guarantees to learning with Wasserstein approximations. On the other hand, the gradient formula allows us to efficiently solve learning and optimization problems in practice. Promising preliminary experiments complement our analysis.

    pdf code slides video
       title = {Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance},
       author = {Luise, Giulia and Rudi, Alessandro and Pontil, Massimiliano and Ciliberto, Carlo},
       booktitle = {Advances in Neural Information Processing Systems 31},
       pages = {5864--5874},
       year = {2018},

    wasserstein optimal-transport smoothness barycenters gradient structured-prediction learning kernel-methods

  • On Fast Leverage Score Sampling and Optimal Learning

    Alessandro Rudi, Daniele Calandriello, Luigi Carratino, Lorenzo Rosasco NIPS 2018

    Leverage score sampling provides an appealing way to perform approximate computations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right and further approximations are typically needed. In this paper, we study the problem of leverage score sampling for positive definite matrices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling. We provide theoretical guarantees as well as empirical results proving that the proposed algorithm is currently the fastest and most accurate solution to this problem. Second, we analyze the properties of the proposed method in a downstream supervised learning task. Combining several algorithmic ideas, we derive the fastest solver for kernel ridge regression and Gaussian process regression currently available. Also in this case, theoretical findings are corroborated by experimental results.

    pdf code slides video
      title={On fast leverage score sampling and optimal learning},
      author={Rudi, Alessandro and Calandriello, Daniele and Carratino, Luigi and Rosasco, Lorenzo},
      booktitle={Advances in Neural Information Processing Systems},

    falkon leverage scores nystrom statistical-machine-learning large-scale-machine-learning kernel-methods

  • Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

    Loucas Pillaud-Vivien, Alessandro Rudi, Francis Bach NIPS 2018

    We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

    pdf code slides video
       title = {Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes},
       author = {Pillaud-Vivien, Loucas and Rudi, Alessandro and Bach, Francis},
       booktitle = {Advances in Neural Information Processing Systems 31},
       pages = {8125--8135},
       year = {2018},

    statistical-machine-learning stochastic-gradient-descent SGD averaged-SGD optimal learning rates learning kernel-methods

  • Localized Structured Prediction

    Carlo Ciliberto, Francis Bach, Alessandro Rudi Arxiv 2018

    Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve state-of-the-art performance. While such algorithms are typically derived on a case-by-case basis, in this work we propose the first theoretical framework to deal with part-based data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of the problem with respect to the learning rates of the proposed estimator.

    pdf code slides video
      title={Localized Structured Prediction},
      author={Ciliberto, Carlo and Bach, Francis and Rudi, Alessandro},
      journal={arXiv preprint arXiv:1806.02402},

    structured-prediction parts locality graphical-models random-fields learning kernel-methods

  • Approximating Hamiltonian dynamics with the Nyström method

    Alessandro Rudi, Leonard Wossnig, Carlo Ciliberto, Andrea Rocchetto, Massimiliano Pontil, Simone Severini Arxiv 2018

    Simulating the time-evolution of quantum mechanical systems is BQP-hard and expected to be one of the foremost applications of quantum computers. We consider the approximation of Hamiltonian dynamics using subsampling methods from randomized numerical linear algebra. We propose conditions for the efficient approximation of state vectors evolving under a given Hamiltonian. As an immediate application, we show that sample based quantum simulation, a type of evolution where the Hamiltonian is a density matrix, can be efficiently classically simulated under specific structural conditions. Our main technical contribution is a randomized algorithm for approximating Hermitian matrix exponentials. The proof leverages the Nyström method to obtain low-rank approximations of the Hamiltonian. We envisage that techniques from randomized linear algebra will bring further insights into the power of quantum computation.

    pdf code slides video
      title={Approximating Hamiltonian dynamics with the Nystr\"om method},
      author={Rudi, Alessandro and Wossnig, Leonard and Ciliberto, Carlo and Rocchetto, Andrea and Pontil, Massimiliano and Severini, Simone},
      journal={arXiv preprint arXiv:1804.02484},

    quantum-mechanics nystrom low-rank-approximation classical-simulation quantum-computing learning kernel-methods

  • Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces

    Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, Volkan Cevher ACHA 2018

    In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral-regularized algorithms, including ridge regression, principal component analysis, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.

    pdf code slides video
      title={Optimal rates for spectral algorithms with least-squares regression over hilbert spaces},
      author={Lin, Junhong and Rudi, Alessandro and Rosasco, Lorenzo and Cevher, Volkan},
      journal={Applied and Computational Harmonic Analysis},

    statistical-machine-learning kernel-ridge-regression spectral-filters least-squares hilbert-spaces optimal-learning-rates learning kernel-methods

  • Exponential convergence of testing error for stochastic gradient methods

    Loucas Pillaud-Vivien, Alessandro Rudi, Francis Bach COLT 2018

    We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed.

    pdf code slides video
      title={Exponential convergence of testing error for stochastic gradient methods},
      author={Pillaud-Vivien, Loucas and Rudi, Alessandro and Bach, Francis},
      journal={arXiv preprint arXiv:1712.04755},

    statistical-machine-learning stochastic-gradient-descent SGD averaged-SGD optimal learning rates learning kernel-methods

  • Generalization Properties of Learning with Random Features

    Alessandro Rudi, Lorenzo Rosasco NIPS 2017

    We study the generalization properties of ridge regression with random features in the statistical learning framework. We show for the first time that O(1/sqrt(n)) learning bounds can be achieved with only O(n sqrt(logn)) random features rather than O(n) as suggested by previous results. Further, we prove faster learning rates and show that they might require more random features, unless they are sampled according to a possibly problem dependent distribution. Our results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.

    pdf code slides video
      title={Generalization properties of learning with random features},
      author={Rudi, Alessandro and Rosasco, Lorenzo},
      booktitle={Advances in Neural Information Processing Systems},

    random-features large-scale-machine-learning kernel-methods

  • FALKON: An Optimal Large Scale Kernel Method

    Alessandro Rudi, Luigi Carratino, Lorenzo Rosasco NIPS 2017

    Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic projections, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt(n)) time. Extensive experiments show that state of the art results on available large scale datasets can be achieved even on a single machine.

    pdf code slides video
      title={FALKON: An optimal large scale kernel method},
      author={Rudi, Alessandro and Carratino, Luigi and Rosasco, Lorenzo},
      booktitle={Advances in Neural Information Processing Systems},

    nystrom statistical-machine-learning large-scale-machine-learning kernel-methods

  • Consistent Multitask Learning with Nonlinear Output Relations

    Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, Massimiliano Pontil NIPS 2017

    Key to multitask learning is exploiting relationships between different tasks to improve prediction performance. If the relations are linear, regularization approaches can be used successfully. However, in practice assuming the tasks to be linearly related might be restrictive, and allowing for nonlinear structures is a challenge. In this paper, we tackle this issue by casting the problem within the framework of structured prediction. Our main contribution is a novel algorithm for learning multiple tasks which are related by a system of nonlinear equations that their joint outputs need to satisfy. We show that the algorithm is consistent and can be efficiently implemented. Experimental results show the potential of the proposed method.

    pdf code slides video
      title={Consistent Multitask Learning with Nonlinear Output Relations},
      author={Ciliberto, Carlo and Rudi, Alessandro and Rosasco, Lorenzo and Pontil, Massimiliano},
      booktitle={Advances in Neural Information Processing Systems},

    structured-prediction multitask-learning non-linear-multitask learning kernel-methods

  • A Consistent Regularization Approach for Structured Prediction

    Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco NIPS 2016

    We propose and analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed method. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.

    pdf code slides video
      title={A Consistent Regularization Approach for Structured Prediction},
      author={Ciliberto, Carlo and Rosasco, Lorenzo and Rudi, Alessandro},
      booktitle={Advances in Neural Information Processing Systems},

    relevance2 nips-2016 structured-prediction algorithm theory kernel-methods statistical-learning-theory consistency

  • NYTRO: When Subsampling Meets Early Stopping

    Tomas Angles, Raffaello Camoriano, Alessandro Rudi, Lorenzo Rosasco AISTATS 2016

    Early stopping is a well known approach to reduce the time complexity for performing training and model selection of large scale learning machines. On the other hand, memory/space (rather than time) complexity is the main constraint in many applications, and randomized subsampling techniques have been proposed to tackle this issue. In this paper we ask whether early stopping and subsampling ideas can be combined in a fruitful way. We consider the question in a least squares regression setting and propose a form of randomized iterative regularization based on early stopping and subsampling. In this context, we analyze the statistical and computational properties of the proposed method. Theoretical results are complemented and validated by a thorough experimental analysis.

    pdf code slides video
      title={NYTRO: When subsampling meets early stopping},
      author={Camoriano, Raffaello and Angles, Tom{\'a}s and Rudi, Alessandro and Rosasco, Lorenzo},
      booktitle={Artificial Intelligence and Statistics},

    relevance1 aistats-2016 early-stopping iterative-regularization gradient-descent nystrom large-scale-machine-learning kernel-methods

  • Less is More: Nyström Computational Regularization

    Alessandro Rudi, Raffaello Camoriano, Lorenzo Rosasco NIPS 2015

    We study Nystrom type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. In particular, we prove that these approaches can achieve optimal learning bounds, provided the subsampling level is suitably chosen. These results suggest a simple incremental variant of Nystr\"om Kernel Regularized Least Squares, where the subsampling level implements a form of computational regularization, in the sense that it controls at the same time regularization and computations. Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.

    pdf code slides video
    title = {Less is More: Nystr\"{o}m Computational Regularization},
    author = {Rudi, Alessandro and Camoriano, Raffaello and Rosasco, Lorenzo},
    booktitle = {Advances in Neural Information Processing Systems 28},
    editor = {C. Cortes and N. D. Lawrence and D. D. Lee and M. Sugiyama and R. Garnett},
    pages = {1657--1665},
    year = {2015},
    publisher = {Curran Associates, Inc.},
    url = {}

    large-scale-machine-learning big-data theory nystrom subsampling stochastic-projection kernel-methods statistical-learning-theory consistency

  • Learning Sets and Subspaces

    Alessandro Rudi, Guillermo D Canas, Ernesto De Vito, Lorenzo Rosasco Book Chapter 2014

    We consider here the classic problem of support estimation, or learning a set from random samples, and propose a natural but novel approach to address it. We do this by investigating its connection with a seemingly distinct problem, namely subspace learning.

    pdf code slides video
           title={Learning Sets and Subspaces}, 
           author={Rudi, Alessandro and Canas, G. D. and De Vito, Ernesto and Rosasco, Lorenzo}, 
           booktitle={Regularization, Optimization, Kernels, and Support Vector Machines}, 
           publisher={CRC Press} 

    relevance-0 book-chapter set-learning subspace-learning pca kernel-pca principal-component-analysis kernel-methods statistical-learning-theory consistency

  • Geometrical and Computational aspects of Spectral Support Estimation for Novelty Detection

    Alessandro Rudi, Francesca Odone, Ernesto De Vito Pattern Recognition Letters 2014

    In this paper we discuss the Spectral Support Estimation algorithm [1] by analyzing its geometrical and computational properties. The estimator is non-parametric and the model selection depends on three parameters whose role is clarified by simulations on a two-dimensional space. The performance of the algorithm for novelty detection is tested and compared with its main competitors on a collection of real benchmark datasets of different sizes and types.

    pdf code slides video
            title={On the sample complexity of subspace learning}, 
            author={Rudi, Alessandro and Canas, Guillermo D and Rosasco, Lorenzo}, 
            booktitle={Advances in Neural Information Processing Systems}, 

    set-learning pca kernel-pca principal-component-analysis kernel-methods statistical-learning-theory consistency

  • On the Sample Complexity of Subspace Learning

    Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco NIPS 2013

    A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.

    pdf code slides video
            title={On the sample complexity of subspace learning}, 
            author={Rudi, Alessandro and Canas, Guillermo D and Rosasco, Lorenzo}, 
            booktitle={Advances in Neural Information Processing Systems}, 

    subspace-learning pca kernel-pca principal-component-analysis kernel-methods statistical-learning-theory consistency

  • Adaptive Optimization for Cross Validation

    Alessandro Rudi, Gabriele Chiusano, Alessandro Verri ESANN 2012

    The process of model selection and assessment aims at finding a subset of parameters that minimize the expected test error for a model related to a learning algorithm. Given a subset of tuning parameters, an exhaustive grid search is typically performed. In this paper an automatic algorithm for model selection and assessment is proposed. It adaptively learns the error function in the parameters space, making use of scale space theory and statistical learning theory in order to estimate a reduced number of models and, at the same time, to make them more reliable. Extensive experiments are performed on the MNIST dataset.

    pdf code slides video
      title={Adaptive Optimization for Cross Validation.},
      author={Rudi, Alessandro and Chiusano, Gabriele and Verri, Alessandro},

    automatic-model-selection kernel-methods scale-space-theory statistical-learning-theory consistency

  • A general method for the point of regard estimation in 3D space

    Fiora Pirri, Matia Pizzoli, Alessandro Rudi CVPR 2011

    A novel approach to 3D gaze estimation for wearable multi-camera devices is proposed and its effectiveness is demonstrated both theoretically and empirically. The proposed approach, firmly grounded on the geometry of the multiple views, introduces a calibration procedure that is efficient, accurate, highly innovative but also practical and easy. Thus, it can run online with little intervention from the user. The overall gaze estimation model is general, as no particular complex model of the human eye is assumed in this work. This is made possible by a novel approach, that can be sketched as follows: each eye is imaged by a camera; two conics are fitted to the imaged pupils and a calibration sequence, consisting in the subject gazing a known 3D point, while moving his/her head, provides information to 1) estimate the optical axis in 3D world; 2) compute the geometry of the multi-camera system; 3) estimate the Point of Regard in 3D world. The resultant model is being used effectively to study visual attention by means of gaze estimation experiments, involving people performing natural tasks in wide-field, unstructured scenarios.

    pdf code slides video
      title={A general method for the point of regard estimation in 3D space},
      author={Pirri, Fiora and Pizzoli, Matia and Rudi, Alessandro},
      booktitle={Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on},

    computer vision 3d-eyetracking

  • Linear solvability in the viewing graph

    Alessandro Rudi, Matia Pizzoli, Fiora Pirri ACCV 2011

    The Viewing Graph [1] represents several views linked by the corresponding fundamental matrices, estimated pairwise. Given a Viewing Graph, the tuples of consistent camera matrices form a family that we call the Solution Set. This paper provides a theoretical framework that formalizes different properties of the topology, linear solvability and number of solutions of multi-camera systems. We systematically characterize the topology of the Viewing Graph in terms of its solution set by means of the associated algebraic bilinear system. Based on this characterization, we provide conditions about the linearity and the number of solutions and define an inductively constructible set of topologies which admit a unique linear solution. Camera matrices can thus be retrieved efficiently and large viewing graphs can be handled in a recursive fashion. The results apply to problems such as the projective reconstruction from multiple views or the calibration of camera networks.

    pdf code slides video
      title={Linear solvability in the viewing graph},
      author={Rudi, Alessandro and Pizzoli, Matia and Pirri, Fiora},
      booktitle={Asian Conference on Computer Vision},

    computer-vision viewing-graphs fundamental-matrix projective-reconstruction
2 rue Simone Iff - Office C415
Paris, France 75012