Loucas Pillaud-Vivien

 

Briefly

I am an assistant professor at Ecole des Ponts ParisTech and a researcher at CERMICS, in the Applied Probability team. Prior to this, I was a Courant Instructor/Flatiron fellow both at the Courant institute of NYU and the Flatiron Institute, where I work mainly with Joan Bruna. Before, I was a postdoc in the Theory of Machine Learning group of Nicolas Flammarion at EPFL. I did my Ph.D. in the SIERRA Team, under the supervision of Francis Bach and Alessandro Rudi on stochastic approximation for high dimensionnal learning problems.

Contact

  • Physical address: 6 et 8 avenue Blaise Pascal, Cité Descartes - Champs sur Marne. [Cermics].

  • E-mail: loucas.pillaud-vivien [at] enpc [dot] fr

Research interests

My main research interests are centred around optimization, statistics and stochastic models. More precisely, here is are a selection of research topics I am interested in:

  • Gradient flow for non-convex learning problems (as we barely understand them, why bother with discrete updates?)

  • Implicit bias of overparametrized architectures in optimization learning

  • Stochastic Differential Equations (and PDEs) and how they can model and help us understand machine learning problems

  • Stochastic approximations in Hilbert spaces

  • Kernel methods

Selected Publications

For a complete list of my publications, rendez-vous to my Google Scholar page. Do not worry, you'll find the same turtle photo there.

  • A. Bietti, J. Bruna, L. Pillaud-Vivien.
    On Learning Gaussian Multi-index Models with Gradient Flow. [arxiv:2310.19793], Submitted, 2023. [Show Abstract]

    We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated ‘saddle-to-saddle’ dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.

  • M. Andriushchenko, A.V. Varre, L. Pillaud-Vivien, N. Flammarion.
    SGD with large step sizes learns sparse features. [arxiv:2210.05337], ICML, 2023. [Show Abstract]

    We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.

  • L. Pillaud-Vivien, F. Bach.
    Kernelized Diffusion maps. [arxiv:2302.06757], COLT, 2023. [Show Abstract]

    Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nyström subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.

  • E. Boursier, L. Pillaud-Vivien, N. Flammarion.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. [arxiv:2206.00939], NeurIPS, 2022. [Show Abstract]

    Abstract: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

  • L. Pillaud-Vivien, J.Reygner, N. Flammarion.
    Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. [arxiv:2206.09841], COLT, 2022. [Show Abstract]

    Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the role of the label noise in the training dynamics of a quadratically parametrised model through its continuous time version. We explicitly characterise the solution chosen by the stochastic flow and prove that it implicitly solves a Lasso program. To fully complete our analysis, we provide nonasymptotic convergence guarantees for the dynamics as well as conditions for support recovery. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of stochastic dynamics as observed in practice.

  • L. Pillaud-Vivien, A. Rudi, F. Bach.
    Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes. [arXiv:1805.10074], NeurIPS, 2018. [Show Abstract]

    Abstract: We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

PhD Thesis

I defended my thesis in October 2020.

You can download the final version of the manuscript via this link [Thesis].

You can also have a look at the slides. [Slides]

Some Presentations

  • Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation
    [Slides]. COLT. July 2022.

  • Some results on the role of stochasticity in learning algorithms
    [Slides]. Birs-Banff workshop. May 2022.

  • Some results on the role of stochasticity in learning algorithms
    [Slides]. One world ML seminar. March 2022.

  • Some results on the role of stochasticity in learning algorithms
    [Slides]. NYU-Flatiron. January 2022.

  • Model order reduction by spectral gap optimization
    [Slides]. CEMRACS. August 2021.

  • Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
    [Slides]. Theory of Deep Learning, EPFL. June 2021.

  • Two results on Stochastic gradient descent in Hilbert spaces for Machine Learning problems
    [Slides]. Cermics seminar, Ecole des Ponts. November 2019.

  • Statistical Estimation of the Poincaré constant and Application to Sampling Multimodal Distributions.
    [Slides]. Sierra group seminar, INRIA. April 2019.

  • Statistical Optimality of Stochastic Gradient Descent through Multiple Passes.
    [Slides, Poster]. Optimization and Statistical Learning, Workshop in Les Houches. March 2019.

  • Langevin dynamics and applications to Machine Learning.
    [Slides]. Sierra group seminar, INRIA. February 2019.

  • Comparing Dynamics: Deep Neural Networks versus Glassy systems.
    [Slides]. Statistical Machine Learning in Paris (SMILE seminar). December 2018.

  • Statistical Optimality of Stochastic Gradient Descent through Multiple Passes.
    [Slides, Poster]. Advances in Neural Information Processing Systems (NeurIPS). December 2018.

  • Exponential convergence of testing error for stochastic gradient methods.
    [Slides, Video, Poster]. International Conference on Learning Theory (COLT). July 2018.

Review

Reviewer for Journals:

  • Annals of Statistics

  • Journal of Machine Learning Research

  • Applied and Computational Harmonic Analysis

  • IoP Science: Machine Learning: Science and Technology

Reviewer for Conferences:

  • International Conference on Learning Representations (ICLR 2021)

  • Advances in Neural Information Processing Systems (NeurIPS 2019-20-21)

  • International Conference on Machine Learning (ICML 2020-21)

  • Algorithmic Learning Theory (ALT 2020)