Learning Theory from First Principles

Francis Bach

Mastere MASH 2020/2021


Mandatory registration

The class will be taught in French or English, depending on attendance (all slides and class notes are in English).


The goal of this class is to present old and recent results in learning theory, for the most widely-used learning architectures. This class is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout the masters program.

A particular effort will be made to prove many results from first principles, while keeping the exposition as simple as possible. This will naturally lead to a choice of key results that show-case in simple but relevant instances the important concepts in learning theory. Some general results will also be presented without proofs.

The class will be organized in 9 three-hour sessions, each with a precise topic except the last one dedicated to recent learning theory results. See tentative schedule below.

Prerequisites: We will prove results in class so a good knowledge of undergraduate mathematics is important, as well as basic notions in probability. Having followed an introductory class on machine learning is beneficial.


Given the sanitary situation, all classes will be online, with the following tentative plan. Detailed class notes will be made available 2 days before each class, while connection details the night before (I will use GotoMeeting).

Each student is expected to read the class notes before the class. During class, I will go over them, provide additional details and answer questions. Classes will be held on Friday between 8.30am and 11.30am.

Date Topics Class notes
September 18
Learning with infinite data (population setting)
-Decision theory (loss, risk, optimal predictors)
-Decomposition of excess risk into approximation and estimation errors
-No free lunch theorems
-Basic notions of concentration inequalities (MacDiarmid, Hoeffding, Bernstein)
 gotomeeting link
September 25
Least-squares regression
-Guarantees in the fixed design settings (simple in closed form)
-Guarantees in the random design settings
-Ridge regression: dimension independent bounds

October 2 Classical risk decomposition
-Approximation error
-Convex surrogates
-Estimation error through covering numbers (basic example of ellipsoids)
-Modern tools (no proof): Rademacher complexity, Gaussian complexity + Slepian and Lipschitz results
-Minimax rates (at least one proof)

October 16
Optimization for machine learning
-Gradient descent
-Stochastic gradient descent
-Generalization bounds through stochastic gradient descent

October 23 Local averaging techniques
-Kernel density estimation
-Nadaraya-Watson estimators (simplest proof to be found with apparent curse of dimensionality)
-Decision trees and associated methods

October 30
Kernel methods
-Modern analysis of non-parametric techniques (simplest proof with results depending on s and d)

November 6
Model selection
-L0 penalty with AIC
-L1 penalty
-High-dimensional estimation

November 13
Neural networks
-Approximation properties (simplest approximation result)
-Two layers
-Deep networks

November 20
Special topics
-Generalization/optimization properties of infinitely wide neural networks
-Double descent


One written in-class exam, and (very) simple coding assignments (to illustrate convergence results, to be sent to learning.theory.first.principles@gmail.com).