Learning Theory from First Principles

Francis Bach

Mastere MASH 2021/2022


Mandatory registration

The class will be taught in French or English, depending on attendance (all slides and class notes are in English).


The goal of this class is to present old and recent results in learning theory, for the most widely-used learning architectures. This class is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout the masters program.

A particular effort will be made to prove many results from first principles, while keeping the exposition as simple as possible. This will naturally lead to a choice of key results that show-case in simple but relevant instances the important concepts in learning theory. Some general results will also be presented without proofs.

The class will be organized in 9 three-hour sessions, each with a precise topic except the last one dedicated to recent learning theory results. See tentative schedule below.

Prerequisites: We will prove results in class so a good knowledge of undergraduate mathematics is important, as well as basic notions in probability. Having followed an introductory class on machine learning is beneficial.


All classes will be "in real life" at ENS (rue d'Ulm), on Friday between 9am and 12pm.

The class will follow the book in preparation (draft available here).

Each student will benefit more from the class is the corresponding sections are read before class.

Date Topics Book chapters
Figures to reproduce
October 8
Salle des Résistants
Learning with infinite data (population setting)
-Decision theory (loss, risk, optimal predictors)
-Decomposition of excess risk into approximation and estimation errors
-No free lunch theorems
-Basic notions of concentration inequalities (MacDiarmid, Hoeffding, Bernstein)
Chapter 2 

October 15
Salle Favard (46, rue d'Ulm)

Liner Least-squares regression
-Guarantees in the fixed design settings (simple in closed-form)
-Ridge regression: dimension independent bounds
-Guarantees in the random design settings
-Lower bound of performance
Chapter 3
Figures: 3.1, 3.2
November 5
Salle Dussane
Empirical risk minimization
-Convexification of the risk
-Risk decomposition
-Estimation error: finite number of hypotheses and covering numbers
-Rademacher complexity
-Penalized problems
Chapter 4
November 12
Salle Dussane
Optimization for machine learning
-Gradient descent
-Stochastic gradient descent
-Generalization bounds through stochastic gradient descent
Chapter 5
Figures: p. 90, p. 108
November 19
Salle Favard (46, rue d'Ulm)
Local averaging techniques
-Partition estimators
-Nadaraya-Watson estimators
-Universal consistency
Chapter 6
Figures: 6.2, 6.3, 6.4
November 26
Salle Favard (46, rue d'Ulm)
Kernel methods
-Kernels and representer theorems
-Analysis of well-specified models
-Sharp analysis of ridge regression
-Universal consistency
Chapter 7
December 3
Salle Dussane
Model selection
-L0 penalty
-L1 penalty
-High-dimensional estimation
Chapter 8
December 10
Salle Dussane
Neural networks
-Single hidden layer neural networks
- Estimation error
- Approximation properties and universality
Chapter 9
December 17
Emmy NOETHER (U / V), 45 rue d'Ulm


One written in-class exam, and (very) simple coding assignments (to illustrate convergence results, to be sent to learning.theory.first.principles@gmail.com). For all classes, the coding assignment is to reproduce the experiments shown in the book draf and send only the figures to the address above.