Optimisation et Apprentissage Statistique

Francis Bach

INRIA - Ecole Normale Supérieure

Master M2 Mathématiques de l'aléatoire

(Université Paris-Sud), 2eme semestre, 2016/2017

The class will be taught in French or English, depending on attendance (all slides and class notes are in English).

Résumé

De nombreux problèmes d'estimation en apprentissage statistique sont formulés comme des problèmes d'optimisation. Ces formulations ont permis une separation entre l'analyse des performances de l'estimateur et le développement d'algorithmes de résolution du problème. Face aux grands volumes de données, une telle séparation n'est plus efficace et l'analyse doit mêler statistique et optimisation. Dans ce cours, seront présentés de manière unifiée les résultats statistiques classiques sur la M-estimation et l'optimisation convexe, en montrant les liens entre statistique et optimisation. Un accent sera mis sur l'analyse non-asymptotique de l'approximation stochastique et du gradient stochastique.

Références

Transparents (Preliminary version, will be updated before every class)

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.

Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004.

Dates et contenu des cours

Les cours ont lieu au département de Mathématiques de l'Université Paris-Sud (Orsay, Petit Amphi du Batiment 425), le jeudi de 14h à 17h (sauf le 02/02).

Dates	Sujets du cours	Notes de cours de l'annee 2015/2016
Jeudi 26/01 de 14h à 17h	Introduction au probleme generique d'apprentissage Classes de fonctions (convexes, lisses, fortement convexes) Analyse statistique par complexite de Rademacher	lecture1.pdf
Jeudi 02/02 de 14h à 17h Amphi G3, Bat. 450	Analyse des algorithmes deterministes d'optimisation (descente de gradient, Newton, etc.)	lecture2.pdf
Jeudi 23/02 de 14h à 17h Salle B13, Bat. 460	Approximation stochastique pour problemes convexes non-lisses	lecture4.pdf
Jeudi 09/03 de 14h à 17h	Approximation stochastique classique (analyse asymptotique)	lecture3.pdf
Jeudi 16/03 de 14h à 17h	Approximation stochastique pour problemes convexes lisses
Jeudi 23/03 de 14h à 17h	Minimisation de sommes finies
Jeudi 06/04 de 14h à 17h	Evaluation Orale - Room 225-227, Building 425

Assignments

The goal of these assignments is to implement the main algorithms seen in class, with any language of your choice (Python, Matlab or R are advised). Every assignment will be a few lines of code run on small-scale data. The only required outputs for each homework are a few (readable) curves and comments, to be sent as a single pdf file by email to fbachorsay2017@gmail.com with the subject [HWn], with n being the homework number (no acknowledgements will be sent back).

In case of any problems with loading the data, please contact the lecturer.

Homework 1 (Batch algorithms), due Thursday 23/02	The file data_orsay_2017.mat (matlab format, readable from Matlab and Python) contains training data and testing data for a binary classification problem with d=100, n=10 000 training observations, and n=100 000 testing observations. For the two loss functions (square, logistic), and two numbers of training observations (10 000 and 1000), plot training and testing errors (on the same plot) as a function of the number of iterations of gradient descent with constant step-size. This makes a total of 4 plots (2 loss functions x 2 numbers of observartions), to be put in a single pdf file, with two sentences (max) of comments for each plot.
Homework 2 (Nonsmooth stochastic algorithms), due Thursday 09/03	With the same data, and the full training data (10 000 observations), we consider logistic regression, and three plots for stochastic gradient descent, all with 20 passes over the data. For all 4 plots, compare averaged testing logistic losses (as a function of the number of iterations). Advice: only compute testing losses every 100 iterations. (a) No averaging vs. averaging: compare averaged SGD and plain SGD with step-sizes decreasing as 1/sqrt(n). Single plot + one sentence comment. (b) Strongly-convex vs. convex (averaged SGD): consider adding mu/2 times the squared norm to the objective (do not forget to add it to the gradient and the testing objective as well). Compare the step size in 1/(mu n) and the step size in 1/sqrt(n). For mu = 1e-1 and 1e-5. Two plots overall + two sentences of comments. (c) cycling vs. sampling with replacement: compare running SGD with multiple passes on the training data, comparing cycling with sampling with replacement. Single plot + one sentence comment.
Homework 3 (Smooth stochastic algorithms+finite data sets), due Thursday 06/04	We consider the same data, for logistic regression and least-squares regression with regularization by mu times the squared Euclidean norm (with mu=1e-1 and mu=1e-5). Compare constant-step-size stochastic gradient (sampling with replacement with 50 passes) and SAG. This makes overall: 2 (two lossess) x 2 (two mus) x 2 (two algorithms) = 8 plots

Validation

Le cours sera valide par (1) une lecture d'article avec restitution orale et (2) des implementations tres simples en Matlab/R/Python.

In order to obtain a grade for this class, students have to (1) read a research article with an oral presentation and a one-page report (or simpler, the slides themselves), as well as (2) do simple implementations in Matlab/R/Python.

Given the number of students, the session where students present research articles will be a poster session, where every student will prepare less than 8 slides (printing several A4 pages is perfectly fine) to (a) present the main ideas of the papers, (b) its relationship to the class, (c) potentially the main elements of proofs, and (d) if applicable a simple simulation showing the benefits of the proposed method.

The slides may be prepared in French or English. Note that in some cases it may be worth selecting a relevant subset of results to present.

There are two target audiences: (a) the lecturer (supposed to be an expert who already read the papers) and (b) other students (supposed to know the material seen in class, but typically not the papers). The poster should be adaptive to these two target audiences.

The one-page report (or the slides) (to be sent to fbachorsay2017@gmail.com within a week of the poster session) should reuse the same elements as presented during the poster session.

Here is a list of potential papers (if you have any suggestions, please let me know). Papers will be allocated in a first-come / first-serve basis by sending email to Francis Bach (two students may take the same paper).

A. Dieuleveut, F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. [pdf] [assigned to Samuel Doucet] [assigned to Mohammed Amine Alaoui]

Hongzhou Lin, Julien Mairal, Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. [pdf] [assigned to Louise Gahery]

M. Schmidt, N. Le Roux, F. Bach. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization. [pdf] [assigned to Minh Quang Pham] [assigned to Nguyen Thanh Huy]

Shai Shalev-Shwartz, Nathan Srebro, Karthik Sridharan. Fast Rates for Regularized Objectives. [pdf] [assigned to Raphael Berdugo] [assigned to Chenjie Yang]

F. Orabona. Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning. [pdf] [assigned to Brahim Khalil Abid] [assigned to Vincent Divol]

Cameron Musco, Christopher Musco. Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition. [pdf] [assigned to Ernesto Araya]

A. Defossez and F. Bach. Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. [pdf] [assigned to Simon Grah] [assigned to Younes Belkhayat Zougari]

G. Lan. An optimal method for stochastic composite optimization. [pdf] [assigned to Nicolas Brichler] [assigned to Marie Goga]

A. Dieuleveut, N. Flammarion, and F. Bach. Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. [pdf] [assigned to Geoffrey Chinot] [assigned to Wei Jiang]

A. Defazio, F. Bach, S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. [pdf] [assigned to Hamza Cherkaoui] [assigned to Jyotsna Budideti]

Anatoli Juditsky, Arkadi Nemirovski. First Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods. [pdf] [assigned to Amin Selim]

Anatoli Juditsky, Arkadi Nemirovski. First Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem's Structure. [pdf]

Y. Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. [pdf] [assigned to Emile Chapuis] [assigned to Shan-Conrad Wolf]

S. Lacoste-Julien and M. Jaggi. On the Global Linear Convergence of Frank-Wolfe Optimization Variants. [pdf] [assigned to Mario Ynocente Castro]

Elad Hazan and Satyen Kale. Beyond the Regret Minimization Barrier: Optimal Algorithms for Stochastic Strongly-Convex Optimization. [pdf] [assigned to Loïc Arnault] [assigned to Hedi Hadiji]

Hamed Karimi, Julie Nutini, Mark Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. [pdf] [assigned to Jacques de Catelan] [assigned to Benjamin Adjadj]

D. Scieur, V. Roulet, F. Bach, A. d'Aspremont. Integration Methods and Accelerated Optimization Algorithms. [pdf] [assigned to Raphaël Berthier] [assigned to Konstantinos Varelas]

N. Flammarion, F. Bach. Stochastic Composite Least-Squares Regression with convergence rate O(1/n). [pdf] [assigned to Xiaoling Zhu]