{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient Descent Methods\n", "\n", "### Author: Aymeric DIEULEVEUT\n", "\n", "### Read the following two cells before you start clicking everywhere !! :D\n", "\n", "The aim of this material is to implement and compare several optimization methods for Least Squares Regression and Logistic Regression. We focus on the empirical risk minimization problem.\n", "\n", "To do so we will:\n", "\n", "| Part | Objective | What you have to code |\n", "| :- |:- | :- |\n", "| I. | Simulate a linear and logistic model | nothing !|\n", "| II. | Create two classes: ModelLinReg and ModelLogisticReg,| Loss and gradients for the logistic loss |\n", "|| this will allow us to access the gradients and Lipshitz/Smoothness constants of the loss functions|\n", "|III. | Code some methods listed afterwards: GD, SGD, AGD, SVRG, etc. | The update steps|\n", "| |only the most crucial part -- the update step -- has to be implemented|\n", "|IV. | Compare the methods of your choice|nearly nothing: choose your methods |\n", "\n", "\n", "\n", "##### Main methods \n", "- gradient descent (GD)\n", "- accelerated gradient descent (AGD)\n", "- coordinate gradient descent (CD)\n", "- stochastic gradient descent (SGD)\n", "- stochastic variance reduced gradient descent (SVRG)\n", "- optimization methods for Deep Learning\n", "\n", "### Remarks\n", "##### Most of the content is elementary, but serves multiple objectives ...\n", "- hands on on the optimization methods you have encountered\n", "- compare optimization methods (stochastic vs non-stochastic), impact of Variance reduction\n", "- what can we read on convergence curves: deduct convergence speed from the learning curve\n", "- how to obtain a meaningful optimization curve\n", "\n", "##### and can be extended in **many** directions\n", "- framework to play with hyper-parameters \n", "- framework to implement other methods\n", "- back to test error: what happens?\n", "- adapative methods, non uniform sampling in stochastic, federated learning methods, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How this lab works\n", "\n", "\n", "## Organisation \n", "- Most of the content is already implemented (obviously)\n", "- You have to complete the code when there is a blank with\n", "```` \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", "````\n", "- A corrected version will be put online at the end \n", "\n", "## Interactive tool\n", "To follow your progress, adapt correction spped, and provide better feedback, we have developped a tool with Evan Courdier at EPFL:\n", "- At the beginning of the lab, when running the first cell, you will be able to enter your name\n", "- Afterwards, each time a ```send``` function appears, the first argument of the send function (typically the code of your function, a value of an array, a plot, etc), is send to a server, which allows me to follwo your progress, correct some mistakes, etc.\n", "- This is totally transparent for you. The code is OS, all results are erased afterwards. You can use a pseudo if you want (try to make it unique). \n", "\n", "\n", "\n", " If it works: you will get a message after each send. Thank you for helping me give a more interactive (hopefully better) lab!\n", " If it does not, or if really do not want to use the tool Do not try debugging it :D. Just redefine the send function as: ```def send(a,b): return()```\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of content\n", "\n", "[1. Introduction](#intro)
\n", "\n", "[2. Models gradients and losses](#models)
\n", "\n", "[2.1 Linear regression](#models_regression)
\n", "[2.2 Check for Linear regression](#models_regression_check)
\n", "[2.3 Logistic regression](#models_logistic)
\n", "[2.4 Check for logistic regression](#models_logistic_check)
\n", "\n", "\n", "[3. Solvers](#solvers)
\n", "\n", "[3.1 Tools for solvers](#tools)
\n", "[3.2 Gradient descent](#gd)
\n", "[3.3 Accelerated Gradient descent](#agd)
\n", "[3.4 Coordinate Gradient descent](#cgd)
\n", "[3.5 Stochastic Gradient descent](#sgd)
\n", "[3.6 Stochastic Average Gradient descent](#sag)
\n", "[3.7 Stochastic Variance Reduced Gradient descent](#svrg)
\n", "[3.8 Adagrad](#adagrad)
\n", "[3.9 RMSProp](#rmsprop)
\n", "[3.10 AdaDelta](#adadelta)
\n", "[3.11 Adam](#adam)
\n", "[3.12 Adamax](#adamax)
\n", "\n", "[4. Comparison of all algorithms](#comparison)
\n", "\n", "\n", "# 1. Introduction\n", "\n", "## 1.1. Getting model weights\n", "\n", "We'll start by generating sparse vectors and simulating data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline\n", "\n", "np.set_printoptions(precision=2) # to have simpler print outputs with numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2. Simulation of a linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy.random import multivariate_normal\n", "from scipy.linalg.special_matrices import toeplitz\n", "from numpy.random import randn\n", "\n", "import requests\n", "exec(requests.get(\"https://courdier.pythonanywhere.com/get-send-code\").content)\n", "npt_config = {\n", " 'session_name': 'Optimization-Gradient-Methods',\n", " 'session_owner': 'aymeric',\n", " 'sender_name': input(\"Your name:\"),\n", "}\n", "send('start', 0) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def simu_linreg(w0, n_samples=1000, corr=0.5, std=0.5):\n", " \"\"\"Simulation of a linear regression model with Gaussian features\n", " and a Toeplitz covariance, with Gaussian noise.\n", " \n", " Parameters\n", " ----------\n", " w0 : `numpy.array`, shape=(n_features,)\n", " Model weights\n", " \n", " n_samples : `int`, default=1000\n", " Number of samples to simulate\n", " \n", " corr : `float`, default=0.5\n", " Correlation of the features\n", " \n", " std : `float`, default=0.5\n", " Standard deviation of the noise\n", "\n", " Returns\n", " -------\n", " X : `numpy.ndarray`, shape=(n_samples, n_features)\n", " Simulated features matrix. It contains samples of a centered \n", " Gaussian vector with Toeplitz covariance.\n", " \n", " y : `numpy.array`, shape=(n_samples,)\n", " Simulated labels\n", " \"\"\"\n", " n_features = w0.shape[0]\n", " # Construction of a covariance matrix\n", " cov = toeplitz(corr ** np.arange(0, n_features))\n", " # Simulation of features\n", " X = multivariate_normal(np.zeros(n_features), cov, size=n_samples)\n", " # Simulation of the labels\n", " y = X.dot(w0) + std * randn(n_samples)\n", " return X, y" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "n_samples = 500\n", "w0 = np.array([0.5])\n", "\n", "X, y = simu_linreg(w0, n_samples=n_samples, corr=0.3, std=0.5)\n", "plt.scatter(X, y)\n", "plt.xlabel(r\"$x_i$\", fontsize=16)\n", "plt.ylabel(r\"$y_i$\", fontsize=16)\n", "plt.title(\"Linear regression simulation\", fontsize=18)\n", "plt.scatter(X, y, label='data')\n", "plt.legend()\n", "send(plt, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3. Simulation of a logistic regression model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def sigmoid(t):\n", " \"\"\"Sigmoid function (overflow-proof)\"\"\"\n", " idx = t > 0\n", " out = np.empty(t.size) \n", " out[idx] = 1 / (1. + np.exp(-t[idx]))\n", " exp_t = np.exp(t[~idx])\n", " out[~idx] = exp_t / (1. + exp_t)\n", " return out\n", "\n", "def simu_logreg(w0, n_samples=1000, corr=0.5):\n", " \"\"\"Simulation of a logistic regression model with Gaussian features\n", " and a Toeplitz covariance.\n", " \n", " Parameters\n", " ----------\n", " w0 : `numpy.array`, shape=(n_features,)\n", " Model weights\n", " \n", " n_samples : `int`, default=1000\n", " Number of samples to simulate\n", " \n", " corr : `float`, default=0.5\n", " Correlation of the features\n", "\n", " Returns\n", " -------\n", " X : `numpy.ndarray`, shape=(n_samples, n_features)\n", " Simulated features matrix. It contains samples of a centered \n", " Gaussian vector with Toeplitz covariance.\n", " \n", " y : `numpy.array`, shape=(n_samples,)\n", " Simulated labels\n", " \"\"\"\n", " n_features = w0.shape[0]\n", " cov = toeplitz(corr ** np.arange(0, n_features))\n", " X = multivariate_normal(np.zeros(n_features), cov, size=n_samples)\n", " p = sigmoid(X.dot(w0))\n", " y = np.random.binomial(1, p, size=n_samples)\n", " # Put the label in {-1, 1}\n", " y[:] = 2 * y - 1\n", " return X, y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "u = np.array([1,2,3])\n", "sigmoid(u)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_samples = 500\n", "w0 = np.array([-3, 3.])\n", "\n", "X, y = simu_logreg(w0, n_samples=n_samples, corr=0.4)\n", "\n", "plt.scatter(*X[y == 1].T, color='b', s=10, label=r'$y_i=1$')\n", "plt.scatter(*X[y == -1].T, color='r', s=10, label=r'$y_i=-1$')\n", "plt.legend(loc='upper left')\n", "plt.xlabel(r\"$x_i^1$\", fontsize=16)\n", "plt.ylabel(r\"$x_i^2$\", fontsize=16)\n", "plt.title(\"Logistic regression simulation\", fontsize=18)\n", "send(plt, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 2. Models gradients and losses\n", "\n", "We want to minimize a goodness-of-fit function $f$ with ridge regularization, namely\n", "$$\n", "\\arg\\min_{w \\in \\mathbb R^d} \\Big\\{ f(w) + \\frac{\\lambda}{2} \\|w\\|_2^2 \\Big\\}\n", "$$\n", "where $d$ is the number of features and where we will assume that $f$ is $L$-smooth.\n", "We will consider below the following cases.\n", "\n", "**Linear regression**, where \n", "$$\n", "f(w) = \\frac 1n \\sum_{i=1}^n f_i(w) = \\frac{1}{2n} \\sum_{i=1}^n (y_i - x_i^\\top w)^2 + \\frac{\\lambda}{2} \\|w\\|_2^2 = \\frac{1}{2 n} \\| y - X w \\|_2^2 + \\frac{\\lambda}{2} \\|w\\|_2^2,\n", "$$\n", "where $n$ is the sample size, $y = [y_1 \\cdots y_n]$ is the vector of labels and $X$ is the matrix of features with lines containing the features vectors $x_i \\in \\mathbb R^d$.\n", "\n", "**Logistic regression**, where\n", "$$\n", "f(w) = \\frac 1n \\sum_{i=1}^n f_i(w) = \\frac{1}{n} \\sum_{i=1}^n \\log(1 + \\exp(-y_i x_i^\\top w)) + \\frac{\\lambda}{2} \\|w\\|_2^2,\n", "$$\n", "where $n$ is the sample size, and where labels $y_i \\in \\{ -1, 1 \\}$ for all $i$.\n", "\n", "We need to be able to compute $f(w)$ and its gradient $\\nabla f(w)$, in order to solve this problem, as well as $\\nabla f_i(w)$ for stochastic gradient descent methods and $\\frac{\\partial f(w)}{\\partial w_j}$ for coordinate descent.\n", "\n", "Below is the full implementation for linear regression.\n", "\n", "\n", "\n", "## 2.1 Linear regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "from numpy.linalg import norm\n", "\n", "\n", "class ModelLinReg:\n", " \"\"\"A class giving first order information for linear regression\n", " with least-squares loss\n", " \n", " Parameters\n", " ----------\n", " X : `numpy.array`, shape=(n_samples, n_features)\n", " The features matrix\n", " \n", " y : `numpy.array`, shape=(n_samples,)\n", " The vector of labels\n", " \n", " strength : `float`\n", " The strength of ridge penalization\n", " \"\"\" \n", " def __init__(self, X, y, strength):\n", " self.X = X\n", " self.y = y\n", " self.strength = strength\n", " self.n_samples, self.n_features = X.shape\n", " \n", " def loss(self, w):\n", " \"\"\"Computes f(w)\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " return 0.5 * norm(y - X.dot(w)) ** 2 / n_samples + strength * norm(w) ** 2 / 2\n", " \n", " def grad(self, w):\n", " \"\"\"Computes the gradient of f at w\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " return X.T.dot(X.dot(w) - y) / n_samples + strength * w\n", "\n", " def grad_i(self, i, w):\n", " \"\"\"Computes the gradient of f_i at w\"\"\"\n", " x_i = self.X[i]\n", " return (x_i.dot(w) - y[i]) * x_i + self.strength * w\n", "\n", " def grad_coordinate(self, j, w):\n", " \"\"\"Computes the partial derivative of f with respect to \n", " the j-th coordinate\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " return X[:, j].T.dot(X.dot(w) - y) / n_samples + strength * w[j]\n", "\n", " def lip(self):\n", " \"\"\"Computes the Lipschitz constant of the gradients of f\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", " return norm(X.T.dot(X), 2) / n_samples + self.strength\n", "\n", " def lip_coordinates(self):\n", " \"\"\"Computes the Lipschitz constant of the gradients of f with respect to \n", " the j-th coordinate\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", " return (X ** 2).sum(axis=0) / n_samples + self.strength\n", " \n", " def lip_max(self):\n", " \"\"\"Computes the maximum of the lipschitz constants of the gradients of f_i\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", " return ((X ** 2).sum(axis=1) + self.strength).max()\n", " \n", " def hessian(self):\n", " X, n_samples = self.X, self.n_samples\n", " return(X.T.dot(X))/ n_samples + self.strength" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 2.2 Checks for the linear regression model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Simulation setting\n", "n_features = 50\n", "nnz = 20\n", "idx = np.arange(n_features)\n", "w0 = (-1) ** (idx + 1) * np.exp(-idx / 10.)\n", "w0[nnz:] = 0.\n", "\n", "plt.figure(figsize=(5, 3))\n", "plt.stem(w0)\n", "plt.title(\"Model weights\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.optimize import check_grad\n", "\n", "X, y = simu_linreg(w0, corr=0.6)\n", "model = ModelLinReg(X, y, strength=1e-3)\n", "w = np.random.randn(n_features)\n", "\n", "send(float(check_grad(model.loss, model.grad, w)), 3)\n", "print(check_grad(model.loss, model.grad, w)) # This must be a number (of order 1e-6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(\"lip=\", model.lip())\n", "print(\"lip_max=\", model.lip_max())\n", "print(\"lip_coordinates=\", model.lip_coordinates())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.hessian().shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 2.3 Logistic regression\n", "\n", "**1) Compute (on paper) the gradient $\\nabla f$, the gradient of $\\nabla f_i$ and the gradient of the coordinate function $\\frac{\\partial f(w)}{\\partial w_j}$ of $f$ for logistic regression (fill the class given below).**\n", "\n", "**2) Fill in the functions below for the computation of $f$, $\\nabla f$, $\\nabla f_i$ and $\\frac{\\partial f(w)}{\\partial w_j}$ for logistic regression in the ModelLogReg class below.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class ModelLogReg:\n", " \"\"\"A class giving first order information for logistic regression\n", " \n", " Parameters\n", " ----------\n", " X : `numpy.array`, shape=(n_samples, n_features)\n", " The features matrix\n", " \n", " y : `numpy.array`, shape=(n_samples,)\n", " The vector of labels\n", " \n", " strength : `float`\n", " The strength of ridge penalization\n", " \"\"\" \n", " def __init__(self, X, y, strength):\n", " self.X = X\n", " self.y = y\n", " self.strength = strength\n", " self.n_samples, self.n_features = X.shape\n", " \n", " def loss(self, w):\n", " \"\"\"Computes f(w)\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " \n", " def grad(self, w):\n", " \"\"\"Computes the gradient of f at w\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " \n", " def grad_i(self, i, w):\n", " \"\"\"Computes the gradient of f_i at w\"\"\"\n", " x_i = self.X[i]\n", " strength = self.strength\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " \n", " def grad_coordinate(self, j, w):\n", " \"\"\"Computes the partial derivative of f with respect to \n", " the j-th coordinate\"\"\"\n", " y, X, n_samples, strength = self.y, self.X, self.n_samples, self.strength\n", " u = y*np.exp(- y * (X.dot(w)))/(1 + np.exp(- y * (X.dot(w))))\n", " return - (X[:, j].T.dot(u))/n_samples + strength * w[j]\n", "\n", "\n", " def lip(self):\n", " \"\"\"Computes (an upper bound on) the Lipschitz constant of the gradient of f\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", "\n", " return norm(X.T.dot(X), 2) / (4*n_samples) + self.strength\n", "\n", "\n", " def lip_coordinates(self):\n", " \"\"\"Computes (an upper bound on) the Lipschitz constant of the gradient of f with respect to \n", " the j-th coordinate\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", "\n", " return (X ** 2).sum(axis=0) / (4*n_samples) + self.strength\n", "\n", "\n", " def lip_max(self):\n", " \"\"\"Computes (an upper bound on) the maximum of the lipschitz constants of the gradients of f_i\"\"\"\n", " X, n_samples = self.X, self.n_samples\n", " return ((X ** 2).sum(axis=1)/4 + self.strength).max()\n", "\n", " \n", " def hessian(self):\n", " X, n_samples = self.X, self.n_samples\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " \n", "send(ModelLogReg.loss, 4)\n", "send(ModelLogReg.grad, 5)\n", "send(ModelLogReg.grad_i, 6)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "## 2.4 Checks for the logistic regression model\n", "\n", "**3) Use the function `simu_logreg` to simulate data according to the logistic regression model. Check numerically the gradient using the function ``checkgrad`` from ``scipy.optimize``, as we did for linear regression above.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Simulation setting\n", "n_features = 50\n", "nnz = 20\n", "idx = np.arange(n_features)\n", "w0 = (-1) ** (idx + 1) * np.exp(-idx / 10.)\n", "w0[nnz:] = 0.\n", "\n", "plt.figure(figsize=(5, 3))\n", "plt.stem(w0)\n", "plt.title(\"Model weights\")\n", "\n", "from scipy.optimize import check_grad\n", "\n", "\n", "\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "send('Checkgrad returns %.2e' % (check_grad(model.loss, model.grad, w)), 7)\n", "print('Checkgrad returns %.2e' % (check_grad(model.loss, model.grad, w))) # This must be a number (of order 1e-6)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(\"lip=\", model.lip())\n", "print(\"lip_max=\", model.lip_max())\n", "print(\"lip_coordinates=\", model.lip_coordinates())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Solvers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have classes `ModelLinReg` and `ModelLogReg` that allow to compute $f(w)$, $\\nabla f(w)$, \n", "$\\nabla f_i(w)$ and $\\frac{\\partial f(w)}{\\partial w_j}$ for the objective $f$\n", "given by linear and logistic regression. We want now to code and compare several solvers to minimize $f$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.1. Tools for the solvers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tools store the loss after each epoch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Starting point of all solvers\n", "w0 = np.zeros(model.n_features)\n", "\n", "# Number of iterations\n", "n_iter = 50\n", "\n", "# Random samples indices for the stochastic solvers (sgd, sag, svrg) \n", "idx_samples = np.random.randint(0, model.n_samples, model.n_samples * n_iter)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def inspector(model, n_iter, verbose=True):\n", " \"\"\"A closure called to update metrics after each iteration.\n", " Don't even look at it, we'll just use it in the solvers.\"\"\"\n", " objectives = []\n", " it = [0] # This is a hack to be able to modify 'it' inside the closure.\n", " def inspector_cl(w):\n", " obj = model.loss(w)\n", " objectives.append(obj)\n", " if verbose == True:\n", " if it[0] == 0:\n", " print(' | '.join([name.center(8) for name in [\"it\", \"obj\"]]))\n", " if it[0] % (n_iter / 5) == 0:\n", " print(' | '.join([(\"%d\" % it[0]).rjust(8), (\"%.6e\" % obj).rjust(8)]))\n", " it[0] += 1\n", " inspector_cl.objectives = objectives\n", " return inspector_cl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.2 Gradient descent\n", "\n", "\n", "**4) Finish the function `gd` below that implements the gradient descent algorithm and test it using the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "def gd(model, w0, step, n_iter, callback, verbose=True):\n", " \"\"\"Gradient descent\n", " \"\"\"\n", " #step = 1 / model.lip()\n", " w = w0.copy()\n", " w_new = w0.copy()\n", " if verbose:\n", " print(\"Lauching GD solver...\")\n", " callback(w)\n", " for k in range(n_iter + 1):\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " callback(w)\n", " return w\n", "send(gd, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The following code runs GD and stores the objectivre values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "callback_gd = inspector(model, n_iter=n_iter)\n", "w_gd = gd(model, w0, step= 1/model.lip(), n_iter=n_iter, callback=callback_gd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Which step size was chosen?\n", "- What is the expected rate of convergence?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.3 Accelerated gradient descent\n", "\n", "**5) Finish the function `agd` below that implements the accelerated gradient descent algorithm and test it using the next cell.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What choice of momentum coefficient is recommended for AGD ?\n", "- for strongly convex\n", "- for convex functions\n", "\n", "Here you can make different choices:\n", "- an arbitrary value (0.9)\n", "- using the strong convexity\n", "- using only convexity" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def agd(model, w0, n_iter, callback, verbose=True):\n", " \"\"\"Accelerated gradient descent\n", " \"\"\"\n", " mu = model.strength\n", " step = 1 / model.lip()\n", " w = w0.copy()\n", " w_new = w0.copy()\n", " # An extra variable is required for acceleration\n", " z = w0.copy() # the auxiliari point at which the gradient is taken\n", " t = 1. # beta this morning = momentum coefficient \n", " t_new = 1. \n", " if verbose:\n", " print(\"Lauching AGD solver...\")\n", " callback(w)\n", " for k in range(n_iter + 1):\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " callback(w)\n", " return w\n", "\n", "send(agd, 9)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "callback_agd = inspector(model, n_iter=n_iter)\n", "w_agd = agd(model, w0, n_iter=n_iter, callback=callback_agd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 3.4 Coordinate gradient descent\n", "\n", "**6) Finish the function `cgd` below that implements the coordinate gradient descent algorithm and test it using the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cgd(model, w0, n_iter, callback, verbose=True):\n", " \"\"\"Coordinate gradient descent\n", " \"\"\"\n", " w = w0.copy()\n", " n_features = model.n_features\n", " steps = 1 / model.lip_coordinates()\n", " if verbose:\n", " print(\"Lauching CGD solver...\")\n", " callback(w)\n", " for k in range(n_iter + 1):\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " callback(w)\n", " return w\n", "send(cgd, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "callback_cgd = inspector(model, n_iter=n_iter)\n", "w_cgd = cgd(model, w0, n_iter=n_iter, callback=callback_cgd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.5. Stochastic gradient descent\n", "\n", "**7) Finish the function `sgd` below that implements the st stochastic gradient descent algorithm and test it using the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sgd(model, w0, idx_samples, n_iter, step, callback, verbose=True):\n", " \"\"\"Stochastic gradient descent\n", " \"\"\"\n", " mu = model.strength\n", " w = w0.copy()\n", " w_ave = w0.copy()\n", " callback(w)\n", " n_samples = model.n_samples\n", " for idx in range(n_iter):\n", " i = idx_samples[idx]\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " # every n_samples iterations\n", " if idx % n_samples == 0:\n", " callback(w) #w_ave\n", " return w\n", "\n", "\n", "send(sgd, 11)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "step = 0.1\n", "callback_sgd = inspector(model, n_iter=n_iter)\n", "w_sgd = sgd(model, w0, idx_samples, n_iter=model.n_samples * n_iter, \n", " step=step, callback=callback_sgd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Which of the method is the fastest during the first 5 passes over the data?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.6. Stochastic average gradient descent\n", "\n", "\n", "**8) Finish the function `sag` below that implements the stochastic averaged gradient algorithm and test it using the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sag(model, w0, idx_samples, n_iter, step, callback, verbose=True):\n", " \"\"\"Stochastic average gradient descent\n", " \"\"\"\n", " w = w0.copy()\n", " n_samples, n_features = model.n_samples, model.n_features\n", " gradient_memory = np.zeros((n_samples, n_features)) # one gradient per sample n= 60k, d= 50M => 3 10^12\n", " y = np.zeros(n_features)\n", " callback(w)\n", " for idx in range(n_iter):\n", " i = idx_samples[idx] \n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " if idx % n_samples == 0:\n", " callback(w)\n", " return w\n", "send(sag, 12)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step = 1 / model.lip_max()\n", "callback_sag = inspector(model, n_iter=n_iter)\n", "w_sag = sag(model, w0, idx_samples, n_iter=model.n_samples * n_iter, \n", " step=step, callback=callback_sag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What happens during the first pass of SAG?\n", "- What is the main problem of SAG?\n", "- What is the size of the gradient_memory matrix?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.7. Stochastic variance reduced gradient\n", "\n", "**9) Finish the function `svrg` below that implements the stochastic variance reduced gradient algorithm and test it using the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def svrg(model, w0, idx_samples, n_iter, step, callback, verbose=True):\n", " \"\"\"Stochastic variance reduced gradient descent\n", " \"\"\"\n", " w = w0.copy()\n", " w_old = w.copy()\n", " temp_sum = 0\n", " n_samples = model.n_samples\n", " callback(w)\n", " for idx in range(n_iter): \n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", " if idx % n_samples == 0:\n", " callback(w)\n", " return \n", "send(svrg, 13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step = 1 / model.lip_max()\n", "callback_svrg = inspector(model, n_iter=n_iter)\n", "w_svrg = svrg(model, w0, idx_samples, n_iter=model.n_samples * n_iter,\n", " step=step, callback=callback_svrg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 3.8.1 ADAGRAD\n", "\n", "**10) Create the function `adagrad` that implements the adagrad solver and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "\n", "def adagrad(model, w0, n_iter, step, callback, verbose=True):\n", " \"\"\"Adagrad\"\"\"\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################\n", "send(adagrad, 14)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What is the set of parameters for ADAGRAD? \n", "- What are default choices for those parameters?\n", "- Is Adagrad typically implemented with deterministic or Stochastic Gradients?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Do/Shall we use momentum for adagrad adadelta, etc?\n", "# - we should\n", "# - for simplicity, we don't" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 3.8.2 ADAGRAD_Stochastic\n", "\n", "**10 bis) Create the function `adagrad` that implements the adagrad solver and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "\n", "def adagrad_Sto(model, w0, idx_samples, n_iter, step, callback, verbose=True):\n", " \"\"\"Adagrad\"\"\"\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################\n", "send(adagrad, 15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "## 3.9 RMSProp\n", "\n", "**11) Create the function `rmsprop` that implements the RMSProp solver and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "def rmsprop(model, w0, n_iter, step, rho, epsilon, callback, verbose=True):\n", "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################\n", "send(rmsprop, 16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What is the set of parameters for RMSPROP? \n", "- What are default choices for those parameters?\n", "- Is RMSPROP typically implemented with deterministic or Stochastic Gradients?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 3.10 AdaDelta\n", "\n", "**12) Create the function `adadelta` that implements adadelta solver and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def adadelta(model, w0, n_iter, rho, epsilon, callback, verbose=True):\n", " ##################\n", " \n", " #\n", " #\n", " # YOUR CODE HERE\n", " #\n", " #\n", " \n", "##################\n", "send(adadelta, 17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What is the set of parameters for AdaDelta? \n", "- What are default choices for those parameters? Compare r= 0.95 and rho = 0.999\n", "\n", "- What is Adadelta equivalent to for rho =1 (GD with step size 1)\n", "\n", "- Is AdaDelta typically implemented with deterministic or Stochastic Gradients?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "## 3.11 Adam\n", "\n", "**13) Create the function `adam` that implements the adam algorithm and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def adam(model, w0, n_iter, step, beta1, beta2, epsilon, callback, verbose=True):\n", "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################\n", "send(adam, 18)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What is the set of parameters for Adam? \n", "- What are default choices for those parameters?\n", "- Is Adam typically implemented with deterministic or Stochastic Gradients?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "## 3.12 Adamax\n", "\n", "**14) Create the function `adamax` that implements the adamax solver and test it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################\n", "send(adamax, 19)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################\n", "\n", "#\n", "#\n", "# YOUR CODE HERE\n", "#\n", "#\n", "\n", "##################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 4. Comparison of all algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**15) Plot the values of the loss for the different iteration and for each solver. Comment. **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify here to only call the methods you have implemented\n", "callbacks = [callback_gd, callback_agd, callback_cgd, callback_sgd, \n", " callback_sag, callback_svrg, \n", " callback_adagrad, callback_rmsprop, callback_adadelta, callback_adam, callback_adamax]\n", "names = [\"GD\", \"AGD\", \"CGD\", \"SGD\",\n", " \"SAG\", \"SVRG\", \n", " \"ADAGRAD\", \"RMSPROP\", \"ADADELTA\", \"ADAM\", \"ADAMAX\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's look at the convergence curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 10))\n", "\n", "for callback, name in zip(callbacks, names):\n", " objectives = np.array(callback.objectives)\n", " objectives_dist = objectives \n", " plt.plot(objectives_dist, label=name, lw=2)\n", "\n", "plt.tight_layout()\n", "plt.xlim((0, n_iter))\n", "plt.xlabel(\"Number of passes on the data\", fontsize=16)\n", "plt.ylabel(r\"$F(w^k) $\", fontsize=16)\n", "plt.legend(loc='lower left')\n", "plt.tight_layout()\n", "send(plt, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What do you think? How can we improve the plot to make it more insightful ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1 . Compute the minimal value of the function by running CGD (or another fast method)\n", "# for 20 times more iterations than any other method\n", "\n", "# 2. Set yscale to log\n", "\n", "callback_long = inspector(model, n_iter=1000, verbose=False)\n", "w_cgd = cgd(model, w0, n_iter=1000, callback=callback_long, verbose=False)\n", "obj_min = callback_long.objectives[-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 10))\n", "plt.yscale(\"log\")\n", "\n", "for callback, name in zip(callbacks, names):\n", " objectives = np.array(callback.objectives)\n", " objectives_dist = objectives - obj_min \n", " plt.plot(objectives_dist, label=name, lw=2)\n", "\n", "plt.tight_layout()\n", "plt.xlim((0, n_iter))\n", "plt.xlabel(\"Number of passes on the data\", fontsize=16)\n", "plt.ylabel(r\"$F(w^k) - F(w^*)$\", fontsize=16)\n", "plt.legend(loc='lower left')\n", "plt.tight_layout()\n", "send(plt, 21)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try to answer some of the following questions:\n", "\n", "- A. What is the speed of convergence of GD, AGD, SGD (you may need another plot for that)\n", "- B. Compare GD SGD: is there any \"best algorithm between GD and SGD\"\n", "- C. Compare GD and AGD. Compare the slopes? How does that reflect theory ? What steps did we choose?\n", "- D. Which curve oscillates (i.e., is \"bumpy\")? check this great blog post on the topic https://distill.pub/2017/momentum/\n", "- E. Tune the step size for GD, SGD, etc. FInd the value of the maximal step size that avoids divergence\n", "- F. Compare the convergence rate for different strategies for SGD (learning rate decay, averaging or not)\n", "- G. Compare the different strategies for the tuning of the momentum coefficient for AGD\n", "- H. What is the impact of Variance reduction. Does it match what is expected?\n", "- I. n logistic regression, study the influence of the correlation of the features on the performance of the optimization algorithms. Explain.\n", "- J. In logistic regression, study the influence of the level of ridge penalization on the performance of the optimization algorithms. Explain.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### If you generate interesting plots, share them !! \n", "\n", "Use \n", "\n", "````send(plt, 22)````\n", "\n", " (with increasing numbers starting at 30)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 1 }