{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TD1: Linear and Polynomial Regressions; Application to Classification" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.io as sio\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear and polynomial regression: curve fitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $(X,Y)$ be a pair of real random variable such that $X$ is uniform on $[0,1]$ and $Y = f_*(X)+\\sigma \\varepsilon$, where $f_*(x) = \\sin(6x)$, $\\sigma = 0.5$, and $\\varepsilon$ is some is a standard Gaussian random variable, independent from $X$. \n", "\n", "(1) Generate $n = 40$ realizations $(x_i, y_i), i = 1, \\dots n$ of $(X,Y)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(2) Plot the realizations of $(X,Y)$, along with the curve $y = f_*(x)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we try to learn the function $f_*$ from the $n$ samples. We start with empirical risk minimization over the set of linear functions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(3) What are here the input space $\\mathcal{X}$ of the linear regression? What is the output space $\\mathcal{Y}$? What is the risk $R(f)$ of a classifier $f:\\mathcal{X} \\to \\mathcal{Y}$ in terms of $\\sigma$? What is the optimal classifier among all $L^2$ functions $f:\\mathcal{X} \\to \\mathcal{Y}$? (Here optimal means that it minimizes the risk $R(f)$.) What is the risk of the optimal classifier?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The empirical risk minimization over the set of linear function means that we estimate\n", "$$ \\hat{f} = {\\rm argmin}_{f \\in F} \\hat{R}_n(f) $$ \n", "where \n", "$$ F = \\{f(x) = \\theta_1 x + \\theta_0 | \\theta_0, \\theta_1 \\in \\mathbb{R}\\} \\, , $$\n", "$$ \\hat{R}_n(f) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - f(x_i))^2 \\, .$$\n", "\n", "(4) Writing $\\hat{f}(x) = \\theta_1 x + \\theta_0$, find a closed-from formula for $\\theta_0$, $\\theta_1$ in terms of the observations $(x_i,y_i)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(5) Using this formula, complete the previous plot with this estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(6) Repeat the computation of the coefficients, using now the function `numpy.linalg.lstsq` . Check on the plot that the results are the same." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(7) Minimize the empirical risk over the set of polynomials of order 2. Plot the optimal polynomial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(8) Generalize your code in order to compute the optimal polynomial of order $k$. Vary $k$ and the number of samples $n$, and plot the results. Comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(9) Let us denote $\\hat{f}_k$ the minimizer of the empirical risk over the polynomials of order $k$. Plot the risk $R(\\hat{f}_k)$ and the empirical risk $\\hat{R}_n(\\hat{f}_k)$ as a function of $k$, for $n=40$ and $n=400$. Comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(10) Repeat question (8) and (9) with $f_*(x) = 1.2x + 4x^2 + 4.4x^3 - 3.8x^4 + 3.6 x^5$ and $n=20$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear and polynomial classifiers" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "class | \n", "
---|---|---|---|
0 | \n", "-3.603405 | \n", "1.3266 | \n", "1.0 | \n", "
1 | \n", "-4.219011 | \n", "2.0150 | \n", "1.0 | \n", "
2 | \n", "-1.515658 | \n", "0.5059 | \n", "1.0 | \n", "
3 | \n", "-1.169757 | \n", "0.3815 | \n", "1.0 | \n", "
4 | \n", "0.522741 | \n", "-0.6572 | \n", "1.0 | \n", "