class: center, middle # Lecture 5: ### Gradient descent, Backpropagation, Hand-crafted features, Neural Networks Florent Krzakala - Marc Lelarge - Andrei Bursuc
.center[
] .footnote.small[ With slides from A. Karpathy, F. Fleuret, J. Johnson, S. Yeung, A. Vedaldi ...] --- ## Recap -- - Image classification: - K-Nearest Neighbors: - Linear classifier: - Loss functions: Multi-class SVM and Softmax - Regularization --- ## Recap .center[
] --- ## Recap .center[
] --- ## Recap .center[
] --- ## Today .left[ - Gradient descent - Backpropagation - Hand crafted features - FeedForward Networks - Practical PyTorch: Clustering, Recsys, Triplet Loss ] --- ## Optimization Given: - a dataset of $(x,y)$ - a score function $s=f(x,W)=Wx$ - a loss function: + $L_i = -log\frac{e^{s_yi}}{\sum_j{e^{s_j}}}$             .green[ per sample] + $L = \frac{1}{N}\sum^{N}_{i=1}{L_i} + R(W)$ .green[for all samples] How to find best $W$? Modularization in basic blocks helps building intuition (also for deep) .center[
] --- ## Optimization .center[
] --- ## Optimization .center[
] .center[Follow the slope!] --- ## Optimization - Follow the slope - In 1D, the derivative of a function: .center[$\frac{df(x)}{dx} = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}$] - In multiple dimensions, the gradient is a vector of partial derivatives along each dimension + The slope in any direction is the dot product of the (unit) direction with the gradient + The direction of the steepest descent is the negative gradient --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## (Naive) finite differences
.credit[Slide credit: J. Johnson] --- ## Optimization - The loss function is just a function of $W$: .center[$L= \frac{1}{N}\sum^{N}_{i=1}{L_i} + \sum_k{W^2_k}$] - We want $\nabla_WL$ - We can use calculus to compute an analytic gradient - In practice: always use analytic gradient, but check implementation with numerical gradient -> __gradient check__ --- ## Optimization
.credit[Slide credit: J. Johnson] --- ## Gradient descent - Code for simple gradient descent: ```python # Vanilla Gradient Descent while True: weights_grad = evaluate_gradient(loss_fun, data, weights) weights += - step_size * weights_grad # perform parameter update ``` .center[
] .credit[Slide credit: A. Karpathy] --- ## Gradient descent - gradient descent uses local linear information to iteratively move towards a (local) minimum - the iterative rule:   `weights += - step_size * weights_grad` corresponds to _"following the steepest descent"_ - this finds a local minimum and the choices of $w_0$ (initial weights) and `step_size` are important. --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Gradient descent
.center[
] .credit[Slide credit: F. Fleuret] --- ## Mini-batch gradient descent - _a.k.a_ Stochastic Gradient Descent (SGD) - Use only a small portion of the training set to compute the gradient ``` python # Vanilla Gradient Descent while True: data_batch = sample_training_data(data,128) # sample 128 examples weights_grad = evaluate_gradient(loss_fun, data_batch, weights) weights += - step_size * weights_grad # perform parameter update ``` - Common mini-batch sizes are 32/64/128/256 examples - __step_size == learning rate__ .center[
] --- ## Mini-batch gradient descent - Example of optimization progress while training a neural network - Showing loss over mini-batches as it goes down over time .center[
] --- ## Mini-batch gradient descent - Example of optimization progress while training a neural network - __Epoch__ = one full pass of the training dataset through the network .center[
] .credit[Slide credit: A. Karpathy] --- ## Mini-batch gradient descent - The effects of different optimization techniques .right.green.small[we'll cover them in more detail later on]
.center[
] --- ## Backpropagation Given: - a dataset of $(x,y)$ - a score function $s=f(x,W)=Wx$ - a loss function: + $L_i = -log\frac{e^{s_yi}}{\sum_j{e^{s_j}}}$ .green[per sample] + $L = \frac{1}{N}\sum^{N}_{i=1}{L_i} + R(W)$ .green[for all samples] How to find best $W$? Modularization in basic blocks helps building intuition (also for deep) .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Computational graphs
.center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - What happens in a single unit/function/neuron/layer? .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - What happens in a single unit/function/neuron/layer? .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - What happens in a single unit/function/neuron/layer? .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - What happens in a single unit/function/neuron/layer? .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - What happens in a single unit/function/neuron/layer? .credit[Slide credit: A. Karpathy] --- ## Backpropagation .center[
] - For _deep_ you just replicate modules in this manner .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$ .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - patterns in backward flow + __Add__ gate: distributes gradient evenly + __Max__ gate: gradient router to max input + __Mul__ gate: doing some sort of gradient switching between inputs .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - gradients add at branches: .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - Implementation: forward/backward functions + (x,y,z) are scalars here .center[
] .credit[Slide credit: A. Karpathy] --- ## Backpropagation - vectorized - Example: $f(x,W)= \left\lVert W \cdot x \right\rVert^2 = \sum_{i=1}^{n}{(W \cdot x)^2_i}$ -- + $ x \in \mathbb{R}^n$ + $ W \in \mathbb{R}^{n \times n}$ -- .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- ## Backpropagation - vectorized .center[
] .credit[Slide credit: S. Yeung] --- class: center, middle .center[
] --- class: center, middle .center[
] --- class: center, middle # Human-engineered features ### once upon a time ... --- ## Feature extraction - early methods - Concatenation of pixels into 1D descriptors .center[
] --- ## Feature extraction - early methods - Concatenation of pixels into 1D descriptors - Applied to: + face recognition .center[
] + digit recognition .center[
] --- ## Color histogram - Histogram is a summary of the data describing in this case color characteristics .center[
] --- ## Color histogram - Histogram is a summary of the data describing in this case color characteristics .center[
] --- ## Color histogram Still used here and there .center[
] --- ## Color histogram .center[
] -- .center[
] --- ## Texture features * Features corresponding to human perception * Tamura examined 6 different features (found 3 to correspond strongly to human perception): - Coarseness -- coarse vs. fine - Contrast -- high vs. Low - Directionality -- directional vs. non-directional - Line-likeness -- line-like vs. non-line-like - Regularity -- regular vs. Irregular - Roughness -- rough vs. smooth .center[
] --- ## Gradient based representation - Compute differences between sums of pixels in rectangles - Captures contrast in adjacent spatial regions - Similar to Haar wavelets, efficient to compute .center[
] .center.citation.tiny[Rapid object detection using a boosted cascade of simple features, P. Viola, CVPR 2001] --- ## Histogram of Oriented Gradients (HOG) .center[
] .center.citation.tiny[Histogram of oriented gradients for human detection, N. Dalal et al., CVPR 2005] --- ## Local features - Identify small patterns of interest in the image (_i.e._ interest points, keypoints, corners) .center[
] --- ## Local features - SIFT - Describe content and context around interest points - Scale Invariant Feature Transform (SIFT) - Output is a $128d$ vector .center[
] .center.citation.tiny[Distinctive image features from scale-invariant keypoints, D. Lowe, IJCV 2004] --- ## Exhaustive matching - Matching everything with everything .center[
] --- ## Exhaustive matching .center[
] --- ## Exhaustive matching .center[
] - The left image has $m$ features - The right image has $n$ features --- ## Exhaustive matching .center[
] - Match the i-th left feature to its right nearest-neighbor nn(i), where .center[
] --- ## Exhaustive matching .center[
] --- ## Exhaustive matching .center[
] --- ## Going large scale? --
.center[
] --- ## Visual words .center[
] .center.citation.tiny[Video Google: A text retrieval approach to object matching in videos, J. Sivic et al., ICCV 2003] --- ## Visual words - Dictionary is typically learned using _k-means clustering_ - Value of $k$ depends on the task: from 8 to 16M .center[
] --- ## Visual words - Visual word examples: each row is an equivalence class of patches mapped to the same cluster by _k-means_ - Visual words = iconic image fragments .center[
] --- ## Visual words ### Quantisation .center[
] --- ## Histogram of visual words - A simple but efficient global image descriptor - Vector of the number of occurrences of the $K$ visual words in the image (_i.e._ __embedding__) - If there are $K$ visual words, then $h$ in $R^K$ - The vector $h$ is a global image descriptor - $h$ is also called _bag of (visual) words (__BoW__)_ .center[
] --- ## Histogram of visual words ### Intuition .center[
] .center.citation.tiny[Video Google: A text retrieval approach to object matching in videos, J. Sivic et al., ICCV 2003] --- ## BoW extensions ### VLAD - _Vector of Locally Aggregated Descriptors_ .center[
] .center.citation.tiny[Aggregating local descriptors into a compact image representation, H. Jegou et al., CVPR 2010] --- ## BoW extensions ### Fisher Vectors .center[
] .center.citation.tiny[Fisher kernels on visual vocabularies for image categorization, F. Perronning et al., ECCV 2010] --- ## BoW extensions - dim(BoW) = $K$ + $k$ = size of vocabulary + $k = [1e3, 1e4]$ for classification + $k = [2e5, 16e6]$ for retrieval -- - dim(VLAD) = $K \times d$ + $d$ = size of SIFT descriptors + $k = [64, 2048]$ -- - dim(Fisher) = $K \times d \times 2$ + $d$ = size of SIFT descriptors + $2$ = GMM moments + $k = [64, 2048]$ --- ## In the meantime ...
.center[
] --- class: center, middle .center[
] --- ## Why neural networks? Why _deep_? - Traditional recognition: "shallow" architecture + each block is designed and implemented individually .center[
] - Deep learning: "deep" architecture (Convolutional Neural Network) .center[
] --- ## Why neural networks? Why _deep_? - Deep learning: train and optimize all blocks jointly + 1 -- 140M trainable parameters .center[
] --- ## Disclaimer - Not trying to sell you the _Kool-Aid_ for doing only _end-to-end learning_ - _End-to-end_ worked quite well in the past few years - Researchers have typically made differentiable classic operations from computer vision - Domain experience is highly important and it's a key asset for progress in the next years .center[
] --- class: center, middle # Neural Networks --- ## Neural Network for classification (__Before__) Linear score function:   $f = Wx$ --- ## Neural Network for classification (__Before__) Linear score function:   $f = Wx$ (__Now__) 2-layer neural network:   $f = W_2 max(0, W_1X)$ .footnote.center[
] --- ## Neural Network for classification (__Before__) Linear score function:   $f = Wx$ (__Now__) 2-layer neural network:   $f = W_2 max(0, W_1X)$ Or a 3-layer neural network:   $f = W_3 max(0,W_2 max(0, W_1X))$ .footnote.center[
] --- ## Neural Network for classification ### The neuron - Inspired by neuroscience and human brain, but resemblances do not go too far .center[
] - In fact there several types of neurons with different functions and the metaphor does not hold everywhere .credit[Slide credit: A. Karpathy] --- ## Neural Network for classification ### The neuron Inspired by neuroscience and human brain, but resemblances do not go too far .center[
] .credit[Slide credit: A. Karpathy] --- ## Neural Network for classification Inspired by neuroscience and human brain, but resemblances do not go too far .center[
] .credit[Slide credit: A. Karpathy] --- ## Multi-layer neural networks - __Training__: find network weights $w$ to minimize the error between true training labels $y_i$ and estimated labels $f_w(x_i)$: $$ E(w)= \sum_{i=1}^{N}{(y_i - f_w(x_i))^2} $$ - Minimization can be done by gradient descent (if $f$ is differentiable) + the training method is called __backpropagation__ .center[
] --- ## Discovery of oriented cells in the visual cortex .center[
] .citation.center.tiny[Hubel& Wiesel, 1959] --- ## Discovery of oriented cells in the visual cortex Find out more from [video](https://www.youtube.com/watch?v=IOHayh06LJ4) .center[
] .citation.center.tiny[Hubel& Wiesel, 1959] --- ## Mark I Perceptron - first implementation of the perceptron algorithm - the machine was connected to a camera that used 20x20 cadmium sulfide photocells to produce a 400-pixel image - it recognized letter of the alphabet .left-column[ .center[
] .center[
] ] .right-column[
] .reset-column[ ] .citation.center.tiny[Rosenblatt, 1957] --- ## Neural Network for classification - Vector function with tunable parameters $\theta$ / $W$ $$ \mathbf{f}(\cdot; \mathbf{\theta}): \mathbb{R}^N \rightarrow (0, 1)^K $$ - $s$ sample in dataset $S$: - input: $\mathbf{x}^s \in \mathbb{R}^N$ - expected output: $y^s \in [0, K-1]$ - probability: $\mathbf{f}(\mathbf{x}^s;\mathbf{\theta})_c = p(Y=c|X=\mathbf{x}^s)$ .credit[Slide credit: C. Ollion & O. Grisel] ??? the model parametrizes a conditional distribution of Y given X example: - x is the vector of the pixel values of an photo in an online fashion store - y is the type of the piece of closing (shoes, dress, shirt) represented in the photo --- ## Artificial Neuron .center[
] .credit[Slide credit: C. Ollion & O. Grisel] --
.center[ $z(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$ $f(\mathbf{x}) = g(\mathbf{w}^T \mathbf{x} + b)$ ] - $\mathbf{x}, f(\mathbf{x}) \,\,$ input and output - $z(\mathbf{x})\,\,$ pre-activation - $\mathbf{w}, b\,\,$ weights and bias - $g$ activation function .credit[Slide credit: C. Ollion & O. Grisel] ??? McCullot & pitts: inspiration from brain, but simplistic model with no will to be close to biology --- ## More neurons -> more capacity .center[
] --- ## Layer of Neurons .center[
] .credit[Slide credit: C. Ollion & O. Grisel] --
.center[ $\mathbf{f}(\mathbf{x}) = g(\textbf{z(x)}) = g(\mathbf{W} \mathbf{x} + \mathbf{b})$ ]
- $\mathbf{W}, \mathbf{b}\,\,$ now matrix and vector .credit[Slide credit: C. Ollion & O. Grisel] --- ## One Hidden Layer Network .center[
]
- $\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} + \mathbf{b}^h$ -
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) = g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$
-
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$
-
$\mathbf{f}(\mathbf{x}) = softmax(\mathbf{z}^o) = softmax(\mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o)$
.credit[Slide credit: C. Ollion & O. Grisel] ??? also named multi-layer perceptron (MLP) feed forward, fully connected neural network logistic regression is the same without the hidden layer --- ## One Hidden Layer Network .center[
]
-
$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} + \mathbf{b}^h$
- $\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) = g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$ -
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$
-
$\mathbf{f}(\mathbf{x}) = softmax(\mathbf{z}^o) = softmax(\mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o)$
.credit[Slide credit: C. Ollion & O. Grisel] --- ## One Hidden Layer Network .center[
]
-
$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} + \mathbf{b}^h$
-
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) = g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$
- $\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$ -
$\mathbf{f}(\mathbf{x}) = softmax(\mathbf{z}^o) = softmax(\mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o)$
.credit[Slide credit: C. Ollion & O. Grisel] --- ## One Hidden Layer Network .center[
]
-
$\mathbf{z}^h(\mathbf{x}) = \mathbf{W}^h \mathbf{x} + \mathbf{b}^h$
-
$\mathbf{h}(\mathbf{x}) = g(\mathbf{z}^h(\mathbf{x})) = g(\mathbf{W}^h \mathbf{x} + \mathbf{b}^h)$
-
$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$
- $\mathbf{f}(\mathbf{x}) = softmax(\mathbf{z}^o) = softmax(\mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o)$ .credit[Slide credit: C. Ollion & O. Grisel] --- ## One Hidden Layer Network .center[
] ### Alternate representation .center[
] .credit[Slide credit: C. Ollion & O. Grisel] --- ## One Hidden Layer Network .center[
]
### PyTorch implementation ```py model = torch.nn.Sequential( torch.nn.Linear(D_in, H), # weight matrix dim [D_in x H] torch.nn.Tanh(), torch.nn.Linear(H, D_out), # weight matrix dim [H x D_out] torch.nn.Softmax(), ) ``` --- ## Element-wise activation functions
.center[
]
- blue: activation function - green: derivative .credit[Slide credit: C. Ollion & O. Grisel] --- ## Element-wise activation functions - [Many other activation functions available](https://dashee87.github.io/data%20science/deep%20learning/visualising-activation-functions-in-neural-networks/):
.center[
] --- ## Softmax function $$ softmax(\mathbf{x}) = \frac{1}{\sum_{i=1}^{n}{e^{x_i}}} \cdot \begin{bmatrix} e^{x_1}\\\\ e^{x_2}\\\\ \vdots\\\\ e^{x_n} \end{bmatrix} $$ $$ \frac{\partial softmax(\mathbf{x})_i}{\partial x_j} = \begin{cases} softmax(\mathbf{x})_i \cdot (1 - softmax(\mathbf{x})_i) & i = j\\\\ -softmax(\mathbf{x})_i \cdot softmax(\mathbf{x})_j & i \neq j \end{cases} $$ -- - vector of values in (0, 1) that add up to 1 - $p(Y = c|X = \mathbf{x}) = \text{softmax}(\mathbf{z}(\mathbf{(x}))_c$ - the pre-activation vector $\mathbf{z}(\mathbf{x})$ is often called "the logits" .credit[Slide credit: C. Ollion & O. Grisel] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] --- ## Universal approximation We can approximate any $f \in \mathscr{C}([a,b],\mathbb{R})$ with a linear combination of translated/scaled ReLU functions .center[
] .credit[Slide credit: F. Fleuret] This is true for other activation functions under mild assumptions --- # Dropout .center[
] .citation.tiny[Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al., JMLR 2014] --- # Dropout ### Interpretation - Reduces the network dependency to individual neurons - More redundant representation of data ### Ensemble interpretation - Equivalent to training a large ensemble of shared-parameters, binary-masked models - Each model is only trained on a single data point --- #Dropout .center[
]
At test time, multiply weights by $p$ to keep same level of activation .citation.tiny[Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al., JMLR 2014] --- class: center, middle # Applications of deep learning --- ## Deep Learning for Flow Sculpting .center[
] .citation.tiny[Deep Learning for Flow Sculpting: Insights into Efficient Learning using Scientific Simulation Data, D. Stoecklein et al., Nature Scientific Reports 2017] --- ## Deep Learning for Flow Sculpting
.center[
] --- ## Deep Learning for Flow Sculpting What about using a CNN? .center[
] --- ## Deep Learning for Flow Sculpting What about using a CNN? .center[
] --- ## Deep Learning for Flow Sculpting What about using a CNN? .center[
] --- ## Recap - Gradient descent - Backpropagation - Hand crafted features - Intro to Neural Networks