Deep Learning DIY lectures

# Lecture 5:
### Gradient descent, Backpropagation, Hand-crafted features, Neural Networks

Florent Krzakala - Marc Lelarge - Andrei Bursuc
 
 
.center[<img src="images/logos.png" style="width: 700px;" />]

.footnote.small[
With slides from A. Karpathy, F. Fleuret, J. Johnson, S. Yeung, A. Vedaldi ...]

---
## Recap

- Image classification:

- K-Nearest Neighbors:

- Linear classifier:

- Loss functions: Multi-class SVM and Softmax

- Regularization

---
## Recap
.center[<img src="images/enigma_1.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_2.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_3.png" style="width: 760px;" />]

---
## Today

- Backpropagation

- Hand crafted features

- FeedForward Networks

- Practical PyTorch: Clustering, Recsys, Triplet Loss
]

---

## Optimization

Given:
- a dataset of $(x,y)$
- a score function $s=f(x,W)=Wx$
- a loss function:
  + $L_i = -log\frac{e^{s_yi}}{\sum_j{e^{s_j}}}$              .green[ per sample]
  + $L = \frac{1}{N}\sum^{N}_{i=1}{L_i} + R(W)$   .green[for all samples]

How to find best $W$?
Modularization in basic blocks helps building intuition (also for deep)

---
## Optimization

---

## Optimization

---

## Optimization

- Follow the slope
- In 1D, the derivative of a function:

- In multiple dimensions, the gradient is a vector of partial 
derivatives along each dimension
  + The slope in any direction is the dot product of the (unit) direction with 
the gradient
  + The direction of the steepest descent is the negative gradient

---

## (Naive) finite differences

## (Naive) finite differences

---

## (Naive) finite differences

---

## (Naive) finite differences

<img src="images/finite_diff_4.png" style="width: 760px;" /> 
.credit[Slide credit: J. Johnson]

---

## (Naive) finite differences

<img src="images/finite_diff_5.png" style="width: 760px;" /> 
.credit[Slide credit: J. Johnson]

---

## (Naive) finite differences

<img src="images/finite_diff_6.png" style="width: 760px;" /> 
.credit[Slide credit: J. Johnson]

---

## (Naive) finite differences

<img src="images/finite_diff_7.png" style="width: 760px;" /> 
.credit[Slide credit: J. Johnson]

---

## Optimization

- The loss function is just a function of $W$:

- We want $\nabla_WL$

- We can use calculus to compute an analytic gradient

- In practice: always use analytic gradient, but check implementation
with numerical gradient -> __gradient check__

---

## Optimization

<img src="images/finite_diff_8.png" style="width: 760px;" /> 
.credit[Slide credit: J. Johnson]

---

## Gradient descent

- Code for simple gradient descent:

```python
# Vanilla Gradient Descent
while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update
```

.center[<img src="images/gradient_descent_1.png" height="300px"/>]
.credit[Slide credit: A. Karpathy]

---

## Gradient descent

- gradient descent uses local linear information to iteratively move towards a (local) minimum
- the iterative rule:   `weights += - step_size * weights_grad` corresponds to _"following the steepest descent"_
- this finds a local minimum and the choices of $w_0$ (initial weights) and `step_size` are important.

---
## Gradient descent
 
.center[<img src="images/gd_1.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_2.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_3.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_4.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_5.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_6.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_7.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_8.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_9.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_10.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_11.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_12.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]

---
## Gradient descent
 
.center[<img src="images/gd_13.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_14.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_15.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_16.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_17.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_18.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_19.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_20.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_21.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_22.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_23.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_24.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_25.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_26.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_27.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_28.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_29.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_30.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_31.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_32.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_33.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_34.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_35.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_36.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_grid_1.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_grid_2.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_grid_3.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---
## Gradient descent
 
.center[<img src="images/gd_grid_4.png" height="400px"/>]
.credit[Slide credit: F. Fleuret]
---

## Mini-batch gradient descent

- _a.k.a_ Stochastic Gradient Descent (SGD)
- Use only a small portion of the training set to compute the gradient

``` python
# Vanilla Gradient Descent
while True:
    data_batch = sample_training_data(data,128) # sample 128 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update
```

- Common mini-batch sizes are 32/64/128/256 examples
- __step_size == learning rate__

---

## Mini-batch gradient descent

- Example of optimization progress while training a neural network
- Showing loss over mini-batches as it goes down over time

---

## Mini-batch gradient descent

- Example of optimization progress while training a neural network
- __Epoch__ = one full pass of the training dataset through the network

---
## Mini-batch gradient descent

- The effects of different optimization techniques .right.green.small[we'll cover them in more detail later on]

---
## Backpropagation

Given:
- a dataset of $(x,y)$
- a score function $s=f(x,W)=Wx$
- a loss function:
  + $L_i = -log\frac{e^{s_yi}}{\sum_j{e^{s_j}}}$ .green[per sample]
  + $L = \frac{1}{N}\sum^{N}_{i=1}{L_i} + R(W)$ .green[for all samples]

How to find best $W$?
Modularization in basic blocks helps building intuition (also for deep)

---

## Backpropagation

- Computational graphs
 
 
.center[<img src="images/comp_graph_3.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/alexnet.png" style="width: 780px"/>]
.credit[Slide credit: A. Karpathy]

---
## Backpropagation
.center[<img src="images/backprop_1.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_2.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_3.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_4.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_5.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_1.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_6.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_7.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_8.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_9.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_10.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_11.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_12.png" style="width: 780px"/>]

---
## Backpropagation
.center[<img src="images/backprop_unit_1.png" style="width: 780px"/>]

- What happens in a single unit/function/neuron/layer?

---
## Backpropagation
.center[<img src="images/backprop_unit_2.png" style="width: 780px"/>]

- What happens in a single unit/function/neuron/layer?

---
## Backpropagation
.center[<img src="images/backprop_unit_3.png" style="width: 780px"/>]
- What happens in a single unit/function/neuron/layer?

.credit[Slide credit: A. Karpathy]
---
## Backpropagation
.center[<img src="images/backprop_unit_4.png" style="width: 780px"/>]
- What happens in a single unit/function/neuron/layer?

.credit[Slide credit: A. Karpathy]
---
## Backpropagation
.center[<img src="images/backprop_unit_5.png" style="width: 780px"/>]
- What happens in a single unit/function/neuron/layer?

.credit[Slide credit: A. Karpathy]
---
## Backpropagation
.center[<img src="images/backprop_unit_6.png" style="width: 780px"/>]
- For _deep_ you just replicate modules in this manner

- Another example:   $f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$