Deep Learning DIY lectures

# Lecture 7:
### Convolutions, CNN Architectures, Visualizations, GPU, Training NNs in practice

Andrei Bursuc - Florent Krzakala - Marc Lelarge 
 
 
.center[<img src="images/logos.png" style="width: 700px;" />]

.citation.tiny[
With slides from A. Karpathy, F. Fleuret, J. Johnson, S. Yeung, E. Gavves ...]

---
## Recap

- Activation functions

- Deep regularization

- Convolutional layers

- CNN architectures

- Practical PyTorch: Sentiment analysis
]

---
## Recap
.center[<img src="images/enigma_1.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_2.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_3.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_4.png" style="width: 760px;" />]

---
## Recap
.center[<img src="images/enigma_5.png" style="width: 760px;" />]

---
## Today

- CNN architectures (continued)

- Visualizing and understanding CNNs

- Tips & tricks for training deep networks

- Practical PyTorch: RNNs, a training pipeline

]

---
## Previously: One Hidden Layer Network

.center[
<img src="images/neural_network_hidden_t.svg" style="width: 700px;" />
]
 
### PyTorch implementation
```py
model = torch.nn.Sequential(
 torch.nn.Linear(D_in, H), # weight matrix dim [D_in x H]
 torch.nn.Tanh(),
 torch.nn.Linear(H, D_out), # weight matrix dim [H x D_out]
 torch.nn.Softmax(),
)
```

---
## Previously: Dropout

- One has to decide on which units/layers to use dropout, and with what probability $p$ units are dropped.
- During training, for each sample, as many Bernoulli variables as units are sampled independently to select units to remove.
- To keep the means of the inputs to layers unchanged, the initial version of dropout was multiplying activations by $p$ during test.
- The standard variant is the "inverted dropout": multiply activations by $\frac{1}{1-p}$ during training and keep the network untouched during test.

---

## Previously: Dropout

```py
>>> x = Variable(torch.Tensor(3, 9).fill_ (1.0), requires_grad = True)
>>> x.data
1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1
[torch.FloatTensor of size 3x9]

>>> dropout = nn.Dropout(p = 0.75)
>>> y = dropout(x)
>>> y.data
4 0 4 4 4 0 4 0 0 
4 0 0 0 0 0 0 0 0 
0 0 0 0 4 0 4 0 4
[torch.FloatTensor of size 3x9]

>>> l = y.norm(2, 1).sum() 
>>> l.backward()
>>> x.grad.data
1.7889 0.0000 1.7889 1.7889 0.0000 0.0000 1.7889 0.0000 0.0000 
4.0000 0.0000 0.0000 1.7889 0.0000 0.0000 0.0000 2.3094 0.0000
0.0000 0.0000 0.0000 0.0000 2.3094 0.0000 0.0000 0.0000 2.3094
[torch.FloatTensor of size 3x9]

```

$\frac{1}{1-0.75}=4$

---

## Previously: Why would we need convolutions?

- One neuron gets specialized for detecting a full-image pattern, while being sensible to translations

---
## Previously: Why would we need convolutions?

- Each neuron gets specialized for detecting a full-image pattern.  
- Neurons from later layer work similarly
- This is a big waste of parameters without good performance.

---
## Previously: Convolutions

- Since convolutions output one scalar at a time, they can be seen as an individual neuron from a MLP with a receptive field limited to the dimensions of the kernel
- The same neuron is "fired" over multiple areas from the input.

.center[
<img src="images/conv_intuition_1.png" style="width: 600px;" />
]
---
## Previously: Convolutions

.right-column[
.center[.green[Remember this?]]
.center[
<img src="images/neuron_1.png" style="width: 350px;" />
]
]
---
## Previously: Convolutions

.right-column[
.center[.green[Remember this?]]
.center[
<img src="images/neuron_1.png" style="width: 350px;" />
]
]

---
## Receptive field

- The receptive field is defined as the region in the input space that a particular CNN's feature is looking at (_i.e._ be affected by). 
- A receptive field of a feature can be fully described by its center location and its size
- Example: $k = 3\times3; p = 1\times1; s = 2\times2; input = 3\times3$

.right-column[
.tiny[Fixed-sized CNN feature map visualization, where the size of each feature map is fixed, and the feature is located at the center of its receptive field.]
]

---
## Receptive field

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

- Can we do better?
- ... Without adding parameters?

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

---
## Dilated convolutions

- also goes by the name _convolutions à trous_

.citation.tiny[DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs; Chen et al., PAMI 2016]
---
## Dilated convolutions

Usage

.left-column[
.center[In parallel]
.center[
<img src="images/conv_atrous_3.png" style="width: 450px;" />
]
]

.right-column[
.center[Stacked]
.center[
<img src="images/dilated_conv_stacked.png" style="width: 450px;" />
]
 
.center[.green[More frequently used]
]

.citation.tiny[DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs; Chen et al., PAMI 2016

Multi-scale context aggregation by dilated convolutions; Yu and Koltun, ICLR 2016]
]

---
## Dilated convolutions

- works for 1d as well
- appealing alternative to recurrent neural networks

.citation.tiny[WaveNet: A Generative Model for Raw Audio, A. van den Oord et al., 2016]

---
## Previously: GoogLeNet / Inception

Szegedy et al. (2015) also introduce the idea of "auxiliary classifiers" to help the propagation of the gradient in the early layers.

This is motivated by the reasonable performance of shallow networks that indicates early layers already encode informative and invariant features.

---
## Previously: GoogLeNet / Inception

The resulting GoogLeNet has 12 times less parameters than AlexNet and is more accurate on ILSVRC14 (Szegedy et al., 2015).

It was later extended with batch-normalization (Ioffe and Szegedy, 2015) and pass-through a la resnet (Szegedy et al., 2016)

---
## Previously: GoogLeNet / Inception

---
## A saturation point

If we continue stacking more layers on a CNN:

.citation.tiny[
.left-column[Deep residual learning for image recognition, He et al., CVPR 2016.
]
]

A block learns the residual w.r.t. identity

- Good optimization properties

.citation.tiny[
.left-column[
Deep residual learning for image recognition, He et al., CVPR 2016.
]
]

Even deeper models:

34, 50, 101, 152 layers

---
.left-column[
## ResNet
]

.citation.tiny[
.left-column[
Deep residual learning for image recognition, He et al., CVPR 2016.
]
]

ResNet50 Compared to VGG:

#### Superior accuracy in all vision tasks **5.25%** top-5 error vs 7.1%

#### Less parameters **25M** vs 138M

#### Computational complexity **3.8B Flops** vs 15.3B Flops

#### Fully Convolutional until the last layer

---
## ResNet

Performance on ImageNet

---
## ResNet

The output of a residual network can be understood as an ensemble, which explains in part its stability

.citation.tiny[Residual Networks Behave Like Ensembles of Relatively Shallow Networks, A. Veit et al., NIPS 2016]

---
## ResNet

Results

---
## ResNet

Results

---
## ResNet

In PyTorch:

```py
def make_resnet_block(num_feature_maps , kernel_size = 3): 
    
    return nn.Sequential(

nn.Conv2d(num_feature_maps , num_feature_maps , 
                  kernel_size = kernel_size ,
                  padding = (kernel_size - 1) // 2),

nn.BatchNorm2d(num_feature_maps),

nn.ReLU(inplace = True),
        nn.Conv2d(num_feature_maps , num_feature_maps , 
                   kernel_size = kernel_size ,
                    padding = (kernel_size - 1) // 2),

nn.BatchNorm2d(num_feature_maps),
)
```

---
## ResNet

In PyTorch:

```py
def __init__(self, num_residual_blocks, num_feature_maps)
...
    self.resnet_blocks = nn.ModuleList() 
    for k in range(nb_residual_blocks):
        self.resnet_blocks.append(make_resnet_block(num_feature_maps , 3))
...

```

```py
def forward(self,x):
...
    for b in self.resnet_blocks:
*        x = x + b(x)
...
    return x

```

---
## Deeper is better

.citation.tiny[
from Kaiming He slides "Deep residual learning for image recognition." ICML. 2016.
]

---
## Resnet variants: Stochastic Depth Networks

- DropOut at layer level
- Allows training up to 1K layers

.citation.tiny[Deep Networks with Stochastic Depth, Huang et al., ECCV 2016]

---
## Resnet variants: DenseNet

- Copying feature maps to upper layers via skip-connections
- Better reuse of parameters and redundancy avoidance

.citation.tiny[Densely Connected Convolutional Networks, Huang et al., CVPR 2017]

---
## Inception-V4 / -ResNet-V2

Deep, modular and state-of-the-art
Achieves **3.1% top-5** classification error on imagenet

.citation.tiny[Inception-v4, inception-resnet and the impact of residual connections on learning, C. Szegedy et al., 2016
]

More building blocks engineering...

.citation.tiny[Inception-v4, inception-resnet and the impact of residual connections on learning, C. Szegedy et al., 2016
]

- Active area or research
- See also DenseNets, Wide ResNets, Fractal ResNets, ResNeXts,
  Pyramidal ResNets...

---
## Comparison of models

Top 1-accuracy, performance and size on ImageNet

.citation.tiny[An Analysis of Deep Neural Network Models for Practical Applications, Canziani et al., 2016
]

---
## Comparison of models

Forward pass time and power consumption

.citation.tiny[An Analysis of Deep Neural Network Models for Practical Applications, Canziani et al., 2016
]

---
## Comparison of models

---
## Comparison of models

3 x more accurate in 3 years

101 ResNet Layers same size/speed as 16 VGG-VD layers

---
## Comparison of models

Number of parameters is about the same

---
## Comparison of models

5 x slower

---

# Understanding and visualizing CNNs