# 1. Introduction to Dropout

1.1 Why we need Dropout?
> In many machine learning model training, it involves many parameters to learn while the number of training instances are limited. The trained model under such a situation is likely to be overfitted. Especially in deep learning, the training loss is small and the training accuracy is high while the validation/testing loss is very high and the model cannot make good predictions on the validation/testing set. 

> A common way to mitigate the overfitting in ML is the idea of ensemble. However, training and testing multiple models are computationally expensive.

> Therefore, Dropout comes to address the above mentioned two issues: overfitting and costly computation. Dropout is similar to regularization to some extent.

1.2 What is Dropout?

> In a peper entitled "Improving neural networks by preventing co-adaptation of feature detectors" by Hinton in 2012, Dropout was proposed. When a complex feedforward network is trained on a small training set, it's likely to overfit. To avoid overfitting, we can remove some feature detectors (e.g., neurons in hidden layers) to prevent their co-adaption. Note that co-adaption means one neuron can play a role only when some other neurons become effective. 

> The idea of Dropout is simple. During the forward pass, we remove some nerons with a probability of $p$ at random. Such a operation can make the model become "thin" and "generic", since every neuron is not relying on others too much (see Figure below).

![dropout](https://drive.google.com/uc?id=1qw-nrqNkORUFs3Xk2W0c5UaeOMo4fH2m)

# 2. Dropout Mechanism

2.1 Algorithm of Dropout
> (Drop some neurons (e.g. dashed lines in the Figure, make them ineffective) at random with the probability of $p$. The output neurons remain unchanged. 

> (2) For each batch training, do forward pass operations to obtain the loss for backpropagation to update parameters (w, b) via gradient descent. After this, we recover the deleted neurons to the original network. Parameters associated with these undropped neurons are updated. Parameters associated with these dropped ones are stored and remain the same as before deletion.

> (3) Repeat the above steps (1) and (2).

<figure>
<center>
<img src='https://drive.google.com/uc?id=1GcEcf2MmIUSoFPX_OgrGlDFn9cKfXYni' />
<figcaption>Figure: Dropout in a simple fully connected feedforward NN.</figcaption></center>
</figure>

2.2 Application of Dropout in NN
How can we implement this Dropout in NN? Let's discuss it from the perspective of the math involved.

>（1）Training period

<figure>
<center>
<img src='https://drive.google.com/uc?id=11JNbe99uuoUGlO-1fF2HIJagxahrdn3i' />
<figcaption>Figure: Comparison of standard NN and NN with Dropout.</figcaption></center>
</figure>

>> * forward passing without Dropout:

$z_i^{(l+1)} = W_i^{(l+1)}+b_i^{(l+1)}$,

$y_i^{(l+1)} = f(z_i^{(l+1)})$

>> * forward passing with Dropout:

$r_j^{(l)} \sim Bernoulli(p)$

$\tilde{y}^{(l)} = r^{(l)}*y^{(l)}$

$z_i^{(l+1)} = W_i^{(l+1)}\tilde{y}^l+b_i^{(l+1)}$,

$y_i^{(l+1)} = f(z_i^{(l+1)})$

Bernoulli function above is to generate a random binary vector. Dropping neurons in code means that the activation functions in these neurons make the output become 0. For example, in a layer with 1000 neurons, the outputs after the activation functions are $y_1, y_2, \cdots, y_{1000}$. If we choose the dropout rate to 0.4, then approximately 400 neurons' outputs are 0.

> (2) Testing period

Every parameter (W, b) needs to multiply $p$: $w_{test}^{(l)} = pW^{(l)}$. 

>> We cannot drop some neurons at test time. Because we will have unstable results for the same test data where it sometimes outputs a while outputing b at other times. This is not acceptable for end-users. One solution is to multiply $p$ for every parameter, which can make training and testing consistent. For example, the output of a neuron is $x$. If it's not dropped, it participates the training with the probability of $p$. The expected output is: $p*x + (1-p)*0 = px$. Therefore, we multiply $p$ for every parameter to obtain the same expected output as training.

<figure>
<center>
<img src='https://drive.google.com/uc?id=1QJXSeHaauFI7sCxhyiYja05ReZ7uYPNR' />
<figcaption>Figure: Dropout at test time.</figcaption></center>
</figure>


# 3. Why Dropout can avoid overfitting?

> (1) averaging. When we drop different sets of neurons, it’s equivalent to training different neural networks (as ensemble). So, the dropout procedure is like averaging the effects of large number of different networks. The different networks will overfit in different ways, so the net effect of dropout will be to reduce overfitting. Also, these networks all share weights i.e. we aren’t optimizing weights separately for these networks. (tip: so basically every network gets trained very rarely.) But, it works. It serves its purpose of regularization.

> (2) preventing co-adaptation. In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes.

> (3) A motivation for dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Please refer to the paper "Dropout: A Simple Way to Prevent Neural Networks from Overfitting."

# 4. Dropout in Pytorch

In this section, we will 

(1) implement Dropout using Pytorch and 

(2) apply the build-in Dropout in Pytorch for a real application.

All our implementations are based on PyTorch. The model training is on GPU and all other tasks are on CPU (you can run this notebook even you don't have GPU). To switch between GPU/CPU, you can add/remove `.cuda()` in the code.

In [None]:
import numpy as np
import pandas as pd
import time

import torch
import torchvision
from torchvision import datasets, transforms
from torch.autograd import Variable
import torch.nn as nn 
import torch.optim as optim
from torch.utils.data.sampler import SubsetRandomSampler

# 4.1 Dropout implementation

> We choose to multiply the dropout output by $\frac{1}{1-p}$ where $p$ is the dropout rate (note this is different from $p$ in Keras, keep probability) to compensate for the dropped neurons during the training.

In [None]:
class MyDropout(nn.Module):

  def __init__(self,p=0.5):
    super(MyDropout,self).__init__()
    self.p = p

    if self.p < 1:
      self.multiplier = 1.0/(1.0-p)
    else:
      self.multiplier = 0.0 # to avoid division by zero error

  def forward(self, x):
    if not self.training:
      return x

    # we have `input.shape` numbers of Bernoulli(1-p) samples to keep
    selected = torch.Tensor(x.shape).uniform_(0,1) > self.p

    # to support both CPU and GPU
    if x.is_cuda:
      selected = Variable(selected.type(torch.cuda.FloatTensor), requires_grad=False)
    else:
      selected = Variable(selected.type(torch.FloatTensor), requires_grad=False)

    return torch.mul(selected,x)*self.multiplier

## 4.2 Dropout in Fully Connected FeedForward Networks

> (1) we load the MNIST data from torchvision

> (2) we build a multilayer perceptron (MLP) and show our implementation of Dropout is correct. 

In [None]:
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize([0],[1])])

trainset = datasets.MNIST(root='data/',train=True,download=True,transform=transform)
testset = datasets.MNIST(root='data/',train=False,transform=transform)
print(len(trainset),len(testset))

60000 10000


In [None]:
# define a fully-connected feedforward network
class FFNN(nn.Module):

  def __init__(self,hidden_layers=[800,800],droprates=[0,0]):
    super(FFNN,self).__init__()
    self.model = nn.Sequential()
    self.model.add_module('dropout0',MyDropout(p=droprates[0]))
    self.model.add_module('input',nn.Linear(28*28,hidden_layers[0]))
    self.model.add_module('tanh',nn.Tanh())

    for i,d in enumerate(hidden_layers[:-1]):
      self.model.add_module('dropout_hidden'+str(i+1),MyDropout(p=droprates[1]))
      self.model.add_module('hidden'+str(i+1),nn.Linear(hidden_layers[i],hidden_layers[i+1]))
      self.model.add_module('tanh_hidden'+str(i+1),nn.Tanh())
      
    self.model.add_module('final',nn.Linear(hidden_layers[-1],10))

  def forward(self,x):
    x = x.view(x.shape[0],28*28)
    x = self.model(x)

    return x

In [None]:
# # define a classifier for training

class FFNNClassifier:
  
  def __init__(self,hidden_layers=[800,800],droprates=[0,0],batch_size=128,max_epoch=10,lr=0.1,momentum=0):
    self.hidden_layers = hidden_layers
    self.droprates = droprates
    self.batch_size = batch_size
    self.max_epoch = max_epoch
    self.model = FFNN(hidden_layers=hidden_layers,droprates=droprates)
    self.model.cuda()
    self.criterion = nn.CrossEntropyLoss().cuda()
    self.optimizer = optim.SGD(self.model.parameters(),lr=lr,momentum=momentum)
    self.loss = []
    self.test_accuracy = []
    self.test_error = []
    #print(self.model)

  def fit(self,trainset,testset,verbose=True):
    trainloader = torch.utils.data.DataLoader(trainset,batch_size=self.batch_size,shuffle=True)
    testloader = torch.utils.data.DataLoader(testset,batch_size=len(testset),shuffle=True)
    X_test,y_test = next(iter(testloader))
    X_test = X_test.cuda()

    for epoch in range(self.max_epoch):
      running_loss = 0
      for i,data in enumerate(trainloader):
        inputs,labels = data
        inputs,labels = Variable(inputs).cuda(),Variable(labels).cuda()
        
        self.optimizer.zero_grad()
        outputs = self.model(inputs)
        loss = self.criterion(outputs,labels)
        loss.backward()
        self.optimizer.step()
        
        running_loss += loss.item()
      self.loss.append(running_loss / len(trainloader))

      if verbose:
        print('Epoch {} loss: {}'.format(epoch+1, self.loss[-1]))
      
      y_test_pred = self.predict(X_test).cpu()
      self.test_accuracy.append(np.mean(y_test_pred.data.numpy() == y_test.data.numpy()))
      self.test_error.append(int(len(testset)*(1-self.test_accuracy[-1])))

      if verbose:
        print('# Misclassified: {}, test accuracy: {}'.format(self.test_error[-1],self.test_accuracy[-1]))
      
    return self

  def predict(self,x):
    model = self.model.eval()
    outputs = model(Variable(x))
    _,pred = torch.max(outputs.data,1)
    model = self.model.train()

    return pred

  def __str__(self):
    return 'Hidden layers: {}, dropout rates: {}'.format(self.hidden_layers, self.droprates)


In [None]:
# run various testing experiments using different settings
hidden_layers = [256,256]
max_epoch = 10

# define networks
ffnn = [FFNNClassifier(hidden_layers=hidden_layers, droprates=[0,0]),
        FFNNClassifier(hidden_layers=hidden_layers, droprates=[0,0.5]),
        FFNNClassifier(hidden_layers=hidden_layers, droprates=[0.2,0.5])]

# training
for i in range(len(ffnn)):
  m = ffnn[i]
  print('Processing model {}\n====='.format(i+1))
  m.fit(trainset,testset) 

# save torch models
for ind,m in enumerate(ffnn):
  torch.save(m.model,'model_'+str(ind)+'.pth')