CS231n (winter 2016) : Assignment2

前言：

以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。建议PC端阅读，该课程的学习资料和代码如下：
视频和PPT
笔记
 assignment2初始代码

Part 1：深层全连接神经网络（python编程任务）

我们在Assignment1中完成了简单的2-layer全连接神经网络，但是我们之前的编程不够模块化，所有的计算部分（损失函数的计算、梯度的计算等等）都放在了一个函数块里，使得没有灵活性，即我们无法随意更改网络的结构。这里，我们使用更加模块化的编程方式，每个模块之间相互独立，运行的时候可以相互调用，使得我们的神经网络结构十分灵活。就像这样：

python 
def layer_forward(x, w): 
""" Receive inputs x and weights w """ 
# Do some computations ... 
z = # ... some intermediate value 
# Do some more computations ... 
out = # the output  
cache = (x, w, z, out) # Values we need to compute gradients  

return out, cache 

The backward pass will receive upstream derivatives and the cache object, 
and will return gradients with respect to the inputs and weights, like this:

python 
def layer_backward(dout, cache): 
""" 
Receive derivative of loss with respect to outputs and cache, 
and compute derivative with respect to inputs. 
""" 
# Unpack cache values 
x, w, z, out = cache 

# Use values in cache to compute derivatives 
dx = # Derivative of loss with respect to x 
dw = # Derivative of loss with respect to w  

return dx, dw

此外，我们会将前面学过的参数更新策略全部集成到模块中，这样我们可以探索不同的参数更新策略的性能表现；我们也会将Batch Normalization和Dropout应用到模块中，来更高效地优化深度网络。

由于这部分的编程任务较为繁重，我们把任务拆分下来，一步一步地完成：

1. 2-layer全连接神经网络

这部分我们需要完成以下编程任务（此外，需要看懂solver.py）：
--> fc_net.py里的TwoLayerNet类
--> layers.py里的前四个函数
--> optim.py

具体代码如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from layer_utils import *

class TwoLayerNet(object):   
    """    
    A two-layer fully-connected neural network with ReLU nonlinearity and    
    softmax loss that uses a modular layer design. We assume an input dimension    
    of D, a hidden dimension of H, and perform classification over C classes.    

    The architecure should be affine - relu - affine - softmax.    

    Note that this class does not implement gradient descent; instead, it    
    will interact with a separate Solver object that is responsible for running    
    optimization.    

    The learnable parameters of the model are stored in the dictionary    
    self.params that maps parameter names to numpy arrays.   
    """
    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,           
                              weight_scale=1e-3, reg=0.0):    
        """    
        Initialize a new network.   
        Inputs:    
        - input_dim: An integer giving the size of the input    
        - hidden_dim: An integer giving the size of the hidden layer    
        - num_classes: An integer giving the number of classes to classify    
        - dropout: Scalar between 0 and 1 giving dropout strength.    
        - weight_scale: Scalar giving the standard deviation for random 
                        initialization of the weights.    
        - reg: Scalar giving L2 regularization strength.    
        """    
        self.params = {}    
        self.reg = reg   
        self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)     
        self.params['b1'] = np.zeros((1, hidden_dim))    
        self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)  
        self.params['b2'] = np.zeros((1, num_classes))

    def loss(self, X, y=None):    
        """   
        Compute loss and gradient for a minibatch of data.    
        Inputs:    
        - X: Array of input data of shape (N, d_1, ..., d_k)    
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].  
        Returns:   
        If y is None, then run a test-time forward pass of the model and return:    
        - scores: Array of shape (N, C) giving classification scores, where              
                  scores[i, c] is the classification score for X[i] and class c. 
        If y is not None, then run a training-time forward and backward pass and    
        return a tuple of:    
        - loss: Scalar value giving the loss   
        - grads: Dictionary with the same keys as self.params, mapping parameter             
                 names to gradients of the loss with respect to those parameters.    
        """
        scores = None
        N = X.shape[0]
        # Unpack variables from the params dictionary
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        h1, cache1 = affine_relu_forward(X, W1, b1)
        out, cache2 = affine_forward(h1, W2, b2)
        scores = out              # (N,C)
        # If y is None then we are in test mode so just return scores
        if y is None:   
            return scores

        loss, grads = 0, {}
        data_loss, dscores = softmax_loss(scores, y)
        reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)
        loss = data_loss + reg_loss

       # Backward pass: compute gradients
       dh1, dW2, db2 = affine_backward(dscores, cache2)
       dX, dW1, db1 = affine_relu_backward(dh1, cache1)
       # Add the regularization gradient contribution
       dW2 += self.reg * W2
       dW1 += self.reg * W1
       grads['W1'] = dW1
       grads['b1'] = db1
       grads['W2'] = dW2
       grads['b2'] = db2

       return loss, grads

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

#import numpy as np

def affine_forward(x, w, b):   
    """    
    Computes the forward pass for an affine (fully-connected) layer. 
    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N   
    examples, where each example x[i] has shape (d_1, ..., d_k). We will    
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and    
    then transform it to an output vector of dimension M.    
    Inputs:    
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)    
    - w: A numpy array of weights, of shape (D, M)    
    - b: A numpy array of biases, of shape (M,)   
    Returns a tuple of:    
    - out: output, of shape (N, M)    
    - cache: (x, w, b)   
    """
    out = None
    # Reshape x into rows
    N = x.shape[0]
    x_row = x.reshape(N, -1)         # (N,D)
    out = np.dot(x_row, w) + b       # (N,M)
    cache = (x, w, b)
    
    return out, cache

def affine_backward(dout, cache):   
    """    
    Computes the backward pass for an affine layer.    
    Inputs:    
    - dout: Upstream derivative, of shape (N, M)    
    - cache: Tuple of: 
    - x: Input data, of shape (N, d_1, ... d_k)    
    - w: Weights, of shape (D, M)    
    Returns a tuple of:   
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)    
    - dw: Gradient with respect to w, of shape (D, M) 
    - db: Gradient with respect to b, of shape (M,)    
    """    
    x, w, b = cache    
    dx, dw, db = None, None, None   
    dx = np.dot(dout, w.T)                       # (N,D)    
    dx = np.reshape(dx, x.shape)                 # (N,d1,...,d_k)   
    x_row = x.reshape(x.shape[0], -1)            # (N,D)    
    dw = np.dot(x_row.T, dout)                   # (D,M)    
    db = np.sum(dout, axis=0, keepdims=True)     # (1,M)    

    return dx, dw, db

def relu_forward(x):   
    """    
    Computes the forward pass for a layer of rectified linear units (ReLUs).    
    Input:    
    - x: Inputs, of any shape    
    Returns a tuple of:    
    - out: Output, of the same shape as x    
    - cache: x    
    """   
    out = None    
    out = ReLU(x)    
    cache = x    

    return out, cache

def relu_backward(dout, cache):   
    """  
    Computes the backward pass for a layer of rectified linear units (ReLUs).   
    Input:    
    - dout: Upstream derivatives, of any shape    
    - cache: Input x, of same shape as dout    
    Returns:    
    - dx: Gradient with respect to x    
    """    
    dx, x = None, cache    
    dx = dout    
    dx[x <= 0] = 0    

    return dx

def svm_loss(x, y):   
    """    
    Computes the loss and gradient using for multiclass SVM classification.    
    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
         for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss   
    - dx: Gradient of the loss with respect to x    
    """    
    N = x.shape[0]   
    correct_class_scores = x[np.arange(N), y]    
    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)    
    margins[np.arange(N), y] = 0   
    loss = np.sum(margins) / N   
    num_pos = np.sum(margins > 0, axis=1)    
    dx = np.zeros_like(x)   
    dx[margins > 0] = 1    
    dx[np.arange(N), y] -= num_pos    
    dx /= N    

    return loss, dx

def softmax_loss(x, y):    
    """    
    Computes the loss and gradient for softmax classification.    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
    for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss    
    - dx: Gradient of the loss with respect to x   
    """    
    probs = np.exp(x - np.max(x, axis=1, keepdims=True))    
    probs /= np.sum(probs, axis=1, keepdims=True)    
    N = x.shape[0]   
    loss = -np.sum(np.log(probs[np.arange(N), y])) / N    
    dx = probs.copy()    
    dx[np.arange(N), y] -= 1    
    dx /= N    

    return loss, dx

def ReLU(x):    
    """ReLU non-linearity."""    
    return np.maximum(0, x)

---> optim.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def sgd(w, dw, config=None):    
    """    
    Performs vanilla stochastic gradient descent.    
    config format:    
    - learning_rate: Scalar learning rate.    
    """    
   if config is None: config = {}    
   config.setdefault('learning_rate', 1e-2)    
   w -= config['learning_rate'] * dw    

   return w, config

def sgd_momentum(w, dw, config=None):    
    """    
    Performs stochastic gradient descent with momentum.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - momentum: Scalar between 0 and 1 giving the momentum value.                
    Setting momentum = 0 reduces to sgd.    
    - velocity: A numpy array of the same shape as w and dw used to store a moving    
    average of the gradients.   
    """   
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)   
    config.setdefault('momentum', 0.9)    
    v = config.get('velocity', np.zeros_like(w))    
    next_w = None    
    v = config['momentum'] * v - config['learning_rate'] * dw    
    next_w = w + v    
    config['velocity'] = v    

    return next_w, config

def rmsprop(x, dx, config=None):    
    """    
    Uses the RMSProp update rule, which uses a moving average of squared gradient    
    values to set adaptive per-parameter learning rates.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared                  
    gradient cache.    
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - cache: Moving average of second moments of gradients.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)  
    config.setdefault('decay_rate', 0.99)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('cache', np.zeros_like(x))    
    next_x = None    
    cache = config['cache']    
    decay_rate = config['decay_rate']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']    
    cache = decay_rate * cache + (1 - decay_rate) * (dx**2)    
    x += - learning_rate * dx / (np.sqrt(cache) + epsilon)  
    config['cache'] = cache    
    next_x = x    

    return next_x, config

def adam(x, dx, config=None):    
    """    
    Uses the Adam update rule, which incorporates moving averages of both the  
    gradient and its square and a bias correction term.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - beta1: Decay rate for moving average of first moment of gradient.    
    - beta2: Decay rate for moving average of second moment of gradient.   
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - m: Moving average of gradient.    
    - v: Moving average of squared gradient.    
    - t: Iteration number.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-3)    
    config.setdefault('beta1', 0.9)    
    config.setdefault('beta2', 0.999)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('m', np.zeros_like(x))    
    config.setdefault('v', np.zeros_like(x))    
    config.setdefault('t', 0)   
    next_x = None    
    m = config['m']    
    v = config['v']    
    beta1 = config['beta1']    
    beta2 = config['beta2']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']   
    t = config['t']    
    t += 1    
    m = beta1 * m + (1 - beta1) * dx    
    v = beta2 * v + (1 - beta2) * (dx**2)    
    m_bias = m / (1 - beta1**t)    
    v_bias = v / (1 - beta2**t)    
    x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)    
    next_x = x    
    config['m'] = m    
    config['v'] = v    
    config['t'] = t    

    return next_x, config

编程完成后，我们可以用FullyConnectedNets.ipynb里的代码来check我们的代码是否有误。check完之后，我们可以在CIFAR-10上跑一遍，和Assignment1里的2-layer神经网络比较一下，结果应该是差不多的。

这里，我贴一下在CIFAR-10上运行的代码和结果图：
---> two_layer_fc_net_start.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

import matplotlib.pyplot as plt
from fc_net import *
from data_utils import get_CIFAR10_data
from solver import Solver

data = get_CIFAR10_data()
model = TwoLayerNet(reg=0.9)
solver = Solver(model, data,                
                lr_decay=0.95,                
                print_every=100, num_epochs=40, batch_size=400, 
                update_rule='sgd_momentum',                
                optim_config={'learning_rate': 5e-4, 'momentum': 0.5})

solver.train()                 

plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()


best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()
# Validation set accuracy:  about 52.9%
# Test set accuracy:  about 54.7%


# Visualize the weights of the best network
from vis_utils import visualize_grid

def show_net_weights(net):    
    W1 = net.params['W1']    
    W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)    
    plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))   
    plt.gca().axis('off')    

plt.show()show_net_weights(best_model)

Figure_1.png

2. Multilayer全连接网络 + Batch Normalization

这部分我们需要完成以下编程任务:
--> fc_net.py 里的 FullyConnectedNet类
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函数

具体代码如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

from layer_utils import *

class FullyConnectedNet(object):    
    """    
    A fully-connected neural network with an arbitrary number of hidden layers,    
    ReLU nonlinearities, and a softmax loss function. This will also implement    
    dropout and batch normalization as options. For a network with L layers,    
    the architecture will be    
    {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax    
    where batch normalization and dropout are optional, and the {...} block is    
    repeated L - 1 times.   
    Similar to the TwoLayerNet above, learnable parameters are stored in the    
    self.params dictionary and will be learned using the Solver class. 
    def __init__(self, hidden_dims, input_dim=3*32*32,  
                 num_classes=10,              
                 dropout=0, use_batchnorm=False, reg=0.0,    
                 weight_scale=1e-2, dtype=np.float32, seed=None):    
    """
    def __init__(self, hidden_dims, input_dim=3*32*32, 
                 num_classes=10,           
                 dropout=0, use_batchnorm=False, reg=0.0,      
                 weight_scale=1e-2, dtype=np.float32, seed=None):

        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout > 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        layers_dims = [input_dim] + hidden_dims + [num_classes]
        for i in xrange(self.num_layers):    
            self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])    
            self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))    
            if self.use_batchnorm and i < len(hidden_dims): 
                self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))        
                self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:    
            self.dropout_param = {'mode': 'train', 'p': dropout}    
            if seed is not None:        
                self.dropout_param['seed'] = seed
        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.use_batchnorm:    
            self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])    
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)

        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh = {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:        
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:        
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

在给出 batchnorm_forward 和 batchnorm_backward函数代码之前，先给出Batch Normalization的算法和反向求导公式：

Batch Normalization, algorithm1.png

Backpropagate the gradient of loss ℓ .png

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)
    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':    
        sample_mean = np.mean(x, axis=0, keepdims=True)       # [1,D]    
        sample_var = np.var(x, axis=0, keepdims=True)         # [1,D] 
        x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps)    # [N,D]    
        out = gamma * x_normalized + beta    
        cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)    
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean    
        running_var = momentum * running_var + (1 - momentum) * sample_var
    elif mode == 'test':    
        x_normalized = (x - running_mean) / np.sqrt(running_var + eps)    
        out = gamma * x_normalized + beta
    else:    
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

def batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
    N, D = x.shape
    dx_normalized = dout * gamma       # [N,D]
    x_mu = x - sample_mean             # [N,D]
    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]
    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \                                
                                   2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
    dx1 = dx_normalized * sample_std_inv
    dx2 = 2.0/N * dsample_var * x_mu
    dx = dx1 + dx2 + 1.0/N * dsample_mean
    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
    dbeta = np.sum(dout, axis=0, keepdims=True)

    return dx, dgamma, dbeta

完成编程后，我们可以用Batch Normalization.ipynb来check我们的code是否有误。下面我会给出在使用Batch Normalization的情况下，6-layer神经网络在CIFAR-10上的performance。可以预见，6-layer神经网络的performance应该不会比2-layer神经网络的performance好多少的（因为会存在我在Assignment1最后提到的问题1）。

在这之前，我们先来看看Batch Normalization对梯度消失现象的缓解能力怎样，同时给出在不同weight_scales下的情况。我们分别以sigmoid和ReLU作为为激活函数的6-layer神经网络为例，测试一下：

---> batchnorm_and_weight_scales.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from fc_net import *
from solver import *
import matplotlib.pyplot as plt
from data_utils import get_CIFAR10_data

# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()

hidden_dims = [100, 100, 100, 100, 100]
num_train = 5000
small_data = {  
       'X_train': data['X_train'][:num_train],  
       'y_train': data['y_train'][:num_train],  
       'X_val': data['X_val'],  
       'y_val': data['y_val'],
}
bn_solvers = {}
solvers = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):    
    print 'Running weight scale %d / %d' % (i + 1, len(weight_scales)) 
    bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)    
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)    

    bn_solver = Solver(bn_model, small_data,        
                       num_epochs=10, batch_size=100,           
                       update_rule='adam',                  
                       optim_config={'learning_rate': 1e-3, },                  
                       verbose=False, print_every=1000)    
    bn_solver.train()    
    bn_solvers[weight_scale] = bn_solver    

    solver = Solver(model, small_data,                  
                    num_epochs=10, batch_size=100,      
                    update_rule='adam',                 
                    optim_config={'learning_rate': 1e-3, },  
                    verbose=False, print_every=1000)    
    solver.train()    
    solvers[weight_scale] = solver

# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales: 
    best_train_accs.append(max(solvers[ws].train_acc_history))
    bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))  

    best_val_accs.append(max(solvers[ws].val_acc_history))  
    bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))  

    final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))  
    bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))

plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.gcf().set_size_inches(10, 15)
plt.show()

Activation Function: Sigmoid.png

Activation Function: ReLU.png

从上图可以看出：

1)、Batch Normalization解决了困扰学术界十几年的sigmoid的过饱和问题（梯度消失问题），bravo！可能你觉得上面的结果不够直接，那么我贴一下每层的权重梯度值：

Left: without Batch Normalization --- Right: with Batch Normalization

2)、即使没有梯度消失现象，sigmoid还是没有ReLU好。
3)、如果weight_scales选得好的话，当激活函数为ReLU时，Batch Normalization对识别率的提升并不多。

现在，我给一下6-layer神经网络在CIFAR-10上的识别结果（激活函数为ReLU）：
· Validation set accuracy: 0.554
· Test set accuracy: 0.54

3. Dropout

这部分我们需要完成以下编程任务:
--> 修改fc_net.py，将dropout加进去
vlayers.py 里的 dropout_forward 和 dropout_backward函数

Dropout是我们在实际（深度）神经网络训练中，用得非常多的一种正则化手段，可以很好地抑制过拟合。即：在训练过程中，我们对每个神经元，都以概率p保持它的激活状态。下面给出3-layer神经网络的dropout示意图:

CS231n Convolutional Neural Networks for Visual Recognition.png

具体代码如下：

对于fc_net.py我们只要修改下其中的loss函数：

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param) 
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh, ddrop = {}, {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]     
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

---> layers.py 里的 dropout_forward 和 dropout_backward函数

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

def dropout_forward(x, dropout_param):
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:  
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None
    if mode == 'train':    
        mask = (np.random.rand(*x.shape) < p) / p    
        out = x * mask
    elif mode == 'test':    
        out = x

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param['mode']
    dx = None

    if mode == 'train':    
        dx = dout * mask
    elif mode == 'test':    
        dx = dout

    return dx

完成编程后，我们可以用Dropout.ipynb里的代码来check你的code是否有误。我们可以用Dropout.ipynb里最后一部分的代码来比较下使用和不使用dropout的区别：

Dropout vs Overfitting.png

Part 2：卷积神经网络（Convolutional Neural Networks, CNNs）

现在我们开始理解本课程的核心内容 —— 卷积神经网络，对于视觉识别任务，CNNs无疑是最出彩的。和我们前面讲过的全连接神经网络相比，CNNs的优越之处在哪呢？我觉得可以列出以下几点：

1)、它的权值共享以及局部（感受野，receptive field）连接的特点，使之更加类似生物神经网络，视觉皮层的神经元就是局部接受信息的（即这些神经元只响应某些特定感受野的刺激）；
2)、在我们的图像比较大的情况下（如 96x96、224x224、384x384、512x512等），全连接神经网络将需要训练超大量的参数（权重和偏置），这不仅会使得计算变得非常耗时，还会导致更加严重的过拟合现象。而CNNs的权值共享和局部连接的特点，使得需要训练的参数锐减（指数级的）；
3)、CNNs具有强大的特征提取能力（从边缘到局部再到整体），而全连接神经网络基本没有特征提取的能力。

下面我们来具体讨论CNNs的结构特点，讨论之前，先给一张图，方便感受下CNNs的大致结构：

CS231n Convolutional Neural Networks for Visual Recognition.png

1. 卷积层（Convolutional Layer）

卷积层，也可以称之为特征提取层，是CNNs最重要的部分。卷积层需要训练的参数是一系列的过滤器（我更喜欢卷积核这个词），这些过滤器的大小一致，通常都是正方形。假设我们有n个过滤器，每个过滤器的大小为kxk（k通常取3或5），那么这一层我们需要训练的参数就有nxkxk+n/c个（这里的c表示通道数，如果是灰度图像c=1，如果是彩色图像c=3）。权值共享告诉我们，一个过滤器只能提取一种特征，即当过滤器在图像上卷积（滑动）的过程中，只提取了该图像全局范围内的同一个特征。所以，n个过滤器可以提取图像的n个不同特征。这里贴张卷积过程的动图，这里的过滤器个数是6，但事实上是2种（因为有三个通道嘛），所以提取了两种特征：

CS231n Convolutional Neural Networks for Visual Recognition.gif

动图中，你会发现图像外面多了一圈0，而且过滤器移动的步长（stride）为2。补零这个操作，我们称之为zero-padding。我们记补零的圈数为p，过滤器移动步长为s，那么计算输出卷积特征（convolved feature，或者叫activation map）边长的公式为： L=(input_dim-k+2p)/s+1，输出特征的维数则为LxLxn/c。zero-padding这个操作产生的原因是为了保证过滤器的滑动能从头到尾刚刚好，即保证上面的公式能够整除。上面的p，s和n是需要我们提前设定好的三个超参数。对于步长s的设定，s设定得越小，提取的信息就越丰富，但计算量会相对大一点；s设定得越大，计算量会相对小一点，但是提取的信息就少一些。s的通常选择是1。

---> PS: 卷积为什么work?
自然图像有其固有特性，也就是说，图像的一部分的统计特性与其他部分是一样的。这也意味着我们在这一部分学习的特征也能用在另一部分上，所以对于这个图像上的所有位置，我们都能使用同样的学习特征。（摘自UFLDL）

2. 池化层（Pooling Layer）

卷积层的下一层是池化层，但要注意，卷积层的输出会经过激活函数（如ReLU）激活后，进入池化层。池化层的作用是将卷积层输出的维数进一步降低，以此来减少参数的数量和计算量。具体来讲，是将卷积层得到的结果无重合的分成几个子区域，然后选择每一子区域的最大值，或者平均值，或者2范数，我们以取最大值的max pooling为例（相对而言，max pooling的效果更好，所以我们通常采用max pooling），给出一个diagram：

CS231n Convolutional Neural Networks for Visual Recognition.png

通常，池化层的采样窗口大小为2x2。

有些人认为池化层并不是必要的，如Striving for Simplicity: The All Convolutional Net。此外，有人发现去除池化层对于生成式模型（generative models）很重要，例如variational autoencoders(VAEs)，generative adversarial networks(GANs)。可能在以后的模型结构中，池化层会逐渐减少或者消失。

3. 全连接层（Fully-connected layer）

现在的很多CNNs模型，在最后几层（一般是1~3层）会采用全连接的方式去学习更多的信息。注意，全连接层的最后一层就是输出层；除了最后一层，其它的全连接层都包含激活函数。

4. 卷积神经网络结构（CNNs Architectures）

CNNs的通常结构，可以表述如下：

INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)

其中，"?"是代表池化层是可选的，可有可无；N（一般0_{3），K（一般0}2）和M（M>=0）是具体层数。

注意，我们倾向于选择多层小size的卷积层，而不是一个大size的卷积层。
现在，我们以3个3x3的卷积层和1个7x7的卷积层为例，加以对比说明。从下图可以看出，这两种方法最终得到的activation map大小是一致的，但3个3x3的卷积层明显更好：
1)、3层的非线性组合要比1层线性组合提取出的特征具备更高的表达能力；
2)、3层小size的卷积层的参数数量要少，3x3x3<7x7；
3)、同样的，为了便于反向传播时的梯度计算，我们需要保留很多中间梯度，3层小size的卷积层需要保留的中间梯度更少。

3_3x3 VS 1_7x7.png

下面我给出一个最简单的CNNs结构的diagram（input+1conv+1pool+2fc）:

A simple CNNs architecture.png

这里我们列举几种常见类型的卷积神经网络结构：

· INPUT --> FC/OUT      这其实就是个线性分类器
· INPUT --> CONV --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT

---> PS:
1、对于输入层（图像层），我们一般会把图像大小resize成边长为2的次方的正方形。比如CIFAR-10是32x32x3，STL-10是64x64x3，而ImageNet是224x224x3或者512x512x3。

2、实际工程中，我们得预估一下内存，然后根据内存的情况去设定合理的值。例如输入是224x224x3得图片，过滤器大小为3x3，共64个，zero-padding为1，这样每张图片需要72MB的内存（这里的72MB囊括了图片以及对应的参数、梯度和激活值在内的，所需要的内存空间），但是在GPU上运行的话，内存可能不够（相比于CPU，GPU的内存要小得多），所以需要调整下参数，比如过滤器大小改为7x7，stride改为2（ZF net），或者过滤器大小改为11x11，stride改为4（AlexNet）。

3、构建一个实际可用的深度卷积神经网络最大的瓶颈是GPU的内(显)存。现在很多GPU只有3/4/6GB的内存，单卡最大的也就12G（NVIDIA），所以我们应该在设计卷积神经网的时候，多加考虑内存主要消耗在哪里：

大量的激活值和中间梯度值；
参数，反向传播时的梯度以及使用momentum，Adagrad，or RMSProp时的缓存都会占用储存，所以估计参数占用的内存时，一般至少要乘以3倍；
数据的batch以及其他的类似信息或者来源信息等也会消耗一部分内存。

下面列出一些著名的卷积神经网络：
· LeNet，这是最早成功应用的卷积神经网络，Yann LeCun在论文LeNet中提出。
· AlexNet，2012 ILSVRC竞赛远超第2名的卷积神经网络，掀起了深度学习的浪潮。
· ZF Net，2013 ILSVRC竞赛冠军，调整了Alexnet的结构参数, 扩增了中间卷积层。
· GoogLeNet，2014 ILSVRC竞赛冠军，极大地减少了参数数量（由 60M到4M）。
· VGGNet，2014 ILSVRC，证明了CNNs的深度对于最后的效果有至关重要的作用。
· ResNet，2015 ILSVRC竞赛冠军，截止2016年5月10，这是最先进的模型。最近Kaiming He等人，提出了改进版Identity Mappings in Deep Residual Networks。

From Kaiming He's ICML16 tutorial

Part 3：Python编程任务（3-layer CNNs）

这部分我们需要完成以下编程任务：
1)、layers.py里的以下函数：
---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive

在给出卷积层的代码前，我们先理解下卷积层的前向和后向传播时，具体是如何计算的。为了理解方便，我们假设某一个batch里的第一张图片为x[0, :, :, :]，有RGB三个通道，每个通道大小为7x7，padding为1，stride为2，那么x[0, :, :, :]的大小为1x3x9x9；此外，我们假设有3个过滤器，每个大小为3x3，用w表示所有过滤器中的权重（如第一个滤波器的第一个通道为w[0, 0, :, :]）；偏置b的大小为1x3；activation maps用out来表示，大小为3x4x4（如第一个map为out[0, :, :]）。

以刚才的假设为例，给出前向传播和后向传播的具体计算过程（反向传播的那张图片分辨率较高，请在新的标签页打开图片并放大，或者下载后观看）：

Forward.png

Backward.jpg

具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def conv_forward_naive(x, w, b, conv_param):
    stride, pad = conv_param['stride'], conv_param['pad']
    N, C, H, W = x.shape
    F, C, HH, WW = w.shape
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride
    s = stride
    out = np.zeros((N, F, H_new, W_new))

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]

    cache = (x, w, b, conv_param)

    return out, cache


def conv_backward_naive(dout, cache):
    x, w, b, conv_param = cache
    pad = conv_param['pad']
    stride = conv_param['stride']
    F, C, HH, WW = w.shape
    N, C, H, W = x.shape
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride

    dx = np.zeros_like(x)
    dw = np.zeros_like(w)
    db = np.zeros_like(b)

    s = stride
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
    dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
                    db[f] += dout[i, f, j, k]                
                    dw[f] += window * dout[i, f, j, k]                
                    dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]

    # Unpad
    dx = dx_padded[:, :, pad:pad+H, pad:pad+W]

    return dx, dw, db

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

下面给出池化层（最大值池化）的代码：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def max_pool_forward_naive(x, pool_param):
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    out = np.zeros((N, C, H_new, W_new))
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s] 
                    out[i, j, k, l] = np.max(window)

    cache = (x, pool_param)

    return out, cache


def max_pool_backward_naive(dout, cache):
    x, pool_param = cache
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    dx = np.zeros_like(x)
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]                
                    m = np.max(window)               
                    dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]

    return dx

同样，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

上面的编程中，我们使用了多层for循环，这会使得运行速度过慢。为了加快运行速度，Assignment2里提供了fast_layers.py，但需要借助Cython来生成C扩展，加快运行速度。这里，我给出naive版和fast版在运行速度上的对比，从下图可以看出，运行速度得到了极大的提升：

Naive vs Fast.png

2)、cnn.py，具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

from layer_utils import *

class ThreeLayerConvNet(object):    
    """    
    A three-layer convolutional network with the following architecture:       
       conv - relu - 2x2 max pool - affine - relu - affine - softmax
    """

    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,             
                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
                 dtype=np.float32):
        self.params = {}
        self.reg = reg
        self.dtype = dtype

        # Initialize weights and biases
        C, H, W = input_dim
        self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
        self.params['b1'] = np.zeros((1, num_filters))
        self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)
        self.params['b2'] = np.zeros((1, hidden_dim))
        self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params['b3'] = np.zeros((1, num_classes))

        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        W3, b3 = self.params['W3'], self.params['b3']

        # pass conv_param to the forward pass for the convolutional layer
        filter_size = W1.shape[2]
        conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}

        # pass pool_param to the forward pass for the max-pooling layer
        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

        # compute the forward pass
        a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
        a2, cache2 = affine_relu_forward(a1, W2, b2)
        scores, cache3 = affine_forward(a2, W3, b3)

        if y is None:    
            return scores

        # compute the backward pass
        data_loss, dscores = softmax_loss(scores, y)
        da2, dW3, db3 = affine_backward(dscores, cache3)
        da1, dW2, db2 = affine_relu_backward(da2, cache2)
        dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)

        # Add regularization
        dW1 += self.reg * W1
        dW2 += self.reg * W2
        dW3 += self.reg * W3
        reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])

        loss = data_loss + reg_loss
        grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}

        return loss, grads

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

3)、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函数。在给出代码前，我放张图，方便大家理解CNNs里的Batch Normalization是怎么计算卷积层的均值mean和标准差std的：

ConvNet Batch Normalization.png

具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def spatial_batchnorm_forward(x, gamma, beta, bn_param):
    N, C, H, W = x.shape
    x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)
    out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return out, cache


def spatial_batchnorm_backward(dout, cache):
    N, C, H, W = dout.shape
    dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)
    dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return dx, dgamma, dbeta

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

以上面完成的ThreeLayerConvNet为例，比较下使用和不使用Batch Normalization对收敛速度的影响。从下图中的结果可以看出，使用Batch Normalization明显加快了收敛，使得训练速度大幅提升（因为需要的epoch更少）：

with BN --vs-- without BN.png

---> PS:
1、数据扩增（Data Augmentation）
当数据集较小的情况下，这一操作还是十分有效的，可以一定程度提高识别率。具体的扩增方法如下：
1)、水平翻转（Horizontal flips）

Horizontal flips.png

2)、随机剪裁（Random crops/scales）

Random crops/scales.png

3)、色彩抖动（Color jitter）

Randomly jitter contrast.png

4)、发挥想象力（Get creative）
比如：平移、旋转、拉伸、切变、光学畸变等等。

下面我给出一个CNN模型，测试其在CIFAR-10上的表现（进行简单的水平翻转来扩增数据），training set: 49000x2, validation set: 1000, test set: 10000。CNN层数结构如下：

           [[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax

训练结果如下：
· Validation set accuracy: 0.904
· Test set accuracy: 0.892

Training loss & Accuracy

CONV layer 1: filters

Part 4：可视化卷积神经网络

可视化手段可以直观地揭开CNNs的神秘面纱，帮助我们更好地理解CNNs究竟学到了什么，下面我们讨论下具体的可视化技术：

1. 可视化权重和激活值

以AlexNet为例，给出每层部分权重和激活值的可视化如下：

CONV layer 1: filters(left) and activations(right)

CONV layer 2: filters(left) and activations(right)

CONV layer 3: activations

CONV layer 4: activations

CONV layer 5: activations

Fully-connected layer 1 & 2

Output layer

2. 检索能最大限度激活神经元的图片

我们可以将大量图片输入网络，追踪那些可以最大限度激活神经元的图片，然后我们可以可视化这些图片，以此来理解神经元在它的感受野里究竟在寻找什么，以便能够正确地分类图片？下图是AlexNet的第五个pooling层（光头躺枪 O__O "…）：

AlexNet: pooling layer 5

3. 利用t-SNE和CNNs的特征向量来可视化图片

CNNs可以表示为对输入图像进行逐层转化，最终形成一个可以用线性分类器进行分类的representation，这个最终形成的representation就是CNN codes（例如AlexNet里输入分类器之前的那个4096维向量），即特征向量。

t-SNE作为对高维数据降维并可视化的最好的方法之一，其可视化结果有非常棒的视觉效果。我们可以将CNN codes输入t-SNE，得到每一张图片（对应一个特征向量）对应的二维向量，然后可以可视化出如下结果（靠的越近的图片，在CNNs眼里越相似）：

t-SNE visualization of CNN codes

4. 局部遮挡图片

为了判断CNNs是否是依靠图片中正确的目标进行进行分类（而不是靠蒙的），我们可以对图片进行局部遮挡，来测试CNNs。从下图可以看出，CNNs确实是依靠正确的目标进行分类的：

Occluding parts of the image

Part 5：迁移学习（Transfer Learning）

实际中，我们很少从头开始训练一个CNNs，因为通常我们没有足够的数据。我们常采取的做法是：使用已经在大数据集（例如ImageNet）上训练好的CNNs作为我们的初始模型或者一个固定的特征提取器，然后用在新的数据集上。上张图以便说明：

CS231n Convolutional Neural Networks for Visual Recognition.png

当新数据集和预训练时的数据集不相似的情况下（如医学图像），上图的策略需要稍稍调整下：若新数据集较小，我们需要训练除线性分类器之外更前面的几层；若新数据集较大，我们需要微调所有层。

---> CS231n: Assignment 1
---> CS231n: Assignment 3

最后编辑于：2020.07.15 20:29:46

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 157,298评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,701评论 1赞 290
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 107,078评论 0赞 237
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,687评论 0赞 202
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,018评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,410评论 1赞 211
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,729评论 2赞 310
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,412评论 0赞 194
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,124评论 1赞 239
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,379评论 2赞 242
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,903评论 1赞 257
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,268评论 2赞 251
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,894评论 3赞 233
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,014评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,770评论 0赞 192
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,435评论 2赞 269
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,312评论 2赞 260

CS231n (winter 2016) : Assignment2

前言：

Part 1：深层全连接神经网络（python编程任务）

1. 2-layer全连接神经网络

2. Multilayer全连接网络 + Batch Normalization

3. Dropout

Part 2：卷积神经网络（Convolutional Neural Networks, CNNs）

1. 卷积层（Convolutional Layer）

2. 池化层（Pooling Layer）

3. 全连接层（Fully-connected layer）

4. 卷积神经网络结构（CNNs Architectures）

Part 3：Python编程任务（3-layer CNNs）

Part 4： 可视化卷积神经网络

1. 可视化权重和激活值

2. 检索能最大限度激活神经元的图片

3. 利用t-SNE和CNNs的特征向量来可视化图片

4. 局部遮挡图片

Part 5： 迁移学习（Transfer Learning）

推荐阅读更多精彩内容

Part 4：可视化卷积神经网络

Part 5：迁移学习（Transfer Learning）