Part 1 of the deep learning can be found here.


In part one, we prepare a model with 4 layers of computational nodes. The solutions for are not unique. Should we prefer one solution over the other?

Layer 1:
      [[ 1.10727659,  0.22189273,  0.13302861,  0.2646622 ,  0.2835898 ],
       [ 0.23522207,  0.01791731, -0.01386124,  0.28925567,  0.187561  ]]
Layer 2:
 	[[ 0.9450821 ,  0.14869831,  0.07685842,  0.23896402,  0.15320876],
       [ 0.33076781,  0.02230716,  0.01925127,  0.30486342,  0.10669098],
       [ 0.18483084, -0.03456052, -0.01830806,  0.28216702,  0.07498301],
       [ 0.11560201,  0.05810744,  0.021574  ,  0.10670155,  0.11009798],
       [ 0.17446553,  0.12954657,  0.03042245,  0.03142454,  0.04630127]]), 
Layer 3:	   
	[[ 0.79405847,  0.10679984,  0.00465651,  0.20686431,  0.11202472],
       [ 0.31141474,  0.01717969,  0.00995529,  0.30057041,  0.10141655],
       [ 0.13030365, -0.09887915,  0.0265004 ,  0.29536237,  0.07935725],
       [ 0.07790114,  0.04409276,  0.01333717,  0.10145275,  0.10112565],
       [ 0.12152267,  0.11339623,  0.00993313,  0.02115832,  0.03268988]]), 
Layer 4:	  
	[[ 0.67123192],
       [ 0.48754364],
       [-0.2018187 ],
       [ 0.07363663]])]

This leads us to a very important topic. When we increase the complexity of our model, we risk the chance of modeling the noise also. If we do not have enough sample data to cancel out the noise, we make bad predictions. But, even without the noise, we can still have a bad model. Let’s walk through an example. We start with training samples with input values and output range from 0 to 20. How will you model an equation to link the data points below?

One possibility is which is simple and just misses 2 on the left and 2 on the right.

But, when we show it to our android, Pieter, who has a much higher computational capability than us, he models it as:

Which model is correct? The answer is “don’t know”. Someone thinks the first one is simpler. Simple explanations deserve more credit. But, if you show it to a stock broker, they may say the second curve looks closer to the market closing price of a stock. Instead, we should ask whether our model is too “custom tailored” to the training dataset and fails to make generalized predictions. Even the second model fits the training data completely, but it can make poor predictions if the true model is proven to be a simple straight line.

When we have a complex model but not enough data, we can easily overfit the model which memorizes the training data but fails to make generalized predictions.


Machine learning is about making predictions. A model that has 100% accuracy in training can still be a bad model. For that, we split our data into three parts: say 80% for training, 10% for validation and 10% for testing. During training, we use the training dataset to build models with different hyperparameters. We run those models with the validation dataset and pick the one with the highest accuracy. This strategy works if the validation dataset is similar to what we want to predict. But, as a last safeguard, we use the 10% testing data for a final insanity check. This testing data is for a final verification, but not for model selection. If your testing result is dramatically different from the validation result, the data should be randomized more, or more data should be collected.

Validation data is used to select hyperparameters and models with the highest accuracy. The testing data should never use to select a model. It should be used very infrequently just for insanity check.


Let’s us visualize some of the overfit problem. We can train a model to create a boundary to separate the blue dots from the white dots below. A complex model (the left picture) can produce ad hoc boundaries compared to a low complexity model. In the yellow-circled area, if the two left white dot samples is missing in our training dataset, a complex model may create an odd shaped boundary just to include this white dot. A low complexity model produces a smoother surface, which sometimes makes more generalized predictions. A complex model is vulnerable to outliers. For example, the white dot in the green circle may be incorrectly labeled as white. In a simple model, the boundary is much smoother and may simply ignore this outlier.

Recall from Pieter’s equation, our sample data can be modeled nicely with the following equations:

In fact, there are infinite solutions having different polynomial orders .

Compared with the linear model , the in Pieter’s equation is higher but it seems they are cancelling out each other. In addition, the higher the order, the harder to train the model because of the much bigger search space for the parameters. In additional, some areas in the search space have very steep gradients that making training tough.

Let us create a polynomial model with order 5 to fit our sample data.

We need more iterations to train this model and it is less accurate than a model with order three.

So why don’t we focus on making a model with the right complexity. In real life problems, a complex model is the only way to push accuracy to an acceptable level. A better solution is to introduce methods to reduce overfitting rather than make the model simpler. One simple solution is to add more training data such that it is much harder to overfit. Here, double our training data produces a model closer to a straight line. Unfortunately, labeling large training dataset in real problems can be expensive.

Complex models sometimes create ad hoc ways to fit training data that does not generalize. Increase the volume of training data reduces overfit effectively but can be expensive to do.


Regularization punishes over-complexity. As we have observed before, there are many solutions to a DL problem. To have a close fit, however, the coefficient usually has a larger magnitude.

Overfit model tends to have larger magnitude. To have our training not to overfit the training data, we add a penalty in the cost function to penalize large magnitude. In this example, we use a L2 norm (L2 regularization) as the penalty.

This is called regularization. Here, we introduce another hyperparameter called regularization factor to penalize overfitting.

Regularization favors less complex model if the cost reduction is less significant.

After many repetitions of trial, we pick to be 1. With the regularization, our model makes better predictions.

Like other hyperparameters, the selection process is trial and error. In fact, we use a relatively high because there are only a few trainable parameters in the model. In real life problems, is much lower because the number of trainable parameters are usually in millions.

There is another interesting observation. The loss may jump up sharply and drop to the previous level after a few thousand iterations.

Iteration 87000 [2.5431744485195127]
Iteration 88000 [2.525734745522529]
Iteration 89000 [223.88938197268865]
Iteration 90000 [195.08231216279583]
Iteration 91000 [3.0582387198108449]
Iteration 92000 [2.4587727305339286]

If we look into the equation, we realize the gradient

can be very steep which cause the learning rate problem discussed before. For example, from iteration 10,000 to 11,000, the coefficient for changes slightly from -0.000038 to -0.000021 but the cost jumps from 82 to 34,312. (It takes 2000 iterations to undo the jump.)

Iteration 10000 [82.128486144319155, 
 array([[  1.66841311e+00],
       [  5.15883024e-01],
       [  1.05449372e-01],
       [ -1.15910560e-01],
       [  1.32065116e-02],
       [ -3.83863265e-04]])]
Iteration11000 [34312.355686493174, 
  array([[  1.82722611e+00],
       [  5.83582807e-01],
       [  1.28332499e-01],
       [ -1.11263342e-01],
       [  1.24705708e-02],
       [ -2.05131433e-04]])]

When we build our model, we try out a polynomial model with order of nine. Even after a long training, the model still makes poor predictions. We decide to start with a model with order of three and increase it gradually. This is another example to demonstrate why we should start with a simple model first. At seven, we find the model is too hard to train. The following is what a seven-order model predicts:

Diminishing and exploding gradient

Cannot train a model if gradients explode or diminish. From our previous example, we demonstrate the importance of tracing the gradient at different layers to troubleshoot problems. In our online dating model, we log for each layers.

iteration 0: loss=45.6
layer 0: gradient = 226.1446016395799
layer 1: gradient = 566.6340440894377
layer 2: gradient = 371.4818585197662
layer 3: gradient = 371.7283667292019
iteration 10000: loss=12.28
layer 0: gradient = 39.087735791986816
layer 1: gradient = 70.66776450168192
layer 2: gradient = 40.95339598248693
layer 3: gradient = 49.27868977928858
iteration 90000: loss=11.78
layer 0: gradient = 8.695315741501654
layer 1: gradient = 13.149909360278247
layer 2: gradient = 9.97983678446837
layer 3: gradient = 7.053793667949491

A couple of things we need to monitor. Is the magnitude too high or too small? If the magnitude remains high, some parameters may be oscillating.

iteration 0: ... dW1=1.183e+04 dW2=5.929e+06 ...
iteration 200: ... dW1=4.458e+147 dW2=2.203e+150 ...
iteration 400: ... dW1=1.656e+291 dW2=8.184e+293 ...
iteration 600: ... dW1=nan dW2=nan ...

If the gradient is small, the network learns slowly. In the following log, the gradient diminishes from the right layer (layer 6) to the left layer (layer 0). Layer 0 almost learns nothing.

iteration 0: loss=553.5
layer 0: gradient = 2.337481559834108e-05
layer 1: gradient = 0.00010808796151264163
layer 2: gradient = 0.0012733936924033608
layer 3: gradient = 0.01758514040640722
layer 4: gradient = 0.20165907211476816
layer 5: gradient = 3.3937365923146308
layer 6: gradient = 49.335409914253
iteration 1000: loss=170.4
layer 0: gradient = 0.0005143399278199742
layer 1: gradient = 0.0031069449720360883
layer 2: gradient = 0.03744160389724748
layer 3: gradient = 0.7458109132993136
layer 4: gradient = 5.552521662655173
layer 5: gradient = 16.857110777922465
layer 6: gradient = 37.77102597043024
iteration 2000: loss=75.93
layer 0: gradient = 4.881626633589997e-05
layer 1: gradient = 0.0015526594728625706
layer 2: gradient = 0.01648262093048127
layer 3: gradient = 0.35776408953278077
layer 4: gradient = 1.6930852548061421
layer 5: gradient = 4.064949014764085
layer 6: gradient = 12.7578637206897

Many deep networks suffer from this gradient diminishing problem. Let’s come back to backpropagation to understand the problems.

The gradient descent is computed as:

As indicated, the gradient descent depends on the loss as well as the gradients . Let’s look at a sigmoid activation function. If is higher than 5 or smaller than -5, the gradient is close to 0. Hence, in these regions the node learns close to nothing regardless of the loss.

The derivative of a sigmoid function behaves like a gate to the loss signal. If the input is > 5 or <-5, the derivative is so small, it blocks most of the loss signal to propagate backwards. So, nodes on its left sides learn little.

In addition, the chain rule in the gradient descent has a multiplication effect. If we multiple numbers smaller than one, it diminishes quickly. On the contrary, if we multiple numbers greater than one, it explodes.

The gradient of deep network may diminish quickly or explode. To solve this problem, we have designed new computation nodes that can bypass an individual layer. Microsoft Resnet (2015) has 152 layers. Instead of backpropagate the cost through a long chain of nodes, Resnet have both pathways to backpropagate it through the convolution layer or by-passed through the arc. (Source Kaiming He, Xiangyu Zhang … etc)

Let’s get into details in demonstrating a node called LSTM that used in the natural language processing (NLP). In LSTM, the state of a cell is updated by

If we want to bypass this node such that the input becomes the output , we just need to set and . So one way to address the diminishing gradient problem is to design a computation node with pathway that can bypass a node.

Diminish gradient can be mitigated by designing computation nodes that providing a bypassing pathway for the node.

Gradient clipping

To avoid gradient explosion, we can apply gradient clipping to restrict values of the gradient.

In the example below, we use TensorFlow ( a popular deep learning platform from Google) to clip the gradient if it is too big.

params = tf.trainable_variables()
opt = tf.train.GradientDescentOptimizer(learning_rate)
for b in xrange(time_steps):
    gradients = tf.gradients(losses[b], params)
    clipped_gradients, norm = tf.clip_by_global_norm(gradients, 5.0)

The gradients are rescaled according to the ratio

# t is each individual gradient
# global_norm is the norm for all gradients
# clip_norm is set to 5.0
global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))
t_list[i] * clip_norm / max(global_norm, clip_norm)


A very important part of deep learning is classification. We have mentioned object recognition before. These are classification problems asking the question: what is this? For example, for Android Pieter to safely walk in a street, he needs to learn what is a traffic light, and is there a pedestrian walking towards him. Classification applies to non-visual problems also. We classify whether an email is a spam or should we approve/disapprove a loan etc…

Like solving regression problem using DL, we use a deep network to compute a value. In classification, we call this a score. We apply a classifier to convert the score to a probability value between 0 and 1. To train the network, the training dataset provides the answers to the classification which we call true label. (like classifying an image as a school bus, a truck or an airplane)

Logistic function (sigmoid function)

A score computed by a network takes on any value. A classifier squashes it to a probability value. For example, to classify whether an email is a spam or the medical test is positive, we apply a logistic function (also called a sigmoid function) to the score. If the output probability is lower than 0.5, we predict “no”, otherwise we predict “yes”.

Softmax classifier

For many classification problems, we categorize an input to one of the many classes. For example, we can classify an image to one of the 100 possible object classes. We use a softmax classifier to compute K probabilities, one per class for an input image (the combined probabilities remains 1).

The network computes K scores per image. The probability that an image belongs to the class will be.

For example, the school bus above may have a score of (3.2, 0.8, 0) for the class school bus, truck and airplane respectively. The probability for the corresponding class is

def softmax(z):
    z -= np.max(z)
    return np.exp(z) / np.sum(np.exp(z))

a = np.array([3.2, 0.8, 0])   # [ 0.88379809  0.08017635  0.03602556]

To avoid the numerical stability problem caused by adding large exponential values, we subtract the inputs by its maximum. Adding or subtract a number from the input does not change the probability value in softmax.

z -= np.max(z)

logits is defined as a mean to measure odd.

If we combine the softmax equation with the logits equation, it is easy to see that the score is the logit.

That is why in many literatures and APIs, logit and score are interchangeable when a softmax classifier is used. However, more than one functions that map scores to probabilities will meet the definition of logits. Sigmoid function is one of them.

Softmax is the most common classifier among others.

SVM classifier

A SVM classifier uses the maximum margin loss (Hinge loss) and the L2-regularization for the cost function:

which is the true label for datapoint .

The class having the highest score will be the class predicted. The maximum margin loss function will be discussed later. Its main objective is to create a boundary to separate classes with the largest possible margin.

Softmax vs SVM

Softmax outputs probabilities which have better interpretations than scores in SVM. The value of a score or the difference between 2 score has little interpretable values. In SVM, if a datapoints is not inside a margin, it has 0 data cost. SVM cares more about in fixing wrong predictions or predictions that are not too certain. Softmax add data cost even the predictions are close. In practice, the performance between both are small.


Entropy measures the amount of information. In data compression, it represents the minimum number of bits in representing data. By definition, entropy is defined as:

Suppose we have a string composed of “a”, “b” or “c” with a 25% chance of occurrence for “a”, 25% for “b”, and 50% for “c”. The entropy is:

We will use bit 0 to represent ‘c’ and bits 10 for ‘a’ and bits 11 for ‘b’. In average, we need 1.5 bits per character to represent a string.

Cross entropy

If entropy measures the minimum of bits to encode information using the most optimized scheme. Cross entropy measures the minimum of bits to encode using the wrong optimized scheme from . The cross entropy is always higher than entropy unless both distributions are the same: you need more bits to encode the information if you use a less optimized scheme.

KL Divergence:

KL divergence is simply cross entropy minus entropy (the extra bits needed to encode the data):

In machine learning, KL Divergence measures the difference between 2 probability distributions.

Source wikipedia.

It becomes a very good cost function to penalize the difference between the true label and the prediction made by the model. In our previous example, we classify a picture as either a bus, a truck or an airplane (bus, truck, airplane). The true label probability distribution for a bus is (1, 0, 0) and our model prediction can be (0.88, 0.08, 0.04). The KL Divergence becomes a very good cost function to train our model.

KL Divergence is even better for stochastic processes when the true label is stochastic rather than deterministic. (true label values smaller than 1)


The entropy of the true label is constant, therefore .

i.e. Optimizing KL-divergence is the same as optimizing cross entropy.

We can optimize the simpler cross entropy instead of the KL-divergence.

The cross entropy cost of our example is:

Since is always equal to 1 for the true label and 0 otherwise, the cross entropy can be simplified as:

Cross entropy is a very popular cost function for deep learning.

Maximum likelihood estimation (MLE)

In the last section, we state without proof that KL divergence and cross entropy is a good cost function. Let’s derive it. What is our objective in training a model? Our objective is to train the model parameters so that the likelihood of our model is maximized (MLE). We want to find such that our training dataset is most possible comparing to other . The probability of our model matching the true labels of all training datapoints is:

for training datapoints and true label


For our first datapoint, and = .

Negative log-likelihood (NLL)

Logarithm is a monotonic function. Hence optimize is the same as optimize . However, change the sign and the direction of the function. Maximum likelihood translates to minimize the negative log-likelihood (NLL).

Maximize the “maximum likelihood estimation” (MLE) is the same as minimizing the negative log-likelihood.

i.e. Maximize:

is the same as minimize

With no coincident, this is the cross entropy in our previous example.

NLL is a very common method in finding the cost function.

Logistic loss

As an exercise to demonstrate NLL, we derive the logistic loss from NLL.

In logistic regression, we compute the probability by

Apply NLL,

This becomes the logistic loss:

NLL for softmax

This is the Softmax cost function defined as the NLL and the corresponding gradient:

To proof it, we compute the Softmax gradient for one datapoint:


The code in calculating softmax loss:

def softmax_loss(x, y):
    probs = np.exp(x - np.max(x, axis=1, keepdims=True)) # Subtract from the max for better numeric stability.
    probs /= np.sum(probs, axis=1, keepdims=True)
    N = x.shape[0]
    loss = -np.sum(np.log(probs[np.arange(N), y])) / N
    dx = probs.copy()
    dx[np.arange(N), y] -= 1
    dx /= N
    return loss, dx

Cost function

Deep learning is about knowing your costs. Good cost function builds good models. But, how do you define the cost? San Francisco is about 400 miles from Los Angeles. It costs about $80 for the gas. When you order food from a restaurant, they do not deliver to homes more than a few miles away. From their perspective, the cost grows exponentially with distance. So, there are many definitions of cost. In DL, our objective is not knowing the value of the cost, but finding a set of to make cost the lowest. Therefore, we have much flexibility for the cost function as long as we can find this set of . For example, some cost functions ignore outliers better than others. There are other important considerations including how easy can we optimize the cost function.

Cross entropy cost function

Apply NLL to find the cost function for a classification problem:

Recall from the previous section:

A cross entropy cost function and a softmax classifier is one of the most popular combinations in solving classification problems.

SVM loss (aka Hinge loss or Max margin loss)

If the margin between the score of the true label and the score of a class is less than 1, we add it to the cost. For example, with a score of (8, 14, 9.5, 10) and the true label is the last entry:

For SVM, the cost function creates a boundary with a maximum margin to separate classes.

For training points that are predicted correctly and outside the margin, the cost is 0. Within the margin, all training datapoints add a penalty to the cost function. i.e. If you prediction is certain and it is correct, there is no penalty. However, we penalize predictions that are in-correct or below some certainty.

Mean square error (MSE)

We have used MSE for regression problems before. We can use MSE in classification. But, in practice, we use cross entropy loss instead. Classification uses a classifier to squash scores to a probability between 0 and 1. The mapping is not linear. For a sigmod classifier, a large range of values is squeezed to 0 or 1 (for or ). As shown before, the partial derivative is close to 0 for those region. Based on the chain rule in the back propagation:

The cost signal is hard to propagate backward in those regions regardless of cost. However, there is a way to solve this issue. Even the partial derivative of the sigmoid function in those regions is small but we can make very large if the prediction is bad. The sigmoid function squashes values exponentially. We need a cost function that punishes bad predictions in the same scale to counter that. Squaring the error in MSE is not good enough. Cross entropy punishes bad predictions exponentially. That is why the cross entropy cost function trains better than MSE in the classification problems.


As mentioned before, we add regularization cost into the cost function to avoid overfitting. How does it impact our models? The idea of regularization is to train a model that makes generalized predictions. It forces the model not to memorize the small bits of an individual sample that is not part of the generalized features. The most effective solution is to add more training data. But it can be expensive to label data. Part of the overfit problem is that the model is too powerful. We can reduce the complexity of the model by reducing the number of features. We can remove features that can be directly or indirectly derived by others. We can eliminate some layers or switch to a design that explores better on the locality of the information. For example, using CNN for images to explore the spatial locality and LSTM for temporal locality.

L0, L1, L2 regularization

Large W tends to overfit. L2 regularization adds the L2 norm to a cost function as the regularization cost.

Beside L2 norm, we can use L0 or L1 norm.

L0 regularization

L1 regularization

Comparison among L0, L1 and L2 regularization

The derivative of L2 norm is easy to compute and it is smooth which works well with gradient descent. Hence L2 regularization is very popular. is the optimal without the regularization. By adding the L2 regularization (the circle), we shift the optimal point closer to 0.

The shape of the L1 regularization is a rotated square instead of a circle since regularization cost is measured as but not . L1 regularization tries to push some to zero rather than closer to 0. So it increase the sparsity of .

L0 regularization adds a constant cost to the regular cost (blue curve) except at . The green curve is the new cost which have a sudden drop at 0 to to the green dot. Therefore, it can push more to 0 comparing to L1 regularization.

L1 and L0 regularization promotes sparsity for which can be desirable to avoid overfitting when we have too many input features.


A non-intuitive regularization method called dropout discourages weights with large values. To avoid overfit, we do not want weights to be too dominating. By randomly dropping connections from one layer to the other layer, we force the network not to depend too much on a single node and try to draw prediction from many different pathways. This has an effect similar to forcing the weights smaller.

In the following diagram, for each iteration during training, we randomly drop off some connections.

Here is the code to implement the dropout for the forward feed and back propagation. In the forward feed, it takes a parameter on the percentage of nodes to be dropout. Notice that, dropout applies to training only.

def dropout_forward(x, dropout_param):
  p, mode = dropout_param['p'], dropout_param['mode']
  if 'seed' in dropout_param:

  mask = None
  out = None

  if mode == 'train':
    drop = 1 - p
    mask = (np.random.rand(*x.shape) < drop) / drop
    out = x * mask
  elif mode == 'test':
    out = x

  cache = (dropout_param, mask)
  out = out.astype(x.dtype, copy=False)

  return out, cache

def dropout_backward(dout, cache):
  dropout_param, mask = cache
  mode = dropout_param['mode']
  if mode == 'train':
    dx = dout * mask
  elif mode == 'test':
    dx = dout
  return dx

Dropout forces the decision not to dependent on few features. It behave like other regularization in constraining the magnitude of .

Deep learning network (Fully-connected layers) CIFAR-10

Let’s put together everything to solve the CIFRA-10. CIFAR-10 is a computer vision dataset for object classification. It has 60,000 32x32 color images belonging to one of 10 object classes, with 6000 images per class.

(Source Alex Krizhevsky)

We implement a fully connected network similar to the following to classify the CIFRA images. In our implementation, we allow the user to control how many hidden layers to create and the number of nodes per layer.

Let’s have some boiler plate code that we did before. This is the forward feed and the back propagation code for and the ReLU.

def affine_forward(x, w, b):
    out = x.reshape(x.shape[0], -1).dot(w) + b
    cache = (x, w, b)
    return out, cache

def affine_backward(dout, cache):
    x, w, b = cache
    dx =
    dw = x.reshape(x.shape[0], -1)
    db = np.sum(dout, axis=0)
    return dx, dw, db

def relu_forward(x):
    out = np.maximum(0, x)
    cache = x
    return out, cache

def relu_backward(dout, cache):
    dx, x = None, cache
    dx = dout
    dx[x < 0] = 0
    return dx

We combine them to form a forward feed and a backpropagation “affine relu” layer. (Each affine relu layer is used as a hidden layer.)

def affine_relu_forward(x, w, b):
    a, fc_cache = affine_forward(x, w, b)
    out, relu_cache = relu_forward(a)
    cache = (fc_cache, relu_cache)
    return out, cache

def affine_relu_backward(dout, cache):
    fc_cache, relu_cache = cache
    da = relu_backward(dout, relu_cache)
    dx, dw, db = affine_backward(da, fc_cache)
    return dx, dw, db

This is the softmax function.

def softmax_loss(x, y):
    probs = np.exp(x - np.max(x, axis=1, keepdims=True)) # Subtract from the max for better numeric stability.
    probs /= np.sum(probs, axis=1, keepdims=True)
    N = x.shape[0]
    loss = -np.sum(np.log(probs[np.arange(N), y])) / N
    dx = probs.copy()
    dx[np.arange(N), y] -= 1
    dx /= N
    return loss, dx

We are creating a FullyConnectedNet network with 3 hidden layers with (100, 50, 25) nodes respectively. We also initialize with a random normal distribution.

class FullyConnectedNet(object):

    def __init__(self, hidden_dims, input_dim=3 * 32 * 32, num_classes=10, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32):
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        layers = [input_dim] + hidden_dims + [num_classes]
        # Initialize the W & b for each layers 
        for i in range(self.num_layers):
            self.params['W%d' % (i + 1)] = np.random.randn(layers[i], layers[i + 1]) * weight_scale
            self.params['b%d' % (i + 1)] = np.zeros(layers[i + 1])

        # Cast all parameters to the correct datatype
        for k, v in self.params.iteritems():
            self.params[k] = v.astype(dtype)
model = FullyConnectedNet([100, 50, 25], weight_scale=5e-2, dtype=np.float64)			

Here is the key part in computing the loss in the model class. We do a feed forward for the hidden and output layers. Use softmax to compute the loss. Add L2 regularization cost and compute all the gradients using the backpropagation.

def loss(self, X, y=None):
    X = X.astype(self.dtype)
    # We reuse the same method for prediction. So we have the train mode and the test mode (make prediction.)
    mode = 'test' if y is None else 'train'

    layer = [None] * (1 + self.num_layers)
    cache_layer = [None] * (1 + self.num_layers)

    layer[0] = X

    # Feed forward for each layer define in FullyConnectedNet([100, 50, 25], ...)
    for i in range(1, self.num_layers):
        # Retrieve the W & b
        W = self.params['W%d' % i]
        b = self.params['b%d' % i]
        # Feed forward for one affine relu layer
        layer[i], cache_layer[i] = affine_relu_forward(layer[i - 1], W, b)

    last_W_name = 'W%d' % self.num_layers
    last_b_name = 'b%d' % self.num_layers
    # From the last hidden layer to the output layer, we do an affine op but not ReLU
    scores, cache_scores = affine_forward(layer[self.num_layers - 1],

    # If just making prediction, we return the scores										  
    if mode == 'test':
        return scores

    loss, grads = 0.0, {}
    # Compute the loss
    loss, dscores = softmax_loss(scores, y)

    # For each layer, add the regularization loss
    for i in range(self.num_layers):
        loss += 0.5 * self.reg * np.sum(self.params['W%d' % (i + 1)] ** 2)

    # Back progagation the output to the last hidden layer
    dx = [None] * (1 + self.num_layers)
    dx[self.num_layers], grads[last_W_name], grads[last_b_name] = affine_backward(dscores, cache_scores)
    grads[last_W_name] += self.reg * self.params[last_W_name]

    # Back progagation from the last hidden layer to the first hidden layer
    for i in reversed(range(1, self.num_layers)):
        dx[i], grads['W%d' % i], grads['b%d' % i] = affine_relu_backward(dx[i + 1], cache_layer[i])
        grads['W%d' % i] += self.reg * self.params['W%d' % i]

    return loss, grads

Here is the code of training the model. The Solver takes in the model we created and performs the gradient descent with the training dataset to train the model. All deep learning software platform provides similar services. People use those built-in services to optimize . Since we had covered the gradient descent, we will not list the code for the Solver here.

model = FullyConnectedNet([100, 50, 25],
              weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, data,
                print_every=100, num_epochs=10, batch_size=200,
                  'learning_rate': learning_rate,

And this is the code of making predictions and calculating the accuracy.

X_test, y_test, X_val, y_val = data['X_test'], data['y_test'], data['X_val'], data['y_val']

y_test_pred = np.argmax(model.loss(X_test), axis=1)
y_val_pred = np.argmax(model.loss(X_val), axis=1)
print('Validation set accuracy: ', (y_val_pred == y_val).mean())
print('Test set accuracy: ', (y_test_pred == y_test).mean())

Deep learning is about creating a model by learning from data. We have solved a visual recognition problem that is otherwise difficult to solve. Instead of coding all the rules, which is impossible for the CIFRA problem, we create a FC network to learn the model from data. The accuracy of our model can be further improved by adding convolution layer (CNN) on top of the FC network. CNN will be covered in a separate article.


One of the first deep learning datasets that most people learn is the MNist. It is a dataset for handwritten numbers from 0 to 9.

We will implement code with the TensorFlow to solve the problem with 98%+ accuracy.

Unlike the code with numpy, TensorFlow constructs a graph describing the network first. Here, we declare a placeholder for our input features (the pixel values of the image) and the labels. The real data will be provided later in the execution phase.

x = tf.placeholder(tf.float32, [None, 784])
labels = tf.placeholder(tf.float32, [None, 10])  # True label.

We declare and as variables. Those are trainable variables that we want to learn from the training dataset. In the declaration, we also declare the method to initialize the variables. Nevertheless, the initialization is performed in the execution phase later.

W1 = tf.Variable(tf.truncated_normal([784, 256], stddev=np.sqrt(2.0 / 784)))
b1 = tf.Variable(tf.zeros([256]))
W2 = tf.Variable(tf.truncated_normal([256, 100], stddev=np.sqrt(2.0 / 256)))
b2 = tf.Variable(tf.zeros([100]))
W3 = tf.Variable(tf.truncated_normal([100, 10], stddev=np.sqrt(2.0 / 100)))
b3 = tf.Variable(tf.zeros([10]))

We define 2 hidden layers. Each has a matrix multiplication operation followed by ReLU. Then we define an output layer with the matrix multiplication only. Once again, this is just declaration and no operation is performed yet.

### Building a model
# Create a fully connected network with 2 hidden layers
# 2 hidden layers using relu (z = max(0, x)) as an activation function.
h1 = tf.nn.relu(tf.matmul(x, W1) + b1)
h2 = tf.nn.relu(tf.matmul(h1, W2) + b2)
y = tf.matmul(h2, W3) + b3

Now, we define our loss function including a cross entropy cost with L2 regularization penalties for our . We use an Adam optimizer (named as train_step) to perform the gradient descent to optimize . We also have a placeholder for so a user can supply the value later to control the regularization.

# Cost function & optimizer
# Use a cross entropy cost fuction with a L2 regularization.
lmbda = tf.placeholder(tf.float32)
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=y) +
        lmbda * (tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3)))
train_step = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cross_entropy)

Once the declaration code is completed for the computation graph, we created a session to execute the graph and train the network for 10,000 iterations. For each iteration, we retrieve the next batch of training data and run the operation node “train_step” (i.e. the Adam optimizer we defined above). TensorFlow runs the operation as well as all operations that it depends on.

# Create an operation to initialize the variable
init = tf.global_variables_initializer()
# Now we create a session to execute the operations.
with tf.Session() as sess:
    # Train
    for _ in range(10000):
      batch_xs, batch_ys = mnist.train.next_batch(100), feed_dict={x: batch_xs, labels: batch_ys, lmbda:5e-5})

Once the training is complete, we create 2 more operations. The “correct_prediction” compare the predictions with the true labels. “Accuracy” computes the accuracy of our prediction.

with tf.Session() as sess:
    # Test trained model
    correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(labels, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

We run the accuracy operation with our testing dataset and print out the results.

with tf.Session() as sess:
    print(, feed_dict={x: mnist.test.images,
                                          labels: mnist.test.labels}))

For completeness, here is the code listing. This file depends on “tensorflow.examples.tutorials.mnist” which is used to read the MNist data.

import argparse
import sys

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf
import numpy as np

FLAGS = None

def main(_):
  # Import data
  mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)

  ### Building a model
  # Create a fully connected network with 2 hidden layers
  # Initialize the weight with a normal distribution.
  x = tf.placeholder(tf.float32, [None, 784])
  labels = tf.placeholder(tf.float32, [None, 10])  # True label.
  W1 = tf.Variable(tf.truncated_normal([784, 256], stddev=np.sqrt(2.0 / 784)))
  b1 = tf.Variable(tf.zeros([256]))
  W2 = tf.Variable(tf.truncated_normal([256, 100], stddev=np.sqrt(2.0 / 256)))
  b2 = tf.Variable(tf.zeros([100]))
  W3 = tf.Variable(tf.truncated_normal([100, 10], stddev=np.sqrt(2.0 / 100)))
  b3 = tf.Variable(tf.zeros([10]))

  # 2 hidden layers using relu (z = max(0, x)) as an activation function.
  h1 = tf.nn.relu(tf.matmul(x, W1) + b1)
  h2 = tf.nn.relu(tf.matmul(h1, W2) + b2)
  y = tf.matmul(h2, W3) + b3

  # Cost function & optimizer
  # Use a cross entropy cost fuction with a L2 regularization.
  lmbda = tf.placeholder(tf.float32)
  cross_entropy = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=y) +
         lmbda * (tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3)))
  train_step = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cross_entropy)

  init = tf.global_variables_initializer()
  with tf.Session() as sess:
      # Train
      for _ in range(10000):
        batch_xs, batch_ys = mnist.train.next_batch(100), feed_dict={x: batch_xs, labels: batch_ys, lmbda:5e-5})

      # Test trained model
      correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(labels, 1))
      accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
      print(, feed_dict={x: mnist.test.images,
                                          labels: mnist.test.labels}))

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('--data_dir', type=str, default='/tmp/tensorflow/mnist/input_data',
                      help='Directory for storing input data')
  FLAGS, unparsed = parser.parse_known_args(), argv=[sys.argv[0]] + unparsed)

# 0.9816

This code demonstrates the power to solve a complex visual problem with few lines of DL code. With 10,000 iterations, we achieved an accuracy above 98%.

Weight initialization

Weight initialization is one critical area in implementing a deep network. If you start the parameters incorrectly, you may not even beat the random odd of guessing. Sometimes people accidentally initialize with 0s. This is close to having every neurons dead and we will not be able to backpropagate the loss. We want some non-symmetry on . So nodes are learning different features. To introduce such non-symmetry, we initialize with a Gaussian distribution with . Initially, we do not want the output of the matrix multiplication falls into the saturation regions of the activation function. In these regions, cost is in-sensitive to changes in the input. Their partial derivatives are close to 0 and kill the gradient in the backpropagation. learns too slow in saturation regions.

Let’s generate 20,000 values of with and . Here is the distribution of .

Let’s create a node with 1000 input. We initialize half of the as 1 and the other half as 0. We generate 20,000 values of with initialized above.

The variance of the output jumps almost 500 times and the plot is almost flat.

If we feed this output of the matrix multiplication into the activation function. Huge portion of the values are in the saturation region for any activation functions we discussed as far. Instead, we should ask if we want the input of the activation to be Gaussian distributed with and , how should be distributed?

Hence, if we want the input to the activation function to have and , we need to scale by . With ReLU activation, we want to scale it with

w = np.random.randn(n) * sqrt(2.0/n)

Training parameters

In previous sections, we discuss many problems in training a network and how bad learning rate produces bad predictions. We now come back to the gradient descent and discuss different methods for updating the trainable parameters. This is not an easy topic. Fortunately, people can simply use most of the built-in optimization methods provided by the DL software platforms.

Rate of decay

To maintain a constant learning rate is not a good idea. In later phase, we want much smaller steps to avoid oscillation. After some initial training, we can start decaying the learning rate for every N iterations. For example, after 10,000 iterations, the learning rate will be decay by the formula below for every 20,000 iterations:

where decay factor is another hyperparameter say 0.95.

Momentum update

We mentioned that gradient descent is like dropping a ball in a bowl. But our gradient descent adjusts the parameters by the gradient of the current location of only. In the physical world, the movement of the ball depends on the location and also on the momentum of the ball. We could adjust by the gradient and its history rather than throwing all the history away. If we recall the stochastic gradient descent, it follows a zip zap pattern rather than a smooth curve. With this historical information, we can make stochastic gradient or mini-batch gradient to behave more smoothly.

Here we introduce a variable , which behaves like the momentum in the physical word. In each iteration, we update by keeping a portion of v minus the change caused by the gradient at that location. controls how much history information to keep, and this will be another hyperparameter. Researchers may describe as fraction. If you recall the oscillation problem before, this actually becomes a damper to stop the oscillations. Momentum based gradient descent often has a smoother path and settles to a minimal closer and faster.

v = mu * v - learning_rate * dw
w += v

Nesterov Momentum

In Momentum update, we use the current location of to compute . In Nesterov Momentum, the current location is replaced with a look-ahead location: the location where the “ball” should go without taking the current location into account. Nesterov Momentum converges better than Momentum update.

In the program below, instead of using location to compute , we use location to compute .

w_ahead = w + mu * v
v = mu * v - learning_rate * dw_ahead
w += v

In practice, people use the following formula:

v_prev = v
v = mu * v - learning_rate * dw 
w += -mu * v_prev + (1 + mu) * v 


If the input features are not scaled correctly, it is impossible to find the right learning rate that works for all the features. This indicates the learning rate needs to be self-adapted for each tunable parameter. One way to do it is to remember how much change has made to a specific . We should reduce the rate of change for parameters that already have many updates. This mitigates the oscillation problem because it acts like a damper again. In Adagrad, we reduce the learning rate with a ratio inversely proportional to the L2 norm of all previous gradients .

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + 1e-7) # add a tiny value to avoid division by 0.

Parameter based methods like Adagrad reduce the learning rate for parameters according to their accumulated changes.


Like Adagrad, RMSprop adjusts the learning rate according to the history of the gradients. But it uses a slightly different formular with the introduction of another hyperparameter decay_rate.

cache = decay_rate * cache + (1 - decay_rate) * dw**2
w += - learning_rate * dx / (np.sqrt(cache) + 1e-7)

RMSprop behaves like Adagrad but have mechanism to forget older changes.


Adam combines the concepts of momentum with RMSprop:

m = beta1*m + (1-beta1)*dw
v = beta2*v + (1-beta2)*(dw**2)
w += - learning_rate * m / (np.sqrt(v) + 1e-7)

Adam combines both momentum and parameter based gradient descent. Adam is the most often used method now.

Here is an example of using Adam Optimizer in TensorFlow

loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)) \
           + lmbda * tf.nn.l2_loss(weights1) + lmbda * tf.nn.l2_loss(weights2) \
           + lmbda * tf.nn.l2_loss(weights3) + lmbda * tf.nn.l2_loss(weights4)

optimizer = tf.train.AdamOptimizer(0.0005).minimize(loss)


with tf.Session(graph=graph) as session:
    for step in range(num_steps):
        _, l, predictions =
            [optimizer], feed_dict=feed_dict)

Visualization of Gradient descent methods

Here are some animations produced by Alec Radford in demonstrating how the gradient descent behaves for different algorithms. Regular gradient descent (red) learns the slowest. Momentum based algorithm has a tendency to overshoot the target initially. Adagrad learns faster initially than RMSprop. But RMSprop provides a mechanism to erase old history which make learning more adaptable later.

Feature Scaling (normalization)

As discussed before, we want the feature input to be scaled correctly (normalized). If the features do not have similar scale, it will be hard for the gradient descent to work: the training parameters oscillate like the red arrow.

For example, with 2 input features, we want the shape to be as close to a circle as possible.

To achieve that we normalize the features in the dataset to have zero mean and unit variance.

For images, we normalize every pixel independently. We compute a mean and a variance at each pixel location for the entire training dataset. Therefore, for an image with NxN pixels, we use NxN means and variances to normalize the image.

In practice, we do not read all the training data at once to compute the mean or variance. We compute a running mean during the training. Here is the formula for the running mean:

where is a small constant.

Use the running mean and variance from training dataset to normalize validation and testing data.


In machine learning, we can train a model faster if features are not correlated (whitened). In a dating application, someone may prefer a tall person but not too thin. If we rescale the weight and height independent of each other, the rescaled features are still co-related. From the rescaled weight, we can tell easily whether a person is heavier than the average population. But usually, we are more interested to know whether the person is thinner in the same height group which is very hard to tell from the rescaled weight alone.

We can express the co-relations between features and in a covariance matrix:

Which is the expected value.

Consider 2 data samples (10, 20) and (32, 52). The mean of is and

The expected value of the first element in the second row will be:

From the covariance matrix , we can find a matrix to convert to whose features are un-correlated with each other. For example, the contour plotting on the left below indicates weight and height are correlated (weight increase with height). On the right, the features are whitened with their correlation-ship recalibrate (taken out). For example, if the new feature is smaller than 0, we know the person is thinner in the same height group.

Whitening un-correlates features that make training more effective.

This sounds complicated, but can be done easily by Numpy linear algebra library.

X -= np.mean(X, axis = 0)    
cov =, X) / X.shape[0]

U,S,V = np.linalg.svd(cov)
Xdecorelate =, U)
Xwhite = Xdecorelate / np.sqrt(S + 1e-5)

Image data usually require 0 centered but does not require whitening.

Batch normalization

We have emphasized so many times the benefits of having features with mean = 0 and

But, why do we stop at the input layer only. Batch normalization renormalizes a layer output. For example, we renormalized the output of the affine layer (matrix multiplication) before feeding it into the ReLU.

def affine_batchnorm_relu_forward(x, w, b, gamma, beta, bn_param):
  h, h_cache = affine_forward(x, w, b)
  norm, norm_cache = batchnorm_forward(h, gamma, beta, bn_param)
  relu, relu_cache = relu_forward(norm)
  cache = (h_cache, norm_cache, relu_cache)
  return relu, cache

We apply the normalization formula below:

which the mean and variance is computed from the current mini-batch data.

We feed to a linear equation with the trainable scalar values and (1 pair for each normalized layer).

The normalization can be undone if and . We initialize and , so the input is normalized and therefore learns faster, and the parameters will be learned during the training.

def batchnorm_forward(x, gamma, beta, bn_param):
    sample_mean = np.mean(x, axis=0)
    sample_var = np.var(x, axis=0)

    sqrt_var = np.sqrt(sample_var + eps)
    xmu = (x - sample_mean)
    xhat = xmu / sqrt_var

    out = gamma * xhat + beta

Batch normalization solves a problem called internal covariate shift. As weights are updated, the distribution of outputs at each layer changes. Batch normalization normalized data at each layer again. So we can use a higher learning rate that speed up learning. Batch normalization also help regularize .

In the training, we use the mean and variance of the current training sample. But for testing, we do not use the mean/variance of the testing data. Instead, we record a running mean & variance during the training and apply it in validation or testing.

running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var

We use this running mean and variance to normalize our testing data.

xhat = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * xhat + beta

Batch normalization is common practice to use before or after the activation functions including CNN layers.

This is the code to implement batch normalization in TensorFlow:

w_bn = tf.Variable(w_initial)
z_bn = tf.matmul(x, w_bn)

bn_mean, bn_var = tf.nn.moments(z_bn, [0])
scale = tf.Variable(tf.ones([100]))
beta = tf.Variable(tf.zeros([100]))

bn_layer = tf.nn.batch_normalization(z_bn, bn_mean, bn_var, beta, scale, 1e-3)
l_bn = tf.nn.relu(bn_layer)

Hyperparameter tuning

Because we know very little in our model for real life problems, the hyperparameter tuning is usually a try and error process. Some parameters are dependent on each other and cannot tune separately. Sometimes those relationship is subtle. For example, the regularization rate changes the shape of the cost function. Therefore it impacts how we tune the learning rate. We can create a mesh of values to tune both parameters together. For example, with learning rates of 1e-1, 1e-2, … and 1e-8 and regularization rate of 1e-3, 1e-4, … and 1e-6, we have a potential of 8x4 combinations to test (1e-1, 1e-3), (1e-1, 1e-3), …, (1e-8, 1e-5) and (1e-8, 1e-6) ). We may not want to use an exactly rectangular shape of a mesh. For example, we may want to slight deviations at each mesh point with the hope that some irregularity may help us to explore more information.

Start tuning parameters from coarse grain with fewer iterations before fine tuning.


Many places can go wrong when training a deep network. Here are some simple tips:

  • Unit test the forward pass and back propagation code.
    • At the beginning, test with non-random data.
  • Compare the backpropagation result with the naive gradient check.
  • Always start with a simple network that works.
    • Increasing accuracy should not be the first priority.
    • Handling multiple challenges in a complex network is not the way to go. Issues grow exponentially in DL.
  • Create simple scenarios to verify the network:
    • Compute the loss for random guesses. It should match the loss at early training when the model is no better than a random guess.
    • Train with a small dataset with few iterations.
    • Verify if loss drops and/or accuracies increase during training.
    • Drop regularization - training accuracies should go up.
    • Overfit with a small dataset to see if the loss is small or 0. (Turn off the regularization cost temporarily.)
  • Do not waste time on a large dataset with long iterations during early development.
  • Verify how trainable parameters are initialized.
  • Always keep track of the shape of the data and document it in the code.
  • Display and verify some training samples and the predictions.

Monitor loss

We want to plot the cost vs iterations. Monitor the loss to see its trend:

  • If loss goes up early, the learning rate is way too high.
  • If loss drops fast and flattens very quickly, the learning rate is high.
  • If loss drops too slow, the learning rate is too slow.

Train vs validation accuracy

Plot out accuracy between validation and training to identify overfit issues.

  • If validation error is much lower than the training error, the model is overfit.

Monitor Gradient descent

Monitor the updates to ratio:

  • If the ratio is , consider lower the learning rate.
  • If the ratio is , consider increase the learning rate.

Plot weight, activation and gradient histograms for all layers.

  • Verify if you are having gradient diminishing or explode problem.
  • Identify layers that have very low gradients.
  • Verify whether you have too many saturated nodes.

Visualize filters and activation

For visualization problem, we can visualize the matrix in 2D for FC network, or the filters for CNN in the first couple layers. This helps us to identify what type of features that the model is extracting. The first layer should extract simple structures like edge and color.

We can locate pictures that have the highest activation for each feature map. Once again, it verifies what features are used in building the model.

In later layers, we should see more complex structures evolved:

(Source from Visualizing and Understanding Convolutional Networks, Matthew D Zeiler et al.)

Data augmentation

We have focused on the mechanics of the DL. One significant improvement for the network training is to have more data. This avoids overfitting and has better coverage of your feature spaces. However, getting labeled dataset can be expensive. One alternative is data augmentation. For example, for visual recognition, we can flip the image, slightly rotate or skew the images with software libraries. This helps us to avoid overfitting and produces generalized predictions invariant of the spatial location of the objects. Semi-learning may even expand the classification training dataset further by allowing some data without labels to be used as training data if the model can classify them with high certainty.

Very simple effort to augment your data can have a significant impact on the training.

Model ensembles

So far, we try to find the best models. In machine learning, we can take a vote from different decision trees to make the final prediction. This is based on the assumption that mistakes are localized. There is a smaller chance for 2 different models to make the same mistake. In DL, each training starts with random guesses and therefore the models optimized are not unique. We can pick the best models after repeating the training multiple times. We take votes from those models for the final predictions. This requires us to run the program multiple times, and can be prohibitively expensive. Alternatively, we run the training once and checkpoints multiple times. We pick the best models from the checkpoints. We can also use the validation phase to pick our best models to ensemble. Instead of running multiple models, we can also run a running average for our training parameters. We can have one vote per model, taking an average or use weights based on the confidence level for each prediction.

A lot of production system uses model ensembles to push the accuracy up for few percentage points.

Convolution Net (CNN) & Long short term memory (LSTM)

FC network is rarely used alone. Exploring all possible connections among nodes in the previous layer provides a complex model that is wasteful with small returns. A lot of information is localized. For an image, we want to extract features from neighboring pixels. CNN applies filters to explore localized features, and then apply FC to make predictions. LSTM applies time feedback loop to extract time sequence information. CNN & LSTM add complexity to the model to explore localized information. Topics here covers most of the fundermental of CNN and LSTM.

Two separated articles on CNN and LSTM will explain each methods in details.


For the CIFRA 10 example, we start with assignment 2 in the Stanford class “CS231n Convolutional Neural Networks for Visual Recognition”. We start with some skeleton codes provided by the assignment and put it into our code to complete the assignment.