Bias and variance

There are two sources of error. The variance is error sensitivity to small changes in the training set. It is often the result of overfitting: powerful model but not enough data. On the other hand, bias happens when the model is not powerful enough to make accurate prediction (underfitting). Usually a high variance but low bias model makes in-consistence prediction when trained with different batches of input. But the average prediction is close to the true value. The orange dots below are predictions make by a deep network. The predictions made by a highly variance model generate predictions widely spread around the true value. A highly variance model make consistence predictions but it is off from the true value.

We define bias and variance as:

Here we proof that the mean square error is actually compose of a bias and a variance.


Simple model has high bias. Overfitting a high complexity model causes high variance.

L1 regularization

L1 regularization has a tendency to push to exactly 0. Therefore, L1-regularization increases the sparsity of .

L2 regularization

is where regularization cost is 0 and is where MSE is minimum. The optimal solution for is where the concentric circle meet with the eclipse. This is the same as minimizing mean square error with the L2-norm constraint.

MSE with L2-regularization is also called ridge regression.

L1 vs L2 norm

Even the squared L2-norm may be mathematically simpler and easier to differentiate, the value increases very slowly at the origin. For some machine learning applications, the sparsity of is important. In this situration, we may consider L1-norm instead.

Histogram of gradients


Crop and scale the image to a fixed size patch.

Calculate gradient

Calculate the gradient at each pixel by subtracting its vertical or horizontal neighbors.

The gradient angle is from 0 to 360 degree. But we will treat in the opposite direction to be the same. Therefore, our gradient angle is from 0 to 180 degree. Experiment indicates it performs better in pedestrian detection.

We will compute a histogram for each 8x8 image patch. The histogram has 9 bins starting with . For the first pixel with , we add 10 into the bin . For the second pixel, we have , this value falls between bin and . We will split it proportionally to the distance from the corresponding bin. In this case, half of the value () goes to bin and half goes to bin . Will goes through every pixels and add values to the corresponding bins. For each 8x8 image patch, we will have an input features with 9 values.


Images with different lighting will result in different gradients. We apply normalization so the histogram is not sensitive to lighting.

For each 8x8 patch, we have 9 histogram values, we can normalized each value by the equation below.

It can be shown easily that even we double every , the normalized value remains the same. Hence, we reduce its sensitivity to lighting. We are going to make one more improvement. Instead of normalize every patch, we normalize 4 patches with a sliding window. In the red rectangle below, it compose of 4 patches (16x16 pixels) corresponding to 4x9 histogram values. We are going to generate 4x9 normalized features as:

(for )

Next we are sliding the windows by 8 pixels to compute another 4x9 histogram.

Markov Chain Monte Carlo

Consider the following transitional matrix on weather

If today is sunny, there is 0.9 chance that tomorrow is also sunny and 0.1 change that it is going to rain. If today is sunny, the chance that it is also sunny 2 days later is:

So what is the chance of a sunny date and a rainy date? The answer is

We generate random walks using the target distribution. The equilibrium probability distribution is our target distribution regardless of our initial state. For example, we predict whether 20th day is sunny or rainy. We repeat it 50 times and the average with be the result shown above.


Term frequency–Inverse document frequency (tf-idf)

tf-idf measures the frequency of a term in a document corrected by how common the term in the documents.

Term frequency, Inverse document frequency:

where n is the number of documents and the number of documents containing the term.

Skip-gram model and Continuous Bag-of-Words model (CBOW)

Skip-gram model tries to predict each context word from its target word. For example, in the sentence:

“The quick brown fox jumped over the lazy dog.”

The target word “fox” have 2 context words in a bigram model (2-gram). The training data (input, label) will look like: (quick, the), (quick, brown), (brown, quick), (brown, fox).

The continuous bag-of-words is the opposite of Skip-gram model. It predicts the target word from the context word.

Unsupervised learning

Unsupervised learning tries to understand the relationship and the latent structure of the input data. In contrast to supervised learning, unsupervised training set contains input data but not the labels.

Example of unsupervised learning;


  • K-means
  • Density based clustering - DBSCAN Gaussian mixture models *Expectation–maximization algorithm (EM) Anomaly detection
  • Gaussian model
  • Clustering Deep Networks
  • Generative Adversarial Networks
  • Restricted Boltzmann machines Latent factor model/Matrix factorization * Principal component analysis
    • Singular value decomposition
    • Non-negative matrix factorization Manifold/Visualization
    • MDS, IsoMap, t-sne Self-organized map Association rule


Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville