Maxwell–Boltzmann distribution

Maxwell–Boltzmann distribution models the distribution of speeds of the gas molecules at a certain temperature:

(Source Wikipedia)

Here is the Maxwell–Boltzmann distribution equation:

gives the probability density function (PDF) of molecules with velocity which is a function of the temperature . The probability density decreases exponentially with . As the temperature increases, the probability of a molecule with higher velocity also increases. But since the number of molecules remains constant, the peak is therefore lower.

Energy based model

In Maxwell–Boltzmann statistics, the probability distribution is defined using an energy function :


Z is sum from all possible states and it is called the partition function. It renormalizes the probability between 0 and 1.

By defining an energy function for an energy based model like the Boltzmann Machie or the Restricted Boltzmann Machie, we can compute its probability distribution .

Boltzmann Machine

A Boltzmann Machine projects an input data from a higher dimensional space to a lower dimensional space, forming a condensed representation of the data: latent factors. It contains visible units () for data input and hidden units (blue nodes) which are the latent factors of the input . All nodes are connected together with a bi-directional weight .

(Source Wikipedia)

Each unit is in a binary state . We use to model the connection between unit and . If and are the same, we want . Otherwise, we want . Intuitively, indicates whether two units are positively or negatively related. If it is negatively related, one activation may turn off the other.

The energy between unit and is defined as:

Hence as indicated, the energy is increased if and is wrong (. The likelihood for decreases as energy increases.

The energy function of the system is the sum of all units:

The energy function is equivalent to the cost function in deep learning.

Using Boltzmann statistics, the PDF for is

where are all the neighboring units.

Restricted Boltzmann Machine (RBM)

In Boltzmann Machines, visible units or hidden units are fully connected with each other. In Restricted Boltzmann Machine (RBM), units in the same layer are not connected. The units in one layer is only fully connected with units in the next layer.


The energy function for an RBM:

In vector form:

Probability for a pair of visible and hidden unit:

where the partition function is:

Probability for a visible unit (summing over all neighbors):

The probability that the model is assigned to a training image can be raised by lower the energy of that image and to raise the energy of other images. The derivative of the log probability of a training vector can be find to be:

which is the expectation value for with from the training samples. However, the in is sample from the model i.e. . Hence, the network has the highest probability if the expected value for of the model matches with that of the training samples.

And during training, we can adjust the weight by:

To calculate , we sample an image from the training dataset, the binary state is set to 1 with probability:

To calculate is hard because is unknown.

One possibility is to use Gibbs sampling (which will not be covered here). The other is to use approximation and becomes:

First we pick from the training samples. Then the probability of the hidden units are computed and we sample a binary value from it. Once the binary states have been chosen for the hidden units, a reconstruction is produced by the .

To train the biases, the steps are similar except we use the individual state or instead.

Simple walk through

  1. Start with a sample from the training dataset.
  2. Compute and sample from it.
  3. .
  4. Compute and sample from it.
  5. .
  6. .

Free energy

The free energy of visible vector is the energy of a single configuration that has the same probability as all of the configurations that contain :

which is also the expected energy minus the entropy:

where is the total input to hidden unit .

The free energy of RBM can be simplified as:

Energy based model (Gradient)



Take the negative log:

Its gradient is:

where is the probability distribution formed by the model.

Contrastive Divergence (CD-k)

For a RBM,



(Without proof) We combine with the gradient from the last section, :


and is the prediction from , is the prediction from .

So we sample an image from the training data as and compute . In practice, will show resonable result already.


For those interested in the technical details in the Restricted Boltzmann Machines, please read A Practical Guide to Training Restricted Boltzmann Machines from Hinton.