### Maxwell–Boltzmann distribution

Maxwell–Boltzmann distribution models the distribution of speeds of the gas molecules at a certain temperature:

(Source Wikipedia)

Here is the Maxwell–Boltzmann distribution equation:

$f(v)$ gives the probability density function (PDF) of molecules with velocity $v$ which is a function of the temperature $T$. The probability density decreases exponentially with $v$. As the temperature increases, the probability of a molecule with higher velocity also increases. But since the number of molecules remains constant, the peak is therefore lower.

### Energy based model

In Maxwell–Boltzmann statistics, the probability distribution is defined using an energy function $E(x)$:

where

Z is sum from all possible states and it is called the partition function. It renormalizes the probability between 0 and 1.

By defining an energy function $E(x)$ for an energy based model like the Boltzmann Machie or the Restricted Boltzmann Machie, we can compute its probability distribution $P(x)$.

### Boltzmann Machine

A Boltzmann Machine projects an input data $x$ from a higher dimensional space to a lower dimensional space, forming a condensed representation of the data: latent factors. It contains visible units ($A, B, C, D, \dots$) for data input $x$ and hidden units (blue nodes) which are the latent factors of the input $x$. All nodes are connected together with a bi-directional weight $W_{ij}$.

(Source Wikipedia)

Each unit $i$ is in a binary state $s_i \in \{0, 1 \}$. We use $W_{ij}$ to model the connection between unit $i$ and $j$. If $s_i$ and $s_j$ are the same, we want $W_{ij}>0$. Otherwise, we want $% $. Intuitively, $W_{ij}$ indicates whether two units are positively or negatively related. If it is negatively related, one activation may turn off the other.

The energy between unit $i$ and $j$ is defined as:

Hence as indicated, the energy is increased if $s_i = s_j = 1$ and $W_{ij}$ is wrong ($% $. The likelihood for $W_{ij}$ decreases as energy increases.

The energy function of the system is the sum of all units:

The energy function is equivalent to the cost function in deep learning.

Using Boltzmann statistics, the PDF for $x$ is

where $x^{'}$ are all the neighboring units.

### Restricted Boltzmann Machine (RBM)

In Boltzmann Machines, visible units or hidden units are fully connected with each other. In Restricted Boltzmann Machine (RBM), units in the same layer are not connected. The units in one layer is only fully connected with units in the next layer.

(Source)

The energy function for an RBM:

In vector form:

Probability for a pair of visible and hidden unit:

where the partition function $Z$ is:

Probability for a visible unit (summing over all neighbors):

The probability that the model is assigned to a training image can be raised by lower the energy of that image and to raise the energy of other images. The derivative of the log probability of a training vector can be find to be:

which $\langle v_i h_j \rangle_{data}$ is the expectation value for $v_i h_j$ with $v$ from the training samples. However, the $v$ in $\langle v_i h_j \rangle_{model}$ is sample from the model i.e. $v \sim P_{model}(v) = \frac{1}{Z} \sum_h e^{-E(v, h)}$. Hence, the network has the highest probability if the expected value for $v_i h_j$ of the model matches with that of the training samples.

And during training, we can adjust the weight by:

To calculate $\langle v_i h_j \rangle_{data}$, we sample an image from the training dataset, the binary state $h_j, v_j$ is set to 1 with probability:

To calculate $\langle v_i h_j \rangle_{model}$ is hard because $Z$ is unknown.

One possibility is to use Gibbs sampling (which will not be covered here). The other is to use approximation and $\Delta w_{ij}$ becomes:

First we pick $v$ from the training samples. Then the probability of the hidden units are computed and we sample a binary value from it. Once the binary states $h_j$ have been chosen for the hidden units, a reconstruction $v_i$ is produced by the $h_j$.

To train the biases, the steps are similar except we use the individual state $v_i$ or $h_j$ instead.

#### Simple walk through

1. Start with a sample $v$ from the training dataset.
2. Compute $p(h \vert v) = \sigma(b_j + \sum_i v_i w_{ij})$ and sample $h \in \{ 0, 1\}$ from it.
3. $positve_{ij} = v_i h_j$.
4. Compute $p(v^{(1)} \vert h) = \sigma(a_i + \sum_j h_j w_{ij})$ and sample $v^{(1)}\in \{ 0, 1\}$ from it.
5. $negative_{ij} = v^{(1)}_i h_j$.
6. $W_{ij} = W_{ij} + \epsilon (positive_{ij} - negative_{ij})$.

### Free energy

The free energy of visible vector $v$ is the energy of a single configuration that has the same probability as all of the configurations that contain $v$:

which is also the expected energy minus the entropy:

where $x_j$ is the total input to hidden unit $j$.

The free energy of RBM can be simplified as:

Recall:

Therefore:

Take the negative log:

where $p$ is the probability distribution formed by the model.

### Contrastive Divergence (CD-k)

For a RBM,

where

i.e.

(Without proof) We combine $F(v)$ with the gradient from the last section, :

which

$v = v^{(0)}$ and $h^{(1)}$ is the prediction from $v^{(0)}$, $v^{(1)}$ is the prediction from $h^{(1)}$.

So we sample an image from the training data as $v$ and compute $v^{(k)}$. In practice, $k=1$ will show resonable result already.

### Credit

For those interested in the technical details in the Restricted Boltzmann Machines, please read A Practical Guide to Training Restricted Boltzmann Machines from Hinton.