Gaussian distribution (Quick review)

We define a function to map input to . In statistic, we use a stochastic model to define a probability distribution for such relationship. For example, a 3.8 GPA student can earn an average of $60K salary with a variance () of $10K.

Probability density function (PDF)

In the following diagram, follows a gaussian distribution:

In a Gaussian distribution, 68% of data is within 1 from the and 95% of data is within 2 .

We can sample data based on the probability distribution. The notation to sample data from a distribution is:

In many real world examples, data follows a gaussian distribution.

Here, let’s model the relationship between the body height and the body weight for San Francisco residents. We collect the information from 1000 adult residents and plot the data below with each red dot represents 1 person:

The corresponding PDF in 3D:

Let us generalize the model first with a multivariate gaussian distribution. i.e. PDF depends on multiple variables.

For a multivariate vector:

The PDF of a multivariate Gaussian distribution is defined as:

with covariance matrix :

Let’s go back to our weight and height example to illustrate it.

From our training data, we calculate :

What is the covariance matrix for? Each element in the covariance matrix measures how one variable is related to another. For example, measures how is related with . If weight increases with height, we expect to be positive.

Let’s get into the detail on computing above. To simplify, we consider that we got only 2 data points (150 lb, 66 inches) and (200 lb, 72 inches).

After computing all 1000 data, here is the value of :

Positive element values in means 2 variables are positively related. With no surprise, is positive because weight increases with height. If two variables are independent of each other, the value should be 0 like:

Calculate the probability of

To calculate the probability of given :

which is the accumulative probability distribution:

and we rewrote the covariance variable into the following form:


Here we sample data from a 2-variable gaussian distribution. From the covariance matrix, we can tell x is positively related with y as is positive.

mean = [0, 2]
cov = [[1, 2], [3, 1]]

x, y = np.random.multivariate_normal(mean, cov, 5000).T
plt.plot(x, y, 'x')

Here, we plot the probability distribution for (y, x).

from scipy.stats import multivariate_normal

x, y = np.mgrid[-1:1:.01, -1:1:.01]  # x (200, 200) y (200, 200)
pos = np.empty(x.shape + (2,))
pos[:, :, 0] = x; pos[:, :, 1] = y   # pos (200, 200, 2)

mean = [-0.4, -0.3]
cov = [[2.1, 0.2], [0.4, 0.5]]
rv = multivariate_normal(mean, cov)
p = rv.pdf(pos)                      # (200, 200)
plt.contourf(x, y, p)

Multivariate Gaussian Theorem

Given a Gaussian Distribution for

The posterior conditional for is given below. This formular is particular important for the Gaussian process in the later section. For example, if we have samples of 1000 graduates with their GPAs and their salaries, we can use this theorem to predict salary given a GPA by creating a Gaussian distribution model with our 1000 training datapoints.

Proof of this theorem can be found here. We will not go into details on the proof. But with the assumption that is Gaussian distributed. The co-relation of and is defined by and . So given the value of , we can compute the probability distribution of :

Source: Nando de Freitas UBC Machine learning class.

For example, we know that height of SF residents is gaussian distributed. In the next section, we will apply GP to make a prediction of weight given a height.

Gaussian Process (GP)

The intuition of the Gaussian Process GP is simple. If 2 points have similar input, their output should be similar. With 2 datapoints, if one is closer to a known training datapoint, its prediction is more certain than the other one.

If a student with a GPA 3.5 earns $70K a year, another student with a GPA 3.45 should earn something very similar. In GP, we use our training dataset to build a Gaussian distribution to make prediction. For each prediction, we output a mean and a . For example, with GP, we can predict a 3.3 GPA student can earn with while a 2.5 GPA student can earn and . measures the un-certainty of our prediction. Because a 3.3 GPA is closer to our 3.5 GPA training data, we are more confident about the salary prediction for a 3.3 GPA student than a 2.5 GPA student.

In GP, instead of computing , we compute trying to measure the similarity between datapoint and .


which kernel is a function measuring the similarity of the 2 datapoints (1 means the same). There are many possible kernel functions, we will use an exponential square distance for our kernel.

Note: In previous section means the weight of a datapoint. Here means datapoint 1. means the weight of datapoint 3.

With all the training data, we can create a Gaussian model:

Let’s demonstrate it again with 2 training datapoints (150 lb, 66 inches) and (200lb, 72 inches). Here we are building a Gaussian model for our training data.

with 175 as the mean of the weight and measures the similarity of the height of the datapoints . The notation above just mean we can sample a vector on weight

from which is model by the datapoint (150, 66) and (200, 72).

Now let’s say we want to predict for input . The model change to:

Let’s understand what does it mean again. For example, we have a vector contain 4 persons height

We can use to sample the possible weight for these people:

We know the first 2 values from the training data and we try to compute the distribution for . (what is their and .) Now, instead of predicting just 2 values, we can have input over a range of values

and use to sample vector:

For example, our first output sample from is

We can plot the output against input. The line below is just a regular non-scholastic function. i.e. .

In this section, we use the notation as a non-scholastic function while as a scholastic function.

To generate training data much easier, we are switching to a new model . We use the equation to generate 2 training datapoints (2 blue dots below) to build a Gaussian model. We then sample three times shown as the three solid lines below.

We see that the 2 training data forces to have all lines interest at the blue dots. If we keep sampling, we will start visually recognize the mean and the range of for each . For example, the red dot and the blue line below estimates the mean and the variance of for . Since is between 2 training points, the estimation has a relatively high un-certainty (indicated by ).

In the plot below, we have 5 training data and we sample 30 lines from . The red dotted line indicates the mean output value for and the gray area are within 2 from .

As mentioned before, each line acts like a function to map input to output: . We start with many possible functions but the training dataset reduce or increase the likelihood of some functions. Technically, model the possibility distribution of the functions given the training dataset. (the probability distribution of the lines drawn above.)

GP is charactered as building a Gaussian model to discribe the distribution of functions.

We are not going to solve the problem by sampling. Instead we will solve it analytically.

Back to

We can generalize the expression as follow which is the label (weights) of the training set and is the weights that we want to predict for . Now we need to solve with the Gaussian model.

Recall from the previous section on Multivariate Gaussian Theorem, if we have a model

We can solve by:

Now, we apply the equations to solve

For the training dataset, let output labels has the gaussian distribution:

And let the Gaussian distribution for be:

which L is defined as

and from the Multivariate Gaussian Theorem:

We are applying the equations to model training data from . In this example, as the mean value of a function is 0. Our equation will therefore simplify to:

Note, may be poorly condition to find the inverse. So we apply Cholesky to decompose K first and then apply linear algebra to solve .

The notation

means using linear algebra to solve x for .

We are going to pre-compute some terms before solving .

Apply and the equation above

Now, we we have the equations to compute and .


First we are going to prepare our training data and label it with a function. The training data contains 5 datapoints. .

Xtrain = np.array([-4, -3, -2, -1, 1]).reshape(5,1)
ytrain = np.sin(Xtrain)      # Our output labels.

Testing data: We create 50 new datapoint linear distributed between -5 and 5 to be predicted by the Gaussian process.

# 50 Test data
n = 50
Xtest = np.linspace(-5, 5, n).reshape(-1,1)

Here we define a kernel measure the similarity of 2 datapoint using an exponential square kernel.

# A kernel function (aka Gaussian) measuring the similarity between a and b. 1 means the same.
def kernel(a, b, param):
    sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*, b.T)
    return np.exp(-.5 * (1/param) * sqdist)

To compute the Kernel (

K = kernel(Xtrain, Xtrain, param)                        # Shape (5, 5)
K_s = kernel(Xtrain, Xtest, param)                       # Shape (5, 50)
K_ss = kernel(Xtest, Xtest, param)                       # Kss Shape (50, 50)

We will use the Cholesky decomposition to decompose .

L = np.linalg.cholesky(K + 0.00005*np.eye(len(Xtrain)))  # Shape (5, 5)

Compute the mean output for our prediction. As we assumem , the equation becomes:

L = np.linalg.cholesky(K + 0.00005*np.eye(len(Xtrain)))  # Add some nose to make the solution stable 
                                                         # Shape (5, 5)

# Compute the mean at our test points.
Lk = np.linalg.solve(L, K_s)                             # Shape (5, 50)
mu =, np.linalg.solve(L, ytrain)).reshape((n,)) # Shape (50, )


# Compute the standard deviation.
s2 = np.diag(K_ss) - np.sum(Lk**2, axis=0)               # Shape (50, )
stdv = np.sqrt(s2)                                       # Shape (50, )

Sample so we can plot it.

Sample it using , and as variance:

L = np.linalg.cholesky(K_ss + 1e-6*np.eye(n) -, Lk))    # Shape (50, 50)
f_post = mu.reshape(-1,1) +, np.random.normal(size=(n,5))) # Shape (50, 3)

We sample 3 possible output shown in orange, blue and green line. The gray area is where within 2 of . Blue dot is our training dataset. Here, at blue dot, is closer to 0. For points between the training datapoints, increases reflect its un-certainty because it is not close to the training dataset points. When we move beyond , we do not have any more training data and result in large .

Here is another plot of posterior after seeing 5 evidence. The blue dot is where we have training datapoint and the gray area demonstrate the un-certainty (variance) of the prediction.

Source Wikipedia.

Gaussian mixture model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of Gaussian distributions.

For K=2, we will have 2 Gaussian distributions and . We start with a random initialization of parameters and . Gaussian mixture model tries to fit the training datapoints into and and then re-compute their parameters. The datapoints are re-fitted and the parameters are re-calculated again. The iterations continues until the solution converges.

Expectation Maximization (EM)

Details in Expectation Maximization:

  • Initialize the and ’s parameters and with random values. Set
  • For all the training datapoints , compute the probability that it belongs to () or ().
  • Now, we recalculate the parameters for and
  • Recalculate the priors

For Gaussian Distribution with multiple variate, the probability distribution function is: