Bayes’ theorem

Bayes’ theorem calculates the conditional probability (Probability of A given B):

Sometimes the result of the Bayes’ theorem can surprise you . Let’s say there is an un-common decease that only 0.1% of the population has it. We develop a test of 99% accuracy for positive and negative result. What is the chance that you have this decease if your test is positive. The answer is only 9% which is much smaller than you may think. The intuition is that even the test looks accurate, it generates more false positive than true positive because the decease is not common. With Bayes’ theorem, we can demonstrate why it is only 9%.

Notation: stands for having the decease and stands for testing positive. means not d.

Convention: means . stand for .

Proof of the Bayes theorem:

Naive Bayes Classifier

Naive Bayes Classifier classifies objects given observation based on Bayes’ theorem. For example, if we draw an object from a basket, if it is red and round , what is the chance that it is an apple ?

Assume and are independent of each others given . (An red object does not increase the chance that it is round.) i.e.:


Let’s say we want to determine whether an object we pick in a basket is an apple or a grape given the object is red, round & sweet

is the same regardless of which object to pick, therefore we ignore the denominator in comparing object’s probability. \begin{equation} \begin{split} P(apple \vert red, round, sweet) & \propto P(red | apple) P(round | apple) P(sweet | apple ) P(apple) \end{split} \end{equation}

Say, half of the object in the basket are apples, and

The chance that the object is an apple:

If a quarter of the object in the basket are grapes, the corresponding chance for a grape is:

So if an object is red, round and sweet, it is likely an apple.

E-mail spam filter

We use bag of words to construct features .

  money inheritance rich quick vicodin free fee bank illegal alcohol
Message 1 0 1 0 0 0 1 0 0 1
If p(yi = ‘spam’ | xi) > p(yi = ‘not spam’ | xi), 
	classify as spam


To avoid

We count as

To avoid underflow in multiple small numbers, we take the logarithm of the equation which turns those probability multiplication into additional.

People may be more tolerance on false negative than false positive in classify e-mail as spam. So, instead of comparing the

We give a weight in the comparison

Bayesian inference

Let’s go through an example in determine the infection rate () for Flu. On the first day of the month, we got 10 Flu testing results back. We call this evidence . For example, our new evidence shows that 4 samples are positive. . We can conclude that (). Nevertheless we want to study further by calculating the probability of given a specific .

The chance of given is the likelihood of our evidence. In simple term, likelihood is how likely the evidence (4 samples are positive) will happen (or generated) with different infection rate. Based on simple probability theory:

We plot the probability for values range from 0 to 1. Maximum likelihood estimation (MLE) is where has the highest probability. Here MLE is 0.4. i.e. the most likely value for with the evidence is 0.4.

We called the likelihood. The blue line above is the likelihood.

In a stochastic process, the computed value is not a scalar but a probability distribution. We may predict has a mean value of 0.3 with a variance 0.1. If we need a scalar value, we sample it from the probability distribution.

We had also collected 100 months of data on the infection rate for flu. This data forms a belief of the Flu infection rate which we usually call prior . The orange line below is . It is the prior belief of the probability distribution for . The distribution centers around 0.14 meaning we generally believe the average infection rate of Flu is 0.14. The blue line is the probability distribution of just based on the new evidence. (likelihood) Obviously the new evidence is different from the prior (belief).

We either suspect that the much higher infection rate of the new evidence is caused by low sampling size of the new evidence or we encounter a new strength of Flu that we need to re-adjust the prior.

Recap the terminology

People discuss Bayes with a lot of terminologies. We pause a while to summarize the terms again.

Evidence is some data we observed. For example, 4 samples out of 10 are infected. We can treat belief as a hypothesis. For example, we can start with a belief of 0.14 infection rate with a variance 0.02 and later use Bayes inference to refine it with data that we collect.

posterior probability The refined belief with additional given new evidence.
The new belief after we collect 1000 samples.
likelihood $$ P(D \vert H) $$ The probability of the evidence given the belief.
The chance of have 4 positive samples out of 10 for different values of the infection rate.
prior $$ P(H) $$ The probability of the belief prior to new evidence.
Our hypothesis which will later combine with new data to refine it to a posterior.
marginal probability $$ P(D) $$ The probability of seeing that data.
The probability of seeing 4 positive samples under all possible infection rate values.

Bayesian inference (continue)

Bayesian inference try to draw better prediction based on evidence and prior belief.

With Bayes inference, we calculate the posterior probability using Bayes theory. We re-calibrate our belief given the new evidence:

Here is the plot of the posterior. The yellow line is the prior , the blue line is the likelihood. The green line is the posterior calculated with the Bayes theory.

It is clear that the posterior moves closer to the likelihood with the new evidence. The posterior

depends on the prior and the likelihood. Even the prior peaks at 0.14 (the red line). The new evidence has a likelihood that contradict it. The likelihood at the red line is lower and therefore their posterior (their multiplication), shown as the green dot, is lower. At the vertical green line, even the prior is lower but it is compensated by the higher likelihood and therefore the peak of the posterior moves towards the likelihood.

In short, with new evidence, we shift our belief towards the direction of the newly observed data. We re-calibrate the belief with new data.

In Bayes inference, we can start with certain belief for the probability distribution of . Alternatively, we can start with a random guess say it is uniformed distributed like the orange line below. e.g. the chance that is the same as . The first calculated posterior with this prior will equal to the likelihood.

With new round of evidence, we compute the new posterior given the evidence. This posterior becomes our belief (prior) in the next iteration when another round of evidence is collected. After many iterations, our prior will converge to the true probability distribution of .

How far the posterior will move towards to the new evidence? It depends on the size of the evidence. Obviously, a large sampling size will move the posterior closer. The following plot indicates that the larger the size of the evidence, the further it moves towards the evidence. The variance decreases also and we are more certain on its value.

Beta distribution for prior

In this section, we show how to use beta distribution to model the prior to solve the posterior in the example above.

The definition of a beta distribution is:

For discret variable, the beta function is defined as:

For continuos variable, the beta function is:

Here are the beta distribution for different values of a and b. For , the probability is uniformly distributed:

For :

For :

For :

For :

We can model the likeliness with a Binomial distribution

Let’s apply the Bayes theorem to calculate the posterior:

If we start with a uniformed distributed prior which is a good start if we do not have any prior knowledge of (say the infection rate of a decease.).

and we have (say 3 infections out of 10 samples), the posterior will be which has a peak at 0.3 which is the same as the maximum likeliness estimation from our sample:

If we start with a biased prior geared towards 100% infection:

and we have , the posterior will be with peak at 0.62.

By increase the sampling size to , the posterior move closer to the maximum likeliness.

When we enter a new flu season, our new sampling size for the new Flu strain is small. The error can be large if we just use this small sampling data to compute the infection rate. Instead, we use prior knowledge to compute a prior for the infection rate for the last 12 months. Then we use Bayes theorem with the prior and the likeliness to compute the posterior infection probability. When data size is small, the posterior rely more on the prior but once the sampling size increases, it readjust itself to the new sample. Hence, Bayes theorem can give better prediction when sample size is small while re-adjust the prediction according to the size of the sampling data.


In the coding below, we use PyMC3 as the Bayes inference engine to compute posterior from the likeliness and the prior. Then we sample 5000 data from the posterior distribution. Since the code is self-explanatory, we encourage you to understand the concept directly from the code.

The source code can be find here.

# Markov chain Monte Carlo model
# Create an evidence. 7 infection out of 10 people
people_count = 10
chance_of_infection = 0.4
infection_count = int(people_count * chance_of_infection)  # Number of infections

# Create a model using pymc3
with pm.Model() as model:
   # A placeholder
   people_count = np.array([people_count])
   infections = np.array([infection_count])

   # We use a beta distribution to model the prior.
   # The beta distribution takes in 2 parameters. 
   # For example, if both is 1, the distribution is uniformly distributed.
   # We have 10, 60 which is the distribution we used in the prior in plot.
   theta_prior = pm.Beta('prior', 10, 60)
   # We create a model with our evidence and the prior (assuming bi-normial distribution)
   observations = pm.Binomial('obs', n = people_count
                                   , p = theta_prior
                                   , observed = infections)

   # We use the strategy of maximize a posterior
   start = pm.find_MAP()

   # Sampling 5000 data from the calculated posterior probability.
   trace = pm.sample(5000, start=start)
   # Coding in plotting the graph

Gaussian distribution

We can use Gaussian distribution for the prior and the likelihood:

Without proof, the posterior is: