“Machine learning  Naive bayes classifier, Bayesian inference”
Bayes’ theorem
Bayes’ theorem calculates the conditional probability (Probability of A given B):
Sometimes the result of the Bayes’ theorem can surprise you . Let’s say there is an uncommon decease that only 0.1% of the population has it. We develop a test of 99% accuracy for positive and negative result. What is the chance that you have this decease if your test is positive. The answer is only 9% which is much smaller than you may think. The intuition is that even the test looks accurate, it generates more false positive than true positive because the decease is not common. With Bayes’ theorem, we can demonstrate why it is only 9%.
Notation: stands for having the decease and stands for testing positive. means not d.
Convention: means . stand for .
Proof of the Bayes theorem:
Naive Bayes Classifier
Naive Bayes Classifier classifies objects given observation based on Bayes’ theorem. For example, if we draw an object from a basket, if it is red and round , what is the chance that it is an apple ?
Assume and are independent of each others given . (An red object does not increase the chance that it is round.) i.e.:
Example
Let’s say we want to determine whether an object we pick in a basket is an apple or a grape given the object is red, round & sweet
is the same regardless of which object to pick, therefore we ignore the denominator in comparing object’s probability. \begin{equation} \begin{split} P(apple \vert red, round, sweet) & \propto P(red  apple) P(round  apple) P(sweet  apple ) P(apple) \end{split} \end{equation}
Say, half of the object in the basket are apples, and
The chance that the object is an apple:
If a quarter of the object in the basket are grapes, the corresponding chance for a grape is:
So if an object is red, round and sweet, it is likely an apple.
Email spam filter
We use bag of words to construct features .
money  inheritance  rich  quick  vicodin  free  fee  bank  illegal  alcohol  …  
Message  1  0  1  0  0  0  1  0  0  1  … 
If p(yi = ‘spam’  xi) > p(yi = ‘not spam’  xi),
classify as spam
Tips:
To avoid
We count as
To avoid underflow in multiple small numbers, we take the logarithm of the equation which turns those probability multiplication into additional.
People may be more tolerance on false negative than false positive in classify email as spam. So, instead of comparing the
We give a weight in the comparison
Bayesian inference
Let’s go through an example in determine the infection rate () for Flu. On the first day of the month, we got 10 Flu testing results back. We call this evidence . For example, our new evidence shows that 4 samples are positive. . We can conclude that (). Nevertheless we want to study further by calculating the probability of given a specific .
The chance of given is the likelihood of our evidence. In simple term, likelihood is how likely the evidence (4 samples are positive) will happen (or generated) with different infection rate. Based on simple probability theory:
We plot the probability for values range from 0 to 1. Maximum likelihood estimation (MLE) is where has the highest probability. Here MLE is 0.4. i.e. the most likely value for with the evidence is 0.4.
We called the likelihood. The blue line above is the likelihood.
In a stochastic process, the computed value is not a scalar but a probability distribution. We may predict has a mean value of 0.3 with a variance 0.1. If we need a scalar value, we sample it from the probability distribution.
We had also collected 100 months of data on the infection rate for flu. This data forms a belief of the Flu infection rate which we usually call prior . The orange line below is . It is the prior belief of the probability distribution for . The distribution centers around 0.14 meaning we generally believe the average infection rate of Flu is 0.14. The blue line is the probability distribution of just based on the new evidence. (likelihood) Obviously the new evidence is different from the prior (belief).
We either suspect that the much higher infection rate of the new evidence is caused by low sampling size of the new evidence or we encounter a new strength of Flu that we need to readjust the prior.
Recap the terminology
People discuss Bayes with a lot of terminologies. We pause a while to summarize the terms again.
Evidence is some data we observed. For example, 4 samples out of 10 are infected. We can treat belief as a hypothesis. For example, we can start with a belief of 0.14 infection rate with a variance 0.02 and later use Bayes inference to refine it with data that we collect.
posterior probability  The refined belief with additional given new evidence. The new belief after we collect 1000 samples. 

likelihood  The probability of the evidence given the belief. The chance of have 4 positive samples out of 10 for different values of the infection rate. 

prior  The probability of the belief prior to new evidence. Our hypothesis which will later combine with new data to refine it to a posterior. 

marginal probability  The probability of seeing that data. The probability of seeing 4 positive samples under all possible infection rate values. 
Bayesian inference (continue)
Bayesian inference try to draw better prediction based on evidence and prior belief.
With Bayes inference, we calculate the posterior probability using Bayes theory. We recalibrate our belief given the new evidence:
Here is the plot of the posterior. The yellow line is the prior , the blue line is the likelihood. The green line is the posterior calculated with the Bayes theory.
It is clear that the posterior moves closer to the likelihood with the new evidence. The posterior
depends on the prior and the likelihood. Even the prior peaks at 0.14 (the red line). The new evidence has a likelihood that contradict it. The likelihood at the red line is lower and therefore their posterior (their multiplication), shown as the green dot, is lower. At the vertical green line, even the prior is lower but it is compensated by the higher likelihood and therefore the peak of the posterior moves towards the likelihood.
In short, with new evidence, we shift our belief towards the direction of the newly observed data. We recalibrate the belief with new data.
In Bayes inference, we can start with certain belief for the probability distribution of . Alternatively, we can start with a random guess say it is uniformed distributed like the orange line below. e.g. the chance that is the same as . The first calculated posterior with this prior will equal to the likelihood.
With new round of evidence, we compute the new posterior given the evidence. This posterior becomes our belief (prior) in the next iteration when another round of evidence is collected. After many iterations, our prior will converge to the true probability distribution of .
How far the posterior will move towards to the new evidence? It depends on the size of the evidence. Obviously, a large sampling size will move the posterior closer. The following plot indicates that the larger the size of the evidence, the further it moves towards the evidence. The variance decreases also and we are more certain on its value.
Beta distribution for prior
In this section, we show how to use beta distribution to model the prior to solve the posterior in the example above.
The definition of a beta distribution is:
For discret variable, the beta function is defined as:
For continuos variable, the beta function is:
Here are the beta distribution for different values of a and b. For , the probability is uniformly distributed:
For :
For :
For :
For :
We can model the likeliness with a Binomial distribution
Let’s apply the Bayes theorem to calculate the posterior:
If we start with a uniformed distributed prior which is a good start if we do not have any prior knowledge of (say the infection rate of a decease.).
and we have (say 3 infections out of 10 samples), the posterior will be which has a peak at 0.3 which is the same as the maximum likeliness estimation from our sample:
If we start with a biased prior geared towards 100% infection:
and we have , the posterior will be with peak at 0.62.
By increase the sampling size to , the posterior move closer to the maximum likeliness.
When we enter a new flu season, our new sampling size for the new Flu strain is small. The error can be large if we just use this small sampling data to compute the infection rate. Instead, we use prior knowledge to compute a prior for the infection rate for the last 12 months. Then we use Bayes theorem with the prior and the likeliness to compute the posterior infection probability. When data size is small, the posterior rely more on the prior but once the sampling size increases, it readjust itself to the new sample. Hence, Bayes theorem can give better prediction when sample size is small while readjust the prediction according to the size of the sampling data.
Programming
In the coding below, we use PyMC3 as the Bayes inference engine to compute posterior from the likeliness and the prior. Then we sample 5000 data from the posterior distribution. Since the code is selfexplanatory, we encourage you to understand the concept directly from the code.
The source code can be find here.
# Markov chain Monte Carlo model
# Create an evidence. 7 infection out of 10 people
people_count = 10
chance_of_infection = 0.4
infection_count = int(people_count * chance_of_infection) # Number of infections
# Create a model using pymc3
with pm.Model() as model:
# A placeholder
people_count = np.array([people_count])
infections = np.array([infection_count])
# We use a beta distribution to model the prior.
# The beta distribution takes in 2 parameters.
# For example, if both is 1, the distribution is uniformly distributed.
# We have 10, 60 which is the distribution we used in the prior in plot.
theta_prior = pm.Beta('prior', 10, 60)
# We create a model with our evidence and the prior (assuming binormial distribution)
observations = pm.Binomial('obs', n = people_count
, p = theta_prior
, observed = infections)
# We use the strategy of maximize a posterior
start = pm.find_MAP()
# Sampling 5000 data from the calculated posterior probability.
trace = pm.sample(5000, start=start)
# Coding in plotting the graph
...
Gaussian distribution
We can use Gaussian distribution for the prior and the likelihood:
Without proof, the posterior is: