# “Machine learning - Information theory, probability and game theory.”

### Entropy

A staff misses an all-hands meeting and ask what is he or she missing. A co-worker said nothing. It is not nothing is covered but probably just the same message over and over again. In this scenario, and the message contains no information since it is all predictable. In information theory, entropy measures the amount of information which expresses as the number of bits to encode it. In Huffman encoding, since letters does not occur randomly in a text, we can use fewer bits to it.

In information theory, information and random-ness are positively correlated. High entropy equals high randomness and more bits to encode it.

We define entropy as:

The entropy of a coin peaks when .

For a fair coin,

and therefore we can use 1 bit of 0 to represent head and 1 bit of 1 to represent tail.

So for a fair dice, . It means a fair dice gives more surprise than a fair coin.

### Cross entropy

If entropy measures the minimum of bits to encode information using the most optimized scheme. Cross entropy measures the minimum of bits to encode using the wrong optimized scheme from .

Cross entropy is defined as:

Cross entropy is encoding y with the distribution from .

### KL Divergence

KL Divergence measures the difference between 2 probability distribution as

Diagram source Wikipedia.

Recall:

Compute cross entropy:

So cross entropy is the sum of entropy and KL-divergence.

### Maximum Likelihood Estimation

**Maximum Likelihood Estimation (MLE) is the same as minimize KL Divergence.**

which is the ground truth.

### Gaussian distribution/Normal distribution

### Bernoulli distributions

### Binomial distributions

Source: wiki

The Gaussian distribution is the limiting case for the binomial distribution with:

### Poisson Distribution

Assuming a rare event with an event rate , the probability of observing x events within an interval is:

Example: If there were 2 earthquakes per 10 years, what is the chance of having 3 earth quakes in 20 years.

Given:

Proof:

### Beta distribution

The definition of a beta distribution is:

For discret variable, the beta function is defined as:

For continuos variable, the beta function is:

Here are the beta distribution for different values of a and b. For , the probability is uniformly distributed:

For :

For :

For :

For :

### Exponential and Laplace distribution

which is 1 if . Otherwise 0.

Both have a sharp point at 0 which sometimes used for machine learning.

### Probabilities

Basic:

Given 2 events are independent:

#### Probability Mass Functions with discrete variables

#### Probability Density Functions with continuous variables

Note: lower case for probability density functions

#### Marginal probability

#### Bayes’ theorem

Apply Bayes’ theorem to conditional probability

Proof:

#### Naive Bayes’ theorem

Naive Bayes’ theorem assume and are independent. i.e.

We often ignore the marginal property (the denominator) in Naive Bayes theorem because it is constant and therefore not important when we are optimizing or comparing the parameters for the model.

### Chain rule

### Expectation

### Variance and covariance

### Nash Equilibrium

In game theory, the Nash Equilibrium is when no player will change its strategy after considering all possible strategy of opponents. i.e. if we reveal every strategy of all players and no one will change their strategy for a better gain, the Nash equilibrium is reached. A game can have 0, 1 or multiple Nash Equilibria.

#### The Prisoner’s Dilemma

In the prisoner’s dilemma problem, police arrests 2 suspects but only have evidence to charger them for a lesser crime with 1 month jail time. But if one of them confess, the other party will receive a 12 months jail time and the one confess will be released. Yet, if both confess, both will receive a jail time of 6 months.

For Mary, if she thinks Peter will keep quiet, her best strategy will be confess to receive no jail time instead of 1 month.

On the other hand, if she thinks Peter will confess, her best strategy will be confess also to get 6 months jail time.

In either cases, she should confess. Similarly, Peter should confess also. Therefore (-6, -6) is the Nash Equilibrium even (-1, -1) is the least jail time combined. Why (-1, -1) is not a Nash Equilibrium? Because if Mary knows Peter will keep quiet, she can switch to confess and get a lesser sentence which Peter will response by confessing the crime also. (Providing that Peter and Mary cannot co-ordinate their strategy.)

### Jensen-Shannon Divergence

It measures how distinguishable two or more distributions are from each other.