Association rules

Support: Probability of events are happening.

Confidence: How offen happens when happens.

In association, we are looking for high support and high confidence:

For computation reason, we will restrict the maximum number of variables to consider in S and T. ( and )

Priori algorithm

Support Set Pruning: To reduce un-necessary computation, we will prune the support set tree for children that with ancestor’s :

  • Generate list of all sets ‘S’ with k = 1.
  • Prune candidates where
  • Go to k = k + 1 and generate the list again without the pruned items

Generating rules

If , we generate candidate rules with all possible sub-sets:

Once again, we use pruning to reduce the number of evaluation.


With large amount of data, it is possible that we generate a rule with high probability but yet and has no special co-relationship. For example, if is very high, the corresponding can be pulled above the confidence threshold just because how common is.


One alternative to confidence is lift:

which reduce the rule likeliness when is high.


We build a matrix containing information on what products users purchase.

Rows in contains what a user purchase:

Columns in contains who purchase a product:

The column in is the list of users that purchase product . We can apply clustering to group products (columns) together with the assumption that products are similar if they are purchased by similar users.

We can apply the association rules to make product recommendations.

Note: The columns here is the same as the bags of customer in the Amazon recommendation algorithm.

Amazon recommendation algorithm

We need a scalable solution to handle large amount of users and products. First, we can limit the size of and to 1. Then we compute the similarity of 2 products by:

  • Use bag of customer to represent who brought
  User 1 User 2 User N
Item i 1 0 1
Item j 0 0 1
  • Measure the similarity of 2 items by:

Then we can recommend products by soving the k-nearest neighbors using the cosine similarity as distance measurement.

Content-based filtering

In movie rating, content-based filtering is a supervised learning to extract the genre of a movie and to determine how a person may rate the movie based on its genre. For example, we may define the genre of a movie as:

We apply supervised learning to learn the genre of a movie say from its marketing material. For example, the genre of a romantic movie can be calculated as:

Then we can learn how a person rate a movie based on the type of genre. For example, a person have a 0.9 chance of giving high marks to action or SciFi movies but no chance for romance movies. The preference for this person will be:

Content-based filtering builds a model to predict rating or recommendation given of a person and for a movie.

Content-based filtering looks simple but very hard in practice. Collect labels for the training data is hard. Just using the genre to classify a movie may be over simplify on why a person like a movie. We can add more features but what to add becomes very hard.

Collaborative Filtering

Collaborative filtering is an unsupervised learning which we make predictions from ratings supplied by people. Each rows represents the ratings of movies from a person and each column indicates the ratings of a movie.

In content-based filtering, we define the feature set and we recall that the rating can be computed as:

In Collaborative Filtering, we do not know the feature set before hands. Instead, we try to learn those. Just like the handwritten digit recognition MNist, we do not know what features to extract at the beginning but eventually the program learns those latent features (edge. corner, circle) itself.

So let say the latent features that a program learn for a person is and that for a movie is . The rating for the movie will be

Low rank matrix factorization

We are given a matrix which the rows represent people and the columns represent products. Low rank matrix factorization means we decompose the matrix into 2 lower rank matrices: one representing the latent features of person and the second represent the latent features of products.

In collaborative filtering

  • We decide the number of latent features to learn. i.e. the dimension of and . More latent features helps us to build a more complex model but will be harder to train.
  • Collect data
    • User data: User ID, Movie ID, Rating
  • Normalize the rating to be zero centered
  • Define the cost function as mean square error MSE with L2-regularization
  • Optimize and using Gradient descent

To find similar products, we can find the similarity between the latent features of two products. . for example:


We do not need to assume the rating is zero centered. We can add general bias, bias for the user and bias for the product in the calculation:

Hybrid approach (SVDfeature)

We can also adopt a hybrid approach combing both Collaborative filtering and Content-based filtering. The rating is calculated by combining the ratings from both methods:

Explicit vs. implicit feedback

One common approach for the collaborative filtering treats the entries in the user-product matrix as explicit preferences given by the user to a product, for example, users ratings on products. Alternatively, some implicit feedback (like views, clicks, shares etc.) are more widely available. For example in Spark MLlib, it can model collaborative filtering with explicit or implicit feedback. In implicit feedback, MLlib treats the data as the strength of user actions (such as the number of clicks). These numbers are then used in place of the explicit ratings.

The documentation to create a collaborative filtering using MLlib with explicit and implicit feedback can be found in []

Other recommendation consideration

  • New content
  • May want to suggest something new or different
  • How long should we show the recommendation if user show or do not show interests
  • Give editorial recommendation
  • Get recommendation from friends or communities


Probability ratio for logistic regression

Logistic regression

Probability ratio for predicting over :

We can define a lost function by the amount of constraint violation:

Note: This is the Hinge loss and we have proven it from the perspective of probability ratio and constraint violation.

Relevance Ranking

We want to rank the relevance of based on a query .

We can create a cost function based on all constraint violations as:

Or a cost function to penalize the highest alternatives:

Pairwise ranking

We can rank and using probability ratio.

We can create a cost function based on all constraint violations as:


We can expand our approach beyond linear regression:

  • Define constraint based on probability ratio
  • Minimize violation of logarithm of constraint

For pairwise relevance


Random walk view

  • Start at a random webpage
  • Follow a random link in each iteration
  • PageRank is the probability of landing on a page when
  • Random walk may stuck in part of the graph or never reach some webpage

Damped PageRank algorithm

  • Start at a random webpage.
  • With the probability , go to a random webpage. Otherwise, follow a random link on the page.
  • Keep the iteration and compute the probability of landing on a page.