Principal Component Analysis (PCA)

PCA is a linear model in mapping d-dimensional input features to k-dimensional latent factors (k principal components).


Dimension reduction can be used for data compression or visualization.


For example, here is the to match a RGB pixel to one of the 4 color palete:

In our example, we have 4 principal components/latent factors.

The value for is reconstructed as.

For d = 2 and k = 1, 2D features (purple dots) are projected onto the 1D blue line.

PCA selects a projection that can maximize the variance of their output. Hence, PCA will pick the blue line over the green line if it has a higher variance.

Matrix factorization

PCA can formulated as an approximation to the matrix factorization.

PCA Cost Function

We want to minimize the MSE for and the corresponding value for the latent variable . :


Solving W

First, we need to perform feature scaling on input features :

is the ith training datapoints.

which are the mean and standard deviation for the feature . For an image, they are the means and standard deviations of each pixel. For a 100x100x3 image, we will have 30,000 .

PCA is based on unsupervised learning. We want to optimize the latent factors and latent variables for the cost function

Gradient descent

One of the method to solve PCA is to use Gradient descent to optimize the trainable parameters and with the cost function above.

Alternating minimization:

Alternating minimization is another method to find a solution for PCA.

  • Optimize ‘W’ with ‘Z’ fixed
  • Optimize ‘Z’ with ‘W’ fixed
  • Keep repeating

Singular value decomposition (SVD)

Both methods above solve the PCA using empirical method. SVD solves the PCA analytically. Before discussing it in details, we discuss the Singular value decomposition first (SVD). SVD decompose a matrix into 3 matrice as:

The matrix and is later used to transfrom to in PCA.

SVD consists of

  • Finding the eigenvalues and eigenvectors of and
  • The eigenvectors of make up the columns of U
  • The eigenvectors of make up the columns of V
  • The singular values in S are square roots of eigenvalues from or

Let’s go through an example:

The eigenvector and eigenvalue of is defined as:

Now solving:

The eigenvalues are:

We always sort lambda in the descending order. The kth highest eigenvectors will be used for .


which is the first column of .


which is the second column of .

We can skip all the eigenvalues = 0. Hence:

Similarly, we calculate to find which is:

The singular values in S are square roots of eigenvalues from or :

The sample above is originated from []

Now we apply SVD to solve PCA. First we compute the covariance matrix with our n-Dimensional training datapoints .

is a nxn matrix

Apply SVD to decompose to :

Which have the dimension of:

We only take the first k columns:

To transform to :

Let’s check the dimensionality again:

To convert to :

Predicting Z

Given and testing data , is computed by first subtract the training mean from the features and then calculate .

Choosing the number of latent factors k

PCA is about maximizing variance. Say to keep 90% of variance, we set .

Alternatively, we can start from k=1 and increment it until say .

stores the eigenvalues of the eigenvectors but also reflect how important for a particular latent factors. Hence, we can also determine k by:

W uniqueness (orthogonal)

The solution for W is not unique. But we can impose a few restrictions to solve this.

  • Set the magnitude of the principal component to 1. (: row of )
  • Each latent factors are independent of each other (cross product = 0).

We can still have the factors rotated or label switching switch to . To fix this, we can

  • First set k = 1, and solve W with the constraint above
  • Set k = 2 with the first factor set and solve for the second factor
  • Repeat until reaching our target k

The blue line is our first optimized W for . The optimal solution for is the red line orthogonal to the blue line.

With these constraints:

Solving Z becomes:


Eigenfaces apply PCA to represent a facial image with latent factors. First we compute the mean image and the top k eigenvectors (Principal components)

The image is encoded with the latent factors:

Non-negative matrix factorization (NMF):

NMP is similar to the matrix factorization methods above with the same cost functions, except that we add an additional constraint that and are non-negative.


  • and are non-negative instead of orthogonal
  • Promote sparsity
    • Avoiding postive & negative matrix elements cancelling each other
    • Brain seem to use sparse representation
    • Energy efficient
    • Increase the number of concepts that can memorize
  • Learning the parts of objects

In NMF. the latent factors are more close to individual facial features.

Credit: Daniel D. Lee: Learning the parts of objects by non-negative matrix factorization

Sparse Matrix Factorization

Sparse Matrix Factorization promotes the sparsity of .

Here is the plot of J with an optimized smaller than 0.

With the constraint of , the optimized cost is now at (promote sparsity).

We are going to use Gradient descent to train and reset the value to 0 if it is negative:


We can also use L1 regularization to control sparsity:

Here is the visualization of latent factors with different techniques:

Credit: Julien Mairal etc… Online Learning for Matrix Factorization and Sparse Coding

Regularized Matrix Factorization

Instead of forcing orthogonality, we can add a L2 regularization cost to control how is optimized.

Latent Factor Model using logistic loss

We can also use logistic loss for our cost function

Robust PCA

To reduce the effects of outlier, Robust PCA switch to a L1-norm in calculating the errors.