### Unsupervised learning

Unsupervised learning tries to understand the grouping or the latent structure of the input data. In contrast to the supervised learning, unsupervised training dataset contains input data but not the labels.

Example of unsupervised learning;

Clustering

• K-means
• Density based clustering - DBSCAN

Gaussian mixture models

• Expectation–maximization algorithm (EM)

Latent factor model/Matrix factorization

• Principal component analysis
• Singular value decomposition
• Non-negative matrix factorization

Manifold/Visualization

• MDS, IsoMap, t-sne

Anomaly detection

• Gaussian model
• Clustering

Deep Networks

• Restricted Boltzmann machines

Self-organized map

Association rule

### Semi-supervised learning

In a semi-supervised learning, we use the labeled training data to build a model. However, labeling data is expensive. Instead, we convert non-labeled data to labeled data using the model and combined all data to refit a better model. The purpose of the semi-supervised learning is to augment training data with a model built by the labeled data, so we need far less labeled data than the supervised training.

#### Self-taught training

• Use clustering to fit a model with the labeled training dataset
• Locate the location of the unlabeled data using the model
• Label the unlabeled data
• Combined datapoints to refit the model
• Ignore datapoints that are far from a cluster

However, we do not need to put the same weight on the labeled and unlabeled datapoint. The following cost function uses $$\lambda$$ to lower the error cost of the unlabeled data ($$\hat{y}$$ and $$\hat{x}$$ ) in refitting the model.

$J(w) = \sum^n_{i=1} \log{(1+e^{-y_iw^Tx_i})} + \lambda \sum^m_{i=1} \log{(1+e^{-\hat{y}_iw^T\hat{x_i}})}$

#### Co-training

We may have 2 sets of features ($$x_{1s}, x_{2s}$$) that are conditionally independent given the class ($$Y$$). Both feature sets provide different and complementary information.

$P(x_{1s}, x_{2s} \vert Y) = P(x_{1s} \vert Y) P(x_{2s} \vert Y)$

For example, we may have the spectrogram of a voice clip and its close caption. Can we use them to solve classification problem, say identifying the age group of the speaker?

• Using labeled data, fit one model with the feature set 1 and the second model with the feature set 2.
• Labeled a subset of unlabeled data separately using model 1 and model 2.
• For data that given high confidence in any model, we move the data with the predicted label into our training data set to refit the model.

Co-training works better if one classifier correctly labels a datapoint while the other misclassified it. If both classifiers agree on all the unlabeled data, labeling the data does not create new information. We hope refitting one model with the high confidence data from the other model gives extra information that a stand alone model will not have. Co-training worsened as the dependence of the classifiers increase.

#### Entropy regularization

Entropy is a measure of random-ness. We want to use labeled data to create a model which the entropy of the unlabeled data to be the lowest. i.e. the resulted model should have high certainty in predicting the labels of the unlabeled data.

#### Graph based

In clustering, DBSCAN connects neighboring high density points together to form a cluster. In semi-supervised learning, we want to connect the labeled data with un-labeled data using density.

Foe example, we have labeled data $$(y_1, \cdots, y_n)$$ and unlabeled data $$(y_{n+1}, \cdots, y_{n+m})$$.

The cost function is:

$J(\hat{y}) = \frac{1}{2} \sum^n_{i=1} \sum^{n+m}_{j=n+1} w_{ij} (y_i - \hat{y_j})^2 + \frac{1}{2} \sum^{n+m}_{i=n+1} \sum^{n+m}_{j=n+1} w_{ij}(\hat{y_i} - \hat{y_j})^2$

which we want to optimize $$\hat{y}$$. The first term in the cost function try to cluster similar labeled and unlabeled data while the second term try to cluster unlabeled data.

$$w_{ij}$$ is the graph weight between datapoints:

#### Markov chain based

We can label data using random walk with a Markov chain model.

• Start at a unlabeled state
• Move to the next state based on the transition probability $$w_{ij}$$.
• If the next state has a label, mark the unlabeled datapoint with this label
• Otherwise continue the iteration in the random walk