Unsupervised Machine Learning

What is Unsupervised machine learning?

Unsupervised machine learning is an algorithm used to train the dataset where the labels or classes are unknown.

For a better understanding, imagine that our input training data contains a variety of fruits. The machine does not know what they are based on similarities, i.e. their colour, shapes, etc., they group them as different categories as shown in the above figure. It is basically used to find the structure of a given dataset.

Types of Unsupervised Machine Learning Algorithm

Clustering: Clustering is a type of unsupervised machine learning algorithm. As the name suggests, it works based on grouping the dataset. Every set of grouped data contains similar observations.

Types of Clustering:

Centroid-based Model
Density-based Model
Distribution-based Model
Connectivity-based model

Centroid-based model

Centroid-based clustering converts the data into non-hierarchical clusters.
k-means is the algorithm that works based on a centroid-based clustering algorithm.

K-means:

The main aim of the k-means algorithm is to find the centre of the grouped data where k refers to the number of clusters.

Based on Euclidean distance, it will find the similar points belongs to the cluster.

Then calculate the centroid for each cluster.

K value can choose by elbow method

ELBOW method

By changing various values of k, it plots the values of k. Decreasing the elements in the cluster leads to increasing the k value. Less number of elements in the cluster leads to close to the centroid. At one point, the inertia will decrease; that point is known as the elbow point.

Advantages

It is good to use for a large set of data.
High speed in performance when compared to other clustering algorithms.

Disadvantages

For text data, it is very difficult to find centroids.
It is sensitive to outliers.
K value should be carefully chosen as every value of k gives different centroids.

Application of K-means clustering

Character recognition
Biometrics
Diagnostic system
Military application

Density-based model:

Cluster density of the data points will be detected in density-based clustering. Density-based spatial clustering of applications with noise (DBSCAN) is an example of a density-based clustering algorithm. It does not create a number of clusters like k-means; it forms the arbitrary shapes of clusters.

First, we should know about ε and minPts.

ε is nothing but a neighbourhood surrounded by any point in data. If the distance between the two points is small, then that point is considered as a neighbour. ε value should be carefully chosen. If the ε value is too small, then many points will seem like outliers. If the ε value is too large, then many points will be considered as the same clusters. Hence ε value should be calculated by the k-distance graph.

MinPts is known as a minimum number of data points inside ε. If the dataset is large, then minPts must be larger. MinPts can be calculated by MinPts>= D+1

Core Points

Based on density approximation, core points are the main thing to form clusters. To calculate ε, we use the same neighbourhood, so the volume of the neighbourhood remains the same. At the same time, the mass of the neighbourhood will not remain the same. First, we have to set a minimum density threshold. We can change minPts to fine-tune the cluster dense.

Border Points

Except for the core points, other points in our dataset are considered as border points. In other words, a point which has lesser than minPts inside ε but it is the neighbourhood of a core point.

DBSCAN algorithm

Randomly choose a point that is not considered an outlier. To find the core point, calculate the neighbourhood. If that point is a core point, begin the cluster around that point; if not, assign that point as an outlier.
Whenever we find a cluster point, expand the cluster points by reaching directly to the cluster. To find density-reachable points do “neighbourhood jumps and add that point to the cluster. If an outlier is joined to that cluster, consider that point as a border point.
Repeat the above steps until all the points are iterated and assigned as cluster point or outliers.

Advantages

The data is easy to understand because of the parameters minPts and ε
Unlike K-means clustering, it is not sensitive to outliers.
Arbitrary-shaped clusters can be found by the DBSCAN algorithm.
The single-link effect can be lesser due to minPts.

Disadvantages

If the difference in densities are large, creating cluster becomes difficult and also difficult to choose minPts and ε
Selecting ε will become difficult if the data is not clearly understand
For high-dimensional data, the quality of DBSCAN will vary because of choosing a value of ε

Application of DBSCAN

Scientific literature
Satellite images
Crystallography of x-ray
Anomaly detection in temperature data

Distribution-based model

The data belongs to some type of distribution that forms one cluster.

The most commonly used method is Gaussian mixture models.
If a data point is far away from the centre, it has a lesser chance that the point lie in distribution.

Gaussian Mixture Models (GMMs)

By using the maximum expectation algorithm, we find the parameters known as the mean and standard deviation for Gaussian mixture models.

Like K-means clustering, first, we have to select a number of clusters and have to select Gaussian distribution parameters for each clusters randomly.

After that, we have to find whether a data point belongs to a particular cluster or not. We should find the cluster centre that every point is close to that centre.

Then we have to calculate new parameters that are mean and standard deviation for Gaussian distribution to increase the chance of data points to present in the cluster. Using the weighted sum of data point position, we have to calculate new parameters.

Repeat the above two steps to iterate all data points.

Advantage

By using standard deviation, cluster can be formed by eclipse shape, not in circular shape this made GMMs method more flexible when compared to K-means clustering.
We can achieve multiple clusters per data point because GMMs use probabilities. GMMs use mixed membership, that is, a data point in the centre that can overlap with two cluster.
It overcomes the problem of overfitting by using a fixed number of Gaussian distribution.

Disadvantage

Complex model of clusters are produced by distribution-based clustering.
This made it very difficult to understand for the user.

Connectivity-based model

Hierarchical clustering falls under the category of Connectivity-based clustering.

Building a tree diagram is the main agenda of hierarchical clustering.
Hierarchical clustering is used to find hierarchical data, namely taxonomies.
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.
Each cluster has a similar type of object.
It has two main categories: Divisive and Agglomerative.

Hierarchical clustering

Each observation is treated as separate clusters
Close cluster is identified and joined together
This process is continued until all the clusters are joined

Dendrogram

Dendrogram is nothing but a diagrammatic representation of the tree.

It used to arrange objects into clusters

How to read dendrogram

Dendogram can be either column graph, row graph, circular or fluid shape. But our system will produce column graph or row graph. Irrespective of shapes all graph contain same parts.

The branch is called Clade. Usually named by Greek letters and can be read from left to right, e.g., α β, δ.

Clade has many numbers of leaves such as,

Single (simplicifolius): f
Double (bifolius): d, e
Triple (trifolious): a, b, c

Basically, clade has an enormous number of leaves if the clade with more leaves is difficult to read.

Clades are arranged based on similarities and dissimilarities between them. Clades with same height are considered as similar and clade which contain different height are considered as dissimilar. Pearson’s Correlation Coefficient measures similarities.

Leaves a, b are considered as similar when compared to c, d, e, f.
Leaves d and e are considered as similar when compared to a, b, c, and d.
Leaf e and f are completely dissimilar to other leaves.

In the above diagram, the same clave β joins leaves a, b, c, d, and e. That means that the two groups (a, b, c, d, e) are more similar to each other than they are to f.

Divisive method

In divisive or top-down clustering, we consider every object into a single cluster and then divide the cluster into two least similar clusters.
Finally, we divide each cluster until there is one cluster for every object.

Agglomerative

In the agglomerative or bottom-up clustering method, we consider every object into its own cluster.
Then, calculate the similarity (i.e., distance) between each of the clusters and merge them into the two most similar clusters.

Advantage

Information about the number of clusters is not needed.
It is easy to implement.
By seeing dendrogram, we can count the number of clusters easily.

Disadvantage

Object function is not reduced directly.
It breaks large clusters into small clusters.
Because of the different shape of cluster, it is challenging to handle.
Sensitive to outliers.

Application of Hierarchical clustering

US Senator Clustering through Twitter
Charting Evolution through Phylogenetic Trees
Tracking Viruses through Phylogenetic Trees

How to measure cluster distance

Using distance function, we have to find a proximity matrix that is nothing but the distance between each point then only we have to create cluster.

To show the distance between clusters, we have to update the matrix.

There are five types to calculate how to measure the distance between the clusters

Single linkage
Complete linkage
Average linkage
Centroid method
Ward’s method

Single linkage

Single-linkage is nothing but a single pair of elements that are determined to calculate the distance between two clusters. Two elements which are from the different cluster are linked together because they are closer in the distance. These pairwise distance re the shortest distance between two clusters are merged together; this is also known as nearest neighbour clustering or minimum distance linkage.

The mathematical expression for single linkage is shown in the above diagram in that expression X and Y are two elements in clusters.

Complete linkage

Complete linkage clustering is just opposite to single linkage clustering. Complete linkage clustering refers to the longest linkage between the elements which are far away from each other. This is otherwise known as maximum linkage clustering. But the cluster is small when compared to a single linkage. The diameter of two clusters are smaller than distance threshold.

Average linkage

In average linkage, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. In this case, the outlier will not affect the linkage. This is also known as UPGMA- unweighted pair group mean averaging.

Centroid Method

The main aim of the centroid method is to find the mean vector location of each cluster and finding the distance between two clusters.

Ward’S Method

In this method, the total within the cluster variance is minimized. These clusters are merged and give minimum information loss that is ESS criteria.

Conclusion

With this, we come to an end for this article. If you want to go one step ahead, you can also have a look at Introduction to supervised machine learning.