4 Clustering
4 Clustering
a) Euclidean Distance
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between two
points. It is one of the most used algorithms in the cluster analysis. One of
the algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates between
two objects.
Here the total distance of the Red line gives the Manhattan distance
between both the points.
a) Centroid-based Clustering
The first and foremost clustering algorithm, Centroid-based algorithm, is
a non-hierarchical structure that allows data analysts to group data points
in different clusters according to their attributes of characteristics.
This is basically one of the iterative clustering algorithms in which the
clusters are formed by the closeness of data points to the centroid of
clusters. Here, the cluster center i.e. centroid is formed such that the
distance of data points is minimum with the center. This problem is
basically one of the NP-Hard problems and thus solutions are commonly
approximated over a number of trials.
b) Density-based Clustering
d) Connectivity-based Clustering
The core idea of the connectivity-based model is similar to Centroid based
model which is basically defining clusters on the basis of the closeness of
data points. Here we work on a notion that the data points which are closer
have similar behavior as compared to data points that are farther.
It is not a single partitioning of the data set, instead, it provides an extensive
hierarchy of clusters that merge with each other at certain distances. Here
the choice of distance function is subjective. These models are very easy
to interpret but it lacks scalability.
e) Hierarchical Clustering
Algorithm:
K-medoid algorithm
Algorithm:
Step-1: Randomly choose ‘k’ points from the input data (‘k’ is the number of
clusters to be formed). The correctness of the choice of k’s value can be
assessed using methods such as silhouette method.
Step-2: Each data point gets assigned to the cluster to which its nearest
medoid belongs.
Step-3: For each data point of cluster i, its distance from all other data points
is computed and added. The point of ith cluster for which the computed sum
of distances from other points is minimal is assigned as the medoid for that
cluster.
Step-4 : (2) and (3) are repeated until convergence is reached i.e. the medoids
stop moving.
Divisible Clustering