0% found this document useful (0 votes)
31 views

4 Clustering

Unit-4 discusses different types of clustering algorithms. K-means is a centroid-based algorithm that partitions data into K clusters by minimizing distances between data points and cluster centroids. K-medoids is similar but uses actual data points as cluster centroids. Agglomerative clustering is a bottom-up approach that treats each data point as a singleton cluster and successively merges clusters until all are merged into one large cluster. Divisive clustering is a top-down approach that recursively splits one large cluster containing all data points into smaller clusters.

Uploaded by

Bibek Neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

4 Clustering

Unit-4 discusses different types of clustering algorithms. K-means is a centroid-based algorithm that partitions data into K clusters by minimizing distances between data points and cluster centroids. K-medoids is similar but uses actual data points as cluster centroids. Agglomerative clustering is a bottom-up approach that treats each data point as a singleton cluster and successively merges clusters until all are merged into one large cluster. Divisive clustering is a top-down approach that recursively splits one large cluster containing all data points into smaller clusters.

Uploaded by

Bibek Neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit-4 Clustering

4.1 Introduction to Clustering


In clustering, a group of different data objects is classified as similar objects. One
group means a cluster of data. Data sets are divided into different groups in the
cluster analysis, which is based on the similarity of the data. After the classification
of data into various groups, a label is assigned to the group. It helps in adapting to
the changes by doing the classification.
 Distance measuring
In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points
is small then there is a high degree of similarity among the objects and vice
versa. The similarity is subjective and depends heavily on the context and
application. For example, similarity among vegetables can be determined
from their taste, size, color etc.
Most clustering approaches use distance measures to assess the similarities or
differences between a pair of objects, the most popular distance measures used
are:
The classical methods for distance measures are Euclidean and Manhattan
distances, which are defined as follow:

a) Euclidean Distance
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between two
points. It is one of the most used algorithms in the cluster analysis. One of
the algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates between
two objects.

Fig: Euclidean Distance


b) Manhattan Distance

 This determines the absolute difference among the pair of the


coordinates.
 Suppose we have two points P and Q to determine the distance
between these points we simply have to calculate the perpendicular
distance of the points from X-Axis and Y-Axis.
 In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
 Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Fig: Manhattan Distance

 Here the total distance of the Red line gives the Manhattan distance
between both the points.

 Categories of clustering algorithm

There are different types of clustering algorithm.

a) Centroid-based Clustering
The first and foremost clustering algorithm, Centroid-based algorithm, is
a non-hierarchical structure that allows data analysts to group data points
in different clusters according to their attributes of characteristics.
This is basically one of the iterative clustering algorithms in which the
clusters are formed by the closeness of data points to the centroid of
clusters. Here, the cluster center i.e. centroid is formed such that the
distance of data points is minimum with the center. This problem is
basically one of the NP-Hard problems and thus solutions are commonly
approximated over a number of trials.

b) Density-based Clustering

Density-based clustering connects areas of high example density into


clusters. This allows for arbitrary-shaped distributions as long as dense
areas can be connected. These algorithms have difficulty with data of
varying densities and high dimensions. Further, by design, these
algorithms do not assign outliers to clusters.
c) Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such
as Gaussian distributions. In Figure, the distribution-based algorithm
clusters data into three Gaussian distributions. As distance from the
distribution's center increases, the probability that a point belongs to the
distribution decreases. The bands show that decrease in probability. When
you do not know the type of distribution in your data, you should use a
different algorithm.

d) Connectivity-based Clustering
The core idea of the connectivity-based model is similar to Centroid based
model which is basically defining clusters on the basis of the closeness of
data points. Here we work on a notion that the data points which are closer
have similar behavior as compared to data points that are farther.
It is not a single partitioning of the data set, instead, it provides an extensive
hierarchy of clusters that merge with each other at certain distances. Here
the choice of distance function is subjective. These models are very easy
to interpret but it lacks scalability.
e) Hierarchical Clustering

Hierarchical clustering algorithms, unlike centroid-based algorithms, take


a different turn when it comes to clustering.
These algorithms focus on constructing a hierarchy among all data points
and from there on, generate a mind map that outlays the relation between
all data inputs. Hierarchical clustering creates a tree of clusters.
f) Fuzzy Clustering

It is Quite distinct from other methods of clustering; the Fuzzy Clustering


Algorithm creates clusters of data points in such a manner that one data point
can belong to more than one cluster.
Based on the notion that some data inputs can overlap in terms of
characteristics, this algorithm places a particular data input in more than one
cluster according to the parameters of different clusters.

4.2 K-means and K-medoid Algorithms


 K-means Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset


into K pre-defined distinct non-overlapping subgroups (clusters) where each
data point belongs to only one group. It tries to make the intra-cluster data
points as similar as possible while also keeping the clusters as different (far)
as possible. It assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid (arithmetic mean
of all the data points that belong to that cluster) is at the minimum. The less
variation we have within clusters; the more homogeneous (similar) the data
points are within the same cluster.

Algorithm:

Step-1: Select the number K to decide the number of clusters.


Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

 K-medoid algorithm

K-Medoids is a clustering algorithm resembling the K-Means clustering


technique. It falls under the category of unsupervised machine learning. It
majorly differs from the K-Means algorithm in terms of the way it selects the
clusters’ centers. The former selects the average of a cluster’s points as its
center (which may or may not be one of the data points) while the latter always
picks the actual data points from the clusters as their centers (also known as
‘exemplars’ or ‘medoids’). K-Medoids also differs in this respect from the K-
Medians algorithm which is the same as K-means, except that it chooses the
medians (instead of means) of the clusters as centers.

Algorithm:

Step-1: Randomly choose ‘k’ points from the input data (‘k’ is the number of
clusters to be formed). The correctness of the choice of k’s value can be
assessed using methods such as silhouette method.

Step-2: Each data point gets assigned to the cluster to which its nearest
medoid belongs.

Step-3: For each data point of cluster i, its distance from all other data points
is computed and added. The point of ith cluster for which the computed sum
of distances from other points is minimal is assigned as the medoid for that
cluster.

Step-4 : (2) and (3) are repeated until convergence is reached i.e. the medoids
stop moving.

4.3 Agglomerative Clustering, Concept of Divisive Clustering


 Agglomerative Clustering
It also known as bottom-up approach or hierarchical agglomerative clustering
(HAC). A structure that is more informative than the unstructured set of
clusters returned by flat clustering. This clustering algorithm does not require
us to prespecify the number of clusters. Bottom-up algorithms treat each data
as a singleton cluster at the outset and then successively agglomerates pairs of
clusters until all clusters have been merged into a single cluster that contains
all data.
Steps of Agglomerative Clustering

1. Preparing the data.


2. Computing (dis)similarity information between every pair of objects in the
data set.
3. Using linkage function to group objects into hierarchical cluster tree, based
on the distance information generated at step 1. Objects/clusters that are in
close proximity are linked together using the linkage function.
4. Determining where to cut the hierarchical tree into clusters. This creates a
partition of the data.

 Divisible Clustering

It known as a top-down approach. This algorithm also does not require to


prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting
clusters recursively until individual data have been split into singleton
clusters.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy