0% found this document useful (0 votes)

31 views

4 Clustering

Unit-4 discusses different types of clustering algorithms. K-means is a centroid-based algorithm that partitions data into K clusters by minimizing distances between data points and cluster centroids. K-medoids is similar but uses actual data points as cluster centroids. Agglomerative clustering is a bottom-up approach that treats each data point as a singleton cluster and successively merges clusters until all are merged into one large cluster. Divisive clustering is a top-down approach that recursively splits one large cluster containing all data points into smaller clusters.

Uploaded by

Bibek Neupane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

4 Clustering

Uploaded by

Bibek Neupane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit-4 Clustering

4.1 Introduction to Clustering

In clustering, a group of different data objects is classified as similar objects. One
group means a cluster of data. Data sets are divided into different groups in the
cluster analysis, which is based on the similarity of the data. After the classification
of data into various groups, a label is assigned to the group. It helps in adapting to
the changes by doing the classification.
 Distance measuring
In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points
is small then there is a high degree of similarity among the objects and vice
versa. The similarity is subjective and depends heavily on the context and
application. For example, similarity among vegetables can be determined
from their taste, size, color etc.
Most clustering approaches use distance measures to assess the similarities or
differences between a pair of objects, the most popular distance measures used
are:
The classical methods for distance measures are Euclidean and Manhattan
distances, which are defined as follow:

a) Euclidean Distance
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between two
points. It is one of the most used algorithms in the cluster analysis. One of
the algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates between
two objects.

Fig: Euclidean Distance

b) Manhattan Distance

 This determines the absolute difference among the pair of the

coordinates.
 Suppose we have two points P and Q to determine the distance
between these points we simply have to calculate the perpendicular
distance of the points from X-Axis and Y-Axis.
 In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
 Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Fig: Manhattan Distance

 Here the total distance of the Red line gives the Manhattan distance
between both the points.

 Categories of clustering algorithm

There are different types of clustering algorithm.

a) Centroid-based Clustering
The first and foremost clustering algorithm, Centroid-based algorithm, is
a non-hierarchical structure that allows data analysts to group data points
in different clusters according to their attributes of characteristics.
This is basically one of the iterative clustering algorithms in which the
clusters are formed by the closeness of data points to the centroid of
clusters. Here, the cluster center i.e. centroid is formed such that the
distance of data points is minimum with the center. This problem is
basically one of the NP-Hard problems and thus solutions are commonly
approximated over a number of trials.

b) Density-based Clustering

Density-based clustering connects areas of high example density into

clusters. This allows for arbitrary-shaped distributions as long as dense
areas can be connected. These algorithms have difficulty with data of
varying densities and high dimensions. Further, by design, these
algorithms do not assign outliers to clusters.
c) Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such
as Gaussian distributions. In Figure, the distribution-based algorithm
clusters data into three Gaussian distributions. As distance from the
distribution's center increases, the probability that a point belongs to the
distribution decreases. The bands show that decrease in probability. When
you do not know the type of distribution in your data, you should use a
different algorithm.

d) Connectivity-based Clustering
The core idea of the connectivity-based model is similar to Centroid based
model which is basically defining clusters on the basis of the closeness of
data points. Here we work on a notion that the data points which are closer
have similar behavior as compared to data points that are farther.
It is not a single partitioning of the data set, instead, it provides an extensive
hierarchy of clusters that merge with each other at certain distances. Here
the choice of distance function is subjective. These models are very easy
to interpret but it lacks scalability.
e) Hierarchical Clustering

Hierarchical clustering algorithms, unlike centroid-based algorithms, take

a different turn when it comes to clustering.
These algorithms focus on constructing a hierarchy among all data points
and from there on, generate a mind map that outlays the relation between
all data inputs. Hierarchical clustering creates a tree of clusters.
f) Fuzzy Clustering

It is Quite distinct from other methods of clustering; the Fuzzy Clustering

Algorithm creates clusters of data points in such a manner that one data point
can belong to more than one cluster.
Based on the notion that some data inputs can overlap in terms of
characteristics, this algorithm places a particular data input in more than one
cluster according to the parameters of different clusters.

4.2 K-means and K-medoid Algorithms

 K-means Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset

into K pre-defined distinct non-overlapping subgroups (clusters) where each
data point belongs to only one group. It tries to make the intra-cluster data
points as similar as possible while also keeping the clusters as different (far)
as possible. It assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid (arithmetic mean
of all the data points that belong to that cluster) is at the minimum. The less
variation we have within clusters; the more homogeneous (similar) the data
points are within the same cluster.

Algorithm:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

 K-medoid algorithm

K-Medoids is a clustering algorithm resembling the K-Means clustering

technique. It falls under the category of unsupervised machine learning. It
majorly differs from the K-Means algorithm in terms of the way it selects the
clusters’ centers. The former selects the average of a cluster’s points as its
center (which may or may not be one of the data points) while the latter always
picks the actual data points from the clusters as their centers (also known as
‘exemplars’ or ‘medoids’). K-Medoids also differs in this respect from the K-
Medians algorithm which is the same as K-means, except that it chooses the
medians (instead of means) of the clusters as centers.

Algorithm:

Step-1: Randomly choose ‘k’ points from the input data (‘k’ is the number of
clusters to be formed). The correctness of the choice of k’s value can be
assessed using methods such as silhouette method.

Step-2: Each data point gets assigned to the cluster to which its nearest
medoid belongs.

Step-3: For each data point of cluster i, its distance from all other data points
is computed and added. The point of ith cluster for which the computed sum
of distances from other points is minimal is assigned as the medoid for that
cluster.

Step-4 : (2) and (3) are repeated until convergence is reached i.e. the medoids
stop moving.

4.3 Agglomerative Clustering, Concept of Divisive Clustering

 Agglomerative Clustering
It also known as bottom-up approach or hierarchical agglomerative clustering
(HAC). A structure that is more informative than the unstructured set of
clusters returned by flat clustering. This clustering algorithm does not require
us to prespecify the number of clusters. Bottom-up algorithms treat each data
as a singleton cluster at the outset and then successively agglomerates pairs of
clusters until all clusters have been merged into a single cluster that contains
all data.
Steps of Agglomerative Clustering

1. Preparing the data.

2. Computing (dis)similarity information between every pair of objects in the
data set.
3. Using linkage function to group objects into hierarchical cluster tree, based
on the distance information generated at step 1. Objects/clusters that are in
close proximity are linked together using the linkage function.
4. Determining where to cut the hierarchical tree into clusters. This creates a
partition of the data.

 Divisible Clustering

It known as a top-down approach. This algorithm also does not require to

prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting
clusters recursively until individual data have been split into singleton
clusters.

Cosmoss
67% (15)
Cosmoss
5 pages
DBMS Total Notes
No ratings yet
DBMS Total Notes
283 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Module 5
No ratings yet
Module 5
98 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
M5
No ratings yet
M5
40 pages
Lect 12
No ratings yet
Lect 12
80 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
75 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Cluster
100% (1)
Cluster
72 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering Algorithm: A Fundamental Operation in Data Mining
No ratings yet
Clustering Algorithm: A Fundamental Operation in Data Mining
44 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering new
No ratings yet
Clustering new
6 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
UNIT5
No ratings yet
UNIT5
60 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unit 4
No ratings yet
Unit 4
74 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Grouping
No ratings yet
Grouping
98 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Clustering
No ratings yet
Clustering
28 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Clustering
No ratings yet
Clustering
125 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unit 5
No ratings yet
Unit 5
63 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering
No ratings yet
Clustering
80 pages
clustering
No ratings yet
clustering
6 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Clustering
No ratings yet
Clustering
39 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
DM_C6
No ratings yet
DM_C6
37 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Web Services Digital Notes
No ratings yet
Web Services Digital Notes
114 pages
Kyle Nettleton
No ratings yet
Kyle Nettleton
6 pages
ch 14-Final-normalization
No ratings yet
ch 14-Final-normalization
39 pages
Message Encode Decode in Python
No ratings yet
Message Encode Decode in Python
9 pages
BILA
No ratings yet
BILA
4 pages
The Future of Urban AI - Global Dialogues
No ratings yet
The Future of Urban AI - Global Dialogues
39 pages
Data Annotation Specialist Cv
No ratings yet
Data Annotation Specialist Cv
2 pages
Automated_Resume_Parsing_A_Natural_Language_Processing_Approach
No ratings yet
Automated_Resume_Parsing_A_Natural_Language_Processing_Approach
6 pages
Lect01-Annotated DB
No ratings yet
Lect01-Annotated DB
31 pages
Hotels Review Classification Final
No ratings yet
Hotels Review Classification Final
34 pages
ODI 12c - Basic Concepts
No ratings yet
ODI 12c - Basic Concepts
8 pages
Assignment On Chapter 1 Data Warehousing and Management
No ratings yet
Assignment On Chapter 1 Data Warehousing and Management
11 pages
Sample Thesis Library System
100% (3)
Sample Thesis Library System
6 pages
Daily Time Record System
No ratings yet
Daily Time Record System
74 pages
Week 1 Activity Sheet:: Defining A Database
100% (1)
Week 1 Activity Sheet:: Defining A Database
30 pages
Labonne 2020
No ratings yet
Labonne 2020
123 pages
Compiler Design
No ratings yet
Compiler Design
5 pages
OSA - I-CORE - Flash Info 02 - Aug22 - EN
No ratings yet
OSA - I-CORE - Flash Info 02 - Aug22 - EN
2 pages
Ais Make Up
No ratings yet
Ais Make Up
2 pages
NURSING INFORMATICS - Open Source Free Software
No ratings yet
NURSING INFORMATICS - Open Source Free Software
18 pages
Download ebooks file Dynamic and Advanced Data Mining for Progressing Technological Development Innovations and Systemic Approaches Premier Reference Source 1st Edition A. B. M. Shawkat Ali all chapters
100% (2)
Download ebooks file Dynamic and Advanced Data Mining for Progressing Technological Development Innovations and Systemic Approaches Premier Reference Source 1st Edition A. B. M. Shawkat Ali all chapters
67 pages
Sri Ram - Week 4 Assignment
No ratings yet
Sri Ram - Week 4 Assignment
5 pages
The Evolution of Data Science From Past To Present
No ratings yet
The Evolution of Data Science From Past To Present
11 pages
Education Experience: Iit Dharwad Iit Dharwad
No ratings yet
Education Experience: Iit Dharwad Iit Dharwad
1 page
Dl Question Bank
No ratings yet
Dl Question Bank
23 pages
Advanced Database
No ratings yet
Advanced Database
4 pages
Normalization
No ratings yet
Normalization
54 pages
Lecture 1
No ratings yet
Lecture 1
33 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4 Clustering

Uploaded by

4 Clustering

Uploaded by

Unit-4 Clustering

4.1 Introduction to Clustering

Fig: Euclidean Distance

 This determines the absolute difference among the pair of the

Fig: Manhattan Distance

 Categories of clustering algorithm

There are different types of clustering algorithm.

Density-based clustering connects areas of high example density into

Hierarchical clustering algorithms, unlike centroid-based algorithms, take

It is Quite distinct from other methods of clustering; the Fuzzy Clustering

4.2 K-means and K-medoid Algorithms

Kmeans algorithm is an iterative algorithm that tries to partition the dataset

Step-1: Select the number K to decide the number of clusters.

K-Medoids is a clustering algorithm resembling the K-Means clustering

4.3 Agglomerative Clustering, Concept of Divisive Clustering

1. Preparing the data.

It known as a top-down approach. This algorithm also does not require to

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.