0% found this document useful (0 votes)

3 views

Clustering-Part1.pptx

The document provides an overview of clustering in machine learning, focusing on the differences between supervised and unsupervised learning, with a particular emphasis on K-means clustering and hierarchical clustering methods. It discusses applications of clustering, challenges faced, and various algorithms and distance measures used in clustering. Additionally, it covers the process of selecting the number of clusters and the importance of initialization methods like K-means++.

Uploaded by

16date10month2016

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Clustering-Part1.pptx

Uploaded by

16date10month2016

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

CS60050: Machine Learning

Autumn 2024

Sudeshna Sarkar
Clustering
K-Means
26-27 Sep 2024
Supervised learning vs. unsupervised learning
• Supervised learning: discover discriminative patterns in the data that
relate data attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• discover data generative (all) patterns.
Clustering
Unsupervised learning: The data have no target attribute.
– Requires data, but no labels
– Detect patterns e.g. in
• Group emails or search results
• Customer shopping patterns
• Regions of images
– Useful when don’t know what you’re looking for
– But: can get gibberish
Applications
• Segmenting of customers with similar market characteristics
— pricing , loyalty, spending behaviors etc.
• Grouping of products based on their properties
• Identify similar energy use customer profiles
<x> = time series of energy usage
• Clustering weblog data to discover groups of similar access patterns.
• Recognize communities in social networks.
• Top 20 topics in Twitter
Clustering
Clustering Algorithms
Clustering
Clustering Examples
Customer Segmentation
• Group customers based on their demography and activity
• Purchase History
• Demographic
• Content engagement
• Behavior
• Customer Lifecycle Stage

Cater to customer groups for promotion, recommendation, product

development strategies
INCOME SPEND Customer Segments
233 150
250 187
204 172
236 178
354 163
192 148
294 153
263 173 Annual Spend
199 162
168 174
239 160
275 139
266 171
211 144 Annual Income
Applications:
News
Clustering
Clustering
Fundamental Aspects of clustering
• A clustering algorithm
• Partitional clustering
• Hierarchical clustering
•…
• A distance (similarity, or dissimilarity) function
• Euclidean, cosine, Mahalanobis
• Clustering quality
• Inter-clusters distance ⇒ maximized
• Intra-clusters distance ⇒ minimized
• The quality of a clustering result depends on the algorithm, the
distance function, and the application.
How many clusters?
An illustration
This data set has four natural clusters.
Clustering
•

x x xx
x xx x
xx
xxxxx
X1 x xxx
X2
Aspects of clustering
•

vs
Similarity / Distance Measures

Depends on the problem domain and

data type
• Customers
• Time series
• Text
• Images

• Similarity or distance measures:

• Euclidean
• Manhattan distance
• Cosine similarity
• Pearson correlation
• …
Distance/ Similarity measures
•

y
x
x

y
x
x
Similarity measures
•
(Dis)similarity measures
•
Types of Clustering: Hard vs Soft
• Exclusive (Hard)
• Non-overlapping subsets
• Each item is a member of a single cluster

• Overlapping (Soft)
• Potentially overlapping subsets
• A item can simultaneously belong to multiple clusters
Challenges in Clustering
•Data is very large
•High dimensional data space.
•Data space is not Euclidean (e.g. NLP problems).
K-means Clustering
Clustering by Partitioning
•
K-means algorithm (MacQueen, 1967)

Given K
1. Initialization: Randomly choose K data
points (seeds) to be the initial cluster
centres
2. Cluster Assignment: Assign each data
point to the closest cluster centre
3. Move Centroid: Re-compute the cluster
centres using the current cluster
memberships.
4. If a convergence criterion is not met, go
to 2.
1. Random Initialization
2. Cluster Assignment
2. Cluster Assignment
3. Move Centroid
K-Means & Its Stopping criterion
Given K
1. Initialization: Randomly choose K data •
points (seeds) to be the initial cluster
centres
2. While not converged:
I. Cluster Assignment: Assign each
data point to the closest cluster
centre
II. Move Centroid: Re-compute the
cluster centres using the current
cluster memberships.
III. If a convergence criterion is not met,
go to 2.
K-Means Optimization Objective
•
K-Means Convergence Property
•
Convergence of K-Means
•
Convergence of K-Means
Given K •
1. Initialization: Randomly choose K data
points (seeds) to be the initial cluster
centres
2. While not converged:
I. Cluster Assignment: Assign each
data point to the closest cluster
centre
II. Move Centroid: Re-compute the
cluster centres using the current
cluster memberships.
III. If a convergence criterion is not
met, go to 2.
Convergence K-Means (summary)
Convergence Property
•
Kmeans illustrated
Picking Cluster Seeds (Initial Values)
1. Lloyd’s Method: Random
Initialization

2. K-Means++ : Iteratively
construct a random sample
with good spacing across the
dataset.
Picking Cluster Seeds (Initial Values)
1. Lloyd’s Method: Random
Initialization
May converge at a local optimum
1. K-Means++ : Iteratively
construct a random sample
with good spacing across the
dataset.
1. Perform multiple runs
o Each run with a different set of
randomly chosen seeds
2. Select that configuration that gives
minimum SSE
Picking Cluster Seeds (K-means++)
• Choose centers at random from the data •
points
• Weight the probability of choosing
the centres according to their
squared distance from the closest
centre already chosen

In other words,
• Choose 1 center randomly.
• Choose second furthest from the first.
• Choose third furthest from first and second.
• … and so on.
How to select K?
1: Use cross validation to select K 1.00E+03

Objective Function
9.00E+02
• What should we optimize? 8.00E+02
7.00E+02
2: Let the domain expert look at 6.00E+02
the clustering and decide 5.00E+02
4.00E+02
3.00E+02
2.00E+02
3: The “knee” solution 1.00E+02
• Plot the objective function values for 0.00E+00
1 2 3 4 5 6
different values of K K
• “knee finding” or “elbow finding”.

Figure from slide by Eamonn Keogh

K-Means Time Complexity
•
K-Means Pros and Cons

•
K-means Getting Stuck (Varying K)
K-means not able to properly cluster

Changing the Features or distance function (kernel)

May help
Some bad cases for k-means
• Clusters may overlap
• Some clusters may be
“wider” than others
• Clusters may not be linearly
separable

Slide credit: CMU MLD Aarti Singh

Hierarchical Clustering
Hierarchical Algorithms

Agglomerative (bottom-up):
Start with each point as a Agglomerative
cluster. Clusters are combined
based on their “closeness”.
Divisive (top-down): Start with
one cluster including all points
and recursively split each Divisive
cluster.
Types of hierarchical clustering
1.
• Divisive Hierarchical Clustering
Hierarchical clustering
Divisive (Top-down)

Slide credit: Min Zhang

Types of hierarchical clustering
2.
• Agglomerative (bottom up) clustering
Hierarchical Clustering: Example
1. C= {1},{2},{3},{4},{5},{6},{7}
2. C = {1,6} ,{2},{3},{4},{5},{7}
3. {1,6}, {2,4} ,{3},{5},{7}
Hierarchical Clustering: Example
1. C= {1},{2},{3},{4},{5},{6},{7}
2. {1,6} ,{2},{3},{4},{5},{7}
3. {1,6}, {2,4} ,{3},{5},{7}
4. {1,6},{2,4},{3}, {5,7}
5. {1,6}, {2,4,5,7} ,{3}
6. {1,6,3}, {2,4,5,7}
7. {1,6,3,2,4,5,7}
Dendrogram: Hierarchical Clustering
Dendrogram: Hierarchical Clustering
Dendrogram
• Input set S
• Nodes represent subsets of S

Features of the tree

• The root is S
• The leaves are the individual
elements of S
• The internal nodes are defined as
the union of their children.
Dendrogram: Definition

•
Hierarchical clustering
Height

Height=11; K=12

K=8

K=4
Height = 0; K=1
Height = 0
Clusters = 9

1 4 5 8 3 2 9 6 7
Height = 1
Clusters=8

1 4 5 8 3 2 9 6 7
Height = 2
Clusters=7

1 4 5 8 3 2 9 6 7
Height = 3
Clusters=6

1 4 5 8 3 2 9 6 7
Height = 4
Clusters=5

1 4 5 8 3 2 9 6 7
Height = 5
Clusters=4

1 4 5 8 3 2 9 6 7
Height = 6
Clusters=3

1 4 5 8 3 2 9 6 7
Height = 7
Clusters=2

1 4 5 8 3 2 9 6 7
Height = 8
Clusters=1

1 4 5 8 3 2 9 6 7
Hierarchical Agglomerative clustering
•

Different definitions of the distance leads to different algorithms.

Distance Measures
Real variables Discrete variables

• Euclidean • Hamming
• Cosine • Jaccard
• Correlation •…
• Manhattan
• Minkowski
• Mahalanobis
•…
Linkage: Definition
•
Initialization
• Each individual point is taken as a cluster
• Construct distance/proximity matrix

p1
p8
p1 p2 p3 p4 p5 ..
p1 .
p3
p2 p7
p2
p9 p10 p3
p4
p4
p5
p5
p11 .
p6 .
.
p12 Distance/Proximity Matrix
Intermediate State
Distance/Proximity Matrix

C1 C2 C3 C4 C5
C2
C1

C5 C2
C1 C3

C4
C5
C3 C4

After some merging steps, we have some clusters

Intermediate State
C1 C2 C3 C4 C5

C1
C2
C2
C5
C3
C1
C4

C3 C4

Merge the two closest clusters (C2 and C5)

and update the distance matrix.
After Merging
Update the distance matrix

C2 U C5
C1 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C1
C3 ?

C4 ?

C3 C4
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
• Similarity of the most similar (single-link)
• Complete-link
• Similarity of the least similar points
• Centroid
• Clusters whose centroids (centers of gravity) are the most similar
• Average-link
• Average cosine between pairs of elements
Distance between two clusters

It can result in long and thin

clusters.
Distance between two clusters

It can result in long and thin

clusters.
Single-link clustering: example
• Determined by one pair of points, i.e., by one link in the proximity
graph.

1 2 3 4 5
Complete link method

•
Complete link method

•
Complete-link clustering: example
• Distance between clusters is determined by the two most distant
points in the different clusters

1 2 3 4 5
Computational Complexity
•
Average Link Clustering

Compromise between single and complete link. Less

susceptible to noise and outliers.

K-Means PHP
100% (1)
K-Means PHP
4 pages
Week 9
No ratings yet
Week 9
66 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering
No ratings yet
Clustering
84 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Cluster
100% (1)
Cluster
72 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Data Clustering: K-Means and Hierarchical Clustering
100% (1)
Data Clustering: K-Means and Hierarchical Clustering
24 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 4
No ratings yet
Unit 4
74 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Data Mining P
No ratings yet
Data Mining P
23 pages
Clustering
No ratings yet
Clustering
75 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
K Means
No ratings yet
K Means
36 pages
Unit 5
No ratings yet
Unit 5
63 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Cluster
No ratings yet
Cluster
50 pages
Advanced AutoCAD 2024: A Problem-Solving Approach, 3D and Advanced, 27th Edition
From Everand
Advanced AutoCAD 2024: A Problem-Solving Approach, 3D and Advanced, 27th Edition
Prof. Sham Tickoo
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
30 pages
2018 April CS362-A - Ktu Qbank
No ratings yet
2018 April CS362-A - Ktu Qbank
2 pages
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
No ratings yet
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
4 pages
Semester II: Discipline: Information Technology Stream: IT1
No ratings yet
Semester II: Discipline: Information Technology Stream: IT1
188 pages
Solving Multiple Distribution Center Location Allocation Problem Using Kmeans Algorithm and Center of Gravity Method Take Jinjiang District of Chengdu as an ExampleIOP Conference Series Earth and Environmental Science
No ratings yet
Solving Multiple Distribution Center Location Allocation Problem Using Kmeans Algorithm and Center of Gravity Method Take Jinjiang District of Chengdu as an ExampleIOP Conference Series Earth and Environmental Science
7 pages
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
No ratings yet
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
83 pages
Handbook of Research On Machine Learning Applications and Trends
No ratings yet
Handbook of Research On Machine Learning Applications and Trends
34 pages
Top 10 Machine Learning Algorithms
No ratings yet
Top 10 Machine Learning Algorithms
12 pages
JETIR Survey Report
No ratings yet
JETIR Survey Report
13 pages
Partition
No ratings yet
Partition
52 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
Big Data Analytics For Dynamic Energy Management in Smart Grids
No ratings yet
Big Data Analytics For Dynamic Energy Management in Smart Grids
9 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
ML Application for Logging & Petrophysical Interpretation
No ratings yet
ML Application for Logging & Petrophysical Interpretation
21 pages
PG Program in AI & Machine Learning: Work Integrated Learning Programmes
No ratings yet
PG Program in AI & Machine Learning: Work Integrated Learning Programmes
29 pages
Amixed integer nonlinear programming model for site-specific management zone problem
No ratings yet
Amixed integer nonlinear programming model for site-specific management zone problem
10 pages
Scenarios: Mihir Jethwa 13 OCTOBER 2021
No ratings yet
Scenarios: Mihir Jethwa 13 OCTOBER 2021
6 pages
UNIT 3 Data Mining
No ratings yet
UNIT 3 Data Mining
11 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Only HCM and FCM Final
No ratings yet
Only HCM and FCM Final
120 pages
Introduction To The Case Study: Hank Roark
No ratings yet
Introduction To The Case Study: Hank Roark
25 pages
Determination of Ripeness and Grading of Tomato Using Image Analysis On Raspberry Pi
100% (1)
Determination of Ripeness and Grading of Tomato Using Image Analysis On Raspberry Pi
7 pages
A Regularized Deep Clustering Method For Fault Trend Analysis
No ratings yet
A Regularized Deep Clustering Method For Fault Trend Analysis
7 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Clustering Dan Evaluasi
No ratings yet
Clustering Dan Evaluasi
35 pages
A Comparison of Document Clustering Techniques
No ratings yet
A Comparison of Document Clustering Techniques
3 pages
Mod2 Clustering Text Book
No ratings yet
Mod2 Clustering Text Book
30 pages
A Study On Software Effort Prediction Using Machine Learning Techniques
No ratings yet
A Study On Software Effort Prediction Using Machine Learning Techniques
15 pages
Agravat, Jha - 2015 - Review of Various Clustering Methods Used To Categorize Seismic Data Into Earthquake and Mining Blast
No ratings yet
Agravat, Jha - 2015 - Review of Various Clustering Methods Used To Categorize Seismic Data Into Earthquake and Mining Blast
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering-Part1.pptx

Uploaded by

Clustering-Part1.pptx

Uploaded by

CS60050: Machine Learning

Cater to customer groups for promotion, recommendation, product

Depends on the problem domain and

• Similarity or distance measures:

Figure from slide by Eamonn Keogh

Changing the Features or distance function (kernel)

Slide credit: CMU MLD Aarti Singh

Slide credit: Min Zhang

Features of the tree

Different definitions of the distance leads to different algorithms.

After some merging steps, we have some clusters

Merge the two closest clusters (C2 and C5)

It can result in long and thin

It can result in long and thin

Compromise between single and complete link. Less

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.