Clustering-Part1.pptx
Clustering-Part1.pptx
Autumn 2024
Sudeshna Sarkar
Clustering
K-Means
26-27 Sep 2024
Supervised learning vs. unsupervised learning
• Supervised learning: discover discriminative patterns in the data that
relate data attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• discover data generative (all) patterns.
Clustering
Unsupervised learning: The data have no target attribute.
– Requires data, but no labels
– Detect patterns e.g. in
• Group emails or search results
• Customer shopping patterns
• Regions of images
– Useful when don’t know what you’re looking for
– But: can get gibberish
Applications
• Segmenting of customers with similar market characteristics
— pricing , loyalty, spending behaviors etc.
• Grouping of products based on their properties
• Identify similar energy use customer profiles
<x> = time series of energy usage
• Clustering weblog data to discover groups of similar access patterns.
• Recognize communities in social networks.
• Top 20 topics in Twitter
Clustering
Clustering Algorithms
Clustering
Clustering Examples
Customer Segmentation
• Group customers based on their demography and activity
• Purchase History
• Demographic
• Content engagement
• Behavior
• Customer Lifecycle Stage
x x xx
x xx x
xx
xxxxx
X1 x xxx
X2
Aspects of clustering
•
vs
Similarity / Distance Measures
y
x
x
y
x
x
Similarity measures
•
(Dis)similarity measures
•
Types of Clustering: Hard vs Soft
• Exclusive (Hard)
• Non-overlapping subsets
• Each item is a member of a single cluster
• Overlapping (Soft)
• Potentially overlapping subsets
• A item can simultaneously belong to multiple clusters
Challenges in Clustering
•Data is very large
•High dimensional data space.
•Data space is not Euclidean (e.g. NLP problems).
K-means Clustering
Clustering by Partitioning
•
K-means algorithm (MacQueen, 1967)
Given K
1. Initialization: Randomly choose K data
points (seeds) to be the initial cluster
centres
2. Cluster Assignment: Assign each data
point to the closest cluster centre
3. Move Centroid: Re-compute the cluster
centres using the current cluster
memberships.
4. If a convergence criterion is not met, go
to 2.
1. Random Initialization
2. Cluster Assignment
2. Cluster Assignment
3. Move Centroid
K-Means & Its Stopping criterion
Given K
1. Initialization: Randomly choose K data •
points (seeds) to be the initial cluster
centres
2. While not converged:
I. Cluster Assignment: Assign each
data point to the closest cluster
centre
II. Move Centroid: Re-compute the
cluster centres using the current
cluster memberships.
III. If a convergence criterion is not met,
go to 2.
K-Means Optimization Objective
•
K-Means Convergence Property
•
Convergence of K-Means
•
Convergence of K-Means
Given K •
1. Initialization: Randomly choose K data
points (seeds) to be the initial cluster
centres
2. While not converged:
I. Cluster Assignment: Assign each
data point to the closest cluster
centre
II. Move Centroid: Re-compute the
cluster centres using the current
cluster memberships.
III. If a convergence criterion is not
met, go to 2.
Convergence K-Means (summary)
Convergence Property
•
Kmeans illustrated
Picking Cluster Seeds (Initial Values)
1. Lloyd’s Method: Random
Initialization
2. K-Means++ : Iteratively
construct a random sample
with good spacing across the
dataset.
Picking Cluster Seeds (Initial Values)
1. Lloyd’s Method: Random
Initialization
May converge at a local optimum
1. K-Means++ : Iteratively
construct a random sample
with good spacing across the
dataset.
1. Perform multiple runs
o Each run with a different set of
randomly chosen seeds
2. Select that configuration that gives
minimum SSE
Picking Cluster Seeds (K-means++)
• Choose centers at random from the data •
points
• Weight the probability of choosing
the centres according to their
squared distance from the closest
centre already chosen
In other words,
• Choose 1 center randomly.
• Choose second furthest from the first.
• Choose third furthest from first and second.
• … and so on.
How to select K?
1: Use cross validation to select K 1.00E+03
Objective Function
9.00E+02
• What should we optimize? 8.00E+02
7.00E+02
2: Let the domain expert look at 6.00E+02
the clustering and decide 5.00E+02
4.00E+02
3.00E+02
2.00E+02
3: The “knee” solution 1.00E+02
• Plot the objective function values for 0.00E+00
1 2 3 4 5 6
different values of K K
• “knee finding” or “elbow finding”.
•
K-means Getting Stuck (Varying K)
K-means not able to properly cluster
Agglomerative (bottom-up):
Start with each point as a Agglomerative
cluster. Clusters are combined
based on their “closeness”.
Divisive (top-down): Start with
one cluster including all points
and recursively split each Divisive
cluster.
Types of hierarchical clustering
1.
• Divisive Hierarchical Clustering
Hierarchical clustering
Divisive (Top-down)
•
Hierarchical clustering
Height
Height=11; K=12
K=8
K=4
Height = 0; K=1
Height = 0
Clusters = 9
1 4 5 8 3 2 9 6 7
Height = 1
Clusters=8
1 4 5 8 3 2 9 6 7
Height = 2
Clusters=7
1 4 5 8 3 2 9 6 7
Height = 3
Clusters=6
1 4 5 8 3 2 9 6 7
Height = 4
Clusters=5
1 4 5 8 3 2 9 6 7
Height = 5
Clusters=4
1 4 5 8 3 2 9 6 7
Height = 6
Clusters=3
1 4 5 8 3 2 9 6 7
Height = 7
Clusters=2
1 4 5 8 3 2 9 6 7
Height = 8
Clusters=1
1 4 5 8 3 2 9 6 7
Hierarchical Agglomerative clustering
•
• Euclidean • Hamming
• Cosine • Jaccard
• Correlation •…
• Manhattan
• Minkowski
• Mahalanobis
•…
Linkage: Definition
•
Initialization
• Each individual point is taken as a cluster
• Construct distance/proximity matrix
p1
p8
p1 p2 p3 p4 p5 ..
p1 .
p3
p2 p7
p2
p9 p10 p3
p4
p4
p5
p5
p11 .
p6 .
.
p12 Distance/Proximity Matrix
Intermediate State
Distance/Proximity Matrix
C1 C2 C3 C4 C5
C2
C1
C5 C2
C1 C3
C4
C5
C3 C4
C1
C2
C2
C5
C3
C1
C4
C5
C3 C4
C2 U C5
C1 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C1
C3 ?
C4 ?
C3 C4
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
• Similarity of the most similar (single-link)
• Complete-link
• Similarity of the least similar points
• Centroid
• Clusters whose centroids (centers of gravity) are the most similar
• Average-link
• Average cosine between pairs of elements
Distance between two clusters
1 2 3 4 5
Complete link method
•
Complete link method
•
Complete-link clustering: example
• Distance between clusters is determined by the two most distant
points in the different clusters
1 2 3 4 5
Computational Complexity
•
Average Link Clustering