Chapter 3 Unsupervised Learning
Chapter 3 Unsupervised Learning
1
Supervised learning vs. unsupervised learning
3
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
4
General Applications of Clustering
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Image Processing
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
6
Common Distance/Similarity measures
• Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters.
They include: Minkowski distance, Manhattan, Euclidean, Hamming and Cosine
distance.
1. The Manhattan distance (also called taxicab norm or 1-norm) is given by:
Reading assignment: Try to understand the difference between each of the distance measures. 7
Challenges: Notion of a Cluster can be Ambiguous
8
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.
9
Additional Requirements of Clustering in ML
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Interpretability and usability
10
Partitioning Algorithms: Basic Concept
• Goal: Construct a partition of a dataset D of n objects/rows into a set of k clusters,
where the value of k is provided by the user.
12
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
Step 2: Compute the distance between each cluster center and other data points,
Step 3: Assign each object to the cluster with the nearest cluster center (or centroid) ,
Step 4: Replace the cluster center by a mean of each cluster, and go back to step 2, stop when the
stopping criteria doesn’t met (e.g., no more replacement, no change in SSE, after n iterations, etc.).
13
A Simple example showing the implementation of k-means
algorithm (using K=2)
14
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
15
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
16
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
17
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
18
PLOT
19
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
20
What is the problem of k-Means Method?
21
Limitations of K-means: Differing Density
22
Limitations of K-means: Non-globular Shapes
23
Hierarchical Clustering
24
Hierarchical Clustering
• There are two types of hierarchical clustering:
– Divisive: top-down approach is followed where all the data points are
treated as one big cluster and the clustering process involves dividing
the one big cluster into several small clusters.
25
Agglomerative Clustering Algorithm
26
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
Similarity?
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
27
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
28
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
29
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
30
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
• MIN p4
• MAX p5
• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix
objective function
– Ward’s Method uses squared error
31
Starting Situation
• Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
32
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
33
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
34
After Merging
• The question is “How do we update the proximity
C2 U
matrix?” C5
C1 C3 C4
C1 ?
C3 ? ? ? ?
C2 U C5
C4
C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
35
Simulation of Agglomerative Clustering
36
Summary
• A Dendrogram Shows How the Clusters are Merged Hierarchically
• Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
• A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.
37
Density-Based Clustering Methods
38
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
39
Density-Based Clustering: Background
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-neighbourhood of that
point
p MinPts = 5
q Eps = 1 cm
40
Density-Based Clustering: Background (II)
• Density-reachable:
– A point p is density-reachable from a p
point q wrt. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
such that pi+1 is directly density- q
reachable from pi
• Density-connected
– A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are p q
density-reachable from o wrt. Eps and
MinPts. o
41
DBSCAN: Density Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of
density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
42
DBSCAN: The Algorithm
• Arbitrary select a point p
• If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.
• Continue the process until all of the points have been processed.
43
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
45
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
46
Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Cluster Evaluation
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary
47
References
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.
• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-
105.
• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.
48