0% found this document useful (0 votes)
13 views

Chapter 3 Unsupervised Learning

A good pdf use it

Uploaded by

sintebeta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 3 Unsupervised Learning

A good pdf use it

Uploaded by

sintebeta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Chapter 3

Unsupervised Learning (Clustering)

1
Supervised learning vs. unsupervised learning

• Supervised learning: discover patterns in the data that relate


data attributes with a target (class) attribute.
– These patterns are then utilized to predict the values of the target
attribute in future data instances.

• Unsupervised learning: The data have no target attribute.


– We want to explore the data to find some intrinsic structures in
them.

3
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

4
General Applications of Clustering
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location

• Image Processing

• Economic Science (especially market research)

• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns

6
Common Distance/Similarity measures
• Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters.
They include: Minkowski distance, Manhattan, Euclidean, Hamming and Cosine
distance.

1. The Manhattan distance (also called taxicab norm or 1-norm) is given by:

2. The Euclidean distance (also called 2-norm distance) is given by:

Reading assignment: Try to understand the difference between each of the distance measures. 7
Challenges: Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

8
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.

• The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.

9
Additional Requirements of Clustering in ML
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Interpretability and usability

10
Partitioning Algorithms: Basic Concept
• Goal: Construct a partition of a dataset D of n objects/rows into a set of k clusters,
where the value of k is provided by the user.

– Global optimal: exhaustively enumerate all partitions

– Heuristic methods: k-means and k-medoids algorithms

12
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:

Step 1: Randomly select K data points as initial cluster center

Step 2: Compute the distance between each cluster center and other data points,

Step 3: Assign each object to the cluster with the nearest cluster center (or centroid) ,

Step 4: Replace the cluster center by a mean of each cluster, and go back to step 2, stop when the
stopping criteria doesn’t met (e.g., no more replacement, no change in SSE, after n iterations, etc.).

13
A Simple example showing the implementation of k-means
algorithm (using K=2)

14
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

15
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:

16
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)

17
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change


in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.

18
PLOT

19
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)


– Replacing means of clusters with modes

– Using new dissimilarity measures to deal with categorical objects

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

20
What is the problem of k-Means Method?

21
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

22
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

23
Hierarchical Clustering

24
Hierarchical Clustering
• There are two types of hierarchical clustering:

– Agglomerative : data points are clustered using a bottom-up


approach starting with individual data points

– Divisive: top-down approach is followed where all the data points are
treated as one big cluster and the clustering process involves dividing
the one big cluster into several small clusters.

25
Agglomerative Clustering Algorithm

• Is a more popular hierarchical clustering technique


• Basic algorithm is straightforward
1. At the start, treat each data point as one cluster. Therefore, the number of clusters at the start
will be K, while K is an integer representing the number of data points.
2. Compute the similarity matrix (store it inside proximity matrix)
3. Repeat
4. Merge the two closest clusters based on their similarity measure
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters


– Different approaches to defining the distance between clusters distinguish the different
algorithms

26
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
Similarity?
p1

p2

p3
• MIN p4
• MAX p5

• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix

objective function
– Ward’s Method uses squared error
27
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3
• MIN p4
• MAX p5

• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix

objective function
– Ward’s Method uses squared error
28
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3
• MIN p4
• MAX p5

• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix

objective function
– Ward’s Method uses squared error
29
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3
• MIN p4
• MAX p5

• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix

objective function
– Ward’s Method uses squared error
30
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
  p1

p2

p3
• MIN p4
• MAX p5

• Group Average .
.
• Distance Between Centroids
.
• Other methods driven by an Proximity Matrix

objective function
– Ward’s Method uses squared error
31
Starting Situation
• Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12

32
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

33
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

34
After Merging
• The question is “How do we update the proximity
C2 U
matrix?” C5
C1 C3 C4

C1 ?
C3 ? ? ? ?
C2 U C5
C4
C3 ?

C4 ?
C1 Proximity Matrix

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12

35
Simulation of Agglomerative Clustering

36
Summary
• A Dendrogram Shows How the Clusters are Merged Hierarchically

• Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
• A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.

37
Density-Based Clustering Methods

38
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
39
Density-Based Clustering: Background
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-neighbourhood of that
point

p MinPts = 5
q Eps = 1 cm

40
Density-Based Clustering: Background (II)
• Density-reachable:
– A point p is density-reachable from a p
point q wrt. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
such that pi+1 is directly density- q
reachable from pi
• Density-connected
– A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are p q
density-reachable from o wrt. Eps and
MinPts. o

41
DBSCAN: Density Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of
density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

42
DBSCAN: The Algorithm
• Arbitrary select a point p

• Retrieve all points density-reachable from p wrt Eps and MinPts.

• If p is a core point, a cluster is formed.

• If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.

• Continue the process until all of the points have been processed.

43
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes

45
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
46
Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Cluster Evaluation
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary

47
References
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.
• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-
105.
• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy