Clustering-Part1
Clustering-Part1
1
What is Cluster Analysis?
• Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
2
Applications of Cluster Analysis
• Understanding
• Group related documents for
browsing, group genes and proteins
that have similar functionality, or
group stocks with similar price
fluctuations
• Summarization
• Reduce the size of large data sets
Clustering precipitation
in Australia
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
4
Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
5
Notion of a Cluster can be Ambiguous
6
Types of Clustering
• A clustering is a set of clusters
• Partitional Clustering
• A division of data objects into non-overlapping subsets (clusters)
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
7
Partitional Clustering
8
Hierarchical Clustering
9
Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive
• In non-exclusive clusterings, points may belong to multiple clusters.
• Can belong to multiple classes or could be ‘border’ points
• Fuzzy clustering (one type of non-exclusive)
• In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar characteristics
• Partial versus complete
• In some cases, we only want to cluster some of the data
10
Types of Clusters
• Well-separated clusters
• Prototype-based clusters
• Contiguity-based clusters
• Density-based clusters
11
Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer (or more
similar) to every other point in the cluster than to any point not in the
cluster.
3 well-separated clusters
12
Types of Clusters: Prototype-Based
• Prototype-based
• A cluster is a set of objects such that an object in a cluster is closer (more
similar) to the prototype or “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
13
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or Transitive)
• A cluster is a set of points such that a point in a cluster is closer (or more
similar) to one or more other points in the cluster than to any point not in
the cluster.
8 contiguous clusters
14
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by low-density
regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when noise and
outliers are present.
6 density-based clusters
15
Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and evaluate
the `goodness' of each potential set of clusters by using the given objective
function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
16
Characteristics of the Input Data Are Important
• Type of proximity or density measure
• Central to clustering
• Depends on data and application
17
What Is Good Clustering?
• A good clustering method will produce high
quality clusters with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on
both the similarity measure used by the
method and its implementation.
18
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering
19
Partitioning Clustering Approach
• Partitioning algorithms construct partition of a database of N objects into
a set of K clusters.
• The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm.
• It starts with an initial partition and uses an iterative control strategy.
• It tries swapping data points to see if such a swapping improves the quality of
clustering.
• When swapping does not yield any improvements in clustering, it finds a locally
optimal partitioning
• in principle, optimal partition achieved via minimizing the sum of
squared distance to its “representative object” in each cluster
20
K-means algorithm
• Given the cluster number K, the K-means algorithm
is carried out in three steps after initialization:
• Initialization: set seed points (randomly)
1. Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
21
K-means - Example
• Problem:
• Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of
medicine.
D
Medicine Weight pH-Inde
x
C
A 1 1
B 2 1
C 4 3 A B
D 5 4
22
K-means - Example
• Step 1: Use initial seed points for partitioning
23
K-means - Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
24
K-means - Example
• Step 2: Renew membership based on new centroids
25
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
26
K-means - Example
• Step 3: Repeat the first two steps until its convergence
27
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the
number of data points, k is the number of clusters,
and t is the number of iterations.
28
Weaknesses of k-means
• The algorithm is only applicable if the mean is
defined.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away
from other data points.
• Outliers could be errors in the data recording or
some special data points with very different
values.
29
Weaknesses of k-means: Problems with outliers
30
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points
in the clustering process that are much
further away from the centroids than other
data points.
• To be safe, we may want to monitor these
possible outliers over a few iterations and then
decide to remove them.
• Another method is to perform random
sampling. Since in sampling we only choose
a small subset of the data points, the
chance of selecting an outlier is very small.
• Assign the rest of the data points to the clusters
by distance or similarity comparison, or
classification
31
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.
32
Weaknesses of k-means
• If we use different seeds: good results
There are some
methods to help
choose good seeds
33
Weaknesses of k-means
• The k-means algorithm is not suitable for discovering clusters that
are not hyper-ellipsoids (or hyper-spheres).
34
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids)
• The algorithm is intended to find a sequence of objects
called medoids that are centrally located in clusters
• The goal of the algorithm is to minimize the average
dissimilarity of objects to their closest selected object.
• PAM works effectively for small data sets, but does not
scale well for large data sets
• CLARA
• CLARANS
35
PAM Partition Around Medoids
1) Pick a number, k, of random data items as medoids
2) Calculate
The pair (n,m) of medoid/non-medoid
with the smallest impact on clustering quality
36
Swapping Cost
• For each pair of a medoid m and a non-medoid object
h, measure whether h is better than m as a medoid
• For example, we can use the squared-error criterion
• Compute Eh-Em
• Negative: swapping brings benefit
• Choose the minimum swapping cost
37
K-medoids Example
X1 2 6 Distance to X5 to X9
s
X2 3 4
X1 8 7
X3 3 8
X4 5 7 X2 5 6
X5 6 2 Assume k=2 X3 9 8
X6 6 4
Select X5 and X9 as medoids X4 7 6
X7 7 3 X6 2 3
X8 7 4 X7 2 3
X9 8 5 X8 3 2
x10 7 6 x10 5 2
38
X1 2 6
X2 3 4
K-medoids Example X3
X4
3
5
8
7
• So, now let us choose some other point to be a medoid instead of X5 (6, 2). Let us X5 6 2
randomly choose X1 (2, 6). X6 6 4
X7 7 3
• Not the new medoid set is: (2, 6) and (8, 5). Now repeating the same task as earlier: X8 7 4
Replace Befor to X1 To Chang X9 8 5
X5 by X1 e X9 e x10 7 6
X1 7 0 0 -7
X2 5 3 6 -2
X3 8 3 8 -5
X4 6 4 6 -2
X5 0 8 5 5
X6 2 6 3 1
X7 2 8 3 1
X8 2 7 2 0
X9 0 0 0 0 Current clustering: {X1,X2,X3,X4},{X5,X6,X7,X8,X9,X10}
x10 2 5 2 0
-9
39
K-medoids Properties
•
40
CLARA (Clustering Large Applications)
• CLARA (Clustering Large
Applications) uses a
sampling-based method to deal
with large data sets
• A random sample should closely
represent the original data
• The chosen medoids will likely be
similar to what would have been
chosen from the whole data set
41
CLARA (Clustering Large Applications)
• Draw multiple samples of the data set
• Apply PAM to each sample
• Return the best clustering
42
CLARA Properties
•
43
CLARA - Algorithm
• Set mincost to MAXIMUM;
• Repeat q times // draws q samples
• Create S by drawing s objects randomly from D;
• Generate the set of medoids K from S by applying the
PAM algorithm;
• Compute cost(K,D)
• If cost(K, D)<mincost
Mincost = cost(K, D);
Bestset = K;
• Endif;
• Endrepeat;
• Return Bestset;
44
Complexity of CLARA
• Set mincost to MAXIMUM; O(1)
• Repeat q times O(t(s-k)2*k+(n-k)*k)
• Create S by drawing s objects
randomly from D; O(1)
• Generate the set of medoids K
from S by applying the PAM
algorithm; O(t(s-k)2*k)
• Compute cost(K,D) O((n-k)*k)
• If cost(K, D)<mincost O(1)
Mincost = cost(K, D);
Bestset = K;
Endif;
• Endrepeat;
• Return Bestset; 45
CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm based on
Randomized Search)
• The clustering process can be presented as searching a
graph where every node is a potential solution, that
is, a set of k medoids
• Two nodes are neighbours if their sets differ by only
one medoid
• Each node can be assigned a cost that is defined to be
the total dissimilarity between every object and the
medoid of its cluster
• The problem corresponds to search for a minimum on
the graph
• At each step, all neighbours of current node node are
searched; the neighbour which corresponds to the
deepest descent in cost is chosen as the next solution
46
CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm
based on Randomized Search)
• The clustering process can be
presented as searching a graph
where every node is a potential
solution, that is, a set of k medoids
• Graph Abstraction
• Every node is a potential solution
(k-medoid)
• Two nodes are adjacent if they differ
by one medoid
• Every node has k(n−k) adjacent nodes
47
CLARANS (“Randomized” CLARA)
• For large values of n and k, examining k(n-k) neighbours is time
consuming.
• At each step, CLARANS draws sample of neighbours to examine.
• Note that CLARA draws a sample of nodes at the beginning of
search; therefore, CLARANS has the benefit of not confining the
search to a restricted area.
• If the local optimum is found, CLARANS starts with a new randomly
selected node in search for a new local optimum. The number of
local optimums to search for is a parameter.
• It is more efficient and scalable than both PAM and CLARA; returns
higher quality clusters.
48
Compare no more than maxneighbor
CLARANS times
N C N
N N
<
C
… Local
minimum
N N numlocal
… Local
minimum
… Local
minimum
Best Node
… Local
minimum
49
CLARANS - Algorithm
• Set mincost to MAXIMUM;
• For i=1 to h do // find h local optimum
• Randomly select a node as the current node C in the graph;
• J = 1; // counter of neighbors
• Repeat
Randomly select a neighbor N of C;
If Cost(N,D)<Cost(C,D)
Assign N as the current node C;
J = 1;
Else J++;
Endif;
• Until J > m
• Update mincost with Cost(C,D) if applicableEnd for;
• End For
• Return bestnode;
50
Hierarchical Clustering
• Hierarchical Clustering Approach
• A typical clustering analysis approach via partitioning data set
sequentially
• Construct nested partitions layer by layer via grouping objects into
a tree of clusters (without the need to know the number of
clusters in advance)
• Use (generalised) distance matrix as clustering criteria
• Agglomerative vs Divisive
• Two sequential clustering strategies for constructing a tree of clusters
• Agglomerative: a bottom-up strategy
• Initially each data object is in its own (atomic) cluster
• Then merge these atomic clusters into larger and larger clusters
• Divisive: a top-down strategy
• Initially all objects are in one single cluster
• Then the cluster is subdivided into smaller and smaller clusters
51
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a
ab Merge two clusters which are
b abcde most similar to each other;
Until all objects are merged
c
cde into a single cluster
d
de
e
52
Hierarchical Clustering
• Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a Select a cluster and split it into
ab
two sub clusters
b abcde Until each leaf cluster contains
c only one object
cde
d
de
e
53
Dendrogram
• A binary tree that shows how clusters are
merged/split hierarchically
• Each node on the tree is a cluster; each leaf
node is a singleton cluster
54
Dendrogram
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level,
then each connected component forms a cluster
55
Dendrogram
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster
56
How to Merge Clusters?
• How to measure the distance between clusters?
Single-link
Complete-link
Distance?
Average-link
Centroid distance
57
How to Define Inter-Cluster Distance
Single-link
Complete-link
The distance between two clusters is
Average-link represented by the distance of the closest pair
Centroid distance of data objects belonging to different clusters.
58
How to Define Inter-Cluster Distance
Single-link
Complete-link
The distance between two clusters is
Average-link represented by the distance of the farthest pair
Centroid distance of data objects belonging to different clusters.
59
How to Define Inter-Cluster Distance
Single-link
Complete-link
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.
60
How to Define Inter-Cluster Distance
× ×
mi,mj are the means
of Ci, Cj,
Single-link
Complete-link
Average-link The distance between two clusters is
represented by the distance between the
Centroid distance means of the cluters.
61
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature, assume that there
are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
1 2 4 5 6
a 0 1 3 4 5
b 1 0 2 3 4 Complete link
c 3 2 0 1 2
Average
d 4 3 1 0 1
e 5 4 2 1 0
62
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into
a distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
▪ Merge two closest clusters
▪ Update “distance matrix”
63
Example
• Problem: clustering analysis with agglomerative algorithm
data matrix
Euclidean distance
distance matrix
64
Example
• Merge two closest clusters (iteration 1)
65
Example
• Update distance matrix (iteration 1)
66
Example
• Merge two closest clusters (iteration 2)
67
Example
• Update distance matrix (iteration 2)
68
Example
• Merge two closest clusters/update distance matrix (iteration 3)
69
Example
• Merge two closest clusters/update distance matrix (iteration 4)
70
Example
• Final result (meeting termination condition)
71
Example
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
e
lifetim
object
73
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1
2
5
2
3 6 Group Average
3
4 1
4
74
Which Distance Measure is Better?
• Each method has both advantages and disadvantages;
application-dependent, single-link and complete-link
are the most common methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to be
close, irrespective of all others. Therefore clusters can be too spread
out, and not compact enough
• Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters
75
AGNES
• AGNES : Agglomerative Nesting
• Use single-link method
• Merge nodes that have the least dissimilarity
• Eventually all objects belong to the same cluster
76
UPGMA
• UPGMA: Un-weighted Pair-Group Method Average.
• Merge Strategy:
• Average-link approach
• The distance between two clusters is measured by the average distance
between two objects belonging to different clusters.
d avg (C i , C j ) =
1
ni n j
∑ ∑ d ( p, q )
p∈C i q∈C j
Average
distance
ni,nj: the number of objects in cluster
C i, C j.
77
DIANA
• DIANA: Divisive Analysis
• First, all of the objects form one cluster.
• The cluster is split according to some principle, such as the minimum
Euclidean distance between the closest neighboring objects in the
cluster.
• The cluster splitting process repeats until, eventually, each new
cluster contains a single object, or a termination condition is met.
78
C
C2 C1
4. Choose the object Ok with greatest D score.
5. If Dk>0, move Ok from C2 to C1, and repeat 3-5.
……
6. Otherwise, stop splitting process.
C2 C1
79