0% found this document useful (0 votes)

94 views

Clustering Lecture

The document discusses the unsupervised clustering technique of k-means clustering. It begins by explaining that clustering groups similar items together without predefined labels. It then describes the k-means algorithm which involves randomly selecting k cluster centers, assigning all points to the closest center, recalculating the centers, and repeating until centers stop changing. The document provides an example of k-means clustering on sample data. It notes that k-means works best when clusters are similar in size and that the number of clusters k must be specified. The document discusses considerations like initialization of centers and distance measures for k-means clustering.

Uploaded by

ahmetdursun03

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Clustering Lecture

Uploaded by

ahmetdursun03

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 46

Unsupervised Clustering

Clustering is a very general problem that appears in

many different settings (not necessarily in a data
mining context)
Grouping similar products together to improve
the efficiency of a production line
Packing similar items into containers
Grouping similar customers together
Grouping similar stocks together
1

The Similarity Concept

Obviously, the concept of similarity is key
to clustering.
Using similarity definitions that are specific to a
domain may generate more acceptable clusters.
E.g. products that require same or similar
tools/processes in the production line are
similar.
Articles that are in the course pack of the same
course are similar.
General similarity measures are required for
general purpose algorithms.
2

Clustering: The K-Means Algorithm

(Lloyd, 1982)
1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their
closest cluster center.
4. Calculate a new cluster center for each
cluster.
5. Repeat steps 3-5 until the cluster centers do
not change.
3

Distance Measure
The similarity is captured by a
distance measure in this algorithm.
The original proposed measure of
distance is the Euclidean distance.
X ( x1 , x2 , , x n ), Y ( y 1 , y 2 , , y n )
d ( x, y )

2
(
x

y
)
i i
i 1

An Example Using KMeans

Table 3.6

K-Means Input Values

Instance

1
2
3
4
5
6

1.0
1.0
2.0
2.0
3.0
5.0

1.5
4.5
1.5
3.5
2.5
6.0

f(x)

7
6
5
4
3
2
1
0

6
2
4
5
3

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

Outcome

Cluster Centers

Cluster Points

(2.67,4.67)

2, 4, 6

Squared Error
14.50

(2.00,1.83)

1, 3, 5

(1.5,1.5)

1, 3
15.94

(2.75,4.125)

2, 4, 5, 6

(1.8,2.7)

1, 2, 3, 4, 5
9.60

(5,6)

f(x)

7
6
5
4
3
2
1
0

General Considerations
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Requires real-valued data. Categorical data can be
converted into real, but the distance function needs to
be worked out carefully.
We must select the number of clusters present in the
data.
Data normalization may be required if attribute ranges
vary significantly.
Alternative distance measures may generate different
clusters.
9

K-means Clustering

Partitional clustering approach

Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

K-means Clustering Details

Initial centroids are often chosen randomly.

Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.

Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to Until relatively few points change clusters

Complexity is O( n * K * I * d )

n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Two different K-means Clusterings

3
2.5

Original Points

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

1.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering

-2

-1.5

-1

-0.5

0.5

1.5

Sub-optimal Clustering

Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5

3
2.5
2

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Importance of Choosing Initial Centroids

Iteration 1

Iteration 2

1.5

2.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

Iteration 4

-1.5

-1

-0.5

0.5

1.5

-2

Iteration 5

1.5

0.5

-1

-0.5

0.5

1.5

-1

-0.5

0.5

1.5

Iteration 6

2.5

-1.5

2.5

-2

Iteration 3

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest
cluster
To get SSE, we square these errors and sum them.
K

SSE dist 2 ( mi , x )
i 1 xCi

x is a data point in cluster Ci and mi is the representative

point for cluster Ci
can show that mi corresponds to the center (mean) of
the cluster
Given two clusters, we can choose the one with the
smallest error

Importance of Choosing Initial Centroids

Iteration 5
1
2
3
4

3
2.5
2

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Importance of Choosing Initial Centroids

Iteration 1

1.5

2.5

0.5

-2

-1.5

-1

-0.5

0.5

Iteration 3

Iteration 2

1.5

-2

-1.5

-1

-0.5

0.5

Iteration 4

1.5

0.5

-1

-0.5

0.5

1.5

Iteration 5

2.5

-1.5

1.5

2.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

Problems with Selecting Initial Points

If there are K real clusters then the chance of selecting one centroid from each cluster is small.

Chance is relatively small when K is large

If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/10 10 = 0.00036

Sometimes the initial centroids will readjust themselves in right way, and sometimes they dont
Consider an example of five pairs of clusters

Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the
sequences of merges or splits
5

0.2

4
3

0.15

0.1

2
1

0.05

3
0

Strengths of Hierarchical
Clustering
Do not have to assume any particular
number of clusters
Any desired number of clusters can be
obtained by cutting the dendogram at the
proper level

They may correspond to meaningful

taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:

Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

Divisive:

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a point (or
there are k clusters)

Traditional hierarchical algorithms use a similarity or

distance matrix
Merge or split one cluster at a time

Agglomerative Clustering
Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.

Compute the proximity matrix

Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains

Key operation is the computation of the

proximity of two clusters

Different approaches to defining the distance

between clusters distinguish the different algorithms

Starting Situation
Start with clusters of individual points
p1 p2
p3
p4 p5
...
and a proximity matrix
p1
p2
p3
p4
p5
.
.
.

Proximity Matrix

Intermediate Situation
After some merging steps, we have some clusters
C1

C1
C2
C3

C3
C4

C4
C5

Proximity Matrix
C1

Intermediate Situation
We want to merge the two closest clusters (C2 and C5)
and update the proximity matrix.
C1

C1
C2
C3

C3
C4

C4
C5

Proximity Matrix

After Merging
The question is How do we update the proximity
C2
matrix?
C1
C1
C2 U C5

C3
C4

U
C5

?
?

Proximity Matrix

C2 U C5