0% found this document useful (0 votes)

13 views

Chapter 3 Unsupervised Learning

A good pdf use it

Uploaded by

sintebeta

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Chapter 3 Unsupervised Learning

A good pdf use it

Uploaded by

sintebeta

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Chapter 3

Unsupervised Learning (Clustering)

1
Supervised learning vs. unsupervised learning

• Supervised learning: discover patterns in the data that relate

data attributes with a target (class) attribute.
– These patterns are then utilized to predict the values of the target
attribute in future data instances.

• Unsupervised learning: The data have no target attribute.

– We want to explore the data to find some intrinsic structures in
them.

3
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

4
General Applications of Clustering
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location

• Image Processing

• Economic Science (especially market research)

• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns

6
Common Distance/Similarity measures
• Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters.
They include: Minkowski distance, Manhattan, Euclidean, Hamming and Cosine
distance.

1. The Manhattan distance (also called taxicab norm or 1-norm) is given by:

2. The Euclidean distance (also called 2-norm distance) is given by:

Reading assignment: Try to understand the difference between each of the distance measures. 7
Challenges: Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

8
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.

• The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.

9
Additional Requirements of Clustering in ML
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Interpretability and usability

10
Partitioning Algorithms: Basic Concept
• Goal: Construct a partition of a dataset D of n objects/rows into a set of k clusters,
where the value of k is provided by the user.

– Global optimal: exhaustively enumerate all partitions

– Heuristic methods: k-means and k-medoids algorithms

12
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:

Step 1: Randomly select K data points as initial cluster center

Step 2: Compute the distance between each cluster center and other data points,

Step 3: Assign each object to the cluster with the nearest cluster center (or centroid) ,

Step 4: Replace the cluster center by a mean of each cluster, and go back to step 2, stop when the
stopping criteria doesn’t met (e.g., no more replacement, no change in SSE, after n iterations, etc.).

13
A Simple example showing the implementation of k-means
algorithm (using K=2)

14
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

15
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:

16
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters

are:
{1,2} and {3,4,5,6,7}

• Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)

17
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change

in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.

18
PLOT

19
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes

– Using new dissimilarity measures to deal with categorical objects

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

20
What is the problem of k-Means Method?

21
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

22
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

23
Hierarchical Clustering

24
Hierarchical Clustering
• There are two types of hierarchical clustering:

– Agglomerative : data points are clustered using a bottom-up

approach starting with individual data points

– Divisive: top-down approach is followed where all the data points are
treated as one big cluster and the clustering process involves dividing
the one big cluster into several small clusters.

25
Agglomerative Clustering Algorithm

• Is a more popular hierarchical clustering technique

• Basic algorithm is straightforward
1. At the start, treat each data point as one cluster. Therefore, the number of clusters at the start
will be K, while K is an integer representing the number of data points.
2. Compute the similarity matrix (store it inside proximity matrix)
3. Repeat
4. Merge the two closest clusters based on their similarity measure
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters

– Different approaches to defining the distance between clusters distinguish the different
algorithms

26
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
Similarity?
p1

p3
• MIN p4
• MAX p5