Lecture 6 - Clustering
Lecture 6 - Clustering
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 1
Clustering High-Dimensional Data
Clustering high-dimensional data
Many applications: text documents, DNA micro-array data
Major challenges:
Many irrelevant dimensions may mask clusters
Distance measure becomes meaningless—due to equi-distance
Clusters may exist only in some subspaces
Methods
Feature transformation: only effective if most dimensions are relevant
PCA & SVD useful only when features are highly correlated/redundant
Feature selection: wrapper or filter approaches
useful to find a subspace where the data have nice clusters
Subspace-clustering: find clusters in all the possible subspaces
CLIQUE, ProClus, and frequent pattern-based clustering
November 1, 2022 Data Mining: Concepts and Techniques 2
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Strength
automatically finds subspaces of the highest
1 1 1
d d d d
ij | J | ij
d d
Ij | I | i I ij IJ | I || J | i I , j J ij
Where jJ
A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
Problems with bi-cluster
No downward closure property,
Due to averaging, it may contain outliers but still within δ-threshold
November 1, 2022 Data Mining: Concepts and Techniques 11
p-Clustering: Clustering
by Pattern Similarity
Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix d xa d xb
pScore ( ) | (d xa d xb ) (d ya d yb ) |
d ya d yb
A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
Properties of δ-pCluster
Downward closure
Clusters are more homogeneous than bi-cluster (thus the name:
pair-wise Cluster)
Pattern-growth algorithm has been developed for efficient mining
d xa / d ya
For scaling patterns, one can observe, taking logarithmic on
d xb / d yb
will lead to the pScore form
November 1, 2022 Data Mining: Concepts and Techniques 12
Chapter 6. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 13
Why Constraint-Based Cluster Analysis?
Need user feedback: Users know their applications the best
Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm