0% found this document useful (0 votes)
27 views

Lecture 6 - Clustering

This document discusses different types of clustering methods and challenges in clustering high-dimensional data. It covers partitioning, hierarchical, density-based, grid-based, and model-based clustering algorithms. For high-dimensional data, it discusses how many irrelevant dimensions can mask clusters and distance becomes meaningless. Feature selection/transformation and subspace clustering are introduced as methods to address these challenges. The document also discusses constraint-based clustering and how user constraints can guide the clustering process.

Uploaded by

Manikandan M
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lecture 6 - Clustering

This document discusses different types of clustering methods and challenges in clustering high-dimensional data. It covers partitioning, hierarchical, density-based, grid-based, and model-based clustering algorithms. For high-dimensional data, it discusses how many irrelevant dimensions can mask clusters and distance becomes meaningless. Feature selection/transformation and subspace clustering are introduced as methods to address these challenges. The document also discusses constraint-based clustering and how user constraints can guide the clustering process.

Uploaded by

Manikandan M
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 6.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 1
Clustering High-Dimensional Data
 Clustering high-dimensional data
 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
 Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
 Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering
November 1, 2022 Data Mining: Concepts and Techniques 2
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)

 Data in only one dimension is relatively


packed
 Adding a dimension “stretch” the
points across that dimension, making
them further apart
 Adding more dimensions will make the
points further apart—high dimensional
data is extremely sparse
 Distance measure becomes
meaningless—due to equi-distance

November 1, 2022 Data Mining: Concepts and Techniques 3


Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)

 Clusters may exist only in some subspaces


 Subspace-clustering: find clusters in all the subspaces

November 1, 2022 Data Mining: Concepts and Techniques 4


CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
November 1, 2022 Data Mining: Concepts and Techniques 5
CLIQUE: The Major Steps
 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the Apriori
principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of

connected dense units for each cluster


 Determination of minimal cover for each cluster

November 1, 2022 Data Mining: Concepts and Techniques 6


Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

y
l ar 30 50
Sa age

November 1, 2022 Data Mining: Concepts and Techniques 7


Strength and Weakness of CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters exist in


those subspaces
 insensitive to the order of records in input and does not

presume some canonical data distribution


 scales linearly with the size of input and has good

scalability as the number of dimensions in the data


increases
 Weakness
 The accuracy of the clustering result may be degraded

at the expense of simplicity of the method


November 1, 2022 Data Mining: Concepts and Techniques 8
Frequent Pattern-Based Approach
 Clustering high-dimensional space (e.g., clustering text documents,
microarray data)
 Projected subspace-clustering: which dimensions to be projected
on?
 CLIQUE, ProClus
 Feature extraction: costly and may not be effective?
 Using frequent patterns as “features”
 “Frequent” are inherent features
 Mining freq. patterns may not be so expensive
 Typical methods
 Frequent-term-based document clustering
 Clustering by pattern similarity in micro-array data (pClustering)

November 1, 2022 Data Mining: Concepts and Techniques 9


Clustering by Pattern Similarity (p-Clustering)

 Right: The micro-array “raw” data


shows 3 genes and their values in a
multi-dimensional space
 Difficult to find their patterns
 Bottom: Some subsets of dimensions
form nice shift and scaling patterns

November 1, 2022 Data Mining: Concepts and Techniques 10


Why p-Clustering?
 Microarray data analysis may need to
 Clustering on thousands of dimensions (attributes)
 Discovery of both shift and scaling patterns
 Clustering with Euclidean distance measure? — cannot find shift patterns
 Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
 Bi-cluster using transformed mean-squared residue score matrix (I, J)

1 1 1
d  d d   d
ij | J |  ij
d   d
Ij | I | i  I ij IJ | I || J | i  I , j  J ij
 Where jJ
 A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
 Problems with bi-cluster
 No downward closure property,
 Due to averaging, it may contain outliers but still within δ-threshold
November 1, 2022 Data Mining: Concepts and Techniques 11
p-Clustering: Clustering
by Pattern Similarity
 Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix  d xa d xb 
pScore (   ) | (d xa  d xb )  (d ya  d yb ) |
d ya d yb 
 A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
 Properties of δ-pCluster
 Downward closure
 Clusters are more homogeneous than bi-cluster (thus the name:
pair-wise Cluster)
 Pattern-growth algorithm has been developed for efficient mining
d xa / d ya
 For scaling patterns, one can observe, taking logarithmic on 
d xb / d yb
will lead to the pScore form
November 1, 2022 Data Mining: Concepts and Techniques 12
Chapter 6. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 13
Why Constraint-Based Cluster Analysis?
 Need user feedback: Users know their applications the best
 Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters

November 1, 2022 Data Mining: Concepts and Techniques 14


A Classification of Constraints in Cluster Analysis
 Clustering in applications: desirable to have user-guided (i.e.,
constrained) cluster analysis
 Different constraints in cluster analysis:
 Constraints on individual objects (do selection first)

 Cluster on houses worth over $300K


 Constraints on distance or similarity functions
 Weighted functions, obstacles (e.g., rivers, lakes)
 Constraints on the selection of clustering parameters
 # of clusters, MinPts, etc.
 User-specified constraints
 Contain at least 500 valued customers and 5000 ordinary ones
 Semi-supervised: giving small training sets as
“constraints” or hints
November 1, 2022 Data Mining: Concepts and Techniques 15
Clustering With Obstacle Objects
 K-medoids is more preferable since
k-means may locate the ATM center
in the middle of a lake
 Visibility graph and shortest path
 Triangulation and micro-clustering
 Two kinds of join indices (shortest-
paths) worth pre-computation
 VV index: indices for any pair of
obstacle vertices
 MV index: indices for any pair of
micro-cluster and obstacle indices

November 1, 2022 Data Mining: Concepts and Techniques 16


An Example: Clustering With Obstacle Objects

Not Taking obstacles into Taking obstacles into


account
November 1, 2022
account
Data Mining: Concepts and Techniques 17
Clustering with User-Specified Constraints
 Example: Locating k delivery centers, each serving at least m
valued customers and n ordinary ones
 Proposed approach
 Find an initial “solution” by partitioning the data set into k

groups and satisfying user-constraints


 Iteratively refine the solution by micro-clustering relocation

(e.g., moving δ μ-clusters from cluster Ci to Cj) and


“deadlock” handling (break the microclusters when
necessary)
 Efficiency is improved by micro-clustering

 How to handle more complicated constraints?


 E.g., having approximately same number of valued

customers in each cluster?! — Can you solve it?


November 1, 2022 Data Mining: Concepts and Techniques 18
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 19
What Is Outlier Discovery?
 What are outliers?
 The set of objects are considerably dissimilar from the

remainder of the data


 Example: Sports: Michael Jordon, Wayne Gretzky, ...

 Problem: Define and find outliers in large data sets


 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

November 1, 2022 Data Mining: Concepts and Techniques 20


Outlier Discovery:
Statistical Approaches

 Assume a model underlying distribution that generates


data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be known

November 1, 2022 Data Mining: Concepts and Techniques 21


Outlier Discovery: Distance-Based Approach

 Introduced to counter the main limitations imposed by


statistical methods
 We need multi-dimensional analysis without knowing

data distribution
 Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm

 Cell-based algorithm

November 1, 2022 Data Mining: Concepts and Techniques 22


Density-Based Local
Outlier Detection
 Distance-based outlier detection
is based on global distance
distribution
 It encounters difficulties to
identify outliers if data is not
uniformly distributed  Local outlier factor (LOF)
 Assume outlier is not
 Ex. C1 contains 400 loosely
crisp
distributed points, C2 has 100  Each point has a LOF
tightly condensed points, 2
outlier points o1, o2
 Distance-based method cannot
identify o2 as an outlier
 Need the concept of local outlier

November 1, 2022 Data Mining: Concepts and Techniques 23


Outlier Discovery: Deviation-Based Approach

 Identifies outliers by examining the main characteristics


of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
 OLAP data cube technique
 uses data cubes to identify regions of anomalies in
large multidimensional data

November 1, 2022 Data Mining: Concepts and Techniques 24


Summary
 Cluster analysis groups objects based on their similarity
and has wide applications
 Measure of similarity can be computed for various types
of data
 Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster analysis

November 1, 2022 Data Mining: Concepts and Techniques 25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy