0% found this document useful (0 votes)

27 views

Lecture 6 - Clustering

This document discusses different types of clustering methods and challenges in clustering high-dimensional data. It covers partitioning, hierarchical, density-based, grid-based, and model-based clustering algorithms. For high-dimensional data, it discusses how many irrelevant dimensions can mask clusters and distance becomes meaningless. Feature selection/transformation and subspace clustering are introduced as methods to address these challenges. The document also discusses constraint-based clustering and how user constraints can guide the clustering process.

Uploaded by

Manikandan M

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Lecture 6 - Clustering

Uploaded by

Manikandan M

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter 6.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 1
Clustering High-Dimensional Data
 Clustering high-dimensional data
 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
 Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
 Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering
November 1, 2022 Data Mining: Concepts and Techniques 2
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)

 Data in only one dimension is relatively

packed
 Adding a dimension “stretch” the
points across that dimension, making
them further apart
 Adding more dimensions will make the
points further apart—high dimensional
data is extremely sparse
 Distance measure becomes
meaningless—due to equi-distance

November 1, 2022 Data Mining: Concepts and Techniques 3

Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)

 Clusters may exist only in some subspaces

 Subspace-clustering: find clusters in all the subspaces

November 1, 2022 Data Mining: Concepts and Techniques 4

CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
November 1, 2022 Data Mining: Concepts and Techniques 5
CLIQUE: The Major Steps
 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the Apriori
principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of

connected dense units for each cluster

 Determination of minimal cover for each cluster

November 1, 2022 Data Mining: Concepts and Techniques 6

Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

y
l ar 30 50
Sa age

November 1, 2022 Data Mining: Concepts and Techniques 7

Strength and Weakness of CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters exist in

those subspaces
 insensitive to the order of records in input and does not

presume some canonical data distribution

 scales linearly with the size of input and has good

scalability as the number of dimensions in the data

increases
 Weakness
 The accuracy of the clustering result may be degraded

at the expense of simplicity of the method

November 1, 2022 Data Mining: Concepts and Techniques 8
Frequent Pattern-Based Approach
 Clustering high-dimensional space (e.g., clustering text documents,
microarray data)
 Projected subspace-clustering: which dimensions to be projected
on?
 CLIQUE, ProClus
 Feature extraction: costly and may not be effective?
 Using frequent patterns as “features”
 “Frequent” are inherent features
 Mining freq. patterns may not be so expensive
 Typical methods
 Frequent-term-based document clustering
 Clustering by pattern similarity in micro-array data (pClustering)

November 1, 2022 Data Mining: Concepts and Techniques 9

Clustering by Pattern Similarity (p-Clustering)

 Right: The micro-array “raw” data

shows 3 genes and their values in a
multi-dimensional space
 Difficult to find their patterns
 Bottom: Some subsets of dimensions
form nice shift and scaling patterns

November 1, 2022 Data Mining: Concepts and Techniques 10

Why p-Clustering?
 Microarray data analysis may need to
 Clustering on thousands of dimensions (attributes)
 Discovery of both shift and scaling patterns
 Clustering with Euclidean distance measure? — cannot find shift patterns
 Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
 Bi-cluster using transformed mean-squared residue score matrix (I, J)

1 1 1
d  d d   d
ij | J |  ij
d   d
Ij | I | i  I ij IJ | I || J | i  I , j  J ij
 Where jJ
 A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
 Problems with bi-cluster
 No downward closure property,
 Due to averaging, it may contain outliers but still within δ-threshold
November 1, 2022 Data Mining: Concepts and Techniques 11
p-Clustering: Clustering
by Pattern Similarity
 Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix  d xa d xb 
pScore (   ) | (d xa  d xb )  (d ya  d yb ) |
d ya d yb 
 A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
 Properties of δ-pCluster
 Downward closure
 Clusters are more homogeneous than bi-cluster (thus the name:
pair-wise Cluster)
 Pattern-growth algorithm has been developed for efficient mining
d xa / d ya
 For scaling patterns, one can observe, taking logarithmic on 
d xb / d yb
will lead to the pScore form
November 1, 2022 Data Mining: Concepts and Techniques 12
Chapter 6. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 13
Why Constraint-Based Cluster Analysis?
 Need user feedback: Users know their applications the best
 Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters

November 1, 2022 Data Mining: Concepts and Techniques 14

A Classification of Constraints in Cluster Analysis
 Clustering in applications: desirable to have user-guided (i.e.,
constrained) cluster analysis
 Different constraints in cluster analysis:
 Constraints on individual objects (do selection first)

 Cluster on houses worth over $300K

 Constraints on distance or similarity functions
 Weighted functions, obstacles (e.g., rivers, lakes)
 Constraints on the selection of clustering parameters
 # of clusters, MinPts, etc.
 User-specified constraints
 Contain at least 500 valued customers and 5000 ordinary ones
 Semi-supervised: giving small training sets as
“constraints” or hints
November 1, 2022 Data Mining: Concepts and Techniques 15
Clustering With Obstacle Objects
 K-medoids is more preferable since
k-means may locate the ATM center
in the middle of a lake
 Visibility graph and shortest path
 Triangulation and micro-clustering
 Two kinds of join indices (shortest-
paths) worth pre-computation
 VV index: indices for any pair of
obstacle vertices
 MV index: indices for any pair of
micro-cluster and obstacle indices

November 1, 2022 Data Mining: Concepts and Techniques 16

An Example: Clustering With Obstacle Objects

Not Taking obstacles into Taking obstacles into

account
November 1, 2022
account
Data Mining: Concepts and Techniques 17
Clustering with User-Specified Constraints
 Example: Locating k delivery centers, each serving at least m
valued customers and n ordinary ones
 Proposed approach
 Find an initial “solution” by partitioning the data set into k

groups and satisfying user-constraints

 Iteratively refine the solution by micro-clustering relocation

(e.g., moving δ μ-clusters from cluster Ci to Cj) and

“deadlock” handling (break the microclusters when
necessary)
 Efficiency is improved by micro-clustering

 How to handle more complicated constraints?

 E.g., having approximately same number of valued

customers in each cluster?! — Can you solve it?

November 1, 2022 Data Mining: Concepts and Techniques 18
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
November 1, 2022 Data Mining: Concepts and Techniques 19
What Is Outlier Discovery?
 What are outliers?
 The set of objects are considerably dissimilar from the

remainder of the data

 Example: Sports: Michael Jordon, Wayne Gretzky, ...

 Problem: Define and find outliers in large data sets

 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

November 1, 2022 Data Mining: Concepts and Techniques 20

Outlier Discovery:
Statistical Approaches

 Assume a model underlying distribution that generates

data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be known

November 1, 2022 Data Mining: Concepts and Techniques 21

Outlier Discovery: Distance-Based Approach

 Introduced to counter the main limitations imposed by

statistical methods
 We need multi-dimensional analysis without knowing

data distribution
 Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm

 Cell-based algorithm

November 1, 2022 Data Mining: Concepts and Techniques 22

Density-Based Local
Outlier Detection
 Distance-based outlier detection
is based on global distance
distribution
 It encounters difficulties to
identify outliers if data is not
uniformly distributed  Local outlier factor (LOF)
 Assume outlier is not
 Ex. C1 contains 400 loosely
crisp
distributed points, C2 has 100  Each point has a LOF
tightly condensed points, 2
outlier points o1, o2
 Distance-based method cannot
identify o2 as an outlier
 Need the concept of local outlier

November 1, 2022 Data Mining: Concepts and Techniques 23

Outlier Discovery: Deviation-Based Approach

 Identifies outliers by examining the main characteristics

of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
 OLAP data cube technique
 uses data cubes to identify regions of anomalies in
large multidimensional data

November 1, 2022 Data Mining: Concepts and Techniques 24

Summary
 Cluster analysis groups objects based on their similarity
and has wide applications
 Measure of similarity can be computed for various types
of data
 Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster analysis

November 1, 2022 Data Mining: Concepts and Techniques 25

Pilot Jungmeister
No ratings yet
Pilot Jungmeister
8 pages
Rebuilding The Kenwood Piggybacked Power Switch
100% (1)
Rebuilding The Kenwood Piggybacked Power Switch
4 pages
Price PDF
100% (3)
Price PDF
8 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
8clst
No ratings yet
8clst
100 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Clustering
No ratings yet
Clustering
84 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
8clst
No ratings yet
8clst
98 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Clustering
No ratings yet
Clustering
123 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
8 CLST
No ratings yet
8 CLST
98 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Unit 5
No ratings yet
Unit 5
27 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
13 pages
Unit 4
No ratings yet
Unit 4
65 pages
Lecture 4 - Density Based Methods
No ratings yet
Lecture 4 - Density Based Methods
16 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
Algorithms
No ratings yet
Algorithms
107 pages
17 GM ASAP Data Mining - Clustering
No ratings yet
17 GM ASAP Data Mining - Clustering
107 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
No ratings yet
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
4 pages
Data Mining - Cluster Analysis
No ratings yet
Data Mining - Cluster Analysis
4 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
43 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
NEJMra2403213-1
100% (1)
NEJMra2403213-1
14 pages
Right To Health - Constitution Law Project Report
No ratings yet
Right To Health - Constitution Law Project Report
19 pages
Anthony Rimmington - From Military To Industrial Complex The Conversion of Biological Weapons' Facilities in The Russian Federation
No ratings yet
Anthony Rimmington - From Military To Industrial Complex The Conversion of Biological Weapons' Facilities in The Russian Federation
35 pages
Formulating Land Use Principles: Key Dimension/S Areas For Consideration Proposed Planning Principles
No ratings yet
Formulating Land Use Principles: Key Dimension/S Areas For Consideration Proposed Planning Principles
3 pages
Application Form Draft Print For All
No ratings yet
Application Form Draft Print For All
3 pages
Empower B1+grammar Units 1-5
No ratings yet
Empower B1+grammar Units 1-5
111 pages
SMT Lead Site Supervision Assesment Checklist
No ratings yet
SMT Lead Site Supervision Assesment Checklist
18 pages
Web Technologies Nodrm
0% (1)
Web Technologies Nodrm
651 pages
Physical Science ABM
No ratings yet
Physical Science ABM
6 pages
JM Contents
No ratings yet
JM Contents
16 pages
Nabard Grade A Syllabus 2022 Byju S Exam Prep 15
No ratings yet
Nabard Grade A Syllabus 2022 Byju S Exam Prep 15
7 pages
Semiotics- Culture
No ratings yet
Semiotics- Culture
5 pages
Bac 2105 Macroeconomics Paper A
No ratings yet
Bac 2105 Macroeconomics Paper A
2 pages
Book Review of Kenneth Grahame The Wind in The Willows
No ratings yet
Book Review of Kenneth Grahame The Wind in The Willows
4 pages
Cat Apem Ermec Nuevo Catalogo General de Pulsadores Interruptores Apem Big Blue 2011 12
No ratings yet
Cat Apem Ermec Nuevo Catalogo General de Pulsadores Interruptores Apem Big Blue 2011 12
589 pages
Unsui
No ratings yet
Unsui
152 pages
Job Application Form For AP-SPR-TR-MO & BPS 17, 18 Doctor Position - 1663321451
No ratings yet
Job Application Form For AP-SPR-TR-MO & BPS 17, 18 Doctor Position - 1663321451
4 pages
Viking Oil Pump 495 Series
No ratings yet
Viking Oil Pump 495 Series
2 pages
Springs
No ratings yet
Springs
8 pages
Document
No ratings yet
Document
11 pages
ISO 17025 Training Day 1
No ratings yet
ISO 17025 Training Day 1
2 pages
Bme001 - Advanced Internal Combustion Engines
No ratings yet
Bme001 - Advanced Internal Combustion Engines
2 pages
SN05 - Steel in Fire
No ratings yet
SN05 - Steel in Fire
2 pages
A - Helped Someone
No ratings yet
A - Helped Someone
3 pages
Westinghouse
100% (1)
Westinghouse
4 pages
Group 1 - Heineken Report - CPM
No ratings yet
Group 1 - Heineken Report - CPM
17 pages
Upes - Admission Criteria For Academic Year 2010-2011
No ratings yet
Upes - Admission Criteria For Academic Year 2010-2011
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 6 - Clustering

Uploaded by

Lecture 6 - Clustering

Uploaded by

Chapter 6.

 Data in only one dimension is relatively

November 1, 2022 Data Mining: Concepts and Techniques 3

 Clusters may exist only in some subspaces

November 1, 2022 Data Mining: Concepts and Techniques 4

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

connected dense units for each cluster

November 1, 2022 Data Mining: Concepts and Techniques 6

November 1, 2022 Data Mining: Concepts and Techniques 7

dimensionality such that high density clusters exist in

presume some canonical data distribution

scalability as the number of dimensions in the data

at the expense of simplicity of the method

November 1, 2022 Data Mining: Concepts and Techniques 9

 Right: The micro-array “raw” data

November 1, 2022 Data Mining: Concepts and Techniques 10

November 1, 2022 Data Mining: Concepts and Techniques 14

 Cluster on houses worth over $300K

November 1, 2022 Data Mining: Concepts and Techniques 16

Not Taking obstacles into Taking obstacles into

groups and satisfying user-constraints

(e.g., moving δ μ-clusters from cluster Ci to Cj) and

 How to handle more complicated constraints?

customers in each cluster?! — Can you solve it?

remainder of the data

 Problem: Define and find outliers in large data sets

 Telecom fraud detection

November 1, 2022 Data Mining: Concepts and Techniques 20

 Assume a model underlying distribution that generates

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 In many cases, data distribution may not be known

November 1, 2022 Data Mining: Concepts and Techniques 21

 Introduced to counter the main limitations imposed by

November 1, 2022 Data Mining: Concepts and Techniques 22

November 1, 2022 Data Mining: Concepts and Techniques 23

 Identifies outliers by examining the main characteristics

November 1, 2022 Data Mining: Concepts and Techniques 24

November 1, 2022 Data Mining: Concepts and Techniques 25

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.