Week 9
Week 9
Known Data
Model
Apple Banana
New Data
Known Response Training Testing
Supervised Learning
Supervised Learning is basically of two types
• Classification
• When the variables are categorical, i.e., with 2 or more classes (yes/no,
true/false, apple/banana), the classification is used.
• Regression
• In the relationship between two or more variables, a change in one
variable is associated with a change in another variable
Supervised Learning - Classification
New Email
Non spam
Non spam
Spam Filtering
Categorical
Separation
Prediction
%
Temperature Learn
Past Data
New Data
Supervised Learning Applications
Signature Risk Image Face Fraud
Recognition Assessment Classification Detection Detection
Pattern
Recognition
Model
Unsupervised Learning
Unsupervised Learning is basically of two types
• Clustering
• A method of dividing the objects into clusters such that objects in a cluster
should be as similar as possible, and objects in different clusters should be as
dissimilar as possible.
• Association
• A method for discovering interesting relations between variables in large
collections
Unsupervised Learning - Clustering
A
B
Call Duration
Internet Usage
Internet Usage
Total Call Duration
A telecom service provider provides the personalized data and call plans to keep the customers
Unsupervised Learning - Association
Customer 1 Customer 2 New Customer
Identification of Identifying
Similarity Recommendation
Human Errors Accident Prone
Detection Systems
during Data Entry Areas
Anomaly
Search Engine
Detection
Summary of Classical Machine Learning
Machine Learning
C1 C2
Partitional Clustering
Divide objects into clusters such that each object is only in one cluster, not several clusters
Types of Clustering
Clustering
C1 C2
Partitional Clustering
Divide objects into clusters such that an object can belong to more than one cluster
Hierarchical Clustering
Clustering
Hierarchical Clustering
de fg
Hierarchical Clustering
defg
cdefg
Agglomerative Divisive
bcdefg
abcdefg
Bottom-Up Approach: Begin with each object as a separate cluster, and then merge them in to larger clusters
Hierarchical Clustering - Divisive
abcdefg
Clustering
bcdefg
defg
Agglomerative Divisive
de fg
a b c d e f g
Top-Down Approach: Begin with all object as a cluster, and then divide them in to smaller clusters
Distance Measure of K-means Clustering
measure
4. Cosine distance
measure
Distance Measure in K-means Clustering
1. Euclidean distance • Manhattan distance is the sum of the distance
measure between two points measured along axes at
2. Squared Euclidean right angles
distance measure 𝑑=
𝑛
𝑞𝑥 − 𝑝𝑥 + 𝑞𝑦 − 𝑝𝑦
3. Manhattan distance 𝑖=1
measure Q(x, y)
4. Cosine distance Manhattan distance
measure
P(x, y)
Distance Measure in K-means Clustering
1. Euclidean distance • Cosine distance measures the angle between
measure two vectors
2. Squared Euclidean
σ𝑛𝑖=1 𝑝𝑖 𝑞𝑖
distance measure 𝑑= 𝑛
σ𝑖=1 𝑝𝑖 2 σ𝑛𝑖=1 𝑞𝑖 2
3. Manhattan distance
measure p
4. Cosine distance Cosine distance
measure
q
Clustering Basics
• Clustering algorithm
• Partitional clustering
• Hierarchical
clustering
• Distance function
• Decides which class is the nearest
• Can be Euclidean distance
• Clustering quality depends
• Algorithm
• Distance function
• Application
• Inter-clusters distance maximized
• Intra-clusters distance minimized
K-means Algorithm
• Partitional clustering
• Partitions the given data into k clusters.
• Each cluster has a cluster center, called centroid.
• k is specified by the user
• Each data point
• Vector X = {x1, x2, …, xn} -> n dimensional data
• N attributes
• Could be weighted or non-weighted attributes
K-means Algorithm
• User decides on the k value
• Given k:
1) Randomly choose k data points as the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to step 2
K=3
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
No changes: Done
K-means Clustering Algorithm
• Step 1
• Random select K cluster centroids
• C is the set of all centroids
𝐶 = 𝑐1 , 𝑐2 , … , 𝑐𝑘
• Step 2
• Calculate the Euclidean distance from each data point to the centroids
and assign the data point to one centroid with the min distance
2
arg min 𝑑 𝑥, 𝑐𝑖
𝐶𝑖 ∈𝐶
K-means Clustering Algorithm
• Step 3
• Calculate the new centroid for each cluster
1
𝑐𝑖 = 𝑥𝑖
𝑆𝑖
𝑥𝑖 ∈𝑆𝑖
where 𝐶𝑖 is the new centroid, 𝑆𝑖 is all data point 𝑥𝑖 assigned to the 𝑖𝑡ℎ cluster
• Step 4
• Repeat Step 2 and Step 3 until the cluster assignments are stable
Strengths of K-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• It terminates at a local optimum
• The global optimum is hard to find.
Weakness of K-means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values, is used
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values
• Algorithm is very sensitive to the initial assignment of the centroids
Outliers in K-means
Outliers in K-means
• Desired output
Outlier
Sensitive to initial points in K-means
Where 𝑥𝑖 is a data point, 𝑐𝑖 is its centroid, and 𝑛 is the total data points
Calculating WSS for a range of values for k
▪ 1 to max k 30000
• Plot WSS vs k
WSS
• Choose k for which WSS 20000
Commercial Residential
Source: https://en.wikipedia.org/wiki/Single-linkage_clustering
Single-Linkage Clustering
Working Example Dendrogram
• Five elements (a,b,c,d,e) and the following u
a
matrix D1 of pairwise distances between them b
• First Step 8 0
Height
(a,b) 0 20 30 22
c 20 0 28 39
((a,b),c) d e
d 30 28 0 50
((a,b),c) 0 28 22 e 22 39 50 0
d 28 0 50
e 22 50 0 • Update the proximity matrix D2 to a new proximity matrix D3
Single-Linkage Clustering
Dendrogram
Working Example u a
• Third Step v b
D3(((a,b),c),d) = min (D2((a,c),d), D2(c,d)) = min(30, 28) = 28 c
D3(((a,b),c),e) = min (D2((a,b),e), D2(c,e)) = min(22, 39) = 22 Height
10 8 0
((a,b),c) 0 28 22
(((a,b),c),e) d
d 28 0 50
(((a,b),c),e) 0 28 e 22 50 0
d 28 0
• Update the proximity matrix D3 to a new proximity matrix D4
Single-Linkage Clustering
Dendrogram
Working Example u a
v b
• Final Step w c
D4((((a,b),c),e),d) = min (D3((((a,b),c),e),d), D3(e,d))
= min(28,50) = 28 e
Height
11 10 8 0
3. At each step, the two clusters separated by the shortest distance are
combined. D2((a,b),e)=25 is the lowest values of D2, so we join Cluster (a,b)
with element e →Cluster ((a,b),e)
4. Repeat Step 2 and Step 3 so that the clusters are then sequentially
combined into larger clusters until all elements end up being in the
same cluster.
Average Linkage Clustering and Centroid method