CH-5 DM Classification
CH-5 DM Classification
1
Classification: Definition
• Classification is the process of finding a model, which
describes and distinguishes data classes or concepts,
for the purpose of being able to use the model to
predict the class of objects whose class label is
unknown
• Generally, classification is a data mining technique
used to predict group membership for data instances.
• Given a collection of records (training set), each
record contains a set of attributes, one of the attributes
is the class
– Find a model for class attribute as a function of the values
of other attributes 2
Classification: Definition
• The goal of classification is to accurately
predict the target class for each case in the data
– For example, a classification model could be used
to identify loan applicants as low, medium, or high
credit risks
• A classification task begins with a data set in
which the class assignments are known.
– For example, a classification model that predicts
credit risk could be developed based on observed
data for many loan applicants over a period of time
3
Classification: Definition
– In addition to the historical credit rating, the data
might track employment history, home ownership
or rental, years of residence, number and type of
investments, and so on
– Credit rating would be the target, the other
attributes would be the predictors, and the data for
each customer would constitute a case
• The simplest type of classification problem is
binary classification
– In binary classification, the target attribute has
only two possible values: for example, high credit
rating or low credit rating
4
Classification: Definition
• In the model building (training) process, a classification
algorithm finds relationships between the values of the
predictors and the values of the target
– Different classification algorithms use different techniques for
finding relationships
– These relationships are summarized in a model, which can then be
applied to a different data set in which the class assignments are
unknown
• Classification models are tested by comparing the
predicted values to known target values in a set of test
data
• The historical data for a classification project is typically
divided into two data sets: one for building the model
(training data set); the other for testing the model (test
data set) 5
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
6
Classification
• There are different Classification algorithms such
– Decision tree,
– Naïve Bayes method,
– Bayesian Belief Network,
– Artificial Neural network,
– Support vector Machine, etc
• Most Classification algorithms involves two steps:
1. Model construction
2. Model Usage
• Some other classification algorithm such as the K-
Nearest Neighbor approach don’t require any model
April 27, 2024 Data Mining: Concepts and Techniques 7
Classification
1. Model construction
• refers to describing a set of predetermined classes
using training data set
• The training data is a set of tuples where Each
tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
• The model is represented as classification rules,
decision trees, or mathematical formulae
2. Model usage:
• Refers to using the model for classifying future or
unknown objects
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Class=Yes a b
ACTUAL (TP) (FP)
CLASS Class=No c d
(FP) (TP)
13
Decision Tree
14
Decision tree classifier
• Decision tree performs classification by
constructing a tree based on training instances
with leaves having class labels
– The tree is traversed for each test instance to find a
leaf, and the class of the leaf is the predicted class
• Widely used learning method
• It has been applied to:
– classify medical patients based on the disease
– equipment malfunction by cause
– loan applicant by likelihood of payment
15
Decision Trees
• Tree where internal nodes are simple decision rules on one or more
attributes and leaf nodes are predicted class labels; i.e. a Boolean
classifier for the input instance
Given an instance of an object or situation, which is specified by a
set of properties, the tree returns a "yes" or "no" decision about that
instance
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
OR
Outlook
4 “yes” 3 “yes”
2 “yes”
3 “no” 2 “no”
The Recursive Procedure for Constructing a
Decision Tree
• The operation discussed above is applied to each
branch recursively to construct the decision tree
• For example, for the branch “Outlook = Sunny”, we
evaluate the information gained by applying each of
the remaining 3 attributes
– Gain(Outlook=sunny;Temperature) = 0.971 – 0.4 = 0.571
– Gain(Outlook=sunny;Humidity) = 0.971 – 0 = 0.971
– Gain(Outlook=sunny;Windy) = 0.971 – 0.951 = 0.02
The Recursive Procedure for Constructing a
Decision Tree
• Similarly, we also evaluate the information
gained by applying each of the remaining 3
attributes for the branch “Outlook = rainy”.
– Gain(Outlook=rainy;Temperature) = 0.971 – 0.951
= 0.02
– Gain(Outlook=rainy;Humidity) = 0.971 – 0.951 =
0.02
– Gain(Outlook=rainy;Windy) =0.971 – 0 = 0.971
Output: A Decision Tree for “play_football”
Outlook
no yes yes no
Classification Rules
IF outlook= “sunny” & humidity= “high” THEN play_football = “no”
IF outlook= “sunny” & humidity= “normal” THEN play_football = “yes”
IF outlook= “overcast” THEN play_football = “yes”
IF outlook= “rainy” & windy= “false” THEN play_football = “yes”
29
IF outlook= “rainy” & windy= “true” THEN play_football = “no”
Pros and Cons of decision trees
Pros Cons
• Reasonable training time • Cannot handle complicated
• Fast application relationship between features
• Easy to interpret • problems with lots of missing
• Easy to implement data
•Can handle large number of
features
31