Classification and Prediction
Classification and Prediction
*
Classification
*
Prediction
Numeric Prediction
Regression analysis
*
Classification—A Two-Step Process
1. Learning step (where a classification model is constructed)
2. Classification step (where the model is used to predict class
labels for given data)
If training set is used to measure the classifier’s accuracy, this estimate would
likely be optimistic, because the classifier tends to overfit the data
*
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
*
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “buys_computer”
Decision trees can easily be converted to classification rules.
branch represents an
outcome of the
age? test
<=30 31..40
overcast >40
internal node youth middle-aged
denotes a test on an attribute senior
no yes yes
*
Decision Tree Induction
Basic algorithm
■
approach
manner
*
Decision Tree Induction Algorithm
Input: D, attribute list, and Attribute selection method.
■
■If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class
■The node N is labeled with the splitting criterion, which serves as a test
at the node A branch is grown from node N for each of the outcomes of
the splitting criterion. The tuples in D are partitioned accordingly.
*
Decision Tree Induction
■Conditions for stopping partitioning
*
Let A be the splitting attribute
If A is discrete-valued, then
one branch is grown for
each known value of A.
If A is continuous-valued,
then two branches
are grown, corresponding to
A <= split point and
A > split point
If A is discrete-valued
and a binary tree must be
produced, then the test is of
the form A ϵSA, where SA is
the splitting subset for A.
Algorithm for forming a decision tree from training tuples
Attribute Selection Measure:
Information Gain
■measure is based on pioneering work by Claude Shannon on
information theory
■Select the attribute with the highest information gain as the splitting
attribute
*
Attribute Selection Measure:
Information Gain
■Select the attribute with the highest information gain
■Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
*
Attribute Selection: Information Gain
Class C1: buys_computer = “yes” [9]
Class C2: buys_computer = “no” [5]
Hence
Similarly,
*
Computing Information-Gain for
Continuous-Valued Attributes
■Let attribute A be a continuous-valued attribute ( age values )
■Must determine the best split point for A
*
Gain Ratio for Attribute Selection
■C4.5 (a successor of ID3) uses gain ratio to overcome the bias
(normalization to information gain)
■ GainRatio(A) = Gain(A)/SplitInfo(A)
value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A
*
Gini Index (CART)
■ Measures the impurity of D, a data partition or set of training tuples
■If a data set D contains examples from n classes, gini index, gini(D) is
defined as
where pj is the relative frequency of class j in D / the probability that a tuple in D belongs to class Cj
■If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
*
Gini Index (CART)
■ Reduction in Impurity:
■The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
*
Gini Index (CART)
■ The Gini index considers a binary split for each attribute.
Let’s first consider the case where A is a discrete-valued attribute having v distinct
values
e.g. income with 3 values low, medium and high
examine all the possible subsets that can be formed using known values of
income except power set and empty set
• There are 2v – 2 possible ways to form two partitions of the data, D, based on a
binary split on A
*
Computation of Gini Index
■ Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Thus, split on the {low,medium} (and {high}) since it has the lowest Gini
index
*
Comparing Attribute Selection Measures
■Information gain:
■Gain ratio:
*
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
31
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
C4.5 uses a method called pessimistic pruning (uses the training set )
similar to the cost complexity method, however, does not require the use of a prune set
33
Decision Tree
Decision trees can suffer from repetition and replication
34
Enhancements to Basic Decision Tree Induction
38
Rule Based Classification
39
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R1: IF age = youth AND student = yes THEN buys_computer = yes
Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality [accuracy, coverage, or size (number of attribute
tests in the rule antecedent) ]or by experts. rule that appears earliest in the list has the highest priority
Using IF-THEN Rules for Classification
The class in majority or the majority class of the tuples that were not covered by
any rule
The default rule is evaluated at the end, if and only if no other rule covers X.
Rules are easier to understand than large trees <=30 31..40 >40
One rule is created for each path from the root to astudent? yes
credit rating?
If given rule pre condition , any condition that does not improve the
estimated accuracy of the rule can be pruned
Any rule that does not contribute to the overall accuracy of the entire
rule set can also be pruned.
Rule Induction: Sequential Covering Method
Sequential covering algorithm: Extracts rules directly from training
data
Typical sequential covering algorithms: AQ, CN2, RIPPER
Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
47
Rule Generation
Empty rule
To generate a rule 1. IF THEN loan decision =accept.
2. IF income = high THEN loan decision = accept.
while(true) 3. IF income= high AND credit rating = excellent
THEN loan decision = accept.
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
49
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
pos neg
FOIL _ Prune( R)
pos neg
51
Classifier Eager or Lazy?
Eager learners, when given a set of training tuples, will construct a
model before receiving new (e.g., test) tuples to classify.
The learned model as being ready and eager to classify previously
unseen tuples.
Lasy learner instead waits until the last minute before doing any model
construction to classify a given test tuple.
Given a training tuple, a lazy learner simply stores it (or does only a
little minor processing) and waits until it is given a test tuple
Lazy learners store the training tuples or “instances,” they are also
referred to as instance-based learners
Can be computationally expensive, so can be implemented on parallel
hardware
Require efficient storage techniques e.g case-based reasoning,kNN
52
kNN Classifier
53
1-Nearest Neighbor
3-Nearest Neighbor
kNN Algorithm
Store all input data in the training set
Lazy Learners
kNN Algorithm
X1 = {x11, x12, : : : , x1n} and X2 ={x21, x22, : : : , x2n}
For each numeric attribute, take the difference between the corresponding
values of that attribute in tuple X1 and in tuple X2, square this difference,
and accumulate it.
The square root is taken of the total accumulated distance count
Hannah 63 200K 1 No
Tom 59 170K 1 No
David 37 50K 2 ?
K-Nearest Neighbor Classifier
Example
Customer Age Income No. Response Distance from David
(K) cards
John 35 35 3 No sqrt [(35-37)2+(35-50)2
+(3-2)2]=15.16
Rachel 22 50 2 Yes sqrt [(22-37)2+(50-50)2
+(2-2)2]=15
Hannah 63 200 1 No sqrt [(63-37)2+(200-
50)2 +(1-2)2]=152.23
Tom 59 170 1 No sqrt [(59-37)2+(170-
50)2 +(1-2)2]=122
Nellie 25 40 4 Yes sqrt [(25-37)2+(40-50)2
+(4-2)2]=15.74
David 37 50 2 Yes
K Nearest Neighbors
K Nearest Neighbors
Advantage
Nonparametric architecture
Simple
Powerful
Requires no training time
Disadvantage
Memory intensive
Classification/estimation is slow
61
Training dataset (nominal attributes)
Customer ID Debt Income Marital Status Risk
K=3
Distance
Score for an attribute is 1 for a match and
0 otherwise
Distance is sum of scores for each
attribute
Test Set
Customer Debt Income Marital Status Risk
ID
Zeb High Medium Married ?
Yong Low High Married ?
Xu Very low Very low Unmarried ?
Vasco High Low Married ?
Unace High Low Divorced ?
Trey Very low Very low Married ?
Steve Low High Unmarried ?
Bayesian Classification
■A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
*
Bayes’ Theorem: Basics
■Let X be a data sample (“evidence”): class label is unknown
■Let H be a hypothesis that X belongs to class C
probability that the hypothesis holds given the observed data sample X
■P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
■E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
Bayes’ Theorem:
*
Naïve Bayesian Classification
■Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
■Suppose there are m classes C1, C2, …, Cm.
*
Naïve Bayes Classifier: An Example
*
Naïve Bayes Classifier: An Example
*
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = high,
Student = yes
Credit_rating = Fair)
*
Avoiding the Zero-Probability
Problem
■Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
“uncorrected” counterparts
*
Naïve Bayes Classifier: Comments
Advantages
■
■Easy to implement
■Good results obtained in most of the cases
Disadvantages
■
■Assumption: class conditional independence, therefore loss of
accuracy
■Practically, dependencies exist among variables
■E.g., hospitals: patients: Profile: age, family history, etc.
Bayes Classifier
*
Draw decision tree for given training data
76
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
Class Imbalance Problem:
Sensitivity = TP/P
Specificity = TN/N
77
Classifier Evaluation Metrics: Example
79
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
F measure (F1 or F-score): harmonic mean of precision and
recall,
80
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
81
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
82
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
83
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is better?
These mean error rates are just estimates of error on the true
population of future data cases
Obtain confidence limits for our error estimates (“One model is better
than the other by a margin of error of ±4%.”)
84
Estimating Confidence Intervals:
Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same ( the difference in mean error rate between the
two is zero)
85
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
86
Estimating Confidence Intervals:
Table for t-distribution
Symmetric
Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
Confidence limit, z
= sig/2
87
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
are same
Conclude: statistically significant difference between M1
& M2
Otherwise, conclude that any difference is chance
88
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
90
Chapter 8. Classification: Basic Concepts
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
92
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples
The bagged classifier M* counts the votes and assigns the class with the
most votes to X
Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
x1 and x2 values multiplied by weight values w1 and w2 are input to the neuron x.
Given that
One Neuron as a Network
The neuron receives the weighted sum as input and calculates the
output as a function of input as follows :
Bias of a Neuron
x1-x2= -1
x2 x1-x2=0
x1-x2= 1
x1
Bias as extra input
w0
x0 = +1
x1 W1
Activation
v
function
Input
Attributex2 w2 ( )
Output
values Summing function
class
weights
y
xm wm
m
v w x
j 0
j j
w0 b
Neuron with Activation
Linear Separable:
x y x y
Linear inseparable:
Solution? x y
A Multilayer Feed-Forward Neural
Network
Output Class
Ok
Output nodes
w jk
Oj
Hidden nodes
wij - weights
Input nodes
Network is fully connected
Input Record : xi
Neural Network Learning
Ok k= 1, 2,.. #classes
• Network is fully connected, i.e. each unit provides input
to each unit in the next forward layer.
Classification by Back propagation
x1 w1j
f
output y
xn wnj
where wij is the weight of the connection from unit i in the previous layer
to unit j; Oi is the output of unit i from the previous layer;
Err j O j (1 O j ) Errk w jk
k
j (l ) Errj
j j j
Update weights and biases
Epoch --- One iteration through the training set is called an epoch.
Output vector
Errk Ok (1 Ok )(Tk Ok )
Output nodes
1 Err j O j (1 O j ) Errk w jk
Oj I j k
1 e
Hidden nodes
Input vector: xi
Example of Back propagation
Input = 3, Hidden
Neuron = 2 Output
=1
Initialize weights :
Random Numbers
from -1.0 to 1.0
Bias ( Random )
θ4 θ5 θ6
1
6 (-0.3)0.332-(0.2)(0.525)+0.1= -0.105 Oj = 0.475
1 e0.105
Calculation of Error at Each Node
Errk Ok (1 Ok )(Tk Ok )
Err j O j (1 O j ) Errk w jk
k
Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1
……..similarly ………similarly
Calculation of weights and Bias Updating
Learning Rate l =0.9
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on an array of real-world data, e.g., hand-written letters
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules
from trained neural networks
SVM—Support Vector Machines
A relatively new classification method for both linear and
nonlinear data
It uses a nonlinear mapping to transform the original training
data into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
SVM—History and Applications
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints
Quadratic Programming (QP) Lagrangian multipliers
Why Is SVM Effective on High Dimensional Data?
SVM—Linearly Inseparable