6 Classification
6 Classification
Classification
Lecturer: Dr. Nguyen Thi Ngoc Anh
Email: ngocanhnt@ude.edu.vn
1
Classification vs. Prediction
— Classification:
◦ predicts categorical class labels (discrete or nominal)
◦ classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
— Prediction:
◦ models continuous-valued functions, i.e., predicts unknown or
missing values
— Typical Applications
◦ credit approval
◦ target marketing
◦ medical diagnosis
◦ treatment effectiveness analysis
2
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
6
3
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
7
age?
<=30 overcast
30..40 >40
no yes no yes
4
Supervised vs. Unsupervised Learning
— Supervised learning (classification)
◦ Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the
observations
◦ New data is classified based on the training set
— Unsupervised learning (clustering)
◦ The class labels of training data is unknown
◦ Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
10
5
Issues (1): Data Preparation
— Data cleaning
◦ Preprocess data in order to reduce noise and
handle missing values
— Relevance analysis (feature selection)
◦ Remove the irrelevant or redundant
attributes
— Data transformation
◦ Generalize and/or normalize data
11
12
6
Classification and Prediction
— What is classification? What is prediction?
— Issues regarding classification and prediction
— Bayesian Classification
— Classification by decision tree induction
— Classification by Neural Networks
— Classification by Support Vector Machines (SVM)
— Instance Based Methods
— Prediction
— Classification accuracy
— Summary
13
14
7
Bayesian Theorem: Basics
— Let X be a data sample whose class label is unknown
— Let H be a hypothesis that X belongs to class C
— For classification problems, determine P(H|X): the
probability that the hypothesis holds given the observed
data sample X
— P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
— P(X): probability that sample data is observed
— P(X|H) : probability of observing the sample X, given
that the hypothesis holds
15
Bayes’ Theorem
— Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
P(H | X ) = P( X | H )P(H )
P( X )
— Informally, this can be written as
posterior =likelihood x prior / evidence
— MAP (maximum posteriori) hypothesis
h º arg max P(h | D) = arg max P(D | h)P(h).
MAP hÎH hÎH
— Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
16
8
Naïve Bayes Classifier
— A simplified assumption: attributes are conditionally
independent:
n
P( X | C i) = Õ P( x k | C i)
k =1
— The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
— No dependence relation between attributes
— Greatly reduces the computation cost, only count the
class distribution.
— Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
17
Training dataset
age income student credit_rating buys_computer
Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
<=30 medium yes excellent yes
Credit_rating=
31…40 medium no excellent yes
Fair)
31…40 high yes fair yes
>40 medium no excellent no
18
9
Naïve Bayesian Classifier: Example
— Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667
=0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
19
20
10
Bayesian Networks
— Bayesian belief network allows a subset of the variables
conditionally independent
— A graphical model of causal relationships
◦ Represents dependency among the variables
◦ Gives a specification of joint probability distribution
Family
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
22
11
Learning Bayesian Networks
— Several cases
◦ Given both the network structure and all variables
observable: learn only the CPTs
◦ Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
◦ Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
◦ Unknown structure, all hidden variables: no good
algorithms known for this purpose
— D. Heckerman, Bayesian networks for data
mining
23
24
12
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
follows an 31…40 high no fair yes
example >40 medium no fair yes
>40 low yes fair yes
from
>40 low yes excellent no
Quinlan’s 31…40 low yes excellent yes
ID3 <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
25
<=30 overcast
30..40 >40
no yes no yes
26
13
Algorithm for Decision Tree Induction
— Basic algorithm (a greedy algorithm)
◦ Tree is constructed in a top-down recursive divide-and-conquer
manner
◦ At start, all the training examples are at the root
◦ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◦ Examples are partitioned recursively based on selected
attributes
◦ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
— Conditions for stopping partitioning
◦ All samples for a given node belong to the same class
◦ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
◦ There are no samples left 27
28
14
Attribute Selection by Information
Gain Computation
g Class P: buys_computer = “yes” 5 4
g Class N: buys_computer = “no” E (age) = I (2,3) + I (4,0)
14 14
g I(p, n) = I(9, 5) =0.940
5
g Compute the entropy for age: + I (3,2) = 0.694
14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out
<=30 2 3 0.971 14
of 14 samples, with 2 yes’es and 3
30…40 4 0 0
no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = I ( p, n) - E (age) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Similarly,
>40 low yes fair yes
>40 low yes excellent no Gain(income) = 0.029
31…40 low yes excellent yes
<=30
<=30
medium
low
no
yes
fair
fair
no
yes
Gain( student ) = 0.151
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
Gain(credit _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
29
>40 medium no excellent no
30
15
Gini Index (IBM IntelligentMiner)
— If a data set T contains examples from n classes, gini
index, gini(T) is defined as n 2
gini(T ) = 1- å p j
j =1
where pj is the relative frequency of class j in T.
— If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
gini(T) is defined as
N 1 gini( ) + N 2 gini( )
gini split (T ) = T1 T2
— The attribute providesNthe smallest ginisplit
N(T) is chosen
to split the node (need to enumerate all possible splitting
points for each attribute).
31
32
16
Avoid Overfitting in Classification
— Overfitting: An induced tree may overfit the training
data
◦ Too many branches, some may reflect anomalies due to noise or
outliers
◦ Poor accuracy for unseen samples
— Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
– Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
– Use a set of data different from the training data to decide
which is the “best pruned tree”
33
34
17
Enhancements to basic decision tree
induction
— Allow for continuous-valued attributes
◦ Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
— Handle missing attribute values
◦ Assign the most common value of the attribute
◦ Assign probability to each of the possible values
— Attribute construction
◦ Create new attributes based on existing ones that are sparsely
represented
◦ This reduces fragmentation, repetition, and replication
35
36
18
Scalable Decision Tree Induction Methods in
Data Mining Studies
— SLIQ (EDBT’96 — Mehta et al.)
◦ builds an index for each attribute and only class list and the
current attribute list reside in memory
— SPRINT (VLDB’96 — J. Shafer et al.)
◦ constructs an attribute list data structure
— PUBLIC (VLDB’98 — Rastogi & Shim)
◦ integrates tree splitting and tree pruning: stop growing the tree
earlier
— RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
◦ separates the scalability aspects from the criteria that determine
the quality of the tree
◦ builds an AVC-list (attribute, value, class label)
37
Presentation of Classification
Results
39
19
Visualization of a Decision Tree in
SGI/MineSet 3.0
40
42
20
Classification
— Classification:
◦ predicts categorical class labels
— Typical Applications
◦ {credit history, salary}-> credit approval ( Yes/No)
◦ {Temp, Humidity} --> Rain (Yes/No)
x Î X = {0,1}n , y Î Y = {0,1}
Mathematically h: X ®Y
y = h( x )
43
Linear Classification
— Binary Classification
problem
x — The data above the red line
x belongs to class ‘x’
x x x
— The data below red line
x
x x o belongs to class ‘o’
x
o — Examples – SVM,
x o
o o o Perceptron, Probabilistic
o o o Classifiers
o o o o
44
21
Discriminative Classifiers
— Advantages
◦ prediction accuracy is generally high
– (as compared to Bayesian methods – in general)
◦ robust, works when training examples contain errors
◦ fast evaluation of the learned target function
– (Bayesian networks are normally slow)
— Criticism
◦ long training time
◦ difficult to understand the learned function (weights)
– (Bayesian networks can be used easily for pattern discovery)
◦ not easy to incorporate domain knowledge
– (easy in the form of priors on the data or distributions)
45
Neural Networks
— Analogy to Biological Systems (Indeed a great example
of a good learning system)
— Massive Parallelism allowing for computational efficiency
— The first learning algorithm came in 1959 (Rosenblatt)
who suggested that if a target output value is provided
for a single neuron with fixed inputs, one can
incrementally change weights to learn to produce these
outputs using the perceptron learning rule
46
22
A Neuron
- mk
x0 w0
x1 w1
å f
output y
xn wn
A Neuron
- mk
x0 w0
x1 w1
å f
output y
xn wn
23
Multi-Layer Perceptron
Output vector
Err j = O j (1 - O j )å Errk w jk
Output nodes k
q j = q j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes Err j = O j (1 - O j )(T j - O j )
wij 1
Oj = -I
1+ e j
Input nodes
I j = å wij Oi + q j
i
Input vector: xi
49
Network Training
— The ultimate objective of training
◦ obtain a set of weights that makes almost all the tuples in the
training data classified correctly
— Steps
◦ Initialize weights with random values
◦ Feed the input tuples into the network one by one
◦ For each unit
– Compute the net input to the unit as a linear combination of all the
inputs to the unit
– Compute the output value using the activation function
– Compute the error
– Update the weights and the bias
50
24
Classification and Prediction
— What is classification? What is prediction?
— Issues regarding classification and prediction
— Classification by decision tree induction
— Bayesian Classification
— Classification by Neural Networks
— Classification by Support Vector Machines (SVM)
— Instance Based Methods
— Prediction
— Classification accuracy
— Summary
52
53
25
Support vector machine(SVM).
— Classification is essentially finding the best
boundary between classes.
— Support vector machine finds the best
boundary points called support vectors
and build classifier on top of them.
— Linear and Non-linear support vector
machine.
54
55
26
Optimal Hyper plane, separable case.
56
58
27
Analysis Cont.
3. So the problem of finding optimal hyperplane
turns to: b , b 0 , || b ||= 1
Maximizing C on
Subject to constrain:
yi ( xiT b + b 0 ) > C , i = 1,..., N .
4. It’s the same as :
Minimizing || b || subject to
59
Non-separable case
60
28
Non-separable Cont.
1. Constraint changes to the following:
y ( xT b + b ), > C (1 - x ), Where
i i 0 i
N
"i, xi > 0, å xi < const.
i =1
2. Thus the optimization problem changes to:
ì T
Min || b || subject to ï yi ( xi b + b0 )>1-xi ,i =1,..., N .
í N
63
Compute SVM.
64
29
SVM computing Cont.
The Lagrange function for this problem is:
N N N
1 2 T
Lp = || b || +g å xi - å a i [ yi ( xi b + b 0 ) - (1 - xi )] - å mixi
2 i =1 i =1 i =1
By formal Lagrange procedures, we get a
dual problem:
N
1 N N
LD = å a i - åå a ia i ' yi yi ' xiT xi '
i =1 2 i =1 i '=1
65
Ù N Ù
b = å a i yi xi
i =1
66
30
67
General SVM
Can we do better?
A non-linear boundary as
shown will do fine.
68
31
General SVM Cont.
— The idea is to map the feature space into
a much bigger space so that the boundary
is linear in the new space.
— Generally linear boundaries in the
enlarged space achieve better training-
class separation, and it translates to non-
linear boundaries in the original space.
69
Mapping
— Mapping F : � d a H
◦ Need distances in H: F ( xi ) × F ( x j )
— Kernel Function: K ( xi , x j ) = F ( xi ) × F ( x j )
-|| xi - x j ||2 / 2s 2
◦ Example: K ( xi , x j ) = e
— In this example, H is infinite-dimensional
70
32
Degree 3 Example
71
Resulting Surfaces
72
33
General SVM Cont.
Now suppose our mapping from original
Feature space to new space is h(xi), the
dual problem changed to:
N
1 N N
LD = å a i - åå a ia i 'yi yi ' h( xi ), h( xi ' )
i =1 2 i =1 i ' =1
Note that the transformation only
operates on the dot product.
73
74
34
Reproducing Kernel.
Look at the dual problem, the solution
only depends on h( xi ), h( xi ' .)
Traditional functional analysis tells us we
need to only look at their kernel
representation: K(X,X’)= h( xi ), h( xi ' )
Which lies in a much smaller dimension
Space than “h”.
75
76
35
Example of polynomial kernel.
r degree polynomial:
K(x,x’)=(1+<x,x’>)d.
For a feature space with two inputs: x1,x2
and
a polynomial kernel of degree 2.
K(x,x’)=(1+<x,x’>)2
Let
h1 ( x) = 1, h2 ( x) = 2 x1 , h3 ( x) = 2 x2 , h4 ( x) = x12 , h5 ( x) = x22
and h6 ( x) = 2 x1 x2 , then
K(x,x’)=<h(x),h(x’)>.
77
Performance of SVM.
— For optimal hyper planes passing through the origin, we
have:
E[ D 2 / M 2 ]
E ( P (error )] £
l
— For general support vector machine.
E[ P (error )] £
E(# of support vectors)/(# training
samples)
— SVM has been very successful in lots of applications.
78
36
Results
79
80
37
Open problems of SVM.
— How do we choose Kernel function for a
specific set of problems. Different Kernel
will have different results, although
generally the results are better than using
hyper planes.
— Comparisons with Bayesian risk for
classification problem. Minimum Bayesian
risk is proven to be the best. When can
SVM achieve the risk.
81
82
38
SVM Related Links
— http://svm.dcs.rhbnc.ac.uk/
— http://www.kernel-machines.org/
— C. J. C. Burges. A Tutorial on Support Vector Machines
for Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
— SVMlight – Software (in C)
http://ais.gmd.de/~thorsten/svm_light
— BOOK: An Introduction to Support Vector
Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
83
86
39
Other Classification Methods
— k-nearest neighbor classifier
— case-based reasoning
— Genetic algorithm
— Rough set approach
— Fuzzy set approaches
87
Instance-Based Methods
— Instance-based learning:
◦ Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
— Typical approaches
◦ k-nearest neighbor approach
– Instances represented as points in a Euclidean space.
◦ Locally weighted regression
– Constructs local approximation
◦ Case-based reasoning
– Uses symbolic representations and knowledge-based
inference
88
40
The k-Nearest Neighbor Algorithm
— All instances correspond to points in the n-D space.
— The nearest neighbor are defined in terms of Euclidean
distance.
— The target function could be discrete- or real- valued.
— For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq.
— Voronoi diagram: the decision surface induced by 1-NN for a
typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
_ + .
89
90
41
Case-Based Reasoning
— Also uses: lazy evaluation + analyze similar instances
— Difference: Instances are not “points in a Euclidean
space”
— Example:Water faucet problem in CADET (Sycara et
al’92)
— Methodology
◦ Instances represented by rich symbolic descriptions (e.g.,
function graphs)
◦ Multiple retrieved cases may be combined
◦ Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
— Research issues
◦ Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
91
42
Genetic Algorithms
— GA: based on an analogy to biological evolution
— Each rule is represented by a string of bits
— An initial population is created consisting of randomly
generated rules
◦ e.g., IF A1 and Not A2 then C2 can be encoded as 100
— Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and
their offsprings
— The fitness of a rule is represented by its classification
accuracy on a set of training examples
— Offsprings are generated by crossover and mutation
93
94
43
Fuzzy Set
Approaches
• Fuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as using
fuzzy membership graph)
• Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated
• For a given new sample, more than one fuzzy value may
apply
• Each applicable rule contributes a vote for membership
in the categories
• Typically, the truth values for each predicted category
are summed
95
96
44
What Is Prediction?
— Prediction is similar to classification
◦ First, construct a model
◦ Second, use model to predict unknown value
– Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
— Prediction is different from classification
◦ Classification refers to predict categorical class label
◦ Prediction models continuous-valued functions
97
98
45
Regress Analysis and Log-Linear
Models in Prediction
— Linear regression:Y = a + b X
◦ Two parameters , a and b specify the line and are to
be estimated by using the data at hand.
◦ using the least squares criterion to the known values
of Y1,Y2, …, X1, X2, ….
— Multiple regression:Y = b0 + b1 X1 + b2 X2.
◦ Many nonlinear functions can be transformed into the
above.
— Log-linear models:
◦ The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
◦ Probability: p(a, b, c, d) = aab baccad dbcd
99
100
46
Prediction: Numerical Data
101
102
47
Classification and Prediction
— What is classification? What is prediction?
— Issues regarding classification and prediction
— Classification by decision tree induction
— Bayesian Classification
— Classification by Neural Networks
— Classification by Support Vector Machines (SVM)
— Instance Based Methods
— Prediction
— Classification accuracy
— Summary
103
104
48
Bagging and Boosting
— General idea Classification method (CM)
CM
105
Bagging
— Given a set S of s samples
— Generate a bootstrap sample T from S. Cases in S may
not appear in T or may appear more than once.
— Repeat this sampling procedure, getting a sequence of k
independent training sets
— A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using the
same classification algorithm
— To classify an unknown sample X,let each classifier
predict or vote
— The Bagged Classifier C* counts the votes and assigns X
to the class with the “most” votes
106
49
Boosting Technique — Algorithm
— Assign every example an equal weight 1/N
— For t = 1, 2, …,T Do
◦ Obtain a hypothesis (classifier) h(t) under w(t)
◦ Calculate the error of h(t) and re-weight the
examples based on the error . Each classifier is
dependent on the previous ones. Samples that are
incorrectly predicted are weighted more heavily
◦ Normalize w(t+1) to sum to 1 (weights assigned to
different classifiers sum to 1)
— Output a weighted sum of all the hypothesis,
with each hypothesis weighted according to its
accuracy on the training set
107
108
50
Classification and Prediction
— What is classification? What is prediction?
— Issues regarding classification and prediction
— Classification by decision tree induction
— Bayesian Classification
— Classification by Neural Networks
— Classification by Support Vector Machines (SVM)
— Instance Based Methods
— Prediction
— Classification accuracy
— Summary
109
Summary
— Classification is an extensively studied problem (mainly in statistics,
machine learning & neural networks)
— Classification is probably one of the most widely used data mining
techniques with a lot of extensions
— Scalability is still an important issue for database applications: thus
combining classification with database techniques should be a
promising topic
— Research directions: classification of non-relational data, e.g., text,
spatial, multimedia, etc..
110
51
References (1)
— C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation
Computer Systems, 13, 1997.
— L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth
International Group, 1984.
— C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2): 121-168, 1998.
— P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling
machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
39-44, Montreal, Canada, August 1995.
— U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf.,
pages 601-606, AAAI Press, 1994.
— J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427,
New York, NY, August 1998.
— J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999
111
References (2)
— M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on
Data Engineering (RIDE'97), Birmingham, England, April 1997.
— B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int.
Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998.
— W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association
Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001.
— J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159.
Blackwell Business, Cambridge Massechusetts, 1994.
— M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. (EDBT'96),
Avignon, France, March 1996.
112
52
References (3)
— T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
— S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
— J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
— J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence
(AAAI'96), 725-730, Portland, OR, Aug. 1996.
— R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In
Proc. 1998 Int. Conf.Very Large Data Bases, 404-415, New York, NY, August 1998.
— J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc.
1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996.
— S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction
Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman,
1991.
— S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
113
53