2 - Decision Tree
2 - Decision Tree
Classification is a supervised learning i.e. we can predict input and out values, Classification is divided into groups but not necessarily similar properties is called Classification. Decision Tree Induction is developed by Ross Quinlan, decision tree algorithm known as ID3 (Iterative Dichotomiser). Decision tree is a classifier in the form of a tree structure
Decision node: specifies a test on a single attribute Leaf node: indicates the value of the target attribute Arc/edge: split of one attribute Path: a disjunction of test to make the final decision
ID3, C4.5, and CART are greedy algorithms for the induction of decision trees. Each algorithm uses an attribute selection measure to select the attribute tested for each nonleaf node in the tree. Pruning algorithms attempt to improve accuracy by removing tree branches reflecting noise in the data. Early decision tree algorithms typically assume that the data are memory resident a limitation to data mining on large databases. Several scalable algorithms, such as SLIQ, SPRINT, and RainForest, have been proposed to address this issue.
2
Key Requirements :
Attribute-value description: Object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). Predefined classes (Target values): the target function has discrete output values (Boolean or Multiclass) Sufficient data: Enough training cases should be provided to learn the model.
1. Decision Tree
2. Bayesian classification
3. Rule-based classification
1) Decision Tree
CLASSIFICATION TYPES
3) Rule-based classification
2) Bayesian classification
1. Decision Tree Induction: Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node)denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node. It can generate understandable rules. It perform classification without much computation. It can handle continuous and categorical variables. It provide a clear indication of which fields are most important for prediction or classification. It is not suitable for prediction of continuous attribute. 7 It Perform poorly with many class and small data.
Let A be the splitting attribute. A has v distinct values, fa1, a2, : : : , avg, based on the training data.
1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the known values of A. 2. A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to the conditions A split point and A > split point, respectively. 3. A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection measure or algorithm being used)
10
age?
<=30
overcast 31..40
>40
student? no
no
yes
yes
yes
Fig: Decision tree for the concept buys computer, indicating whether a customer at AllElectronics is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute. Each leaf node represents a class (either buys computer = yes or buys computer = no). 11
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition D. Input: i. Data partition, D, which is a set of training tuples and their associated class labels; ii. Attribute_list, the set of candidate attributes; iii. Attribute_selection_method, a procedure to determine the splitting criterion that best partitions the data tuples into individual classes. This criterion consists of a splitting_attribute and, possibly, either a split point or splitting subset. Output: A decision tree.
12
Algorithm
13
2. Attribute Selection Measure: Select the attribute with the highest information gain. An attribute selection measure is a heuristic for selecting the splitting criterion that best separates a given data partition, D, of classlabeled training tuples into individual classes. i. Information gain: ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or information content of messages. Let node N represent or hold the tuples of partition D.
14
ii. Gain ratio: The information gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes having a large number of values.
iii. Gini index: The Gini index is used in CART. Using the notation described above, the Gini index measures the impurity of D, a data partition or set of training tuples, as
Where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|
15
3. Tree Pruning: Pruning algorithms attempt to improve accuracy by removing tree branches reflecting noise in the data. There are two common approaches to tree pruning: prepruning and postpruning. i. Prepruning approach, a tree is pruned by halting its construction early (e.g., by deciding not to further split or partition the subset of training tuples at a given node). ii. Postpruning, which removes subtrees from a fully grown tree. A subtree at a given node is pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most frequent class among the subtree being replaced.
16
17
4. Scalability and Decision Tree Induction: The efficiency of existing decision tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively small data sets. Efficiency becomes an issue of concern when these algorithms are applied to the mining of very large realworld databases. The restriction that the training tuples should reside in memory. It consists of (a) Repetition (b) Replication (c) SLIQ (d) SPRINT
18
a) Repetition
Fig: An example of subtree (a) Repetition (where an attribute is repeatedly tested along a given branch of the tree, e.g., age)
19
b) Replication
Fig: An example of subtree (b) Replication (where duplicate subtrees exist within a tree, such as the subtree headed by the node credit rating?).
20
c) SLIQ
21 Fig: Attribute list and class list data structures used in SLIQ for the tuple data
d) SPRINT
Fig: Attribute list data structure used in SPRINT for the tuple data
22
Issues of Classification
1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. Accuracy Training time Robustness Interpretability Scalability Credit approval Target marketing Medical diagnosis Fraud detection Weather forecasting Stock Marketing
23
Typical applications