0% found this document useful (0 votes)
5 views

Decision tree

Classification is a key data mining task that categorizes data points into predefined classes using techniques like decision trees. Decision trees are supervised learning models that recursively split data based on feature values to predict outcomes, utilizing measures such as information gain and Gini index for attribute selection. The process involves building a tree structure with decision nodes and leaf nodes, where the model can classify new instances based on learned patterns from training data.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Decision tree

Classification is a key data mining task that categorizes data points into predefined classes using techniques like decision trees. Decision trees are supervised learning models that recursively split data based on feature values to predict outcomes, utilizing measures such as information gain and Gini index for attribute selection. The process involves building a tree structure with decision nodes and leaf nodes, where the model can classify new instances based on learned patterns from training data.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Classification

● Classification is a fundamental task in data mining that involves categorizing data points into
predefined classes or categories based on their features or attributes.
● It aims to learn a mapping from input variables to categorical output variables.

● One of the popular techniques for classification is building decision trees, which employ a tree-like
structure to represent and classify data
● In data mining, classification is the process of predicting the categorical class labels of new
instances based on past observations or labeled data.
● It involves training a model on a dataset with known class labels and then using this model to
classify new, unseen data.
● Supervised Learning: Classification is a form of supervised learning, where the algorithm learns
from labeled training data to make predictions on unseen data.
● Applications: Classification has various applications, including spam detection, disease diagnosis,
customer segmentation, sentiment analysis, and more.
Decision Tree
● Decision Tree is a Supervised learning technique that can be used for both classification

and Regression problems, but mostly it is preferred for solving Classification problems.

● It is a tree-structured classifier, where internal nodes represent the features of a dataset,

branches represent the decision rules and each leaf node represents the outcome.

● In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.

● Decision nodes are used to make any decision and have multiple branches, whereas Leaf

nodes are the output of those decisions and do not contain any further branches.

● The decisions or the test are performed on the basis of features of the given dataset.

● It is a graphical representation for getting all the possible solutions to a problem/decision

based on given conditions.

● It is called a decision tree because, similar to a tree, it starts with the root node, which

expands on further branches and constructs a tree-like structure.

● In order to build a tree, we use the CART algorithm, which stands for Classification and

Regression Tree algorithm.


● A decision tree simply asks a question, and based on the answer (Yes/No), it further split the

tree into subtrees.

● Below diagram explains the general structure of a decision tree:

Advantages
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
● Root Node: Root node is from where the decision tree starts. It represents the entire dataset,

which further gets divided into two or more homogeneous sets.

● Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further

after getting a leaf node.

● Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes

according to the given conditions.

● Branch/Sub Tree: A tree formed by splitting the tree.

● Pruning: Pruning is the process of removing the unwanted branches from the tree.

● Parent/Child node: The root node of the tree is called the parent node, and other nodes are

called the child nodes.


BUILDING A DECISION TREE - THE TREE INDUCTION METHOD:
● Decision tree is a popular and intuitive method for classification. It recursively partitions the data
into subsets based on the values of input features, with the goal of maximizing homogeneity within
each subset.
Selecting the Best Attribute:
● The tree induction process starts with selecting the best attribute to split the data.

● Various criteria can be used to measure the attribute's effectiveness, such as information gain, Gini
impurity, or entropy.
Splitting the Data:
● Once the best attribute is selected, the data is partitioned into subsets based on the attribute values.

● Each subset represents a branch or node in the decision tree.


Recursive Partitioning:
● The process of selecting the best attribute and splitting the data is repeated recursively for each
subset until one of the stopping conditions is met.
● Stopping conditions may include reaching a maximum depth, having a minimum number of
instances in a node, or no further improvement in purity.
Handling Categorical and Continuous Attributes:
● Decision trees can handle both categorical and continuous attributes.

● For categorical attributes, the tree creates branches for each category.

● For continuous attributes, the tree selects split points to partition the data into intervals.
Pruning (Optional):
● After the tree is built, pruning techniques may be applied to reduce overfitting and improve
generalization.
● Pruning involves removing branches that do not provide significant improvement in performance
on a validation dataset.
Classification:
● Once the tree is constructed, new instances can be classified by traversing the tree from the root to a
leaf node based on the attribute values of the instance.
How does the Decision Tree algorithm Work?
● In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root

node of the tree.

● This algorithm compares the values of root attribute with the record (real dataset) attribute and,

based on the comparison, follows the branch and jumps to the next node.
● For the next node, the algorithm again compares the attribute value with the other sub-nodes and

move further.

● It continues the process until it reaches the leaf node of the tree. The complete process can be better

understood using the below algorithm:


Steps:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
o Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not.

o So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM).

o The root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels.

o The next decision node further gets split into one decision node (Cab facility) and one leaf node.

o Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below diagram:
Attribute Selection Measures
● While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes.
● So, to solve such problems there is a technique which is called as Attribute selection measure or
ASM.
● By this measurement, we can easily select the best attribute for the nodes of the tree. There are two
popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Split Algorithm based on Information Theory:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
Now, let's calculate the information gain for each attribute (outlook, temperature, humidity, and windy) and
decide which attribute to split on first.

1. Calculate the entropy of the target variable ("Play Tennis"):

There are 9 instances where tennis is played (Yes) and 5 instances where tennis is not played (No).
Entropy = -((9/14) * log2(9/14) + (5/14) * log2(5/14))
≈ -((0.64) * log2(0.64) + (0.36) * log2(0.36))
≈ -(0.64 * -0.678 + 0.36 * -1.415)
≈ -(0.435 + 0.509)
≈ -0.944
2. Calculate the information gain for each attribute:
a. Outlook:
● Split the dataset based on different outlook values: Sunny, Overcast, Rain.

● Calculate the entropy for each split and use it to compute the information gain.
b. Temperature:
● Split the dataset based on different temperature values: Hot, Mild, Cool.

● Calculate the entropy for each split and use it to compute the information gain.
c. Humidity:
● Split the dataset based on different humidity values: High, Normal.

● Calculate the entropy for each split and use it to compute the information gain.
d. Windy:
● Split the dataset based on windy values: True, False.

● Calculate the entropy for each split and use it to compute the information gain.

● Choose the attribute with the highest information gain to split on.

● The attribute with the highest information gain will be chosen as the root node of the decision tree.

● Continue recursively splitting the dataset based on the selected attribute until a stopping condition is
met (e.g., all instances belong to the same class).
● To calculate the information gain for each attribute, we need to follow these steps:
o Calculate the entropy of the target variable before splitting.
o Calculate the weighted entropy after splitting for each attribute.
o Calculate the information gain for each attribute as the difference between the entropy
before splitting and the weighted entropy after splitting.
o Let's start by calculating the entropy of the target variable "Play Tennis":
Entropy(S) = -((9/14) * log2(9/14) + (5/14) * log2(5/14))

Entropy(S) ≈ -((0.64) * log2(0.64) + (0.36) * log2(0.36))


≈ -(0.64 * -0.678 + 0.36 * -1.415)
≈ -(0.435 + 0.509)
≈ -0.944

Now, let's calculate the information gain for each attribute:


Outlook:
● Split the dataset based on different outlook values: Sunny, Overcast, Rain.

● Calculate the entropy for each split and use it to compute the information gain.
Entropy(Sunny) = -(2/5 * log2(2/5) + 3/5 * log2(3/5)) ≈ -0.971
Entropy(Overcast) = 0 (All instances belong to the same class)
Entropy(Rain) = -(3/5 * log2(3/5) + 2/5 * log2(2/5)) ≈ 0.971
Weighted Entropy(Outlook) = (5/14) * Entropy(Sunny) + (4/14) * Entropy(Overcast) +
(5/14) * Entropy(Rain)
≈ (5/14) * -0.971 + (4/14) * 0 + (5/14) * 0.971
≈ -0.352
Information Gain(Outlook) = Entropy(S) - Weighted Entropy(Outlook)
≈ -0.944 - (-0.352)
≈ -0.592

Temperature:
● Split the dataset based on different temperature values: Hot, Mild, Cool.

● Calculate the entropy for each split and use it to compute the information gain.
Temperature:
Entropy(Hot) = -(2/4 * log2(2/4) + 2/4 * log2(2/4)) = 1.0
Entropy(Mild) = -(4/6 * log2(4/6) + 2/6 * log2(2/6)) ≈ 0.918
Entropy(Cool) = -(3/4 * log2(3/4) + 1/4 * log2(1/4)) ≈ 0.811

Weighted Entropy(Temperature) = (4/14) * Entropy(Hot) + (6/14) * Entropy(Mild) + (4/14) *


Entropy(Cool)
≈ (4/14) * 1.0 + (6/14) * 0.918 + (4/14) * 0.811
≈ 0.911
Information Gain(Temperature) = Entropy(S) - Weighted Entropy(Temperature)
≈ -0.944 - 0.911
≈ -1.855
Humidity:
● Split the dataset based on different humidity values: High, Normal.

● Calculate the entropy for each split and use it to compute the information gain.
Humidity:
Entropy(High) = -(3/7 * log2(3/7) + 4/7 * log2(4/7)) ≈ 0.985
Entropy(Normal) = -(6/7 * log2(6/7) + 1/7 * log2(1/7)) ≈ 0.591
Weighted Entropy(Humidity) = (7/14) * Entropy(High) + (7/14) * Entropy(Normal)
≈ (7/14) * 0.985 + (7/14) * 0.591
≈ 0.788
Information Gain(Humidity) = Entropy(S) - Weighted Entropy(Humidity)
≈ -0.944 - 0.788
≈ -1.732
Windy:
● Split the dataset based on windy values: True, False.

● Calculate the entropy for each split and use it to compute the information gain.
Windy:
Entropy(True) = -(3/6 * log2(3/6) + 3/6 * log2(3/6)) = 1.0
Entropy(False) = -(6/8 * log2(6/8) + 2/8 * log2(2/8)) ≈ 0.811
Weighted Entropy(Windy) = (6/14) * Entropy(True) + (8/14) * Entropy(False)
≈ (6/14) * 1.0 + (8/14) * 0.811
≈ 0.892
Information Gain(Windy) = Entropy(S) - Weighted Entropy(Windy)
≈ -0.944 - 0.892
≈ -1.836
● Based on the information gain calculations, the "Outlook" attribute has the highest information gain
(-0.592). Therefore, we will choose "Outlook" as the root node for the decision tree.
The final decision tree would look like this:

This decision tree indicates the decision-making process based on different outlook conditions, and further
branching based on windy conditions if it's raining.

2. Split Algorithm based on Gini Index:


o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

● To use the Gini index for splitting, we follow a similar process to information gain, but instead of
using entropy, we calculate the Gini index for each attribute.
● The attribute with the lowest Gini index (indicating the least impurity) is chosen for splitting.
The Gini index is calculated as follows:
● Calculate the Gini index for the target variable before splitting.

● Calculate the weighted Gini index after splitting for each attribute.

● Choose the attribute with the lowest weighted Gini index for splitting.
Let's calculate the Gini index for each attribute using the same dataset:
Calculate the Gini index for the target variable "Play Tennis":
Gini(S) = 1 - (P(Yes))^2 - (P(No))^2
Where P(Yes) is the probability of the positive class (playing tennis) and P(No) is the probability of the
negative class (not playing tennis).
Gini(S) = 1 - ((9/14)^2) - ((5/14)^2)
Gini(S) ≈ 1 - (0.51) - (0.25)
Gini(S) ≈ 0.459
Now, let's calculate the Gini index for each attribute:
Outlook:
● Split the dataset based on different outlook values: Sunny, Overcast, Rain.

● Calculate the Gini index for each split and use it to compute the weighted Gini index.
Gini(Sunny) = 1 - ((2/5)^2) - ((3/5)^2) ≈ 0.48
Gini(Overcast) = 0 (All instances belong to the same class)
Gini(Rain) = 1 - ((3/5)^2) - ((2/5)^2) ≈ 0.48
Weighted Gini(Outlook) = (5/14) * Gini(Sunny) + (4/14) * Gini(Overcast) + (5/14) * Gini(Rain)
≈ (5/14) * 0.48 + (4/14) * 0 + (5/14) * 0.48
≈ 0.343
Temperature:
● Split the dataset based on different temperature values: Hot, Mild, Cool.

● Calculate the Gini index for each split and use it to compute the weighted Gini index.
Gini(Hot) = 1 - ((2/4)^2) - ((2/4)^2) = 0.5
Gini(Mild) = 1 - ((4/6)^2) - ((2/6)^2) ≈ 0.444
Gini(Cool) = 1 - ((3/4)^2) - ((1/4)^2) ≈ 0.375
Weighted Gini(Temperature) = (4/14) * Gini(Hot) + (6/14) * Gini(Mild) + (4/14) * Gini(Cool)
≈ (4/14) * 0.5 + (6/14) * 0.444 + (4/14) * 0.375
≈ 0.439
Humidity:
Split the dataset based on different humidity values: High, Normal.
Calculate the Gini index for each split and use it to compute the weighted Gini index.
Gini(High) = 1 - ((3/7)^2) - ((4/7)^2) ≈ 0.49
Gini(Normal) = 1 - ((6/7)^2) - ((1/7)^2) ≈ 0.245
Weighted Gini(Humidity) = (7/14) * Gini(High) + (7/14) * Gini(Normal)
≈ (7/14) * 0.49 + (7/14) * 0.245
≈ 0.367

Windy:
Split the dataset based on windy values: True, False.
Calculate the Gini index for each split and use it to compute the weighted Gini index.
Gini(True) = 1 - ((3/6)^2) - ((3/6)^2) = 0.5
Gini(False) = 1 - ((6/8)^2) - ((2/8)^2) ≈ 0.375
Weighted Gini(Windy) = (6/14) * Gini(True) + (8/14) * Gini(False)
≈ (6/14) * 0.5 + (8/14) * 0.375
≈ 0.429

Based on the Gini index calculations, the "Outlook" attribute has the lowest weighted Gini index (0.343).
Therefore, we will choose "Outlook" as the root node for the decision tree.
Overfitting and Pruning
● Overfitting and pruning are essential concepts in the context of decision trees and other machine
learning models. Let's discuss each of them in detail:
Overfitting:
● Overfitting occurs when a model learns the training data too well, capturing noise and random
fluctuations rather than the underlying pattern. As a result, the model performs well on the training
data but poorly on unseen data.
● In decision trees, overfitting often happens when the tree is allowed to grow too deep or become too
complex. This can lead to nodes in the tree that are specific to the training data and do not
generalize well.
● Overfitting can be detected by evaluating the model's performance on a separate validation set or
through cross-validation.
● Signs of overfitting include a large difference between training and validation performance or
deteriorating performance on the validation set as the model becomes more complex.
Pruning:
● Pruning is a technique used to combat overfitting by removing parts of the decision tree that do not
contribute significantly to its predictive accuracy.
● There are two main approaches to pruning: pre-pruning and post-pruning.

● Pre-pruning involves stopping the tree's growth early based on predefined criteria, such as limiting
the maximum depth of the tree, the minimum number of samples required to split a node, or the
maximum number of leaf nodes.
● Post-pruning, also known as cost-complexity pruning, involves growing the full tree and then
removing nodes that do not improve the tree's performance on a validation set.
● This is typically done by recursively replacing subtrees with a single node, such as the most
common class in the subtree, while monitoring the tree's performance on the validation set.
● Pruning helps simplify the model, reducing its complexity and improving its generalization
performance on unseen data.
● Common pruning techniques include reduced error pruning, complexity parameter pruning (e.g.,
using cost-complexity pruning in CART), and minimal cost-complexity pruning in decision tree
algorithms like CART and C4.5.
Rules for decision tree
Decision trees are a popular method in machine learning and data mining for both classification and
regression tasks. Here are the basic rules for constructing a decision tree:
● Selecting a Root Node:
Choose the attribute that best splits the data into distinct classes or groups. This is typically done
using metrics such as entropy, Gini impurity, or information gain.
● Splitting Nodes:
For each node in the tree, choose the best attribute to split the data into child nodes.
Continue this process recursively for each child node until a stopping criterion is met.
● Stopping Criteria:
A decision tree can grow very complex and may overfit the training data. To prevent this, stopping
criteria are applied, such as:
● Maximum tree depth:
Limit the maximum depth of the tree.
● Minimum samples per leaf:
Stop splitting nodes when the number of samples in a node is below a certain threshold.
● Maximum number of leaf nodes:
Limit the total number of leaf nodes in the tree.
● Pruning:
After growing the full tree, remove nodes that do not provide much additional predictive power.
● Handling Missing Values:
Decide how to handle missing values in the dataset. Common methods include ignoring them,
imputing them with the mean or median, or treating missing values as a separate category.
● Handling Categorical and Numerical Attributes:
Different algorithms handle categorical and numerical attributes differently. For example, for
categorical attributes, the decision tree algorithm may use techniques like one-hot encoding or label
encoding, while numerical attributes are split based on threshold values.
● Measuring Node Impurity:
The impurity of a node measures the homogeneity of the target variable within that node. Common
impurity measures include Gini impurity and entropy. The attribute that reduces impurity the most
when used for splitting is chosen.
● Pruning:
Pruning is a technique used to reduce the size of decision trees by removing parts of the tree that do
not provide additional predictive power. This helps prevent overfitting.
● Tree Evaluation:
Once the tree is constructed, it can be evaluated using various metrics depending on the task (e.g.,
accuracy, precision, recall, F1-score for classification; mean squared error, R-squared for
regression).

NAÏVE BAYES METHOD


● Naive Bayes is a simple yet powerful probabilistic classification algorithm widely used in data
mining tasks.
● It is based on Bayes' theorem, with the "naive" assumption of independence between features.
Despite its simplicity, Naive Bayes often performs well in practice, especially for text classification
and spam filtering tasks.

Consider a dataset of emails labeled as "spam" or "not spam," with two features: "contains the word
'free'" and "contains the word 'discount'."
Estimating Predictive Accuracy of Classification Methods
Performance Metrics
Confusion Matrix
● A confusion matrix is a table that is often used to evaluate the performance of a classification
model. It provides a summary of the predictions made by the model compared to the actual ground
truth across different classes.
● In a binary classification scenario, the confusion matrix is a 2x2 matrix, but it can be extended to
accommodate multiple classes.
● Here's how a confusion matrix is structured for binary classification:

True Positive (TP): Instances that are actually positive and are correctly classified as positive by the model.
True Negative (TN): Instances that are actually negative and are correctly classified as negative by the
model.
False Positive (FP): Instances that are actually negative but are incorrectly classified as positive by the
model (Type I error).
False Negative (FN): Instances that are actually positive but are incorrectly classified as negative by the
model (Type II error).
● Confusion matrices are valuable for understanding the types of errors made by the model and for
selecting appropriate evaluation metrics based on the specific requirements of the problem.
● They provide a clear visualization of the model's performance across different classes and can help
identify areas for improvement.
Methods for estimating Accuracy
Holdout Method:
● The holdout method involves splitting the dataset into two subsets: a training set and a testing set.
The model is trained on the training set and evaluated on the independent testing set.
Random Subsampling Method:
● In the random subsampling method, multiple random splits of the dataset are created into training
and testing sets. The model is trained and evaluated on each split, and the performance metrics are
averaged across all iterations.
K-Fold Cross-Validation Method:
● K-fold cross-validation divides the dataset into k subsets (folds). The model is trained k times, each
time using k-1 folds for training and the remaining fold for validation. The performance metrics are
averaged across all folds.
Leave One Out Method:
● The leave-one-out method (LOOCV) is a special case of k-fold cross-validation where k is equal to
the number of instances in the dataset. In each iteration, one instance is used for validation, and the
model is trained on the remaining instances.
Bootstrap Method:
● The bootstrap method involves creating multiple bootstrap samples by randomly sampling instances
from the dataset with replacement. The model is trained and evaluated on each bootstrap sample,
and the performance metrics are averaged across all iterations.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy