Unit Ii
Unit Ii
UNIT-II
Picking the best splitting attribute: entropy and information gain, Searching for simple trees and computational
complexity, Occam's razor, Overfitting, noisy data, and pruning.
Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing.
……………………………………………………………………………………………………………………………
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned
function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to
improve human readability.
These learning methods are among the most popular of inductive inference algorithms and have been
successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess
credit risk of loan applicants.
• Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
• Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
• The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.
• Requires little data preparation. Other techniques often require data normalization, dummy variables
need to be created and blank values to be removed. Note however that this module does not support
missing values.
• The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train
the tree.
• Able to handle both numerical and categorical data. However, scikit-learn implementation does not
support categorical variables for now. Other techniques are usually specialized in analyzing datasets
that have only one type of variable. See algorithms for more information.
• Uses a white box model. If a given situation is observable in a model, the explanation for the condition
is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural
network), results may be more difficult to interpret.
• Possible to validate a model using statistical tests. That makes it possible to account for the reliability
of the model.
• Performs well even if its assumptions are somewhat violated by the true model from which the data
were generated.
• Decision-tree learners can create over-complex trees that do not generalize the data well. This is called
overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf
node or setting the maximum depth of the tree are necessary to avoid this problem.
• Decision trees can be unstable because small variations in the data might result in a completely different
tree being generated. This problem is mitigated by using decision trees within an ensemble.
• Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations. Therefore, they are not good at extrapolation.
• The problem of learning an optimal decision tree is known to be NP-complete under several aspects
of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms
are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made
at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can
be mitigated by training multiple trees in an ensemble learner, where the features and samples are
randomly sampled with replacement.
• There are concepts that are hard to learn because decision trees do not express them easily, such as
XOR, parity or multiplexer problems.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to
balance the dataset prior to fitting with the decision tree.
A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g.
whether a coin flip comes up heads or tails), each leaf node represents a class label (decision taken after
computing all features) and branches represent conjunctions of features that lead to those class labels. The
paths from root to leaf represent classification rules. Below diagram illustrate the basic flow of decision tree
for decision making with labels (Rain(Yes), No Rain(No)).
Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine learning.
Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on
different conditions. It is one of the most widely used and practical methods for supervised learning. Decision
Trees are a non-parametric supervised learning method used for both classification and regression tasks.
Tree models where the target variable can take a discrete set of values are called classification trees. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Classification And Regression Tree (CART) is general term for this.
Root Nodes – It is the node present at the beginning of a decision tree from this node the population starts
dividing according to various features.
Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or terminal nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of this decision tree is
called sub-tree.
• Prone to overfitting.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents
a class.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples and their associated class labels. attribute_list, the set of
candidate attributes.
Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or splitting
subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
class in D to node N;
else
end for
return N;
Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset
into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more
branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The
topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can
handle both categorical and numerical data.
Algorithm
The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy
search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain
to construct a decision tree. In ZeroR model there is no predictor, in OneR model we try to find the single
best predictor, naive Bayesian includes all predictors using Bayes' rule and the independence assumptions
between predictors but decision tree includes all predictors with the dependence assumptions between
predictors.
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain
instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a
sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided
it has entropy of one.
To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing
a decision tree is all about finding attribute that returns the highest information gain (i.e., the most
homogeneous branches).
Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it
is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy
before the split. The result is the Information Gain, or decrease in entropy.
Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its
branches and repeat the same process on every branch.
Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes
one by one.
Let N = number of training examples, k = number of features, and d = depth of the decision tree.
A decision tree would calculate a quality function based on each split of the data, and it does this for each
feature in every node that is not a leaf node. This happens as long as there are levels (depth) to continue to.
In the best case of a balanced tree the depth would be in O(logN), but the decision tree does locally optimal
splits without caring much about balancing. This means that the worst case of depth being in O(N) is possible
- basically when each split simply splits data in 1 and n-1 examples, where n is the number of examples of the
current node.
This means that it’s actually somewhere in between being in O(NklogN) and O(N 2k)
The uncertainty here is due to the non-deterministic way in which decision trees are built, always splitting
data based on locally optimal thresholds with close to no consideration for overall balance. Keep in mind
building a globally optimal decision tree is an NP-hard problem.
Occam’s razor argues that the simplest explanation is the one most likely to be correct.
Occam’s Razor is one of the principles that guides us when we are trying to select the appropriate model for
a particular machine learning problem. If the model is too simple, it will make useless predictions. If the
model is too complex (loaded with attributes), it will not generalize well.
Imagine, for example, you are trying to predict a student’s college GPA. A simple model would be one that
is based entirely on a student’s SAT score.
While this model is very simple, it might not be very accurate because often a college student’s GPA is
dependent on factors other than just his or her SAT score. It is severely underfit and inflexible. In machine
learning jargon, we would say this type of model has high bias, but low variance.
In general, the more inflexible a model, the higher the bias. Also, the noisier the model, the higher the
variance. This is known as the bias–variance tradeoff.
If the model is too complex and loaded with attributes, it is at risk of capturing noise in the data that could
be due entirely to random chance. It would make amazing predictions on the training data set, but it would
perform poorly when faced with a new data set. It won’t generalize well because it is severely overfit. It has
high variance and low bias.
A real-world example would be like trying to predict a person’s college gpa based on his or her SAT score,
high school GPA, middle school GPA, socio-economic status, city of birth, hair color, favorite NBA team,
favorite food, and average daily sleep duration.
College GPA = Θ1 * (SAT Score) + Θ2 * High School GPA + Θ3 * Middle School GPA + Θ4 * Socio-Economic
status + Θ5 * City of Birth + Θ6 * Hair Color + Θ7 * Favorite NBA Team + Θ8 * Favorite Food + Θ9 *
Average Daily Sleep Duration
Training Data:
Test Data:
Overfitting:
Overfitting means too many un-necessary branches in the tree. Overfitting results in different kind of
anomalies that are the results of outliers and noise.
1. Pre-pruning
2. Post-pruning
1.Pre-Pruning:
Pre-Pruning means to stop the growing tree before a tree is fully grown.
2. Post-Pruning:
Post-Pruning means to allow the tree to grow with no size limit. After tree completion starts to prune the
tree.
Practically, the second approach of post-pruning overfit trees is more successful because it is not easy to
precisely estimate when to stop growing the tree.
The important step of tree pruning is to define a criterion be used to determine the correct final tree size
using one of the following methods:
1. Use a distinct dataset from the training set (called validation set), to evaluate the effect of post-pruning
nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether pruning or
expanding a particular node is likely to produce an improvement beyond the training set.
• Error estimation
• Significance testing (e.g., Chi-square test)
The first method is the most common approach. In this approach, the available data are separated into two
sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to
evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain
the error estimation and Chi2 test.
Error estimate for a sub-tree is weighted sum of error estimates for all its leaves. The error estimate (e) for a
node is:
In the following example we set Z to 0.69 which is equal to a confidence level of 75%.
The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the
split, we do not want to keep the children.
In Chi2 test we construct the corresponding frequency table and calculate the Chi2 value and its probability.
Bad 4 1 4
Good 2 1 2
If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the
node.
In many cases it is important to evaluate the performance of learned hypotheses as precisely as possible.
One reason is simply to understand whether to use the hypothesis. For instance, when learning from a
limited-size database indicating the effectiveness of different medical treatments, it is important to
understand as precisely as possible the accuracy of the learned hypotheses.
A second reason is that evaluating hypotheses is an integral component of many learning methods. For
example, in post-pruning decision trees to avoid overfitting, we must evaluate the impact of possible pruning
steps on the accuracy of the resulting decision tree. Therefore, it is important to understand the likely errors
inherent in estimating the accuracy of the pruned and unpruned tree.
Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when
we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key
difficulties arise:
Whenever you form a hypothesis for a given training data set, for example, you came up with a hypothesis
for the EnjoySport example where the attributes of the instances decide if a person will be able to enjoy their
favorite sport or not.
Now to test or evaluate how accurate the considered hypothesis is we use different statistical measures.
Evaluating hypotheses is an important step in training the model.
Some Definitions
1. What is the best estimate of the accuracy of h over future instances taken from the same distribution,
given a hypothesis h and a data sample containing n examples picked at random according to the
distribution D?
True Error
The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn
sample from the population. Here the population represents all the data in the world.
Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h
will misclassify an instance drawn at random i.e. true error is:
Sample Error
The sample error of S with respect to target function f and data sample S is the proportion of examples S
misclassifies.
Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling
error should be:
SE=7/33 = 0.21
Confidence Interval
Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a
confidence interval. The confidence interval can be estimated as the function of the sampling error.
• Randomly drawn n samples S (independently of each other), where n should be >30 from the
population P.
Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for
calculating true error:
where zs is the value of the z-score of the s percentage of the confidence interval:
Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing
There are literally countless number of machine learning (ML) algorithms human has invented. Of course,
most of the time only a small subset is used in research and in industries. Yet, it is still a bit overwhelming for
a human to understand and remember all the nitty-gritty details of all these ML models. Some people might
also have a wrong impression that all these algorithms are totally unrelated. More importantly, how might
one choose to use algorithm A over B when both seem to be effective algorithms?
• Time complexity
• Space complexity
• Sample complexity
• Online and Offline learning
• Parallelizability
• Parametricity
• Methodology, Assumptions and Objectives
1. Cross Validation
In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work
accurately for the real data. For this, we must assure that our model got the correct patterns from the data,
and it is not getting up too much noise. For this purpose, we use the cross-validation technique.
Cross-validation is a technique in which we train our model using the subset of the data-set and then
evaluate using the complementary subset of the data-set.
a) Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the testing
purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we are leaving
while training our model i.e higher bias.
Note:
It is always suggested that the value of k should be 10 as the lower value of k is takes towards validation
and higher value of k leads to LOOCV method.
Example
The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold
cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training([1-5] testing and [5-25] training) while in the second
iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training([5-10] testing and [1-5 and 10-25] training), and so on.
Total instances: 25
Value of k: 5
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
1. This runs K times faster than Leave One Out cross-validation because K-fold cross-validation repeats
the train/test split K-times.
Advantages of cross-validation:
2. More “efficient” use of data as every observation is used for both training and testing.
2. learning curves
A learning curve is just a plot showing the progress over the experience of a specific metric related to
learning during the training of a machine learning model. They are just a mathematical representation of the
learning process.
According to this, we’ll have a measure of time or progress in the x-axis and a measure of error or
performance in the y-axis.
We use these charts to monitor the evolution of our model during learning so we can diagnose problems
and optimize the prediction performance.
The most popular example of a learning curve is loss over time. Loss (or cost) measures our model error, or
“how bad our model is doing”. So, for now, the lower our loss becomes, the better our model performance
will be.
In the picture below, we can see the expected behavior of the learning process:
Despite the fact it has slight ups and downs, in the long term, the loss decreases over time, so the model is
learning.
One of the most widely used metrics combinations is training loss + validation loss over time.
The training loss indicates how well the model is fitting the training data, while the validation loss indicates
how well the model fits new data.
We will see this combination later on, but for now, see below a typical plot showing both metrics:
• Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of
the model are being optimized, such as loss or Mean Squared Error
• Performance Learning Curves: Learning curves calculated on the metric by which the model will be
evaluated and selected, such as accuracy, precision, recall, or F1 score
Below you can see an example in Machine Translation showing BLEU (a performance score) together with
the loss (optimization score) for two different models (orange and green):
Any data science project starts with exploring the data. When we perform an analysis on a sample through
exploratory data analysis and inferential statistics we get information about the sample. Now, we want to
use this information to predict values for the entire population.
Hypothesis testing is done to confirm our observation about the population using sample data, within the
desired error level. Through hypothesis testing, we can determine whether we have enough statistical
evidence to conclude if the hypothesis about the population is true or not.
To trust your model and make predictions, we utilize hypothesis testing. When we will use sample data to
train our model, we make assumptions about our population. By performing hypothesis testing, we validate
these assumptions for a desired significance level.
1. Formulate a Hypothesis
5. Make Decision
One of the key steps to do this is to formulate the below two hypotheses:
• The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about
the population.
• The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim
which we would like to prove as True
One of the main points which we should consider while formulating the null and alternative hypothesis is
that the null hypothesis always looks at confirming the existing notion. Hence, it has sign >= or , < and ≠
2. Determine the significance level also known as alpha or α for Hypothesis Testing
The significance level is the proportion of the sample mean lying in critical regions. It is usually set as 5% or
0.05 which means that there is a 5% chance that we would accept the alternate hypothesis even when our
null hypothesis is true
Based on the criticality of the requirement, we can choose a lower significance level of 1% as well.
3. Determine the Test Statistic and calculate its value for Hypothesis Testing
Hypothesis testing uses Test Statistic which is a numerical summary of a data-set that reduces the data to one
value that can be used to perform the hypothesis test.
We choose the type of test statistic based on the predictor variable – quantitative or categorical. Below are a
few of the commonly used test statistics for quantitative data
Test Statistic is then used to calculate P-Value. A P-value measures the strength of evidence in support of a
null hypothesis. If the P-value is less than the significance level, we reject the null hypothesis.
if the p-value < α, then we have statistically significant evidence against the null hypothesis, so we reject the
null hypothesis and accept the alternate hypothesis
if the p-value > α then we do not have statistically significant evidence against the null hypothesis, so we fail
to reject the null hypothesis.
As we make decisions, it is important to understand the errors that can happen while testing.
P-value :- The P value, or calculated probability, is the probability of finding the observed, or more extreme,
results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on
how the hypothesis is being tested.