0% found this document useful (0 votes)
7 views

Unit Ii

Uploaded by

Manohar 6666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit Ii

Uploaded by

Manohar 6666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

JNTUK B.

TECH R19 4-1 MACHINE LEARNING

UNIT-II

Decision Tree Learning

Representing concepts as decision trees, Recursive induction of decision trees,

Picking the best splitting attribute: entropy and information gain, Searching for simple trees and computational
complexity, Occam's razor, Overfitting, noisy data, and pruning.

Experimental Evaluation of Learning Algorithms: Measuring the accuracy of learned hypotheses.

Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing.

……………………………………………………………………………………………………………………………

Decision tree learning is a method for approximating discrete-valued target functions, in which the learned
function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to
improve human readability.

These learning methods are among the most popular of inductive inference algorithms and have been
successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess
credit risk of loan applicants.

• Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.

• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.

• The decisions or the test are performed on the basis of features of the given dataset.

• It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.

• In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.

• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.

• Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

1 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:

• Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.

• The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.

Some advantages of decision trees are:

• Simple to understand and to interpret. Trees can be visualized.

• Requires little data preparation. Other techniques often require data normalization, dummy variables
need to be created and blank values to be removed. Note however that this module does not support
missing values.

• The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train
the tree.

• Able to handle both numerical and categorical data. However, scikit-learn implementation does not
support categorical variables for now. Other techniques are usually specialized in analyzing datasets
that have only one type of variable. See algorithms for more information.

• Able to handle multi-output problems.

• Uses a white box model. If a given situation is observable in a model, the explanation for the condition
is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural
network), results may be more difficult to interpret.

• Possible to validate a model using statistical tests. That makes it possible to account for the reliability
of the model.

• Performs well even if its assumptions are somewhat violated by the true model from which the data
were generated.

The disadvantages of decision trees include:

• Decision-tree learners can create over-complex trees that do not generalize the data well. This is called
overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf
node or setting the maximum depth of the tree are necessary to avoid this problem.

• Decision trees can be unstable because small variations in the data might result in a completely different
tree being generated. This problem is mitigated by using decision trees within an ensemble.

2 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

• Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations. Therefore, they are not good at extrapolation.

• The problem of learning an optimal decision tree is known to be NP-complete under several aspects
of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms
are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made
at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can
be mitigated by training multiple trees in an ensemble learner, where the features and samples are
randomly sampled with replacement.

• There are concepts that are hard to learn because decision trees do not express them easily, such as
XOR, parity or multiplexer problems.

• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to
balance the dataset prior to fitting with the decision tree.

1 Q) Representing concepts as decision trees

A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g.
whether a coin flip comes up heads or tails), each leaf node represents a class label (decision taken after
computing all features) and branches represent conjunctions of features that lead to those class labels. The
paths from root to leaf represent classification rules. Below diagram illustrate the basic flow of decision tree
for decision making with labels (Rain(Yes), No Rain(No)).

Fig: Decision Tree for Rain Forecasting

Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine learning.

Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on
different conditions. It is one of the most widely used and practical methods for supervised learning. Decision
Trees are a non-parametric supervised learning method used for both classification and regression tasks.

Tree models where the target variable can take a discrete set of values are called classification trees. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Classification And Regression Tree (CART) is general term for this.

Root Nodes – It is the node present at the beginning of a decision tree from this node the population starts
dividing according to various features.

Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or terminal nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of this decision tree is
called sub-tree.

Advantage of Decision Tree

• Easy to use and understand.

3 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

• Can handle both categorical and numerical data.

• Resistant to outliers, hence require little data preprocessing.

Disadvantage of Decision Tree

• Prone to overfitting.

• Require some kind of measurement as to how well they are doing.

• Need to be careful with parameter tuning.

• Can create biased learned trees if some classes dominate.

2. Q) Recursive induction of decision trees

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents
a class.

The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.

• It is easy to comprehend.

• The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D

4 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Algorithm : Generate_decision_tree

Input:

Data partition, D, which is a set of training tuples and their associated class labels. attribute_list, the set of
candidate attributes.

Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data

tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or splitting
subset.

Output:

A Decision Tree

Method

create a node N;

if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;

if attribute_list is empty then

return N as leaf node with labeled

with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)

to find the best splitting_criterion;

label node N with splitting_criterion;

if splitting_attribute is discrete-valued and

multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then

attach a leaf labeled with the majority

class in D to node N;

5 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

else

attach the node returned by Generate

decision tree(Dj, attribute list) to node N;

end for

return N;

3. Q) Picking the best splitting attribute:

3.1 entropy and information gain

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset
into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more
branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The
topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can
handle both categorical and numerical data.

Algorithm

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy
search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain
to construct a decision tree. In ZeroR model there is no predictor, in OneR model we try to find the single
best predictor, naive Bayesian includes all predictors using Bayes' rule and the independence assumptions
between predictors but decision tree includes all predictors with the dependence assumptions between
predictors.

Entropy

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain
instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a
sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided
it has entropy of one.

6 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:

a) Entropy using the frequency table of one attribute:

b) Entropy using the frequency table of two attributes:

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing
a decision tree is all about finding attribute that returns the highest information gain (i.e., the most
homogeneous branches).

7 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it
is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy
before the split. The result is the Information Gain, or decrease in entropy.

Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its
branches and repeat the same process on every branch.

8 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes
one by one.

9 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

3.2 Searching for simple trees and computational complexity

Let N = number of training examples, k = number of features, and d = depth of the decision tree.

A decision tree would calculate a quality function based on each split of the data, and it does this for each
feature in every node that is not a leaf node. This happens as long as there are levels (depth) to continue to.
In the best case of a balanced tree the depth would be in O(logN), but the decision tree does locally optimal
splits without caring much about balancing. This means that the worst case of depth being in O(N) is possible
- basically when each split simply splits data in 1 and n-1 examples, where n is the number of examples of the
current node.

So to conclude, the time complexity for decision trees is in O(Nkd)

This means that it’s actually somewhere in between being in O(NklogN) and O(N 2k)

The uncertainty here is due to the non-deterministic way in which decision trees are built, always splitting
data based on locally optimal thresholds with close to no consideration for overall balance. Keep in mind
building a globally optimal decision tree is an NP-hard problem.

3.3 Occam's razor

Occam’s razor argues that the simplest explanation is the one most likely to be correct.

How is Occam’s Razor Relevant in Machine Learning?

Occam’s Razor is one of the principles that guides us when we are trying to select the appropriate model for
a particular machine learning problem. If the model is too simple, it will make useless predictions. If the
model is too complex (loaded with attributes), it will not generalize well.

Imagine, for example, you are trying to predict a student’s college GPA. A simple model would be one that
is based entirely on a student’s SAT score.

College GPA = Θ * (SAT Score)

While this model is very simple, it might not be very accurate because often a college student’s GPA is
dependent on factors other than just his or her SAT score. It is severely underfit and inflexible. In machine
learning jargon, we would say this type of model has high bias, but low variance.

10 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

In general, the more inflexible a model, the higher the bias. Also, the noisier the model, the higher the
variance. This is known as the bias–variance tradeoff.

If the model is too complex and loaded with attributes, it is at risk of capturing noise in the data that could
be due entirely to random chance. It would make amazing predictions on the training data set, but it would
perform poorly when faced with a new data set. It won’t generalize well because it is severely overfit. It has
high variance and low bias.

A real-world example would be like trying to predict a person’s college gpa based on his or her SAT score,
high school GPA, middle school GPA, socio-economic status, city of birth, hair color, favorite NBA team,
favorite food, and average daily sleep duration.

College GPA = Θ1 * (SAT Score) + Θ2 * High School GPA + Θ3 * Middle School GPA + Θ4 * Socio-Economic
status + Θ5 * City of Birth + Θ6 * Hair Color + Θ7 * Favorite NBA Team + Θ8 * Favorite Food + Θ9 *
Average Daily Sleep Duration

3.4 Overfitting, noisy data, and pruning.

Training Data:

Training data is the data that is used for prediction.

Test Data:

Test data is used to assess the power of training data in prediction.

Overfitting:

Overfitting means too many un-necessary branches in the tree. Overfitting results in different kind of
anomalies that are the results of outliers and noise.

How to avoid overfitting?

There are two techniques to avoid overfitting;

1. Pre-pruning

11 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

2. Post-pruning

1.Pre-Pruning:

Pre-Pruning means to stop the growing tree before a tree is fully grown.

2. Post-Pruning:

Post-Pruning means to allow the tree to grow with no size limit. After tree completion starts to prune the
tree.

Advantages of tree-pruning and post-pruning:

• Pruning controls to increase tree un-necessary.

• Pruning reduces the complexity of the tree.

Practically, the second approach of post-pruning overfit trees is more successful because it is not easy to
precisely estimate when to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine the correct final tree size
using one of the following methods:

1. Use a distinct dataset from the training set (called validation set), to evaluate the effect of post-pruning
nodes from the tree.

2. Build the tree by using the training set, then apply a statistical test to estimate whether pruning or
expanding a particular node is likely to produce an improvement beyond the training set.

• Error estimation
• Significance testing (e.g., Chi-square test)

The first method is the most common approach. In this approach, the available data are separated into two
sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to
evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain
the error estimation and Chi2 test.

Post-pruning using Error estimation

Error estimate for a sub-tree is weighted sum of error estimates for all its leaves. The error estimate (e) for a
node is:

In the following example we set Z to 0.69 which is equal to a confidence level of 75%.

12 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the
split, we do not want to keep the children.

Post-pruning using Chi2 test

In Chi2 test we construct the corresponding frequency table and calculate the Chi2 value and its probability.

Bronze Silver Gold

Bad 4 1 4

Good 2 1 2

Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the
node.

13 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Experimental Evaluation of Learning Algorithms: Measuring the accuracy of learned hypotheses

In many cases it is important to evaluate the performance of learned hypotheses as precisely as possible.

One reason is simply to understand whether to use the hypothesis. For instance, when learning from a
limited-size database indicating the effectiveness of different medical treatments, it is important to
understand as precisely as possible the accuracy of the learned hypotheses.

A second reason is that evaluating hypotheses is an integral component of many learning methods. For
example, in post-pruning decision trees to avoid overfitting, we must evaluate the impact of possible pruning
steps on the accuracy of the resulting decision tree. Therefore, it is important to understand the likely errors
inherent in estimating the accuracy of the pruned and unpruned tree.

Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when
we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key
difficulties arise:

ESTIMATING HYPOTHESIS ACCURACY

Whenever you form a hypothesis for a given training data set, for example, you came up with a hypothesis
for the EnjoySport example where the attributes of the instances decide if a person will be able to enjoy their
favorite sport or not.

Now to test or evaluate how accurate the considered hypothesis is we use different statistical measures.
Evaluating hypotheses is an important step in training the model.

What reasons do we have for evaluating hypotheses?

• to decide whether to use a particular hypothesis or not


• to use the evaluation as part of the learning algorithm

Some Definitions

• X Space of possible instances


• D Unknown probability distribution that defines the probability of each instance in X.
• f(x) The target function
• h(x) The hypothesized function

The following two questions are of particular relevance to us in this context,

1. What is the best estimate of the accuracy of h over future instances taken from the same distribution,
given a hypothesis h and a data sample containing n examples picked at random according to the
distribution D?

2. What is the margin of error in this estimate of accuracy?

Sample Error and True Error

True Error

The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn
sample from the population. Here the population represents all the data in the world.

Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h
will misclassify an instance drawn at random i.e. true error is:

14 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Sample Error

The sample error of S with respect to target function f and data sample S is the proportion of examples S
misclassifies.

or, the following formula represents also represents sample error:

Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling
error should be:

SE=7/33 = 0.21

Confidence Interval

Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a
confidence interval. The confidence interval can be estimated as the function of the sampling error.

Below are the steps for the confidence interval:

• Randomly drawn n samples S (independently of each other), where n should be >30 from the
population P.

• Calculate the Sample Error of sample S.

Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for
calculating true error:

where zs is the value of the z-score of the s percentage of the confidence interval:

% Confidence Interval 50 80 90 95 99 99.5

Z-score 0.67 1.28 1.64 1.96 2.58 2.80

15 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

True Error vs Sample Error

True Error Sample Error

The true error represents the


Sample Error represents the fraction of the sample which is
probability that a random sample
misclassified.
from the population is misclassified.

True error is used to estimate the


Sample Error is used to estimate the errors of the sample.
error of the population.

True error is difficult to calculate. It is


Sample Error is easy to calculate. You just have to calculate the
estimated by the confidence interval
fraction of the sample that is misclassified.
range on the basis of Sample error.

Sampling error can be of type population-specific error (wrong


The true error can be caused by poor
people to survey), selection error, sample-frame error (wrong frame
data collection methods, selection
window selected for sample), and non-response error (when
bias, or non-response bias.
respondent failed to respond).

16 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Comparing learning algorithms: cross-validation, learning curves, and statistical hypothesis testing

There are literally countless number of machine learning (ML) algorithms human has invented. Of course,
most of the time only a small subset is used in research and in industries. Yet, it is still a bit overwhelming for
a human to understand and remember all the nitty-gritty details of all these ML models. Some people might
also have a wrong impression that all these algorithms are totally unrelated. More importantly, how might
one choose to use algorithm A over B when both seem to be effective algorithms?

• Time complexity
• Space complexity
• Sample complexity
• Online and Offline learning
• Parallelizability
• Parametricity
• Methodology, Assumptions and Objectives

1. Cross Validation

In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work
accurately for the real data. For this, we must assure that our model got the correct patterns from the data,
and it is not getting up too much noise. For this purpose, we use the cross-validation technique.

Cross-validation is a technique in which we train our model using the subset of the data-set and then
evaluate using the complementary subset of the data-set.

The three steps involved in cross-validation are as follows:

1. Reserve some portion of sample data-set.

2. Using the rest data-set train the model.

3. Test the model using the reserve portion of the data-set.

Methods of Cross Validation

a) Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the testing
purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we are leaving
while training our model i.e higher bias.

b) LOOCV (Leave One Out Cross Validation)


In this method, we perform training on the whole data-set but leaves only one data-point of the available
data-set and then iterates for each data-point. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we are testing
against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is it
takes a lot of execution time as it iterates over ‘the number of data points’ times.

c) K-Fold Cross Validation


In this method, we split the data-set into k number of subsets(known as folds) then we perform training on
the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each time.

Note:
It is always suggested that the value of k should be 10 as the lower value of k is takes towards validation
and higher value of k leads to LOOCV method.

17 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Example
The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold
cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training([1-5] testing and [5-25] training) while in the second
iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training([5-10] testing and [1-5 and 10-25] training), and so on.

Total instances: 25

Value of k: 5

No. Iteration Training set observations Testing set observations

1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]

2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]

3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]

4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]

5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

Advantages of train/test split:

1. This runs K times faster than Leave One Out cross-validation because K-fold cross-validation repeats
the train/test split K-times.

2. Simpler to examine the detailed results of the testing process.

Advantages of cross-validation:

1. More accurate estimate of out-of-sample accuracy.

2. More “efficient” use of data as every observation is used for both training and testing.

18 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

2. learning curves

A learning curve is just a plot showing the progress over the experience of a specific metric related to
learning during the training of a machine learning model. They are just a mathematical representation of the
learning process.

According to this, we’ll have a measure of time or progress in the x-axis and a measure of error or
performance in the y-axis.

We use these charts to monitor the evolution of our model during learning so we can diagnose problems
and optimize the prediction performance.

2.1 Single Curves

The most popular example of a learning curve is loss over time. Loss (or cost) measures our model error, or
“how bad our model is doing”. So, for now, the lower our loss becomes, the better our model performance
will be.

In the picture below, we can see the expected behavior of the learning process:

Despite the fact it has slight ups and downs, in the long term, the loss decreases over time, so the model is
learning.

2.2 Multiple Curves

One of the most widely used metrics combinations is training loss + validation loss over time.

The training loss indicates how well the model is fitting the training data, while the validation loss indicates
how well the model fits new data.

We will see this combination later on, but for now, see below a typical plot showing both metrics:

19 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

2.3 Two Main Types

We often see these two types of learning curves appearing in charts:

• Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of
the model are being optimized, such as loss or Mean Squared Error

• Performance Learning Curves: Learning curves calculated on the metric by which the model will be
evaluated and selected, such as accuracy, precision, recall, or F1 score

Below you can see an example in Machine Translation showing BLEU (a performance score) together with
the loss (optimization score) for two different models (orange and green):

3. statistical hypothesis testing

What is Hypothesis Testing?

Any data science project starts with exploring the data. When we perform an analysis on a sample through
exploratory data analysis and inferential statistics we get information about the sample. Now, we want to
use this information to predict values for the entire population.

20 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

Hypothesis testing is done to confirm our observation about the population using sample data, within the
desired error level. Through hypothesis testing, we can determine whether we have enough statistical
evidence to conclude if the hypothesis about the population is true or not.

How to perform hypothesis testing in machine learning?

To trust your model and make predictions, we utilize hypothesis testing. When we will use sample data to
train our model, we make assumptions about our population. By performing hypothesis testing, we validate
these assumptions for a desired significance level.

Key steps to perform hypothesis test are as follows:

1. Formulate a Hypothesis

2. Determine the significance level

3. Determine the type of test

4. Calculate the Test Statistic values and the p values

5. Make Decision

Now let’s look into the steps in detail:

1. Formulating the hypothesis

One of the key steps to do this is to formulate the below two hypotheses:

• The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about
the population.
• The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim
which we would like to prove as True

One of the main points which we should consider while formulating the null and alternative hypothesis is
that the null hypothesis always looks at confirming the existing notion. Hence, it has sign >= or , < and ≠

2. Determine the significance level also known as alpha or α for Hypothesis Testing

The significance level is the proportion of the sample mean lying in critical regions. It is usually set as 5% or
0.05 which means that there is a 5% chance that we would accept the alternate hypothesis even when our
null hypothesis is true

Based on the criticality of the requirement, we can choose a lower significance level of 1% as well.

21 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified


JNTUK B.TECH R19 4-1 MACHINE LEARNING

3. Determine the Test Statistic and calculate its value for Hypothesis Testing

Hypothesis testing uses Test Statistic which is a numerical summary of a data-set that reduces the data to one
value that can be used to perform the hypothesis test.

4. Select the type of Hypothesis test

We choose the type of test statistic based on the predictor variable – quantitative or categorical. Below are a
few of the commonly used test statistics for quantitative data

Type of predictor Desired


Distribution type Attributes
variable Test

• Large sample size


Quantitative Normal Distribution Z – Test
• Population standard deviation known

• Sample size less than 30


Quantitative T Distribution T-Test
• Population standard deviation unknown

Positively skewed • When you want to compare 3 or more


Quantitative F – Test
distribution variables

Negatively skewed • Requires feature transformation to


Quantitative NA
distribution perform a hypothesis test

Chi-Square • Test of independence


Categorical NA
test • Goodness of fit

5.The decision about your model

Test Statistic is then used to calculate P-Value. A P-value measures the strength of evidence in support of a
null hypothesis. If the P-value is less than the significance level, we reject the null hypothesis.

if the p-value < α, then we have statistically significant evidence against the null hypothesis, so we reject the
null hypothesis and accept the alternate hypothesis

if the p-value > α then we do not have statistically significant evidence against the null hypothesis, so we fail
to reject the null hypothesis.

As we make decisions, it is important to understand the errors that can happen while testing.

P-value :- The P value, or calculated probability, is the probability of finding the observed, or more extreme,
results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on
how the hypothesis is being tested.

22 © www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy