U4 ML Updated
U4 ML Updated
• The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the
tree.
• Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
• In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
• Step-1: Begin the tree with the root node, says S(Entropy), which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S(Entropy) into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
How does the Decision Tree
algorithm Work?
• Example: Suppose there is a candidate who has a job offer and wants
to decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute
by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
How does the Decision Tree
algorithm Work?
How does the Decision Tree
algorithm Work?
How does the Decision Tree
algorithm Work?
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique which is called as Attribute selection measure
or ASM. By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which are:
• Information Gain(IG)
• Gini Index(GI)
Information Gain
• ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm
creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical
feature that will yield the largest information gain for categorical targets. Trees are grown
to their maximum size and then a pruning step is usually applied to improve the ability of
the tree to generalize to unseen data.
• C4.5 is the successor to ID3 and removed the restriction that features must be categorical
by dynamically defining a discrete attribute (based on numerical variables) that partitions
the continuous attribute value into a discrete set of intervals. C4.5 converts the trained
trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. This accuracy of each
rule is then evaluated to determine the order in which they should be applied. Pruning is
done by removing a rule’s precondition if the accuracy of the rule improves without it.
• CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it
supports numerical target variables (regression) and does not compute rule sets. CART
constructs binary trees using the feature and threshold that yields the largest information
gain at each node.
Naïve Bayes Classifier Algorithm
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
• The formula for Bayes' theorem is given as:
• Where,
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
• P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
• Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
• Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions.
So to solve this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the below dataset:
Dataset
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather Yes No
Overcast 5 0 5/14=0.35
Rainy 2 2 4/14=0.29
Sunny 3 2 5/14=0.35
Total 10/14=0.71 4/14=0.29
Applying Bayes'theorem
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play the game.
Applying Bayes'theorem
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
Advantages & Disadvantages
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes
Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model:
• There are three types of Naive Bayes Model, which are given below:
• Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the
model assumes that these values are sampled from the Gaussian distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but
the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
What are Bayesian networks?
• A Bayesian network (also known as a Bayes network, Bayes net, belief
network, or decision network) is a probabilistic graphical model that
represents a set of variables and their conditional dependencies via a
directed acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred and predicting
the likelihood that any one of several possible known causes was the
contributing factor.
• For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the
network can be used to compute the probabilities of the presence of
various diseases.
A simple Bayesian network with
conditional probability tables
Inference