0% found this document useful (0 votes)
22 views

U4 ML Updated

Uploaded by

Janhvi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

U4 ML Updated

Uploaded by

Janhvi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT IV

TREE BASED AND PROBABILISTIC MODELS


CONTENTS
• Tree Based Model: Decision Tree – Concepts and Terminologies,
Impurity Measures -Gini Index, Information gain, Entropy, Tree
Pruning -ID3/C4.5, Advantages and Limitations
• Probabilistic Models: Conditional Probability and Bayes Theorem,
Naïve Bayes Classifier, Bayesian network for Learning and Inferencing.
Decision Tree Classification Algorithm
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is
a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
Decision Tree Classification Algorithm
Why use Decision Trees?

• Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.

• The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the
tree.
• Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree algorithm Work?

• In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
• Step-1: Begin the tree with the root node, says S(Entropy), which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S(Entropy) into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
How does the Decision Tree
algorithm Work?
• Example: Suppose there is a candidate who has a job offer and wants
to decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute
by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
How does the Decision Tree
algorithm Work?
How does the Decision Tree
algorithm Work?
How does the Decision Tree
algorithm Work?
Attribute Selection Measures

• While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique which is called as Attribute selection measure
or ASM. By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which are:

• Information Gain(IG)
• Gini Index(GI)
Information Gain

• Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
• Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
• Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
• Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
• Where,
• S= Total number of samples(or sets)
• P(yes)= probability of yes
• P(no)= probability of no
G in iIn d e x

• Gini index is a measure of impurity or purity(how impure or mixed a dataset


is) used while creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the
high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
• Gini index can be calculated using the below formula:
Pruning: Getting an Optimal Decision tree

• Pruning is a process of deleting the unnecessary nodes from a tree in order


to get the optimal decision tree.
• A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique
that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology
used:
• Cost Complexity Pruning
• Reduced Error Pruning.
Advantages & Disadvantages of the Decision Tree

• It is simple to understand as it follows the same process which a human


follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
• Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
• For more class labels, the computational complexity of the decision tree
may increase.
-ID3/C4.5

• ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm
creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical
feature that will yield the largest information gain for categorical targets. Trees are grown
to their maximum size and then a pruning step is usually applied to improve the ability of
the tree to generalize to unseen data.
• C4.5 is the successor to ID3 and removed the restriction that features must be categorical
by dynamically defining a discrete attribute (based on numerical variables) that partitions
the continuous attribute value into a discrete set of intervals. C4.5 converts the trained
trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. This accuracy of each
rule is then evaluated to determine the order in which they should be applied. Pruning is
done by removing a rule’s precondition if the accuracy of the rule improves without it.
• CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it
supports numerical target variables (regression) and does not compute rule sets. CART
constructs binary trees using the feature and threshold that yields the largest information
gain at each node.
Naïve Bayes Classifier Algorithm

• Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of
Bayes' Theorem
Bayes' Theorem:

• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
• The formula for Bayes' theorem is given as:
• Where,
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
• P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:

• Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
• Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions.
So to solve this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the below dataset:
Dataset
Outlook Play

0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4

Likelihood table weather condition:

Weather Yes No
Overcast 5 0 5/14=0.35
Rainy 2 2 4/14=0.29
Sunny 3 2 5/14=0.35
Total 10/14=0.71 4/14=0.29
Applying Bayes'theorem
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play the game.
Applying Bayes'theorem
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
Advantages & Disadvantages
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes
Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model:

• There are three types of Naive Bayes Model, which are given below:
• Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the
model assumes that these values are sampled from the Gaussian distribution.

• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education etc.
The classifier uses the frequency of words for the predictors.

• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but
the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
What are Bayesian networks?
• A Bayesian network (also known as a Bayes network, Bayes net, belief
network, or decision network) is a probabilistic graphical model that
represents a set of variables and their conditional dependencies via a
directed acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred and predicting
the likelihood that any one of several possible known causes was the
contributing factor.
• For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the
network can be used to compute the probabilities of the presence of
various diseases.
A simple Bayesian network with
conditional probability tables
Inference

• Inference is the process of calculating a probability distribution of interest e.g.


P(A | B=True), or P(A,B|C, D=True). The terms inference and queries are used
interchangeably. The following terms are all forms of inference will slightly
difference semantics.
• Prediction - focused around inferring outputs from inputs.
• Diagnostics - inferring inputs from outputs.
• Supervised anomaly detection - essentially the same as prediction
• Decision making under uncertainty - optimization and inference combined.
• A few examples of inference in practice:
• Given a number of symptoms, which diseases are most likely?
• How likely is it that a component will fail, given the current state of the system?
• Given recent behavior of 2 stocks, how will they behave together for the next 5
time steps?
Inference
• Exact inference
• Exact inference is the term used when inference is performed exactly
(subject to standard numerical rounding errors).
• Exact inference is applicable to a large range of problems, but may not
be possible when combinations/paths get large.
• Wider class of problems are solve by exact inference.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy