Machine Ass
Machine Ass
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Advertisement
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Problem: If the weather is sunny, then the Player should play or not?
Advertisement
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
Advertisement
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Advertisement
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
There are three types of Naive Bayes Model, which are given below:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will
use the "user_data" dataset, which we have used in our other classification model.
Therefore we can easily compare the Naive Bayes model with the other models.
Steps to implement:
In this step, we will pre-process/prepare the data so that we can use it efficiently in
our code. It is similar as we did in data-pre-processing. The code for this is given
below:
After the pre-processing step, now we will fit the Naive Bayes model to the Training
set. Below is the code for it:
. # Fitting Naive Bayes to the Training set
. from sklearn.naive_bayes import GaussianNB
. classifier = GaussianNB()
. classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
Next we will visualize the training set result using Na �ve Bayes Classifier. Below is
the code for it:
Advertisement
In the above output we can see that the Na�ve Bayes classifier has segregated the
data points with the fine boundary. It is Gaussian curve as we have
used GaussianNB classifier in our code.
next →
Naive Bayes Classifiers
Last Updated : 10 Jul, 2024
The dataset is divided into two parts, namely, feature matrix and the response
vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are
‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable(prediction or output) for each
row of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of
each other, given the class label.
Continuous features are normally distributed: If a feature is continuous, then it
is assumed to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it
is assumed to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to
the prediction of the class label.
No missing data: The data should not contain any missing values.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature
being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has
no effect on the winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example,
knowing only temperature and humidity alone can’t predict the outcome accurately.
None of the attributes is irrelevant and assumed to be contributing equally to the
outcome.
The assumptions made by Naive Bayes are not generally correct in
real-world situations. In-fact, the independence assumption is never
correct but often works well in practice.Now, before moving to the
formula for Naive Bayes, it is important to know about Bayes’
theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically as
the following equation:
P(A∣B)=P(B∣A)P(A)P(B)P(A∣B)=P(B)P(B∣A)P(A)
where A and B are events and P(B) ≠ 0
Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown instance(here,
it is event B).
P(B) is Marginal Probability: Probability of Evidence.
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is
seen.
P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true
based on the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y∣X)=P(X∣y)P(y)P(X)P(y∣X)=P(X)P(X∣y)P(y)
where, y is class variable and X is a dependent feature vector (of size n) where:
X=(x1,x2,x3,…..,xn)X=(x1,x2,x3,…..,xn)
Just to clear, an example of a feature vector and corresponding class variable can be:
(refer 1st row of dataset)
X = (Rainy, Hot, High, False)
y = No
So basically, P(y∣X)P(y∣X)here means, the probability of “Not playing golf” given
that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high
humidity” and “no wind”.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature
being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has
no effect on the winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example,
knowing only temperature and humidity alone can’t predict the outcome accurately.
None of the attributes is irrelevant and assumed to be contributing equally to the
outcome.
Now, its time to put a naive assumption to the Bayes’ theorem, which
is, independence among the features. So now, we split evidence into the independent
parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
P(y∣x1,…,xn)=P(x1∣y)P(x2∣y)…P(xn∣y)P(y)P(x1)P(x2)…P(xn)P(y∣x1,…,xn)=P(x1
)P(x2)…P(xn)P(x1∣y)P(x2∣y)…P(xn∣y)P(y)
which can be expressed as:
P(y∣x1,…,xn)=P(y)∏i=1nP(xi∣y)P(x1)P(x2)…P(xn)P(y∣x1,…,xn)=P(x1)P(x2)…P(xn
)P(y)∏i=1nP(xi∣y)
Now, as the denominator remains constant for a given input, we can remove that term:
P(y∣x1,…,xn)∝P(y)∏i=1nP(xi∣y)P(y∣x1,…,xn)∝P(y)∏i=1nP(xi∣y)
Now, we need to create a classifier model. For this, we find the probability of given
set of inputs for all possible values of the class variable y and pick up the output with
maximum probability. This can be expressed mathematically as:
y=argmaxyP(y)∏i=1nP(xi∣y)y=argmaxyP(y)∏i=1nP(xi∣y)
So, finally, we are left with the task of calculating P(y)P(y)and P(xi∣y)P(xi∣y).
Please note that P(y)P(y) is also called class probability and P(xi∣y)P(xi∣y) is called
conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make
regarding the distribution of P(xi∣y).P(xi∣y).
Let us try to apply the above formula manually on our weather dataset. For this, we
need to do some precomputations on our dataset.
We need to findP(xi∣yj)P(xi∣yj)for each xixi in X andyjyj in y. All these calculations
have been demonstrated in the tables below:
Naive Bayes Classifiers
So, in the figure above, we have calculated P(xi ∣yj)P(xi ∣yj) for each xixi in X
and yjyj in y manually in the tables 1-4. For example, probability of playing golf
given that the temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities P(y)P(y) which has been calculated in the
table 5. For example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)
P(Yes∣today)=P(SunnyOutlook∣Yes)P(HotTemperature∣Yes)P(NormalHumidity∣Yes)
P(NoWind∣Yes)P(Yes)P(today)P(Yes∣today)=P(today)P(SunnyOutlook∣Yes)P(HotTe
mperature∣Yes)P(NormalHumidity∣Yes)P(NoWind∣Yes)P(Yes)
and probability to not play golf is given by:
P(No∣today)=P(SunnyOutlook∣No)P(HotTemperature∣No)P(NormalHumidity∣No)P(
NoWind∣No)P(No)P(today)P(No∣today)=P(today)P(SunnyOutlook∣No)P(HotTempera
ture∣No)P(NormalHumidity∣No)P(NoWind∣No)P(No)
Since, P(today) is common in both probabilities, we can ignore P(today) and find
proportional probabilities as:
P(Yes∣today)∝39.29.69.69.914≈0.02116P(Yes∣today)∝93.92.96.96.149≈0.02116
and
P(No∣today)∝35.25.15.25.514≈0.0068P(No∣today)∝53.52.51.52.145≈0.0068
Now, since
P(Yes∣today)+P(No∣today)=1P(Yes∣today)+P(No∣today)=1
These numbers can be converted into a probability by making the sum equal to 1
(normalization):
P(Yes∣today)=0.021160.02116+0.0068≈0.0237P(Yes∣today)=0.02116+0.00680.02116
≈0.0237
and
P(No∣today)=0.00680.0141+0.0068≈0.33P(No∣today)=0.0141+0.00680.0068≈0.33
Since
P(Yes∣today)>P(No∣today)P(Yes∣today)>P(No∣today)
So, prediction that golf would be played is ‘Yes’.
The method that we discussed above is applicable for discrete data. In case of
continuous data, we need to make some assumptions regarding the distribution of
values of each feature. The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi∣y).P(xi∣y).
Python
1
# load the iris dataset
2
from sklearn.datasets import load_iris
3
iris = load_iris()
4
5
# store the feature matrix (X) and response vector (y)
6
X = iris.data
7
y = iris.target
8
9
# splitting X and y into training and testing sets
10
from sklearn.model_selection import train_test_split
11
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
12
13
# training the model on training set
14
from sklearn.naive_bayes import GaussianNB
15
gnb = GaussianNB()
16
gnb.fit(X_train, y_train)
17
18
# making predictions on the testing set
19
y_pred = gnb.predict(X_test)
20
21
# comparing actual response values (y_test) with predicted response values (y_pred)
22
from sklearn import metrics
23
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test,
y_pred)*100)
Output:
Gaussian Naive Bayes model accuracy(in %): 95.0
Feature vectors represent the frequencies with which certain events have been
generated by a multinomial distribution. This is the event model typically used for
document classification.
In the multivariate Bernoulli event model, features are independent booleans (binary
variables) describing inputs. Like the multinomial model, this model is popular for
document classification tasks, where binary term occurrence(i.e. a word occurs in a
document or not) features are used rather than term frequencies(i.e. frequency of a
word in the document).
Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from normal distributions
Disadvantages of Naive Bayes Classifier
Assumes that features are independent, which may not always hold in real-world
data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to poor generalization.
Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-spam based on features.
Text Classification: Used in sentiment analysis, document categorization, and
topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease based on
symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
Weather Prediction: Classifies weather conditions based on various factors.
As we reach to the end of this article, here are some important points to ponder upon:
In spite of their apparently over-simplified assumptions, naive Bayes classifiers
have worked quite well in many real-world situations, famously document
classification and spam filtering. They require a small amount of training data to
estimate the necessary parameters.
Naive Bayes learners and classifiers can be extremely fast compared to more
sophisticated methods. The decoupling of the class conditional feature distributions
means that each distribution can be independently estimated as a one dimensional
distribution. This in turn helps to alleviate problems stemming from the curse of
dimensionality.
Conclusion
In conclusion, Naive Bayes classifiers, despite their simplified assumptions, prove
effective in various applications, showcasing notable performance in document
classification and spam filtering. Their efficiency, speed, and ability to work with
limited data make them valuable in real-world scenarios, compensating for their naive
independence assumption.
Author's Note
Hello everyone 👋🏻,
Welcome to a beginner-friendly guide to Naive Bayes classification! This notebook
has been carefully crafted to serve as a comprehensive companion for those who are
taking their first steps into the world of machine learning.
If you're just starting out with machine learning, this guide is designed specifically
for you. We'll walk through the Naive Bayes classification technique in a way that's
easy to understand, even if you're new to this exciting field 🤩.
By the time you finish this guide, you'll have a solid grasp of how Naive Bayes
works and how it can be used to make predictions and when you should use it 🙌.
Let's dive in and unlock the power of Naive Bayes classification for beginners!
Naïve Bayes Classification : Spam Email Detection
Classifier to identify spam emails from legitimate ones
What is Naive Bayes Classification?
In 1763, the English statistician and philosopher Thomas Bayes proposed the Bayes
Theorem, which serves as the fundamental principle of conditional probability. This
theorem states that the likelihood of an event occurring, given the occurrence of
another event, is equal to the conditional probability of the second event given the
first event, multiplied by the probability of the first event itself.
Naive Bayes is a popular classification approach that is rooted in Bayes' theory. The
posterior class probability of a test data point can be calculated using class-conditional
density estimation and class prior probability. The test data will then be assigned to
the class with the highest posterior class probability.
In [1]:
# Try it out !# Uncomment the code to run this cell# Code Below# Calculating
Conditional Probability using Bayes Theorem :-
#----------------------------------------------------------------------------------------------------
-----------------------
Displaying the first 5 rows gives us an idea of how the data is arranged
in the table. This helps us estimate which features are necessary and
which are not. Note : The label ham denotes non spam emails.
Data Preprocessing
What is Data Preprocessing ?
In simpler terms, data preprocessing refers to cleaning of the data. It is like chopping
and cleaning the veggies before cooking them.
Defination : Data preprocessing is the act of cleaning, converting, and organizing
raw data such that it may be fed into a machine learning or data analysis algorithm in
a more useable and structured shape.
why is it necessary ?
Quality assurance: Raw data may have errors, inconsistencies, or missing numbers.
By addressing these challenges, preprocessing ensures data quality.
Better Results: Accurate, dependable insights are generated by good data. Clean,
well-organized data helps algorithms function better.
Feature Engineering: By combining existing features, you can construct new, useful
ones that improve the model's capacity to grasp the data.
Reduced Noise: Outliers or extreme values might cause results to be distorted.
Preprocessing assists in identifying and dealing with them.
Standardization: Different data sources may have varying units or scales. Data is
more similar after preprocessing.
Missing Values: Algorithms may struggle to handle missing values. Preprocessing
aids in the filling or removal of missing data.
Efficiency: Preparing data correctly saves time and computational resources during
analysis.
In [5]:
# Drop the columns with NaN valuesdata = data.drop(columns=['Unnamed: 2',
'Unnamed: 3', 'Unnamed: 4'], axis=1)
We do not need the Not-a-Number (NaN) values, as they do not
provide any insights into the data or impact other features. Therefore,
we are dropping them.
In [6]:
# Split the data into training and testing sets (80% training, 20% testing)X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Train the Classifier (Multinomial Naive Bayes)
In [9]:
MultinomialNB
MultinomialNB()
Make Predictions on the Test Data
In this step, we are predicting the accuracy of our model by evaluating how precisely
it can predict outcomes on new, unseen data.
In [13]:
print(f"Accuracy: {accuracy:.2f}")print("Confusion
Matrix:")print(conf_matrix)print("Classification Report:")print(classification_rep)
Accuracy: 0.98
Confusion Matrix:
[[963 2]
[ 16 134]]
Classification Report:
precision recall f1-score support
# Count the number of spam and non-spam emails in the test setspam_counts =
y_test.value_counts()
linkcode
Reference :