0% found this document useful (0 votes)
10 views

unit 2 notes (1)

Uploaded by

ginni bhayana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

unit 2 notes (1)

Uploaded by

ginni bhayana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

INSTITUTE OF INFORMATION TECHNOLOGY & MANAGEMENT

Accredited ‘A’ Grade by NAAC &Recognised U/s 2(f) of UGC act


Rated Category `A+’ by SFRC & `A’ by JAC Govt. of NCT of Delhi
Approved by AICTE & Affiliated to GGS Indraprastha University, New Delhi

Machine Learning with Python

Programme : BCA
Semester : V
Subject Code : BCAT311
Subject : Operating System
Topic : Decision tree, Naïve Bayes Support
Vector machines Classifier, Rule based Classifier
Faculty : Ms. Shilpi Bansal
© Institute of Information Technology and Management, D-29, Institutional Area, Janakp
uri, New Delhi-110058
List of Topics

Classification: Basic Concepts

Decision Tree Induction

Naïve Bayes Classification Methods


Support Vector Machines Classifier
Rule based Classifier

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Topic

Classification

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Supervised vs. Unsupervised Learning

 Supervised learning (classification)

Supervision: The training data (observations,

measurements, etc.) are accompanied by labels


indicating the class of the observations
New data is classified based on the training set

 Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc.

with the aim of establishing the existence of


4
classes or clusters in the data
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
 Numeric Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
5 Web page categorization: which category it is
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision
trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that
are correctly classified by the model
 Test set is independent of training set (otherwise
overfitting)
 If the accuracy is acceptable, use the model to classify new
6
data
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
7
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
8 Joseph Assistant Prof 7 yes
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-
and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information
gain)
 Conditions for stopping partitioning
All samples for a given node belong to the same
class
10
There are no remaining attributes for further
Brief Review of Entropy

m=2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D ) 
j
Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A) Info(D)  Info A(D)
12
Attribute Selection: Information Gain

g Class P: buys_computer = 5 4
Info age ( D )  I (2,3)  I (4,0)
“yes” 14 14
g Class N: buys_computer =
9 9 5 5 5
Info ( D“no”
) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3) means “age <=30” has
14
31…40 4 0 0 5 out of 14 samples, with 2
>40 3 2 0.971 yes’es and 3 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info ( D )  Info age ( D ) 0.246
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
13
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-
Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of
adjacent values is considered as a possible split
point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information


requirement for A is selected as the split-point for
A
 Split:
14
D1 is the set of tuples in D satisfying A ≤ split-
Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards


attributes with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain) v |D | | Dj |
SplitInfo A ( D )  
j
log 2 ( )
j 1 |D| |D|

GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is
15
selected as the splitting attribute
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini
index, gini(D) is defined as n
gini( D) 1  p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as
|D1| |D2 |
gini A ( D )  gini ( D1)  gini( D 2)
 Reduction in Impurity: |D| |D|

gini( A) gini( D)  gini ( D)


A
 The attribute provides the smallest ginisplit(D) (or the
largest reduction in impurity) is chosen to split the node
(need to enumerate all the possible splitting points for
16
each attribute)
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes” and 5
2 2
in “no” 9
    5
gini ( D) 1       0.459
 14   14 
 Suppose the attribute  10   4
income
giniincome{low, medium} ( partitions
D )   Gini ( DD
1 ) into 10( Din
  Gini 2)
 14   14 
D1: {low, medium} and 4 in D2

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus,


split on the {low,medium} (and {high}) since it
has the lowest Gini index
 All attributes are assumed continuous-valued
17
 May need other tools, e.g., clustering, to get the
Comparing Attribute Selection Measures

 The three measures, in general, return good results


but
Information gain:
 biased towards multivalued attributes
Gain ratio:
 tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions
and purity in both partitions
18
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
19
be measured
Bayes’ Theorem: Basics
M
P(B)   P(B | A )P( A )
 Total probability Theorem:
i i
i 1

P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


 Bayes’ Theorem:
P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori
probability): the probability that the hypothesis holds given
the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
20
31..40, medium income
Prediction Based on Bayes’ Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X), follows the Bayes’ theorem

P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)


P(X)
 Informally, this can be viewed as

posteriori = likelihood x prior/evidence


 Predicts X belongs to Ci iff the probability P(Ci|X) is

the highest among all the P(Ck|X) for all the k


classes
 Practical difficulty: It requires initial knowledge of
21
many probabilities, involving significant
Why is it called Naïve Bayes?

 Naïve: It is called Naïve because it assumes that the


occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as
an apple. Hence each feature individually
contributes to identify that it is an apple without
depending on each other.
 Bayes: It is called Bayes because it depends on the
principle of Bayes' Theorem.

22
Naïve Bayes Classifier: Training Dataset

age income studentcredit_rating


buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
23 >40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 P(C ): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
i >40 low yes fair yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no

 Compute P(X|C ) for each class <=30


>40
low yes fair
medium yes fair
yes
yes
i <=30 medium yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.22231…40
31…40
medium
high
no excellent
yes fair
yes
yes

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating =
fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
0.028
24
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each
conditional prob. be non-zero. Otherwise, the
n
predicted prob. will be zero
P( X | C i)   P( x k | C i)
k 1

 Ex. Suppose a dataset with 1000 tuples,


income=low (0), income= medium (990), and
income = high (10)
 Use Laplacian correction (or Laplacian
estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
25
Prob(income = high) = 11/1003
Advantages & Disadvantage
Advantages of Naïve Bayes Classifier:
 Naïve Bayes is one of the fast and easy ML algorithms to
predict a class of datasets.
 It can be used for Binary as well as Multi-class
Classifications.
 It performs well in Multi-class predictions as compared to
the other Algorithms.
 It is the most popular choice for text classification
problems.
Disadvantages of Naïve Bayes Classifier:
 Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between
26
features.
Applications
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because
Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam
filtering and Sentiment analysis.

27
TYPES of Naïve Bayes
There are three types of Naive Bayes Model, which are given below:
 Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
 Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
 Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.

28
TOPIC
Support Vector Machines Classifier

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Introduction
Support Vector Machines (SVM) are a powerful class
of supervised machine learning algorithms primarily
used for classification and regression tasks.
They are particularly well-suited for tasks where the
goal is to separate data points into different classes
or groups.
At its core, SVM aims to find the optimal hyperplane
that best separates data points of different classes in
a high-dimensional space.
This hyperplane serves as the decision boundary,
allowing SVM to classify new, unseen data points
based
© Institute on their
of Information position
Technology relative to this boundary.
and Manage
ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Features of SVM
Maximizing Margin: SVM's primary objective is to maximize the
margin, which is the distance between the decision boundary
(hyperplane) and the nearest data points from each class.
 This maximization helps SVM achieve robust generalization to new
data.
Support Vectors: Support vectors are the data points that lie
closest to the decision boundary.
 They play a crucial role in defining the hyperplane and determining
the SVM's performance.
Linear and Non-Linear Classification: SVM can be used for
both linear and non-linear classification tasks.
 In the linear case, it finds a straight-line hyperplane, while in the non-
linear case, it employs kernel functions to handle more complex
decision boundaries.
© Institute of Information Technology and Manage
ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Linear Separability
Linear separability is a fundamental concept in
Support Vector Machines (SVM) and plays a pivotal
role in understanding how SVM works.
It refers to the property of data points being
separable into distinct classes by a straight-line (in
two dimensions) or a hyperplane (in higher
dimensions).

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Why is Linear Separability
Important?
SVM's Objective: SVM aims to find the best
hyperplane that separates data points into classes.
Linear separability ensures that a clear decision
boundary (hyperplane) exists.
Margin Maximization: Linearly separable data
allows SVM to maximize the margin effectively.
The margin is the distance between the decision
boundary and the nearest data points from each class.
A larger margin leads to better generalization.

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Non-Linear Separability
It's important to note that not all datasets are linearly
separable.
In cases where data is not linearly separable, SVM
can still be used effectively by employing kernel
functions, which allow it to handle more complex
decision boundaries.

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Linearly Separable Vs. Non Linearly
Separable data Points

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
The Margins
 In Support Vector Machines (SVM), "margins" refer to the region or
space between the decision boundary (hyperplane) and the nearest
data points from each class.
 Understanding margins is crucial as they are central to SVM's
optimization process.
 Why Are Margins Important in SVM?
 Maximizing Separation: SVM's primary goal is to maximize the margin.
 A larger margin means a better separation of data points, leading to improved
classification performance.
 Robust Generalization: A wide margin results in a more robust model
that can generalize better to new, unseen data.
 It helps in reducing overfitting, where the model performs well on training data but
poorly on test data.
 Support Vector Identification: The support vectors are the data points
closest to the hyperplane and define the margin.
 They are critical in determining the decision boundary and the model's performance.
Support Vectors
 Support Vectors are a crucial concept in Support Vector Machines
(SVM).
 They are the subset of data points that have the closest proximity to
the decision boundary (hyperplane) and play a pivotal role in defining
and optimizing the SVM model.
 Characteristics of Support Vectors
 Closest to the Hyperplane: Support vectors are the data points that lie
at the shortest distance from the decision boundary. They are essentially
the "border" points.
 Determine Margin: The positions of support vectors dictate the width
of the margin. SVM aims to maximize this margin while ensuring that
support vectors remain correctly classified.
 Robustness: Since support vectors are the most critical data points,
SVM is robust to outliers and noisy data because these anomalies are
unlikely to be support vectors.
Hyperplane
In the world of Support Vector Machines (SVM), the
term "hyperplane" plays a pivotal role as it defines
the decision boundary that separates data points into
different classes.
What is a Hyperplane?
A hyperplane, in SVM, is a flat, high-dimensional surface
that acts as the decision boundary.
In two dimensions (2D), a hyperplane is simply a
straight line.
In three dimensions (3D), it becomes a flat plane.
Purpose of the Hyperplane
The hyperplane's primary purpose is to separate data
points from different classes.
It serves as the foundation for SVM's classification
process.
Data points are classified based on which side of the
hyperplane they fall.
2D and 3D Representation of
Hyperplane
Linear SVM
 Linear Support Vector Machines (SVM) are a fundamental concept
in machine learning, particularly in classification tasks.
 They are designed to work with linearly separable data, where a
straight-line (in 2D) or a hyperplane (in higher dimensions) can
effectively separate the data points into distinct classes.
 Key Characteristics of Linear SVM:
 Straight-line or Hyperplane: Linear SVM seeks to find a straight-
line (in 2D) or a hyperplane (in higher dimensions) that best
separates the data.
 Maximizing Margin: It aims to maximize the margin, the distance
between the hyperplane and the nearest data points from each class.
A larger margin leads to better generalization.
 Efficiency: Linear SVM is computationally efficient and works well
with high-dimensional data.
Advantages of Linear SVM
Simplicity: Linear SVM is conceptually
straightforward and easy to implement.
Efficiency: It works efficiently with large datasets
and high-dimensional feature spaces.
Generalization: Linear SVM often generalizes well,
especially when the data is linearly separable.
Linear SVM is suitable when the data is linearly
separable or when simplicity and efficiency are
essential for the task.
Limitations of Linear SVM
Limited for Non-Linear Data: Linear SVM performs
poorly when data is not linearly separable, as it
cannot capture complex decision boundaries.
Sensitivity to Outliers: It can be sensitive to
outliers, as outliers may affect the position of the
hyperplane.
Less Expressive: Compared to non-linear SVM, it
has less expressive power to capture intricate
patterns in data.
Kernel Functions
In Support Vector Machines (SVM), "Kernel Functions" are
essential tools that enable SVM to handle data that is not
linearly separable.
They transform the data into a higher-dimensional space,
making it possible to find a hyperplane that can separate non-
linearly separable data.
Why Use Kernel Functions?
 Handling Non-Linearity: Many real-world datasets are not
linearly separable, meaning a straight line or hyperplane cannot
effectively separate the classes.
 Kernel functions allow SVM to handle complex, non-linear decision boundaries.

Avoiding Dimensionality Issues: Kernel functions enable


SVM to implicitly operate in a higher-dimensional space.
Types of Kernel Functions
There are several types of kernel functions commonly used in
SVM. Here are a few:
 Linear Kernel: The linear kernel is used for linearly separable data.
It doesn't transform the data; it uses the original feature space.
 Polynomial Kernel: This kernel transforms the data into a
polynomial feature space, allowing SVM to capture non-linear
patterns.
 Radial Basis Function (RBF) Kernel: The RBF kernel is one of the
most popular choices. It transforms the data into an infinite-
dimensional space, making it highly flexible for capturing complex
patterns.
 Sigmoid Kernel: The sigmoid kernel is used for problems where
the data distribution is not well-known. It is often used in neural
network-inspired SVMs.
Tuning Parameters
In Support Vector Machines (SVM), there are several
key tuning parameters that allow you to customize and
fine-tune the model's behavior to achieve the best
possible performance for your specific dataset.
C (Cost Parameter):
 Definition: The C parameter controls the trade-off
between maximizing the margin and minimizing
classification errors on the training data.
 Higher C: Results in a smaller margin but fewer training
errors, potentially leading to overfitting.
 Lower C: Results in a larger margin but more training
errors, potentially improving generalization.
Tuning Parameters
Gamma (Kernel Coefficient):
Definition: The gamma parameter determines how far
the influence of a single training example reaches.
Higher Gamma: Leads to a more complex, non-linear
decision boundary, potentially causing overfitting.
Lower Gamma: Results in a simpler, smoother decision
boundary, potentially improving generalization.
Tuning Parameters
Kernel Type
Definition: SVM can use different kernel functions (e.g.,
linear, polynomial, radial basis function) to transform
data into higher dimensions for non-linear classification.
Choice of Kernel: Depends on the nature of the data
and the problem.
Tuning Parameters
Degree (for Polynomial Kernel)
Definition: In the case of a polynomial kernel, the
degree parameter determines the degree of the
polynomial used for transformation.
Higher Degree: Results in a more complex polynomial
transformation, potentially overfitting.
Lower Degree: Leads to a simpler polynomial
transformation, potentially improving generalization.
Tuning Parameters
Class Weights
Definition: SVM allows you to assign different weights
to different classes to handle class imbalance issues.
Use Case: Useful when one class has significantly
fewer samples than others.
Tuning Parameters
Convergence Parameters
Definition: These parameters control the convergence
criteria for the SVM optimization algorithm.
Tolerance (tol): Sets the tolerance for stopping
criterion.
Maximum Iterations (max_iter): Determines the
maximum number of iterations for the solver to
converge.
Soft Margin SVM
In the world of Support Vector Machines (SVM), we
often encounter datasets that are not perfectly linearly
separable, where a clear hyperplane cannot separate
all data points without errors.
In such cases, SVM introduces the concept of a "soft
margin."
What is a Soft Margin?
 A soft margin is a modification to the traditional SVM
approach that allows for some degree of misclassification
of data points.
 It acknowledges the presence of noise or outliers in real-
world datasets, making the model more robust.
When is a Soft Margin Used?
Noisy Data: Soft margin SVM is beneficial when the
dataset contains noisy or mislabeled data points that
are difficult to separate correctly.
Overlapping Classes: In situations where classes
overlap to some extent, a hard margin (perfect
separation) may not be feasible or result in an overly
complex model.
Imbalanced Datasets: Soft margin can be
advantageous when dealing with imbalanced
datasets, where one class significantly outnumbers
the other.
Trade-off Parameter (C)
The trade-off parameter 'C,‘ controls the balance
between maximizing the margin and minimizing
classification errors.
A smaller C allows for a wider margin but permits
more misclassifications, while a larger C leads to a
narrower margin but fewer misclassifications.
Formula: Soft Margin Objective = Margin + C * (Sum
of Classification Errors)
Benefits of Soft Margin SVM
Increased Robustness: Soft margin SVM can handle
imperfect datasets and generalize better to unseen
data.
Better Convergence: It is often more stable during
training as it accounts for minor variations in data.
Multi Class SVM
Support Vector Machines (SVM) are naturally designed
for binary classification, meaning they distinguish
between two classes. However, in many real-world
scenarios, we encounter problems with more than two
classes.
Multi-class SVM extends the binary SVM to handle such
situations effectively.
Challenges in Multi-Class Classification:
 In multi-class classification, there are more than two
classes, making the problem more complex.
 Traditional binary SVM cannot be applied directly to these
situations.
Approaches for Multi-Class
SVM:
One-vs-Rest (OvR) or One-vs-All (OvA)
In the OvR approach, a separate SVM is trained for each
class against the rest.
For N classes, N different binary classifiers are trained.
During prediction, all N classifiers are used, and the class
with the highest confidence score is selected.
One-vs-One (OvO)
In the OvO approach, a binary classifier is trained for each
pair of classes.
For N classes, N(N-1)/2 classifiers are trained.
During prediction, each classifier votes for a class, and the
class with the most votes is the winner.
Advantages of Multi-Class SVM
SVM's ability to maximize the margin is preserved in
multi-class scenarios, making it robust and reliable.
It can handle complex decision boundaries and is
effective in high-dimensional spaces.
Larger datasets often favor OvR, while OvO can be
preferable for smaller datasets.
TOPIC
Rule-Based Classifiers

© Institute of Information Technology and Manage


ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058
Rule-Based Classifier
• Classify records by using a collection of “if…then…” rules

• Rule: (Condition)  y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
(Blood Type=Warm)  (Lay Eggs=Yes)  Birds
(Taxable Income < 50K)  (Refund=Yes)  Evade=No
Name
(Example)
Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of the instance
satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
• Coverage of a rule: Tid Refund Marital
Status
Taxable
Income Class
– Fraction of records that
1 Yes Single 125K No
satisfy the antecedent of a 2 No Married 100K No
rule 3 No Single 70K No
• Accuracy of a rule: 4 Yes Married 120K No
– Fraction of records that 5 No Divorced 95K Yes
satisfy both the antecedent 6 No Married 60K No
and consequent of a rule 7 Yes Divorced 220K No
(over those that satisfy the 8 No Single 85K Yes
antecedent) 9 No Married 75K No
10 No Single 90K Yes
10

(Status=Single)  No
Coverage = 40%, Accuracy = 50%
Decision Trees vs. rules
From trees to rules.
 Easy: converting a tree into a set of rules
 One rule for each leaf:
 Antecedent contains a condition for every node on the path from the
root to the leaf
 Consequent is the class assigned by the leaf
 Straightforward, but rule set might be overly complex
Decision Trees vs. rules
From rules to trees
• More difficult: transforming a rule set into a tree
– Tree cannot easily express disjunction between rules
• Example:
If a and b then x
If c and d then x

– Corresponding tree contains identical subtrees (Þ“replicated


subtree problem”)
A tree for a simple disjunction
How does Rule-based Classifier Work?
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Desiderata for Rule-Based Classifier
• Mutually exclusive rules
– No two rules are triggered by the same record.
– This ensures that every record is covered by at most one rule.

• Exhaustive rules
– There exists a rule for each combination of attribute values.
– This ensures that every record is covered by at least one rule.

Together these properties ensure that every record is covered by exactly


one rule.
Rules

Non mutually exclusive rules
A record may trigger more than one rule
Solution?
 Ordered rule set

Non exhaustive rules


A record may not trigger any rules
Solution?
 Use a default class
Ordered Rule Set
Rules are ranked ordered according to their priority
(e.g. based on their quality)
 An ordered rule set is known as a decision list

When a test record is presented to the classifier


 It is assigned to the class label of the highest ranked rule it has
triggered
 If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Building Classification Rules: Sequential Covering

1. Start from an empty rule


2. Grow a rule using some Learn-One-Rule
function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping
criterion is met

(i) Original Data (ii) Step 1


R1 R1

R2

(iii) Step 2 (iv) Step 3

• This approach is called a covering approach because at


each stage a rule is identified that covers some of the
instances
Example: generating a rule
b a
b b b a b a b a
a y b b a y b a
b b a a b
a
b b
a
b b a a b b a a
y b b
b 2·6
b b
b a b b b
a b
b b
b b b a b
b b b
b b b b b
x x
x 1·2 1·2

Possible rule set for class “b”:


More rules could be added for “perfect” rule set
If x  1.2 then class = b
If x > 1.2 and y  2.6 then class = b
Asimple covering algorithm
Generates a rule by adding tests
that maximize rule’s accuracy
Similar to situation in decision
trees: problem of selecting an
attribute to split on.
But: decision tree inducer
maximizes overall purity
space of
Here, each new test (growing the examples
rule) reduces rule’s coverage.
rule so far

rule after
adding new
term
Selecting a test
Goal: maximizing accuracy
t: total number of instances covered by rule
p: positive examples of the class covered by rule
t-p: number of errors made by rule

Þ Select test that maximizes the ratio p/t

We are finished when p/t = 1 or the set of instances can’t be


split any further
Example: contact lenses data
Age Spectacle Astigmatism Tear production Recommended
prescription rate Lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermetrope no reduced none
young hypermetrope no normal soft
young hypermetrope yes reduced none
young hypermetrope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
pre-presbyopic hypermetrope yes reduced none
pre-presbyopic hypermetrope yes normal none
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermetrope no reduced none
presbyopic hypermetrope no normal soft
presbyopic hypermetrope yes reduced none
presbyopic hypermetrope yes normal none
Example: contact lenses data

The numbers on the right show the fraction of “correct” instances in


the set singled out by that choice.
In this case, correct means that their recommendation is “hard.”
Modified rule and resulting data

The rule isn’t very accurate, getting only 4 out of 12 that it covers.
So, it needs further refinement.
Further refinement
Modified rule and resulting data

Should we stop here? Perhaps. But let’s say we are going for exact
rules, no matter how complex they become.
So, let’s refine further.
Further refinement
The result
© Institute of Information Technology and Manage
ment, D-29, Institutional Area, Janakpuri, New Delh
i-110058

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy