Decisiontree 2
Decisiontree 2
Classification is a two-step process, learning step and prediction step, in machine
learning. In the learning step, the model is developed based on given training data.
In the prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to
understand and interpret.
Decision trees classify the examples by sorting them down the tree from the root to
some leaf/terminal node, with the leaf/terminal node providing the classification of
the example.
Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test case. This
process is recursive in nature and is repeated for every subtree rooted at the new
node.
Entropy
Entropy is a measure of the randomness in the information being processed. The
higher the entropy, the harder it is to draw any conclusions from that information.
Flipping a coin is an example of an action that provides information that is random.
From the above graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability is 0.5
because it projects perfect randomness in the data and there is no chance if
perfectly determining the outcome.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A
brach with entropy more than zero needs further splitting.
Mathematically Entropy for 1 attribute is represented as:
Where S → Current state, and Pi → Probability of an event i of state S or
Percentage of class i in a node of state S.
Mathematically Entropy for multiple attributes is represented as:
Information Gain
Information gain or IG is a statistical property that measures how well a given
attribute separates the training examples according to their target classification.
Constructing a decision tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.
Information Gain
Information gain is a decrease in entropy. It computes the difference between
entropy before split and average entropy after split of the dataset based on given
attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information
gain.
Mathematically, IG is represented as:
Information Gain
Where “before” is the dataset before the split, K is the number of subsets generated
by the split, and (j, after) is subset j after the split.
Gini Index
You can understand the Gini index as a cost function used to evaluate splits in the
dataset. It is calculated by subtracting the sum of the squared probabilities of each
class from one. It favors larger partitions and easy to implement whereas information
gain favors smaller partitions with distinct values.
Gini Index
Gini Index works with the categorical target variable “Success” or “Failure”. It
performs only Binary splits.
Higher the value of Gini index higher the homogeneity.
Steps to Calculate Gini index for a split
1. Calculate Gini for sub-nodes, using the above formula for success(p) and
failure(q) (p²+q²).
2. Calculate the Gini index for split using the weighted Gini score of each node of
that split.
CART (Classification and Regression Tree) uses the Gini index method to create
split points.
Gain ratio
Information gain is biased towards choosing attributes with a large number of values
as root nodes. It means it prefers the attribute with a large number of distinct values.
C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information
gain that reduces its bias and is usually the best option. Gain ratio overcomes the
problem with information gain by taking into account the number of branches that
would result before making the split. It corrects information gain by taking the
intrinsic information of a split into account.
Let us consider if we have a dataset that has users and their movie genre
preferences based on variables like gender, group of age, rating, blah, blah. With the
help of information gain, you split at ‘Gender’ (assuming it has the highest
information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be equally
important and with the help of gain ratio, it will penalize a variable with more distinct
values which will help us decide the split at the next level.
Gain Ratio
Where “before” is the dataset before the split, K is the number of subsets generated
by the split, and (j, after) is subset j after the split.
Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables
(regression problems). This algorithm uses the standard formula of variance to
choose the best split. The split with lower variance is selected as the criteria to split
the population:
Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:
1. Calculate variance for each node.
2. Calculate variance for each split as the weighted average of each node variance.
Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one
of the oldest tree classification methods. It finds out the statistical significance
between the differences between sub-nodes and parent node. We measure it by the
sum of squares of standardized differences between observed and expected
frequencies of the target variable.
It works with the categorical target variable “Success” or “Failure”. It can perform two
or more splits. Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
It generates a tree called CHAID (Chi-square Automatic Interaction Detector).
Mathematically, Chi-squared is represented as:
Steps to Calculate Chi-square for a split:
1. Calculate Chi-square for an individual node by calculating the deviation for
Success and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure
of each node of the split
Pruning
In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been
pruned as it has more importance on the right-hand side of the tree, hence removing
overfitting.
Random Forest
Random Forest is an example of ensemble learning, in which we combine multiple
machine learning algorithms to obtain better predictive performance.
Why the name “Random”?
Two key concepts that give it the name random:
1. A random sampling of training data set when building trees.
2. Random subsets of features considered when splitting nodes.
A technique known as bagging is used to create an ensemble of trees where multiple
training sets are generated with replacement.
In the bagging technique, a data set is divided into N samples using randomized
sampling. Then, using a single learning algorithm a model is built on all samples.
Later, the resultant predictions are combined using voting or averaging in parallel.
Dataset
We will take only Age and EstimatedSalary as our independent variables X because
of other features like Gender and User ID are irrelevant and have no effect on the
purchasing capacity of a person. Purchased is our dependent variable y.
feature_cols = ['Age','EstimatedSalary' ]X = data.iloc[:,[2,3]].values
y = data.iloc[:,4].values
The next step is to split the dataset into training and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25,
random_state= 0)
Well, the classification rate increased to 94%, which is better accuracy than the
previous model.
Now let us again visualize the pruned Decision tree after optimization.
dot_data = StringIO()
export_graphviz(classifier, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names =
feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Conclusion
In this article, we have covered a lot of details about Decision Tree; It’s working,
attribute selection measures such as Information Gain, Gain Ratio, and Gini Index,
decision tree model building, visualization and evaluation on supermarket dataset
using Python Scikit-learn package and optimizing Decision Tree performance using
parameter tuning.