Basics of Machine Learning1
Basics of Machine Learning1
• Data Everywhere
• Conclusion
• References
Introduction
• Machine learning has come to stay and touches our daily life in many
ways and its being used without knowing.
• Application areas:
- Effective web search
- Automatic translation of documents
- Image recognition
- Information security (e.g. access control)
- Number and Speech recognition
- Collaborative filtering
Introduction
Figure 1: Discipline of ML
Machine Learning
The primary aim of machine learning is to allow the computers to learn
automatically without human intervention or assistance and adjust actions
accordingly.
Modern machine learning is a statistical process that starts with a body of data and
tries to derive a rule or procedure that explains the data or can predict future data.
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent
Constrained optimization
E.g.: Linear programming
Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
Unsupervised learning
Training data does not include desired outputs
Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions
Inductive Learning
Feature Selection
Supervised learning
Unsupervised learning
ML in Practice
All Possible Values Imputation – replaces the missing data for a given attribute by all possible values of
that attribute or by all possible values of the attribute in the class.
Regression Imputation Approach – replaces missing data with estimated values. After whicg, the
variables are used to make a prediction and the predicted value is replaced as an actual obtained value.
A forward fill or backfill method can be used to propagate the previous value forward or the next value
backward for missing value replacement
𝑛 𝑥
• 𝐴𝑣 = 𝑥=𝑖 𝑛 (4)
where Av is the average number
• n is the number of items
• x is the total sum of all number
Table 1: Training-Sample Dataset
Id duration State Dwin trans_ spkts dpkts sbytes dbytes Rate Output
Depth (Label2)
1 0.000011 INT 0 0 2 0 496 0 90909.09 Normal
2 0.000008 INT 0 0 2 0 1762 0 125000 Attack
3 0.000005 INT 0 0 2 0 1068 0 200000 Attack
4 0.000006 INT 0 0 2 0 900 0 166666.7 Attack
5 0.00001 ACC 1 1 2 0 2126 0 100000 Attack
6 0.000003 INT 0 0 2 0 784 0 333333.3 Attack
7 0.000006 INT 0 0 2 0 1960 0 166666.7 Attack
8 0.000028 ACC 1 0 2 0 1384 0 35714.29 Attack
9 0 ACC 1 1 1 0 46 0 0 Attack
10 0 ACC 1 0 1 0 46 0 0 Attack
11 0 ACC 1 0 1 0 46 0 0 Attack
12 0 ACC 1 0 1 0 46 0 0 Attack
13 0.000004 INT 0 0 2 0 1454 0 250000 Normal
14 0.000007 INT 0 0 2 0 2062 0 142857.1 Normal
15 0.000011 INT 0 0 2 0 2040 0 90909.09 Normal
16 0.000004 INT 0 0 2 0 1052 0 250000 Normal
17 0.000003 INT 0 0 2 0 314 0 333333.3 Normal
18 0.00001 INT 0 0 2 0 1774 0 100000 Normal
19 0.000002 INT 0 0 2 0 1568 0 500000 Attack
20 0.000005 INT 0 0 2 0 1658 0 500050 Attack
Table 2 Binarization Formation of Training-Sample Dataset
BINARIZATION
Attributes Minimum Average Maximum 0 1
dur 0 1.006756146 59.999989 below 1 above 1
Hence, remove noise (redundant features) and reduce the computational time
There are different feature selection methods, Discussions are limited to Entropy
Information gain and correlation-based feature selection (CFS).
𝑆𝑉
Gain (S,A) = H (S) - 𝑣∈(𝐴) 𝑆 𝐻 𝑆𝑣 (6)
where v is the possible values of A, Sv is subset of S where the value of A = v, Pi is the
proportion of instances with class i, a class is a category to which an instance may belong
with a certain probability
Feature Selection
The split information is calculated using equation (7) while Gain Ratio is computed using
equation (8).
𝑟 𝑆𝑖 𝑆𝑖
Split info (S, A) = 𝑖=1 𝑆 𝑙𝑜𝑔2 (7)
𝑆
Gain info
Gain Ratio = (8)
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
From Table 3, generates the description of subset grouping of the dataset for features
state, dwin, trans_depth, and output
= + 0.403+0.522 = 0.925
To further calculate gain values for each features using equation (3.3), we have;
14 7 7 7 7 6 6 6 0 0
1. Info (State) = 20 𝑙𝑜𝑔2 14 − 𝑙𝑜𝑔2 14 + 20 𝑙𝑜𝑔2 6 − 𝑙𝑜𝑔2 6 =0
14 14 6 6
Gain state = Info(D) - Info(State) = 0.925 – 0 = 0.925
14 7 7 7 7 6 6 6 0 0
2. Info (Dwin) = 20 𝑙𝑜𝑔2 14 − 𝑙𝑜𝑔2 14 + 20 𝑙𝑜𝑔2 6 − 𝑙𝑜𝑔2 6 =0
14 14 6 6
Gain Dwin = Info(D) - Info(Dwin) = 0.925 – 0 = 0.925
18 11 11 7 7 2 2 2 0 0
3. Info (Trans_d) = 20 𝑙𝑜𝑔2 18 − 𝑙𝑜𝑔2 18 + 20 𝑙𝑜𝑔2 2 − 𝑙𝑜𝑔2 2
18 18 2 2
= 0.8681
Gain Trans_d = Info(D) - Info(Trans_d) = 0.925 – 0.8681 = 0.0569
To calculate split info for each features using formula split info in equation (7)
And to calculate Gain Ratio we use equation (8)
7 7 6 6
1. Split info (State) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.5 log20.5 + 1 log21= 0.5 *-1+0
14 14 6 6
= - 0.5
Gain State 0.925
Gain Ratio (State) = = = - 1.85
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆𝑡𝑎𝑡𝑒 −0.5
7 7 6 6
2. Split info (Dwin) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.5 log2 0.5+1 log21 = 0.5 *-1+0
14 14 6 6
= - 0.5
Gain State 0.925
Gain Ratio (Dwin) = 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆𝑡𝑎𝑡𝑒
= −0.5
= - 1.85
11 11 2 2
3. Split info (Trans_d) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.6111 log2 0.6111 + 1 log2 1
18 18 2 2
= 0.6111 x-0.7111 + 1 log2 1 =-0.4345 + 1 log2 1 = 0.4345
Gain trans _d 0.0569
Gain Ratio Trans_d = SplitInfo trans _𝑑
= 0.4345 = 0.1310
Table 6 gives the result, Trans_d is the most valuable feature.
Info 0 0 0.8681
Gain 0.925 0.925 0.0569
𝑁 𝑋𝑌−( 𝑋)( 𝑌)
𝐶𝐹𝑆 = (9)
(𝑁 𝑋 2 −( 𝑋 2 )(𝑁− 𝑌 2 −( 𝑌) 2)
20 (6 )−(6)(6) 84
𝐶𝐹𝑆 = = 84 = 1
20 (6)−(6)2 20(6)(6)2
cMAP arg max cC P( x | c) P(c)
argmaxcC means that we choose cC such that the
corresponding probability is maximal and C={c1,..cm} is the set
of class label.
Bayesian Classifier
• On the assumption of naïve Bayes classifier that the attribute vectors x1,….xn are
conditionally independent given the classification, we obtain
n
P( x1 ,..., xn | c j ) P( x
i 1
i | cj)
and using the latter expression, former equation becomes the naïve Bayes classifier
n
c arg max P(c j )
c j C
P( x
i 1
i |cj)
• There is need to maximize P(X|Ci)P(Ci) for I =1,2. The prior probability of each class
P(i) can be computed based on the training tuples.
It should be noted here that Laplace adjustment are used in the computation of conditional
Similarly,
6
P( X | Category int rusion) 0.997 x0.003x0.003 8.773 x10
Therefore, the naïve Bayesian Classifier predicts Category = normal for tuple X.
Rough Set-Based Approach
• Rough set theory (RST) is a useful mathematical tool to deal with imprecise and
insufficient knowledge, find hidden patterns in data, and reduce dataset size.
• Rough set is used for evaluation of significance of data and easy interpretation of results.
• RST contributes immensely to the concept of reducts. Reducts is the minimal subsets of
attributes with the most predictive outcome.
• RST are very effective in removing redundant features from discrete data sets.
objects.
RS -Theoretical Background
as IND(B).
Service, Flag}.
BX {x X | [ x]B X }
set X defined as: BX {x X | [ x]B X 0} (13)
RS -Theoretical Background
Given attributes A = C D and C D = . The positive POSC(D), the negative
NEGC(D) and the boundary BNDC(D) regions for a given set of condition
attribute C in the relation to IND (D), can be defined as
POSC ( D ) CX
xD*
NEGC ( D) U CX (14)
xD *
BNDC ( D) CX CX
xD* xD*
IND(D). POSC(D) contains all objects of U that can be classified correctly into
The boundary region, BNDC(D)), is the set of objects that can possibly, but not
certainly, be classified in this way. The negative region, NEGC(D), is the set of
For example, the IND sets (s4,s6) and (s1,s5) for attributes Service and
Decision rules are then presented in form of “if …then…” rules. For example
In Table 7, for instance there are no total dependencies on any of the attributes.
n
dist ( X 1 , X 2 ) (x
i 1
1i x2i ) 2
Given this example to classify
Highlighted below are the steps involved in the computation of kNN algorithm
-Calculate the distance between the query-instance and all the training tuples.
- Sort the distance and determine nearest neighbour based on the kth
minimum distance
- calculate the distance between the query instance and all the training samples based
on Euclidean distance
Table 8: Distance between query instance and all the training samples
- Sort the distance and determine nearest neighbours based on the k-th minimum
distance.
Table 9: Sorted Distance based on k-th minimum distance
Gather the category of the nearest neighbour. Note that Table 9 shows only the first 3
sorted records
Use simple majority of the category of nearest neighbours as the prediction value of the
query instance. There are 2 normal and 1 intrusion, since 2 >1, then the new sample
Objects/Attributes X Y
A1 1 1
A2 2 1
A3 4 3
A4 5 4
Initial value of centroids. Objects A1 and A2 are chosen as the first c1 and c2 denoting the
Compute Objects-Centroids distance using the Euclidean distance to obtain the distance
matrix
0 1 3.61 5 c1 (1,1) group1
D
0
Each column in the distance matrix symbolizes the object. The first row
centroid and the second row in each object to the second centroid.
0 1 1 1 group2
Group 1 only has one member, the centroid remains in c1 = (1,1). Group2
has three members, thus the centroid becomes the average coordinate
c2 2 45 13 4
3 , 3 11 8
,
3 3
Iteration-1, Objects-centroid distances. Compute the distance of the new
minimum distance. Based on the new distance matrix (D1), object A2 is moved
1 1 0 0 group1
G
1
to Group1.
0 0 1 1 group2
Iteration 2, determine centroids: (repeat step 4 to compute new centroid based
distance D2
minimum distance
1 1 0 0 group1
G 2
0 0 1 1 group2
We obtain G2 = G1. Comparing the grouping of last iteration and this iteration,
computation of the k-means clustering has reached its stability and no more
iteration is needed.
Fuzzy C-means Algorithm
Step1 Input fuzzy coefficient m, , number of clusters, Randomly initialize the
membership matrix (U) that her condition in equation 5.9.
Step 2: Compute centroid based on equation N
m
u ij .xi
cj i 1
N
u
i 1
m
ij
Step 3: compute N c
J m u
2
m
ij xi c j
i 1 j 1
xi ck
k 1
The Ensemble Classifier
Ensemble classifier use a combination of a set of models or classifiers, each of which solves
the same original task in order to obtain a better composite global classifier with more
accurate and reliable estimates or decisions than using a single classifier (Ali and Pazzani,
1996).
Unlabele
M1 d tuples
M2
Training :
: Bagging, Predicted
Data set labels
Boosting,
: etc
:
Mk
(1) systems that think like humans (e.g., cognitive architectures and neural networks);
(2) systems that act like humans (e.g., pass the Turing test via natural language processing;
knowledge representation, automated reasoning, and learning);
(3) systems that think rationally (e.g., logic solvers, inference, and optimization);
and
(4) systems that act rationally (e.g, intelligent software agents and embodied robots that
achieve goals via perception, planning, reasoning, learning, communicating, decision-
making, and acting).
References
• Adetunmbi, A.O., Falaki, S.O., Adewale, O.S. and Alese, B.K. (2007) A
Rough Set Approach for Detecting known and novel Network Intrusion,
Second International Conference on Application of Information and
Communication Technologies to Teaching, Research and Administrations
(AICTTRA, 2007), Ife, pp. 190 – 200.
• Adetunmbi A.O. (2008) Intrusion Detection based on Machine Learning
Techniques, PhD Thesis, Federal University of Technology, Akure.
• Diksha S. (2018) Decision Making: Meaning Process and Factors
http://www.businessmanagementideas.com/decision-making/decision-
making-meaning-process-and-factors/3422 retrieved 8th Dec., 2018.
• Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Reasoning About
Data. Kluwer Academic Publishing, Dordreiht
• Elston S. and Rudin C. (2017) Data Science and Machine Learning
Essentials Video, Coursera
References
• Ayogu, I.I. (2008). Development of a Machine Translation System for
English, Igbo and Yoruba Languages,t A Ph.D Thesis submitted to the
Department of Computer Science, Federal University of Technology,
Akure.
• National Science and Technology Council Committee on Technology,
(2016). Preparing for the future of Artificial Intelligence
Thank you for Listening