0% found this document useful (0 votes)
6 views

Basics of Machine Learning1

The document discusses the development of machine translation systems for Nigerian languages, emphasizing the importance of machine learning (ML) in various applications such as web search and automatic translation. It covers basic ML concepts, including types of learning, preprocessing techniques, feature selection, and the significance of data in machine learning. The presentation outlines the process of integrating and interpreting data to enhance the performance of machine learning models.

Uploaded by

Oluwa Muyiwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Basics of Machine Learning1

The document discusses the development of machine translation systems for Nigerian languages, emphasizing the importance of machine learning (ML) in various applications such as web search and automatic translation. It covers basic ML concepts, including types of learning, preprocessing techniques, feature selection, and the significance of data in machine learning. The presentation outlines the process of integrating and interpreting data to enhance the performance of machine learning models.

Uploaded by

Oluwa Muyiwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Developing Machine Translation Systems for Nigerian Languages:

The Federal University of Technology, Akure

Basics of Machine Learning

Adetunmbi, Adebayo Olusola (Ph.D, mcpn, mieee)


aoadetunmbi@futa.edu.ng
Presentation Outline
• Introduction

• Data Everywhere

• Basic Machine Learning Concepts

• Conclusion

• References
Introduction
• Machine learning has come to stay and touches our daily life in many
ways and its being used without knowing.

• Application areas:
- Effective web search
- Automatic translation of documents
- Image recognition
- Information security (e.g. access control)
- Number and Speech recognition
- Collaborative filtering
Introduction

• Classical modelling and analysis


- based on first principle of mathematical modelling to estimate
parameters difficult to measure.
- sometimes system understudy too complex to model
mathematically.

. Learning from data – i.e. Machine Learning (ML)


ML could be used to discover knowledge from data.
emanates from Artificial Intelligence
Data Everywhere
 Agriculture
 Banks
 Drivers License
 Government Parastatals
 Higher Institutions – Universities, Polytechnics, Colleges, etc
 Hospitals
 Industries
 Primary and Secondary Schools
 National ID card
 Social Media – Twitter, facebook, etc

What is the essence of generating unused Big or Huge Data?


Machine Learning
ML, a product of many disciplines (Figure 1)

Figure 1: Discipline of ML
Machine Learning
 The primary aim of machine learning is to allow the computers to learn
automatically without human intervention or assistance and adjust actions
accordingly.

 Modern machine learning is a statistical process that starts with a body of data and
tries to derive a rule or procedure that explains the data or can predict future data.

This approach—learning from data—contrasts with the older “expert system”


approach to AI, in which programmers sit down with human domain experts to learn
the rules and criteria used to make decisions, and translate those rules into software
code.
Traditional Programming

Data
Computer Output
Program
Machine Learning

Data
Computer Program
Output

Cousera: data science and machine learning


Machine Learning

 Hundreds new every year


 Every machine learning algorithm has three components:
• Representation
• Evaluation
• Optimization
Representation
 Decision trees
 Sets of rules / Logic programs
 Instances
 Graphical models (Bayes/Markov nets)
 Neural networks
 Support vector machines
 Model ensembles
 Etc.
Evaluation

 Accuracy
 Precision and recall
 Squared error
 Likelihood
 Posterior probability
 Cost / Utility
 Margin
 Entropy
 K-L divergence
 Etc.
Optimization
Combinatorial optimization
 E.g.: Greedy search
Convex optimization
 E.g.: Gradient descent
Constrained optimization
 E.g.: Linear programming
Types of Learning
Supervised (inductive) learning
 Training data includes desired outputs
Unsupervised learning
 Training data does not include desired outputs
Semi-supervised learning
 Training data includes a few desired outputs
Reinforcement learning
 Rewards from sequence of actions
Inductive Learning

 Given examples of a function (X, F(X))


 Predict function F(X) for new examples X
 Discrete F(X): Classification
 Continuous F(X): Regression
 F(X) = Probability(X): Probability estimation
What We’ll Cover
 Preprocessing

 Feature Selection

 Supervised learning

 Unsupervised learning
ML in Practice

• Understanding domain, prior knowledge, and goals


• Data Integration & Selection: Obtain data from various sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common format. Transform to new
format.
• learning Models: Obtain desired results.
• Interpreting results: Present results to user in meaningful manner.
• Consolidating and deploying discovered knowledge
16
Preprocessing (outliers)
(a) Outlier detection (and removal): These are unusual data values that are not
consistent with most observations.
• Causes: measurement errors, coding and recording errors, and abnormal values.
Strategies for dealing with outliers:
(i) Detect and remove outliers, or
(ii) Develop robust modeling methods that are insensitive to outliers.
Preprocessing (Missing Data)
• Missing data occurs due to the absence of data items that hide some unknown information
about the data which may be important.
• negative impact on the performance of classification algorithms
• Peng and Lei (2004) opined that:
- 1% missing data are considered trivial;
- 1-5% of missing data is manageable in a dataset;
- 5-15% requires sophisticated methods to handle;
- above 15% missing data in a dataset may severely impact the real representation
and interpretation of the entire dataset.
Causes of Missing data in dataset could be caused by different factors such as;
• Incorrect measurements as a result of faulty equipment
• missing values due to human errors during data collation stage
Missing Data Treatment Methods
 Case Deletion Approach - ignores cases with missing data and performs analysis on the
 Mean/Mode Imputation Approach – replaces missing data with either mean (numeric attribute) or mode
(nominal attribute) of all cases observed or for a given attribute using mean or mode of all known values of
that attribute in the class where the instance with missing data belongs.

 All Possible Values Imputation – replaces the missing data for a given attribute by all possible values of
that attribute or by all possible values of the attribute in the class.

 Regression Imputation Approach – replaces missing data with estimated values. After whicg, the
variables are used to make a prediction and the predicted value is replaced as an actual obtained value.

 A forward fill or backfill method can be used to propagate the previous value forward or the next value
backward for missing value replacement

 Missing data treatment with C4.5

 Hot deck imputation approach

 K-Nearest Neighbor Imputation (kNN)


Pre-Processing (Data Discretization/Normalization)
• Data discretization converts nominal and non-nominal attributes to the
discrete value.
𝑽(𝒊)
• Decimal Scaling 𝐕 𝐢 = ′
𝟏𝟎𝒌
(1)
For the smallest k such that max(|v’(i)|) < 1
Min – Max Normalization
𝑉 𝑖 −𝑚𝑖𝑛( 𝑉 𝑖 )
V′ i = (2)
(𝑚𝑎𝑥 𝑉 𝑖 −𝑚𝑖𝑛 𝑉 𝑖 )

Standard Deviation Normalization


𝑉 𝑖 −𝑚𝑒𝑎𝑛(𝑣)
V′ i = (3)
(𝑆𝑑(𝑣)
Pre-Processing (Data Discretization/Normalization
• Among others include:
• Equal-width
• Entropy
• Binarization is implemented by assigning a threshold value derived from
calculating the mean of the attribute values within each feature to derived a
Boolean value.

𝑛 𝑥
• 𝐴𝑣 = 𝑥=𝑖 𝑛 (4)
where Av is the average number
• n is the number of items
• x is the total sum of all number
Table 1: Training-Sample Dataset
Id duration State Dwin trans_ spkts dpkts sbytes dbytes Rate Output
Depth (Label2)
1 0.000011 INT 0 0 2 0 496 0 90909.09 Normal
2 0.000008 INT 0 0 2 0 1762 0 125000 Attack
3 0.000005 INT 0 0 2 0 1068 0 200000 Attack
4 0.000006 INT 0 0 2 0 900 0 166666.7 Attack
5 0.00001 ACC 1 1 2 0 2126 0 100000 Attack
6 0.000003 INT 0 0 2 0 784 0 333333.3 Attack
7 0.000006 INT 0 0 2 0 1960 0 166666.7 Attack
8 0.000028 ACC 1 0 2 0 1384 0 35714.29 Attack
9 0 ACC 1 1 1 0 46 0 0 Attack
10 0 ACC 1 0 1 0 46 0 0 Attack
11 0 ACC 1 0 1 0 46 0 0 Attack
12 0 ACC 1 0 1 0 46 0 0 Attack
13 0.000004 INT 0 0 2 0 1454 0 250000 Normal
14 0.000007 INT 0 0 2 0 2062 0 142857.1 Normal
15 0.000011 INT 0 0 2 0 2040 0 90909.09 Normal
16 0.000004 INT 0 0 2 0 1052 0 250000 Normal
17 0.000003 INT 0 0 2 0 314 0 333333.3 Normal
18 0.00001 INT 0 0 2 0 1774 0 100000 Normal
19 0.000002 INT 0 0 2 0 1568 0 500000 Attack
20 0.000005 INT 0 0 2 0 1658 0 500050 Attack
Table 2 Binarization Formation of Training-Sample Dataset
BINARIZATION
Attributes Minimum Average Maximum 0 1
dur 0 1.006756146 59.999989 below 1 above 1

proto 131 unique discretized values


service 13 unique discretized values
state 7 unique discretized values
spkts 1 18.66647233 10646 below 19 above 19

dpkts 0 17.54593597 11018 below 18 above 18

sbytes 24 7993.908165 14355774 below 7994 above 7994

dbytes 0 13233.78556 14657531 below 13234 above 13234

rate 0 82410.88674 1000000.003 below 82411 above 82411


Table 3: Discretized Training-Sample
Id dur State dwin trans_ spkts dpkts sbytes dbytes rate Output
depth Label2)
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 0 0 0 1 1
4 0 0 0 0 0 0 0 0 1 1
5 0 1 1 1 0 0 0 0 1 1
6 0 0 0 0 0 0 0 0 1 1
7 0 0 0 0 0 0 0 0 1 1
8 0 1 1 0 0 0 0 0 1 1
9 0 1 1 1 0 0 0 0 0 1
10 0 1 1 0 0 0 0 0 0 1
11 0 1 1 0 0 0 0 0 0 1
12 0 1 1 0 0 0 0 0 0 1
13 0 0 0 0 0 0 0 0 1 0
14 0 0 0 0 0 0 0 0 1 0
15 0 0 0 0 0 0 0 0 1 0
16 0 0 0 0 0 0 0 0 1 0
17 0 0 0 0 0 0 0 0 1 0
18 0 0 0 0 0 0 0 0 1 0
19 0 0 0 0 0 0 0 0 1 1
20 0 0 0 0 0 0 0 0 1 1
Feature Selection
A Feature Selection (FS) is a method of identifying the relevant features in a data set.

Hence, remove noise (redundant features) and reduce the computational time

There are different feature selection methods, Discussions are limited to Entropy
Information gain and correlation-based feature selection (CFS).

Entropy information gain per attribute in Table 3 is computed as stated in equations (5


and (6).
𝑛
Info D = H(S) = − 𝑖=1 𝑃𝑖 𝐿𝑜𝑔2 𝑃𝑖 (5)

𝑆𝑉
Gain (S,A) = H (S) - 𝑣∈(𝐴) 𝑆 𝐻 𝑆𝑣 (6)
where v is the possible values of A, Sv is subset of S where the value of A = v, Pi is the
proportion of instances with class i, a class is a category to which an instance may belong
with a certain probability
Feature Selection
The split information is calculated using equation (7) while Gain Ratio is computed using
equation (8).

𝑟 𝑆𝑖 𝑆𝑖
 Split info (S, A) = 𝑖=1 𝑆 𝑙𝑜𝑔2 (7)
𝑆

Gain info
 Gain Ratio = (8)
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜

 From Table 3, generates the description of subset grouping of the dataset for features
state, dwin, trans_depth, and output

 Computation of Information Gain, Split info and Gain Ratio


Table 4: Attribute information of Training-Sample Dataset

S/No Feature Attribute set


1 State 2
2 Dwin 2
3 trans_depth 2
5 Output 2
Using the spectral information model of equations (5)
From Table 4, the numbers of yes = 1 is 13, while the total number of records = 20, the calculation is shown
below
13 13 7 7
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = − (0.65 ∗ 0.62) − (0.3 ∗ 1.74) = − (−0.403) − (0.522)
20 20 20 20

= + 0.403+0.522 = 0.925
To further calculate gain values for each features using equation (3.3), we have;
14 7 7 7 7 6 6 6 0 0
1. Info (State) = 20 𝑙𝑜𝑔2 14 − 𝑙𝑜𝑔2 14 + 20 𝑙𝑜𝑔2 6 − 𝑙𝑜𝑔2 6 =0
14 14 6 6
Gain state = Info(D) - Info(State) = 0.925 – 0 = 0.925
14 7 7 7 7 6 6 6 0 0
2. Info (Dwin) = 20 𝑙𝑜𝑔2 14 − 𝑙𝑜𝑔2 14 + 20 𝑙𝑜𝑔2 6 − 𝑙𝑜𝑔2 6 =0
14 14 6 6
Gain Dwin = Info(D) - Info(Dwin) = 0.925 – 0 = 0.925
18 11 11 7 7 2 2 2 0 0
3. Info (Trans_d) = 20 𝑙𝑜𝑔2 18 − 𝑙𝑜𝑔2 18 + 20 𝑙𝑜𝑔2 2 − 𝑙𝑜𝑔2 2
18 18 2 2
= 0.8681
Gain Trans_d = Info(D) - Info(Trans_d) = 0.925 – 0.8681 = 0.0569
To calculate split info for each features using formula split info in equation (7)
And to calculate Gain Ratio we use equation (8)
7 7 6 6
1. Split info (State) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.5 log20.5 + 1 log21= 0.5 *-1+0
14 14 6 6
= - 0.5
Gain State 0.925
Gain Ratio (State) = = = - 1.85
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆𝑡𝑎𝑡𝑒 −0.5
7 7 6 6
2. Split info (Dwin) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.5 log2 0.5+1 log21 = 0.5 *-1+0
14 14 6 6
= - 0.5
Gain State 0.925
Gain Ratio (Dwin) = 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆𝑡𝑎𝑡𝑒
= −0.5
= - 1.85
11 11 2 2
3. Split info (Trans_d) = 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 0.6111 log2 0.6111 + 1 log2 1
18 18 2 2
= 0.6111 x-0.7111 + 1 log2 1 =-0.4345 + 1 log2 1 = 0.4345
Gain trans _d 0.0569
Gain Ratio Trans_d = SplitInfo trans _𝑑
= 0.4345 = 0.1310
Table 6 gives the result, Trans_d is the most valuable feature.

Table 6 Rank of Training-Sample Dataset Features


Spectral information model Info (D) =0.925
State Dwin Trans_d

Info 0 0 0.8681
Gain 0.925 0.925 0.0569

Split info - 0.5 - 0.5 0.4345

Gain Ratio - 1.85 - 1.85 0.1310


Correlation Coefficient Feature Selection
Pearson correlation coefficient equation given in Equation (9) will be computed to determine

correlation between each features.

𝑁 𝑋𝑌−( 𝑋)( 𝑌)
𝐶𝐹𝑆 = (9)
(𝑁 𝑋 2 −( 𝑋 2 )(𝑁− 𝑌 2 −( 𝑌) 2)

Where N is the number of features_value (pairs of scores)


X is the Parameter 1
Y is the parameter 2
Using a sample dataset of Table 3 to illustrate the implementation of Pearson correlation

coefficient of state and dwin feature.

20 (6 )−(6)(6) 84
𝐶𝐹𝑆 = = 84 = 1
20 (6)−(6)2 20(6)(6)2

The CFS application produces perfect correlation between two features.


Supervised learning
Predicting a class label using naïve Bayesian

Table 7: Sample data 2

S/N Protocol Service Flag Category


s1 Tcp http SF normal
s2 Tcp http REJ intrusion
s3 Tcp telnet REJ intrusion
s4 Udp ftp REJ normal
s5 Udp telnet SF normal
s6 Tcp ftp REJ normal
Bayesian Classifier
 Describes Instances to be classified by attribute
vectors 
x  ( x1 ,...., x n )
 Assigns to instances most probable or maximum a posterior
(MAP), classification from a finite set of c classes


cMAP  arg max cC P( x | c) P(c)
 argmaxcC means that we choose cC such that the
corresponding probability is maximal and C={c1,..cm} is the set
of class label.
Bayesian Classifier
• On the assumption of naïve Bayes classifier that the attribute vectors x1,….xn are
conditionally independent given the classification, we obtain
n
P( x1 ,..., xn | c j )   P( x
i 1
i | cj)

and using the latter expression, former equation becomes the naïve Bayes classifier

n
c  arg max P(c j )
c j C
 P( x
i 1
i |cj)

• Naïve Bayes classifier is trained by a set of labeled training data presented to it in


relational form
Bayesian Classifier
• The training data in Table 7 will be used as illustration. The class label attribute
(category) values are normal and intrusion. Let C1 correspond to category = normal and
C2 to category = intrusion. Assuming we wish to classify the tuple

X = (Protocol = Tcp, Service = ftp, Flag = SF).

• There is need to maximize P(X|Ci)P(Ci) for I =1,2. The prior probability of each class
P(i) can be computed based on the training tuples.

P(Category = normal) = 4/6 = 0.667

P(Category = intrusion) = 2/6 = 0.333


Bayesian Classifier
• To compute P(X|Ci) for i = 1,2, the following conditional probabilities are computed

• P(Protocol = tcp | Category = normal) = 0.5

• P(Protocol = tcp | Category = intrusion) = 0.997

• P(Protocol = udp | Category = normal) = 0.5

• P(Protocol = udp | Category = intrusion) = 0.003

• P(Service = http | Category = normal) = 0.250

• P(Service = http | Category = intrusion) = 0.499

• P(Service = telnet | Category = normal) = 0.250


Bayesian Classifier
P(Service = telnet | Category = intrusion) = 0.499

P(Service = ftp | Category = normal) = 0.499

P(Service = ftp | Category = intrusion) = 0.003

P(Flag = SF | Category = normal) = 0.50

P(Flag = SF | Category = intrusion) = 0.003

P(Flag = REJ| Category = normal) = 0.50

P(Flag = REJ | Category = intrusion) = 0.997

It should be noted here that Laplace adjustment are used in the computation of conditional

probabilities and ni(xk) was set to 0.0001.

= 0.5 x 0.499 x 0.5 = 0.125


Bayesian Classifier
Using the above probabilities, we obtain

P( X | Category  normal )  P( Protocol  Tcp | Category  normal ) x


P(Service  ftp | Category  normal ) xP( Flag  SF | Category  normal )

= 0.5 x 0.499 x 0.5 = 0.125

Similarly,

6
P( X | Category  int rusion)  0.997 x0.003x0.003  8.773 x10

To find the class Ci, that maximizes P(X|Ci)P(Ci), we compute

P( X | Category  normal ) P(Category  normal )  0.125 x0.667  0.083


P( X | Category  int rusion) P(Category  int rusion)  8.773x106 x0.333  2.821x106

Therefore, the naïve Bayesian Classifier predicts Category = normal for tuple X.
Rough Set-Based Approach

• Rough set theory (RST) is a useful mathematical tool to deal with imprecise and
insufficient knowledge, find hidden patterns in data, and reduce dataset size.

• Rough set is used for evaluation of significance of data and easy interpretation of results.

• RST contributes immensely to the concept of reducts. Reducts is the minimal subsets of
attributes with the most predictive outcome.

• RST are very effective in removing redundant features from discrete data sets.

• Rough set generates explainable rules


RS -Theoretical Background
Central to RS is its indiscernibility. An indiscernibility relation

occurs between two elements, when they cannot be differentiated.

Let S  U , A,V , f be an information system, where U is a universe

containing a finite set of N objects {x1 , x2 ,....xN } .

A is a non-empty finite set of attributes used in description of

objects.
RS -Theoretical Background

V describes values of all attributes, that is, V  aA


Va where Va forms a

set of values of the a-th attribute. f : UxA  V is the total decision

function such that f ( x, a) Va for every a  A and x U .

Information system is referred to as decision table (DT) if the attributes


in S is divided into two disjoint sets called condition (C) and decision
attributes (D) where A = C  D and C  D = .
RS -Theoretical Background
DT  U , C  D,V , f (10)

A subset of attributes B A defines an equivalent relation on U, denoted

as IND(B).

IND( B)  {( x, y) UxU | f ( x, b)  f ( y, b)b  B} . (11)

The equivalent classes of B-indiscernibility relation are denoted [x]B.

[ x]B  { y U | ( x, y)  IND( B)} (12)


RS -Theoretical Background

Indiscernibility (IND) relation as defined from Table 7. The non-empty

subsets of the conditional attributes are {Protocol}, {Service}, {Flag},

{Protocol, Service}, {Protocol, flag}, {Service, Flag}, and {Protocol,

Service, Flag}.

IND ({Protocol}) = {{s1,s2, s3, s6},{s4,s5}}


IND ({Service}) = {{s1,s2},{s3,s5},{s4,s6}}
IND ({Flag}) = {{s1,s5},{s2,s3,s4,s6}}
RS -Theoretical Background
IND({Protocol, Service}) = {{s1,s2},{s3},{s4},{s5}.{s6}}
IND ({Protocol, Flag}) = {{s1},{s2,s3,s6},{s4},{s5}}
IND ({Service, Flag}) = {{s1}, {s2},{s3},{s4,s6},{s5}}
IND ({Protocol, Service, Flag}) = {s1},{s2},{s3},{s4},{s5},{s6}}
IND ({Category}) = {{s1,s4, s5, s6},{s2,s3}}

Given B  A and X  U . X can be approximated using only the information

contained within B by constructing the B lower and B-upper approximations of

BX  {x  X | [ x]B  X }
set X defined as: BX  {x  X | [ x]B  X  0} (13)
RS -Theoretical Background
Given attributes A = C  D and C  D = . The positive POSC(D), the negative
NEGC(D) and the boundary BNDC(D) regions for a given set of condition
attribute C in the relation to IND (D), can be defined as
POSC ( D )  CX
xD*

NEGC ( D)  U  CX (14)
xD *

BNDC ( D)  CX  CX
xD* xD*

where D* denotes the family of equivalence classes defined by the relation

IND(D). POSC(D) contains all objects of U that can be classified correctly into

the distinct classes defined by IND(D).


RS -Theoretical Background

The boundary region, BNDC(D)), is the set of objects that can possibly, but not

certainly, be classified in this way. The negative region, NEGC(D), is the set of

objects that cannot be classified to classes of U/D.


The concept of indiscernibility relation is a means of generating rules.

For example, the IND sets (s4,s6) and (s1,s5) for attributes Service and

Flag respectively, the following rules are generated:

Rule 1. (Service, ftp)  (Category, normal)

Rule 2. (Flag, SF)  (Category, normal)

The number of consistent rules in a DT can be used as consistency


factor of the DT and is denoted by  (C , D) , where C and D are condition
and decision attributes respectively.
RS -Theoretical Background

In Table 7,  (C , D)  6 6  1 shows that the DT is consistent.

Decision rules are then presented in form of “if …then…” rules. For example

rule 1 in Table 1 can be presented as follows

If (Service, http) then (Category, normal)

A set of decision rules is called a decision algorithm.


Dependency of Attributes

In Table 7, for instance there are no total dependencies on any of the attributes.

 (Pr otocol , Category )  62


 ( Service, Category)  62
 ( Flag , Category )  62
 ({Pr otocol , Service}, Category)  64
 ({Pr otocol , Flag}, Category)  63
 ({Service, Flag}, Category)  66
 ({Pr otocol , Service, Flag}, Category)  66

The degree of dependency of Table 7 shows that the data is consistent.


Nearest Neighbour Classifiers
• kNN, a supervised learning algorithm where the result of new
instance query is classified based on majority of k-nearest
neighbour category.

• Closeness is defined in terms of a distance metric, such as


Euclidean distance. The Euclidean distance between two points
or tuples, say X1= (x11, x12,…, x1n) and X2= (x21, x22,…, x2n), is

n
dist ( X 1 , X 2 )   (x
i 1
1i  x2i ) 2
Given this example to classify

(Protocol = Tcp, Service = ftp, Flag = SF, Gategory = ?) based on Table 7.

Highlighted below are the steps involved in the computation of kNN algorithm

- Determine parameter k = number of nearest neighbour

-Calculate the distance between the query-instance and all the training tuples.

- Sort the distance and determine nearest neighbour based on the kth

minimum distance

- Gather the category of the nearest neighbour

- use simple majority of the category of nearest neighbours as the prediction


- determine parameter k = number of nearest neighbour. Suppose use k = 3

- calculate the distance between the query instance and all the training samples based

on Euclidean distance

Table 8: Distance between query instance and all the training samples

S/N Protocol Service Flag Category Euclidean Distance to


query instance
S1 Tcp http SF Normal 02  12  02  1  1
S2 Tcp http REJ Intrusion 02  12  12  2  1.414
S3 Tcp telnet REJ Intrusion 02  12  12  2  1.414
S4 Udp ftp REJ normal 12  02  12  2  1.414
S5 Udp telnet SF normal 12  12  02  2  1.414
S6 Tcp ftp REJ normal 02  02  12  1  1

- Sort the distance and determine nearest neighbours based on the k-th minimum
distance.
Table 9: Sorted Distance based on k-th minimum distance

S/N Protocol Service Flag Category Euclidean Distance


to query instance
S1 Tcp http SF Normal 0 1  0  1  1
2 2 2

S6 Tcp ftp REJ normal 0  0 1  1  1


2 2 2

S2 Tcp http REJ Intrusion 0  1  1  2  1.414


2 2 2

Gather the category of the nearest neighbour. Note that Table 9 shows only the first 3

sorted records

Use simple majority of the category of nearest neighbours as the prediction value of the

query instance. There are 2 normal and 1 intrusion, since 2 >1, then the new sample

(Protocol = Tcp, Service = ftp, Flag = SF) is included in normal category.

kNN algorithm is easy to implement but computationally intensive, especially when


given a large training sets.
Clustering Techniques
• Clustering is the process of grouping a set of
physical or abstract objects into classes of similar
objects.

• Objects clustering embraces various scientific


disciplines from mathematics and statistics to
biology and genetics.

• Clustering techniques used k-means,fuzzy c-
means,fuzzy rough c-means
K-means Clustering Techniques

k-means clustering (X,K)


X, instances set X = {x1, x2, ….xn}
K, Number of clusters
Mk, K clusters centres or means
Until there are no changes in K cluster
centres
determine the centroid coordinate
determine the distance of each object to
the centroids
group the object based on minimum
distance
end until
Table 10: a sample relational data

Objects/Attributes X Y
A1 1 1
A2 2 1
A3 4 3
A4 5 4

Initial value of centroids. Objects A1 and A2 are chosen as the first c1 and c2 denoting the

coordinate of the centroids, then c1 = (1,1) and c2 = (2,1).

Compute Objects-Centroids distance using the Euclidean distance to obtain the distance

matrix
0 1 3.61 5  c1  (1,1) group1
D 
0

1 0 2.83 4.24 c2  (2,1) group2

Each column in the distance matrix symbolizes the object. The first row

of the distance corresponds to the distance of each object to the first

centroid and the second row in each object to the second centroid.

Object Clustering – Each object is assigned based on the minimum


distance. Thus object A1 is assigned to group1, objects A2,A3 and A4
are assigned to group2. The element of group matrix (G0) is 1 if and
only if the object is assigned to that group.
1 0 0 0 group1
G 
0

0 1 1 1 group2

-. Iteration-1, determine centroids: knowing the number of each group. New

centroids are computed based on the new membership in each group.

Group 1 only has one member, the centroid remains in c1 = (1,1). Group2

has three members, thus the centroid becomes the average coordinate

among the three members.

c2   2 45 13 4
3 , 3  11 8
,
3 3
Iteration-1, Objects-centroid distances. Compute the distance of the new

centroids again to obtain the distance matrix D1

0 1 3.61 5  c1  (1,1) group1


D 1

3.14 2.36 0.47 1.89 c2  (113 , 83 ) group2


Iteration-1, objects clustering similar to step 3. Assign object based on the

minimum distance. Based on the new distance matrix (D1), object A2 is moved

1 1 0 0 group1
G 
1
to Group1.
0 0 1 1 group2
Iteration 2, determine centroids: (repeat step 4 to compute new centroid based

on the clustering of previous iteration). New centroids are

c1   122 , 121   (1 12 ,1) and c2   425 , 324   (4 1 2 ,3 1 2)


Iteration 2, objects centroids distance. Repeat step 2 again, we have new

distance D2

0.5 0.5 3.20 4.61 c1  (1 12 ,1) group1


D 
1

4.30 3.54 0.71 0.71 c2  (4 12 ,3 12 ) group2


Iteration -2, Objects clustering: Again we assign each object based on the

minimum distance

1 1 0 0 group1
G 2

0 0 1 1 group2

We obtain G2 = G1. Comparing the grouping of last iteration and this iteration,

we discovered that no object is moving from or to any group. Then the

computation of the k-means clustering has reached its stability and no more

iteration is needed.
Fuzzy C-means Algorithm
Step1 Input fuzzy coefficient m, , number of clusters, Randomly initialize the
membership matrix (U) that her condition in equation 5.9.
Step 2: Compute centroid based on equation N
m
u ij .xi
cj  i 1
N

u
i 1
m
ij

Step 3: compute N c
J m   u
2
m
ij xi  c j
i 1 j 1

Step 4: If || J(k+1) – J(k)|| <  then stop otherwise return to step 5.


Step 5: Compute uij based on equation 5.12 and go to step 2
1
uij  2 , i  1, 2,..., c; k  1, 2,..., N
c  xi  c j  m1

 
 xi  ck


k 1
 
The Ensemble Classifier
Ensemble classifier use a combination of a set of models or classifiers, each of which solves

the same original task in order to obtain a better composite global classifier with more

accurate and reliable estimates or decisions than using a single classifier (Ali and Pazzani,

1996).

Unlabele
M1 d tuples

M2

Training :
: Bagging, Predicted
Data set labels
Boosting,
: etc
:
Mk

Figure 1: Multiple classifiers used in increasing model accuracy.


Conclusion
Machine Learning has been used to achieve to Taxonomies of AI problems as spelt out in
The National Science and Technology Council Committee on Technology (2016)

(1) systems that think like humans (e.g., cognitive architectures and neural networks);

(2) systems that act like humans (e.g., pass the Turing test via natural language processing;
knowledge representation, automated reasoning, and learning);

(3) systems that think rationally (e.g., logic solvers, inference, and optimization);
and

(4) systems that act rationally (e.g, intelligent software agents and embodied robots that
achieve goals via perception, planning, reasoning, learning, communicating, decision-
making, and acting).
References
• Adetunmbi, A.O., Falaki, S.O., Adewale, O.S. and Alese, B.K. (2007) A
Rough Set Approach for Detecting known and novel Network Intrusion,
Second International Conference on Application of Information and
Communication Technologies to Teaching, Research and Administrations
(AICTTRA, 2007), Ife, pp. 190 – 200.
• Adetunmbi A.O. (2008) Intrusion Detection based on Machine Learning
Techniques, PhD Thesis, Federal University of Technology, Akure.
• Diksha S. (2018) Decision Making: Meaning Process and Factors
http://www.businessmanagementideas.com/decision-making/decision-
making-meaning-process-and-factors/3422 retrieved 8th Dec., 2018.
• Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Reasoning About
Data. Kluwer Academic Publishing, Dordreiht
• Elston S. and Rudin C. (2017) Data Science and Machine Learning
Essentials Video, Coursera
References
• Ayogu, I.I. (2008). Development of a Machine Translation System for
English, Igbo and Yoruba Languages,t A Ph.D Thesis submitted to the
Department of Computer Science, Federal University of Technology,
Akure.
• National Science and Technology Council Committee on Technology,
(2016). Preparing for the future of Artificial Intelligence
Thank you for Listening

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy