Practical Machine Learning R
Practical Machine Learning R
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools
and many iterations to get reader feedback, pivot until you have the right book and build
traction once you do.
Part I - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1 - Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 R Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 for loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Making decisions (if & else) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 Datasets and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.10 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.11 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Part II - Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 3 - Classification with Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Splitting Criteria and Decision Tree Construction . . . . . . . . . . . . . . . . . . . . 32
3.3 Application with Pruning and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 36
3.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Support Vector Machines Model Construction and Classification . . . . . . . . . . . 57
6.3 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
getwd()
To set the working directory one can use the setwd function. For example setting the working
directory to the Desktop directory in a Linux machine:
setwd("~/Desktop")
What you type at the R prompt is an expression, which R attempts to evaluate and type the result.
For example getwd() is an expression that is evaluated by calling the function getwd() with no
arguments. The same for 42
42
## [1] 42
Console Output
Whatever starts with ## in the book signifies what the reader should see in the console
output.
(100 * 2 - 12 ^ 2) / 7 * 5 + 2
## [1] 42
sin(pi/2)
## [1] 1
To find out the documentation of a specific function you can enter ?sum or help(sum). To
search for functions, there is the help.search("sin") function to help you with that. For certain
functions one can see examples of use by typing the expression example(plot). Comments start
with #, while to assign values to variables you can use <- or =. For example:
a <- 42
b <- (42 + a) / 2
print(a)
## [1] 42
print(b)
## [1] 42
With ls() one can check all the variables existing in the current R session.
ls()
ls() output
The output of the ls method will differ from the above, based on what are the current
variables in your session.
To delete all the variables in the current session you can use the call:
Chapter 1 - Introduction to R 4
rm(list=ls())
1.2 Vectors
A vector is a collection of similar types of elements. For example integers. Some things you can
do with vectors in R are shown below.
a[c(1,3:4)]
## [1] 10 3 100
The above expression uses the c() function for combining values and the : operator that
generates sequences from:to with step 1. Another easy way of specifying sequences is to use
the seq function.
c(1, 2, 7, 10)
## [1] 1 2 7 10
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 6, by=1)
## [1] 1 2 3 4 5 6
seq(1,6, by=2)
Chapter 1 - Introduction to R 5
## [1] 1 3 5
seq(1,by=2, length=6)
## [1] 1 3 5 7 9 11
class(a)
## [1] "numeric"
a > 5
1. To return the indices of vector a, for which the values are TRUE:
which(a>5)
## [1] 1 4
b <- a > 0
positives <- a[b]
positives
## [1] 10 5 3 100 5
# or more succintly
positives <- a[a>0]
positives
Chapter 1 - Introduction to R 6
## [1] 10 5 3 100 5
length(a)
## [1] 7
c <- 1:7
rbind(a,c)
cbind(a,c)
## a c
## [1,] 10 1
## [2,] 5 2
## [3,] 3 3
## [4,] 100 4
## [5,] -2 5
## [6,] 5 6
## [7,] -50 7
1.3 Matrices
To create matrics use the matrix() function:
matrix(10,3, 2)
## [,1] [,2]
## [1,] 10 10
## [2,] 10 10
## [3,] 10 10
# or
matrix(c(1,2,3,4,5,6), 3, 2)
Chapter 1 - Introduction to R 7
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# or
matrix(c(1,2,3), 3, 2)
## [,1] [,2]
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
args(matrix)
So the first argument are the data, then with nrow or ncol arguments we can declare the number
of rows and columns and with the argument byrow we declare that we want to fill in the matrix
column-by-column if byrow=FALSE and row-by-row if byrow=TRUE. In the above calls we didn’t
use the byrow argument because the function matrix has a default value byrow=FALSE as we can
also check from the documentation, ?matrix.
Here we have filled in a matrix with values 1 to 9, by row, with the number of rows equal to
3. This gives us a square 3x3 matrix. R is pretty smart in knowing that the number of columns
should be 3 as well!
We can also call cbind and rbind and other functions like rowSums, colSums, mean and t for
transpose:
m2 <- rbind(m, m)
m2
Chapter 1 - Introduction to R 8
rowSums(m2)
## [1] 6 15 24 6 15 24
colSums(m2)
## [1] 24 30 36
mean(m2)
## [1] 5
For element wise multiplication on can use the * operator while for matrix multiplication you
can use the %*% operator.
am * bm
Chapter 1 - Introduction to R 9
am %*% bm
t(am)
Notice the use of the path including data/ since we previously set the working directory as the
Desktop with getwd().
dir()
3 http://swcarpentry.github.io/r-novice-inflammation/files/r-novice-inflammation-data.zip
Chapter 1 - Introduction to R 10
The dir function return the files and directories of the file system. The argument header=FALSE
lets the read.csv function know that there is no header row to give the columns names.
With head(data) I can check if the data are loaded correctly. It return the first few rows:
head(data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18
## 2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6
## 3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10
## 4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7
## 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17
## 6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12
## V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 1 6 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3
## 2 18 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1
## 3 19 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2
## 4 17 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3
## 5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0
## 6 12 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3
## V39 V40
## 1 0 0
## 2 0 1
## 3 1 1
Chapter 1 - Introduction to R 11
## 4 2 1
## 5 1 1
## 6 2 1
## [1] "data.frame"
# dimensions
dim(data)
## [1] 60 40
which tells me that I have a data frame with 60 observations (instances) and 40 variables.
# structure
str(data)
which returns the class and dimensions of the variable data, along with a list of the variables
with their type and their first values.
## V1 V2 V3 V4 V5
## Min. :0 Min. :0.00 Min. :0.000 Min. :0.00 Min. :1.000
## 1st Qu.:0 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000
## Median :0 Median :0.00 Median :1.000 Median :2.00 Median :2.000
## Mean :0 Mean :0.45 Mean :1.117 Mean :1.75 Mean :2.433
## 3rd Qu.:0 3rd Qu.:1.00 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :0 Max. :1.00 Max. :2.000 Max. :3.00 Max. :4.000
## V6 V7 V8 V9
## Min. :1.00 Min. :1.0 Min. :1.000 Min. :2.000
## 1st Qu.:2.00 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:4.000
## Median :3.00 Median :4.0 Median :4.000 Median :5.000
## Mean :3.15 Mean :3.8 Mean :3.883 Mean :5.233
## 3rd Qu.:4.00 3rd Qu.:5.0 3rd Qu.:5.250 3rd Qu.:7.000
## Max. :5.00 Max. :6.0 Max. :7.000 Max. :8.000
## V10 V11 V12 V13
## Min. :2.000 Min. : 2.00 Min. : 2.00 Min. : 3.00
## 1st Qu.:3.750 1st Qu.: 4.00 1st Qu.: 3.75 1st Qu.: 5.00
## Median :6.000 Median : 6.00 Median : 5.50 Median : 9.50
## Mean :5.517 Mean : 5.95 Mean : 5.90 Mean : 8.35
## 3rd Qu.:7.000 3rd Qu.: 9.00 3rd Qu.: 8.00 3rd Qu.:11.00
## Max. :9.000 Max. :10.00 Max. :11.00 Max. :12.00
## V14 V15 V16 V17
## Min. : 3.000 Min. : 3.000 Min. : 3.0 Min. : 4.000
## 1st Qu.: 5.000 1st Qu.: 5.000 1st Qu.: 6.0 1st Qu.: 6.750
## Median : 8.000 Median : 8.000 Median :10.0 Median : 8.500
## Mean : 7.733 Mean : 8.367 Mean : 9.5 Mean : 9.583
## 3rd Qu.:10.000 3rd Qu.:12.000 3rd Qu.:13.0 3rd Qu.:13.000
## Max. :13.000 Max. :14.000 Max. :15.0 Max. :16.000
## V18 V19 V20 V21
## Min. : 5.00 Min. : 5.00 Min. : 5.00 Min. : 5.00
## 1st Qu.: 7.75 1st Qu.: 8.00 1st Qu.: 8.75 1st Qu.: 9.00
## Median :11.00 Median :11.50 Median :13.00 Median :14.00
## Mean :10.63 Mean :11.57 Mean :12.35 Mean :13.25
## 3rd Qu.:13.00 3rd Qu.:15.00 3rd Qu.:16.00 3rd Qu.:16.25
## Max. :17.00 Max. :18.00 Max. :19.00 Max. :20.00
Chapter 1 - Introduction to R 13
There are also different ways I can select slices of data from the data frame:
## [1] 0
## [1] 16
Chapter 1 - Introduction to R 14
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 0 0 1 3 1 2 4 7 8 3
## 2 0 1 2 1 2 1 3 2 2 6
## 3 0 1 1 3 3 2 6 2 5 9
## 4 0 0 2 0 4 2 2 1 6 7
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 5 0 1 1 3 3 1 3 5 2 4
## 6 0 0 1 2 2 4 2 1 6 4
## 7 0 0 2 2 4 2 2 5 5 8
## 8 0 0 1 2 3 1 2 3 5 3
## 9 0 0 0 3 1 5 6 5 5 8
## 10 0 1 1 2 1 3 5 3 5 8
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17
## V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0
## V39 V40
## 5 1 1
## [1] 4 4 15 8 10 15 13 9 11 6 3 8 12 3 5 10 11 4 11 13 15 5 14
## [24] 13 4 9 13 6 7 6 14 3 15 4 15 11 7 10 15 6 5 6 15 11 15 6
## [47] 11 15 14 4 10 15 11 6 13 8 4 13 12 9
A subtle point is that the last selection returned a vector instead of a data frame. This is because
we selected only a single column. If you don’t want this behavior do:
## [1] "data.frame"
## [1] "integer"
## [1] "data.frame"
Other functions you can call are min, max, mean, sd and median to get statistical values of interest:
## [1] 18
## [1] 18
## [1] 1
## [1] 3.8
## [1] 4
## [1] 1.725187
To do more complex calculations like the maximum inflammation for all patients, or
the average for each day? we need to apply the function max or mean per row or column
respectivelly. Luckily there is the function apply that applies a function for each one of the
“margins”, 1 for rows and 2 for columns:
args(apply) # args return NULL because it prints the information, but every funct\
ion must return something!
## [1] 18 18 19 17 17 18 17 20 17 18 18 18 17 16 17 18 19 19 17 19 19 16 17
## [24] 15 17 17 18 17 20 17 16 19 15 15 19 17 16 17 19 16 18 19 16 19 18 16
## [47] 19 15 16 18 14 20 17 15 17 16 17 19 18 18
## V1 V2 V3 V4 V5 V6
## 0.0000000 0.4500000 1.1166667 1.7500000 2.4333333 3.1500000
## V7 V8 V9 V10 V11 V12
## 3.8000000 3.8833333 5.2333333 5.5166667 5.9500000 5.9000000
## V13 V14 V15 V16 V17 V18
## 8.3500000 7.7333333 8.3666667 9.5000000 9.5833333 10.6333333
## V19 V20 V21 V22 V23 V24
## 11.5666667 12.3500000 13.2500000 11.9666667 11.0333333 10.1666667
## V25 V26 V27 V28 V29 V30
## 10.0000000 8.6666667 9.1500000 7.2500000 7.3333333 6.5833333
## V31 V32 V33 V34 V35 V36
## 6.0666667 5.9500000 5.1166667 3.6000000 3.3000000 3.5666667
## V37 V38 V39 V40
## 2.4833333 1.5000000 1.1333333 0.5666667
plot(avg_day_inflammation)
plot is a function with many arguments so you will probably need to study a lot of examples to
do what you want (change an axis, name an axis, change the plot points and/or lines, add title,
add grids, add legend, color the graph, add arrows and text etc.)
Chapter 1 - Introduction to R 19
1.5 R Scripts
So far we have been typing directly into the R command line. What we could also do is save a
sequence of commands in an R source file to run it at will. The way to do this is to have such a
file with an .R extension and use the function source to run it.
If the source file contains an analysis from the beginning to the end it is a good practice to always
clear your session of variables using rm(list=ls()). On the other hand if it is used as a library,
for example to load some functions you have created, then you probably should not do it. You
can also include other source files inside your current source file using the function (you guessed
it): source
Because it is a good practice to have coding guidelines/conventions for standardizing the way
you write your script to make it more readable, Google has some: Google’s R Sytel Guide4
1.6 Functions
Let’s learn how to create functions by creating a fuction fahr_to_kelvin that converts temper-
atures from Fahrenheit to Kelvin:
To run a function:
## [1] 273.15
## [1] 373.15
4 https://google.github.io/styleguide/Rguide.xml
Chapter 1 - Introduction to R 20
## [1] -273.15
## [1] 0
## [1] 0
Let’s do an example:
Chapter 1 - Introduction to R 21
print_words(best_practice)
## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"
## [1] "work"
or another example:
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
len <- len + 1
}
# Number of vowels
len
## [1] 5
num <- 37
if (num > 100) {
print("greater")
} else {
print("not greater")
}
sign(-3)
## [1] -1
• equal (==)
• greater than or equal to (>=),
• less than or equal to (<=),
• and not equal to (!=).
We can also combine tests. An ampersand, &, symbolizes the logical and. A vertical bar, |,
symbolizes the logical or.
head(iris)
str(iris)
Chapter 1 - Introduction to R 23
summary(iris)
For the Species column, one can observe that the summary function did not do the standard
statistical calculation like it did with the other variables. From the str function we can see that the
Species column is a special type of value called factor, which is what R uses to declare categorial
or ordinal values. More on that in the next section.
We can also get the full attribute of Sepal.Length by name through the use of the $ operator:
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
mean(iris$Sepal.Length)
Chapter 1 - Introduction to R 24
## [1] 5.843333
median(iris$Sepal.Length)
## [1] 5.8
min(iris$Sepal.Length)
## [1] 4.3
max(iris$Sepal.Length)
## [1] 7.9
sd(iris$Sepal.Length)
## [1] 0.8280661
var(iris$Sepal.Length)
## [1] 0.6856935
range(iris$Sepal.Length)
sort(iris$Sepal.Length)
Chapter 1 - Introduction to R 25
## [1] 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 4.7 4.8 4.8 4.8 4.8 4.8 4.9
## [18] 4.9 4.9 4.9 4.9 4.9 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.1 5.1
## [35] 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.2 5.2 5.2 5.2 5.3 5.4 5.4 5.4 5.4 5.4
## [52] 5.4 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.6 5.6 5.6 5.6 5.6 5.6 5.7 5.7 5.7
## [69] 5.7 5.7 5.7 5.7 5.7 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.9 5.9 5.9 6.0 6.0
## [86] 6.0 6.0 6.0 6.0 6.1 6.1 6.1 6.1 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 6.3
## [103] 6.3 6.3 6.3 6.3 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 6.5
## [120] 6.5 6.6 6.6 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.8 6.8 6.8 6.9 6.9 6.9
## [137] 6.9 7.0 7.1 7.2 7.2 7.2 7.3 7.4 7.6 7.7 7.7 7.7 7.7 7.9
length(iris$Sepal.Length)
## [1] 150
1.10 Factors
The factor() command is used to create and modify factors in R:
R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even
though the first element in this vector is "male"). You can check this by using the function
levels(), and check the number of levels using nlevels():
levels(sex)
nlevels(sex)
## [1] 2
Sometimes, the order of the factors does not matter, other times you might want to specify the
order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type
of analysis. Additionally, specifying the order of the levels allows us to compare levels:
## Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("l\
ow", : 'min' not meaningful for factors
min(food) ## works!
## [1] low
## Levels: low < medium < high
Acknowledgments
Parts of this chapter were taken and adapted from Software Carpentry Lessons5 lessons
for R. You can check them out to gain a more in depth knowledge of R.
1.11 Challenge
• Use the help functionality to try and learn about the functions used above.
• How many passengers traveled in average for the year 1951?
• Which is the maximum number of passengers for the months January and
February?
• Calculate the summation per year and assign the result to a vector.
• Plot the vector nicely (names in axes, point and lines for the graph, title the graph,
add grid lines)
• Repeat the last two bullets for every month for all the years.
Tip: to transform a row of the data frame to a vector you can use unlist (e.g.
unlist(air["Jan",])
5 https://software-carpentry.org/lessons/
Chapter 2 - Introduction to Machine
Learning
2.1 Definition
Machine Learning (ML) is a subset of Artificial Intelligence (AI) in the field of computer science
that often uses statistical techniques to give computers the ability to “learn” (i.e., progressively
improve performance on a specific task) with data, without being explicitly programmed.
Machine Learning is often closelly related, if not used as an alternate term, to fields like
Data Mining (the process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems), Pattern Recognition,
Statistical Inference or Statistical Learning. All these areas often employ the same methods
and perhaps the name changes based on the practitioner’s expertise or the application domain.
• Supervised Learning: The system is presented with example inputs and their desired outputs
provided by the “teacher” and the goal of the machine learning algorithm is to create a
mapping from the inputs to the outputs. The mapping can be thought of as a function that
if it is given as an input one of the training samples it should output the desired value.
• Unsupervised Learning: In the unsupervised learning case, the machine learning algorithm
is not given any examples of desired output, and is left on its own to find structure in its
input.
The main machine learning tasks and the ones we will examine in the present textbook, are
separated based on what the system tries to accomplish in the end:
• Classification: inputs are divided into two or more classes, and the learner must produce
a model that assigns unseen inputs to one or more (multi-label classification) of these
classes. This is typically tackled in a supervised manner. Spam filtering is an example of
classification, where the inputs are email (or other) messages and the classes are “spam”
and “not spam”.
• Regression: also a supervised problem, the outputs are continuous rather than discrete.
• Clustering: a set of inputs is to be divided into groups. Unlike in classification, the groups
are not known beforehand, making this typically an unsupervised task.
• Dimensionality Reduction: simplifies inputs by mapping them into a lower-dimensional
space. Topic modeling is a related problem, where a program is given a list of human
language documents and is tasked with finding out which documents cover similar topics.
Chapter 2 - Introduction to Machine Learning 28
• Association Rules learning (or dependency modelling): Searches for relationships between
inputs. For example, a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are frequently
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.
1. Problem definition
2. Data preparation
3. Algorithm application
4. Algorithm optimization
5. Result presentation
If someone wants more formally defined processes for working on data problems there are
CRISP-DM and OSEMN. The steps for CRISP-DM6 (check out the diagram below7 ) are:
6 https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
7 By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf
(Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610
Chapter 2 - Introduction to Machine Learning 29
CRISP-DM diagram
• Obtain
• Scrub
• Explore
• Models
• Interpret
8 http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
Part II - Classification
Chapter 3 - Classification with
Decision trees
3.1 Introduction
library(rpart)
Furthermore, we shall use the library rpart.plot to plot the produced trees:
library(rpart.plot)
model <- rpart(Target ~ ., method = "class", data = ..., minsplit = ..., minbucke\
t = ..., cp = ...)
We can plot the tree using plot(model) and text(model, use.n = TRUE), or, alternatively, using
the following command:
In order to check the parameters of rpart, we can run ?rpart. Additionally, if we execute
?rpart.control, we can also check the main parameters:
We are going to apply a decision tree algorithm on the above dataset. We will answer to the
following questions:
a) If we split using the Gini index, on which feature (out of Outlook, Temperature, and Humidity)
should we perform the first split?
b) If we split using the Information gain, on which feature (out of Outlook, Temperature, and
Humidity) should we perform the first split?
c) Build the full decision tree using the Gini index. Also, plot the tree.
weather = read.csv("weather.txt")
library(rpart)
library(rpart.plot)
If we also run this command for Temperature and Humidity, and then plot all three with
rpart.plot(model, extra = 104, nn = TRUE)), then the following visualizations are provided:
Chapter 3 - Classification with Decision trees 33
Intuitively, we can already understand which of the three splits will be selected, right?
We compute the Gini index for Outlook using the following formulas:
GIN I(Sunny) =
1 − F req(P lay = N o|Outlook = Sunny)2 − F req(P lay = Y es|Outlook = Sunny)2
GIN I(Rainy) =
1 − F req(P lay = N o|Outlook = Rainy)2 − F req(P lay = Y es|Outlook = Rainy)2
GIN IOutlook =
F req(Outlook = Sunny) · GIN I(Sunny) + F req(Outlook = Rainy) · GIN I(Rainy)
Similarly, the GINI for Temperature and Humidity is GIN IT emp. = 0.367 and GIN IT emp. = 0.394,
respectively. Hence, answering to question (a), the optimal first split using the Gini index is on
Outlook.
We can also make these computations using R. For Outlook, we build the following frequency
arrays:
Chapter 3 - Classification with Decision trees 34
We compute the Information Gain for Outlook using the following formulas:
Entropy(All) = −F req(N o) · log(F req(N o)) − F req(Y es) · log(F req(Y es))
Entropy(Sunny) =
− F req(N o|Sunny) · log(F req(N o|Sunny)) − F req(Y es|Sunny) · log(F req(Y es|Sunny))
Entropy(Rainy) =
− F req(N o|Rainy) · log(F req(N o|Rainy)) − F req(Y es|Rainy) · log(F req(Y es|Rainy))
GAINOutlook =
Entropy(All) − F req(Sunny) · Entropy(Sunny) − F req(Rainy) · Entropy(Rainy)
Similarly, the Information Gain for Temperature and Humidity is GAINT emperature = 0.105 and
GAINT emperature = 0.071, respectively. Hence, answering to question (a), the optimal first split
using the Information Gain is on Outlook.
Chapter 3 - Classification with Decision trees 35
We can also make these computations using R. Initially, we compute the total entropy of the
dataset:
Entropy_Sunny =
- freq["Sunny", "No"] * log(freq["Sunny", "No"])
- freq["Sunny", "Yes"] * log(freq["Sunny", "Yes"])
Entropy_Rainy =
- freq["Rainy", "No"] * log(freq["Rainy", "No"])
- freq["Rainy", "Yes"] * log(freq["Rainy", "Yes"])
GAIN_Outlook =
Entropy_All
- freqSum["Sunny"] * Entropy_Sunny
- freqSum["Rainy"] * Entropy_Rainy
After that, we split the dataset into training and testing data:
trainingdata = iris2[c(1:40, 51:90, 101:140),]
testdata = iris2[c(41:50, 91:100, 141:150),]
The . in the formula stands for all the remaining variables in the data frame trainingdata. We
can also execute our model on the test set using the commands:
xtest = testdata[,1:2]
ytest = testdata[,3]
pred = predict(model, xtest, type="class")
library(MLmetrics)
cm = ConfusionMatrix(pred, ytest)
accuracy = Accuracy(pred, ytest)
precision = Precision(ytest, pred, 'versicolor')
recall = Recall(ytest, pred, 'versicolor')
f1 = F1_Score(ytest, pred, 'versicolor')
data.frame(precision, recall, f1)
(alternatively we can compute TP, FP, TN, and FN and find precision as TP/(TP + FP) and recall
as TP/(TP + FN))
Finally, the F-measure for the 3 models is shown in the following array (question (d)):
3.4 Exercise
You are given the training data of the following array for a binary classification problem.
Chapter 3 - Classification with Decision trees 38
where we notice that the probability of the set of features is given by their product, as their
values are independent from one another.
library(e1071)
predict(model, trvalue)
while if we further add the parameter type = "raw", we are also provided with the posterior
probabilities.
traffic = read.csv("traffic.txt")
library(e1071)
Since P (N o|Hot, V acation) > P (Y es|Hot, V acation), the algorithm classifies the instance to No.
Similarly, for question (b):
Since P (N o|Hot, W eekend) < P (Y es|Hot, W eekend), the algorithm classifies the instance to Yes.
Chapter 4 - Classification with Naive Bayes 41
Concerning question (c), we will re-compute the probabilities, however adding 1 as the Lapla-
cian parameter. So, given the new probabilities are P (Hot|Y es) = (1 + 1)/(4 + 3) = 2/7,
P (W eekend|Y es) = (2 + 1)/(4 + 3) = 3/7, P (Hot|N o) = (2 + 1)/(4 + 3) = 3/7, and P (W eekend|N o) =
(0 + 1)/(4 + 3) = 1/7, we compute:
Since P (N o|Hot, W eekend) < P (Y es|Hot, W eekend), the algorithm classifies the instance to Yes.
print(model)
A-priori probabilities:
Y
No Yes
0.5 0.5
Conditional probabilities:
Weather
Y Cold Hot Normal
No 0.25 0.50 0.25
Yes 0.50 0.25 0.25
Day
Y Vacation Weekend Work
No 0.25 0.00 0.75
Yes 0.25 0.50 0.25
For question (c), we build a naive bayes model with laplace smoothing:
model <- naiveBayes(HighTraffic ~ ., data = traffic, laplace = 1)
Furthermore, we import the libraries required to execute and evaluate the model:
library(e1071)
library(MLmetrics)
library(ROCR)
After that, we split the dataset into training and testing data:
trainingdata = votes[1:180,]
testingdata = votes[181:232,]
xtest = testingdata[,-1]
ytest = testingdata[,1]
pred = predict(model, xtest)
predprob = predict(model, xtest, type = "raw")
ConfusionMatrix(ytest, pred)
Precision(ytest, pred, "democrat")
Recall(ytest, pred)
Plotting the ROC curve (question (c)) initially requires computing TPR and FRP with the
following commands:
If we plot the ROCcurve object, we shall see TPR and FRP and the corresponding thresholds.
Finally, we can plot the curve as follows:
And find the area under the curve using the command:
performance(pred_obj, "auc")
Chapter 4 - Classification with Naive Bayes 44
ROC curves
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical
plot that illustrates the performance of a binary classifier system as its discrimination
threshold is varied. The curve is created by plotting the true positive rate (TPR) against
the false positive rate (FPR) at various threshold settings
An ROC curve demonstrates several things:
• It shows the tradeoff between sensitivity and specificity (any increase in sensitivity
will be accompanied by a decrease in specificity).
• The closer the curve follows the left-hand border and then the top border of the
ROC space, the more accurate the classifier.
• The closer the curve comes to the 45-degree diagonal of the ROC space, the less
accurate the test.
• The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that
value of the test.
• The area under the curve is a measure of text accuracy. This is discussed further
in the next section.
Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect
test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy
of a diagnostic test is the traditional academic point system:
library(class)
where we have to provide the training data (X_train, Y_train), the new instances (X_test), and
the value of k. If we also add the parameter prob = TRUE, then the probabilities for the class are
also given.
X1 X2 Y
0.7 0.7 A
0.7 0.8 A
0.6 0.6 A
0.5 0.5 A
0.5 0.6 A
0.5 0.7 A
0.5 0.8 A
0.7 0.5 B
0.8 0.7 B
0.8 0.5 B
Chapter 5 - Classification with k-Nearest Neighbors 46
X1 X2 Y
0.8 0.6 B
1.0 0.3 B
1.0 0.5 B
1.0 0.6 B
knndata = read.csv("knndata.txt")
library(class)
X_train = knndata[,c("X1","X2")]
Y_train = knndata$Y
When k = 1 (question (b)), we have to find the nearest instance (using euclidean distance) to
(0.7, 0.4), which is (0.7, 0.5) that belongs to class B. So, we execute the command:
rm(list=ls())
# download the dataset
fileURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cance\
r-wisconsin/breast-cancer-wisconsin.data"
download.file(fileURL, destfile="breast-cancer-wisconsin.data", method="curl")
# read the data
data <- read.table("breast-cancer-wisconsin.data", na.strings = "?", sep=",")
# remove the id column
data <- data[,-1]
# put names in the columns (attributes)
names(data) <- c("ClumpThickness",
"UniformityCellSize",
"UniformityCellShape",
"MarginalAdhesion",
"SingleEpithelialCellSize",
"BareNuclei",
"BlandChromatin",
"NormalNucleoli",
"Mitoses",
"Class")
# make the class a factor
data$Class <- factor(data$Class, levels=c(2,4), labels=c("benign", "malignant"))
# set the seed
set.seed(1234)
# split the dataset
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
library(class)
Because all the attributes are between 1 and 10 there is no need to do normalization between 0
and 1, since no attribute will dominate the others in the distance calculation of kNN. Because
kNN accepts the training and testing datasets without the target column, which puts in a 3rd
argument, we are going to do some data manipulation to have the data the way the knn function
likes them (look the manual with ?knn). Also, because no missing values are allowed in kNN,
let’s remove those too.
Lets predict, since there is no need to training when using kNN. The training instances are the
model.
You can play with the values of k to look for a better model.
5.3.3 Evaluation
Make the predictions for the validation dataset and print the confusion matrix:
cat("Confusion matrix:\n")
xtab = table(prediction, validationData$Class)
print(xtab)
cat("\nEvaluation:\n\n")
accuracy = sum(prediction == validationData$Class)/length(validationData$Class)
precision = xtab[1,1]/sum(xtab[,1])
recall = xtab[1,1]/sum(xtab[1,])
f = 2 * (precision * recall) / (precision + recall)
cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
cat(paste("Precision:\t", format(precision, digits=2), "\n",sep=" "))
cat(paste("Recall:\t\t", format(recall, digits=2), "\n",sep=" "))
cat(paste("F-measure:\t", format(f, digits=2), "\n",sep=" "))
## Confusion matrix:
##
## prediction benign malignant
## benign 99 4
## malignant 6 60
##
## Evaluation:
## Accuracy: 0.94
## Precision: 0.94
## Precision: 0.94
## F-measure: 0.95
The function createDataParitiion does a stratified random split of the data. Similar to what
we did above ourselves (not stratified though). Then we will use the train function to build the
kNN model.
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
kNNFit <- train(Class ~ .,
data = training,
method = "knn",
preProc = c("center", "scale"))
print(kNNFit)
## k-Nearest Neighbors
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7683412 0.5331399
## 7 0.7559100 0.5053696
## 9 0.7378175 0.4715006
12 https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
Chapter 5 - Classification with k-Nearest Neighbors 51
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
We can also search for the best k value given the training dataset.
## k-Nearest Neighbors
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7477453 0.4840235
## 7 0.7189901 0.4225515
## 9 0.7211797 0.4275156
## 11 0.7140987 0.4150135
## 13 0.7031182 0.3932055
## 15 0.7034819 0.3945755
## 17 0.6914316 0.3698916
## 19 0.6855830 0.3588189
## 21 0.6847459 0.3619330
## 23 0.6821917 0.3571894
## 25 0.6626137 0.3186673
## 27 0.6551801 0.3042504
## 29 0.6660760 0.3291024
## 31 0.6643681 0.3273283
## 33 0.6697744 0.3389183
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
To create a 10-fold cross-validation based search of k, repeated 3 times we have to use the function
trainControl:
Chapter 5 - Classification with k-Nearest Neighbors 52
the output:
## k-Nearest Neighbors
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 142, 142, 142, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7820261 0.5548350
## 7 0.7709314 0.5321546
## 9 0.7605147 0.5111479
## 11 0.7504657 0.4924263
## 13 0.7396324 0.4705332
## 15 0.7247386 0.4401370
## 17 0.7058170 0.4006357
## 19 0.7179330 0.4246010
## 21 0.7055229 0.4003491
## 23 0.6930065 0.3779050
## 25 0.6933742 0.3791319
## 27 0.7052451 0.4028364
## 29 0.7163562 0.4277789
## 31 0.7332108 0.4629731
## 33 0.7176389 0.4335254
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
For predictions:
## k-Nearest Neighbors
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 142, 142, 142, 141, 141, 141, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.8846768 0.8773148 0.6988095
## 7 0.8613467 0.8726852 0.6678571
## 9 0.8642361 0.8601852 0.6500000
## 11 0.8592634 0.8518519 0.6494048
## 13 0.8527364 0.8282407 0.6250000
## 15 0.8384011 0.8009259 0.5994048
## 17 0.8389550 0.8004630 0.5904762
## 19 0.8246280 0.8087963 0.5988095
## 21 0.8219783 0.8125000 0.6125000
## 23 0.8125951 0.8055556 0.6125000
## 25 0.8139716 0.7898148 0.6351190
## 27 0.8118676 0.7893519 0.6535714
## 29 0.8090112 0.7842593 0.6577381
## 31 0.8090939 0.7958333 0.6625000
## 33 0.7983300 0.7606481 0.6815476
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Chapter 5 - Classification with k-Nearest Neighbors 55
## Sensitivity : 0.9259
## Specificity : 0.6250
## Pos Pred Value : 0.7353
## Neg Pred Value : 0.8824
## Prevalence : 0.5294
## Detection Rate : 0.4902
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.7755
##
## 'Positive' Class : M
##
Reference
http://rstudio-pubs-static.s3.amazonaws.com/16444_caf85a306d564eb490eebdbaf0072df2.html13
13 http://rstudio-pubs-static.s3.amazonaws.com/16444_caf85a306d564eb490eebdbaf0072df2.html
Chapter 6 - Classification with
Support Vector Machines
6.1 Introduction
library(e1071)
where parameter kernel controls which kernel is going to be used (one of linear, polynomial,
radial), while we may also define more parameters for each kernel, such as degree, gamma, coef0.
By executing ?svm, we can see all the parameters.
Having a model, a new instance may be classified using the command:
predict(model, test)
If we also want to view the probabilities for each class, then we have to add the parameter
probability = TRUE. Note that the parameter probability = TRUE must be added both while
training (command svm) and while running (command predict).
X1 X2 y
Min. :-3.25322007 Min. :-4.664161 Min. :1.0
1st Qu.:-0.70900886 1st Qu.: 0.625361 1st Qu.:1.0
Median :-0.01162501 Median : 1.641370 Median :1.5
Mean : 0.03493570 Mean : 1.939284 Mean :1.5
3rd Qu.: 0.73337179 3rd Qu.: 3.115350 3rd Qu.:2.0
Max. : 3.63957363 Max. : 9.784076 Max. :2.0
Initially, we import the dataset, and split it into training and testing data::
alldata = read.csv("alldata.txt")
trainingdata = alldata[1:600, ]
testdata = alldata[601:800, ]
library(MLmetrics)
library(e1071)
For questions (b), (c), we initially build a grid, on top of which we are going to plot the
hyperplanes:
Chapter 6 - Classification with Support Vector Machines 59
For question (b), we train the SVM with gamma = 1 and apply it on the grid:
training_error = c()
for (gamma in gammavalues) {
svm_model = svm(y ~ ., kernel="radial", type="C-classification", data = trainin\
gdata, gamma = gamma)
pred = predict(svm_model, trainingdata[, c(1:2)])
training_error = c(training_error, 1 - Accuracy(trainingdata$y, pred))
}
testing_error = c()
for (gamma in gammavalues) {
svm_model = svm(y ~ ., kernel="radial", type="C-classification", data = trainin\
gdata, gamma = gamma)
pred = predict(svm_model, testdata[, c(1:2)])
testing_error = c(testing_error, 1 - Accuracy(testdata$y, pred))
}
We may plot the two errors on the same diagram using the commands:
Chapter 6 - Classification with Support Vector Machines 61
plot(training_error, type = "l", col="blue", ylim = c(0, 0.5), xlab = "Gamma", yl\
ab = "Error", xaxt = "n")
axis(1, at = 1:length(gammavalues), labels = gammavalues)
lines(testing_error, col="red")
legend("right", c("Training Error", "Testing Error"), pch = c("-","-"), col = c(\
"blue", "red"))
6.2.4 Cross-validation
We will apply k-fold cross validation to compute the best value for gamma (question (f)). Initially,
we construct k folds:
k = 10
dsize = nrow(trainingdata)
folds = split(sample(1:dsize), ceiling(seq(dsize) * k / dsize))
After that, we make a loop over gamma, and inside the loop we create a new (nested) loop over
the folds:
Chapter 6 - Classification with Support Vector Machines 62
We calculate the accuracy for each fold, and, finally, select the gamma value with the maximum
accuracy:
print(accuracies)
bestgamma = gammavalues[which.max(accuracies)]
6.3 Exercise
You are given the following data where Y is the class vector of a binary classification problem.
X1 X2 Y
2 2 1
2 -2 1
-2 -2 1
-2 2 1
1 1 2
1 -1 2
-1 -1 2
-1 1 2
c) Using your jmodel, in which class would you classify a new instance with values (4, 5)?
Part III - Data Processing
Chapter 7 - Feature Selection
7.1 Introduction
What is feature selection? What are filter methods? What are wrapper methods?
data = iris
# split into training and validation datasets
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
# keep only instances that do not have missing values.
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]
library(FSelector)
subset <- cfs(Species ~ ., trainData)
f <- as.simple.formula(subset, "Species")
print(f)
The output is a formula that we can use in various classification algorithms and says that
according to the CFS algorithm and the training dataset, the best features in order to predict
the Species of the Iris flowers are the Petal Length and the Petal Width.
For example in the Naive Bayes algorithm, we will use both the formula that includes all the
attributes for predicting the Species and the formula derived from CFS.
Chapter 7 - Feature Selection 66
library(e1071)
model <- naiveBayes(Species ~ ., data=trainData, laplace = 1)
simpler_model <- naiveBayes(f, data=trainData, laplace = 1)
library(MLmetrics)
train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes",
Accuracy(train_pred, trainData$Species), sep=" - ")
The accuracy in the training set is increased when using all the attrbitues, but decreased in the
validation set in relation to the simpler model that only used 2 attributes.
First, as always, we will split the dataset in train and validation datasets.
Chapter 7 - Feature Selection 67
Next we will use the rfe method of the caret package, setting it up using the rfeControl method.
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.7453 0.3681 0.04390 0.12449
## 2 0.7546 0.4182 0.04598 0.12317
## 3 0.7715 0.4707 0.05759 0.14388 *
## 4 0.7584 0.4479 0.04145 0.10478
## 5 0.7546 0.4457 0.04429 0.10095
## 6 0.7490 0.4344 0.04393 0.09394
## 7 0.7509 0.4290 0.04686 0.11122
## 8 0.7491 0.4201 0.03507 0.07767
##
## The top 3 variables (out of 3):
## glucose, age, mass
The RFE out of the 8 variables selected three (glucose, age, mass) that provided the best accuracy
under 10-fold cross-validation.
Chapter 7 - Feature Selection 68
We will use the 3 top variables that come out of RFE in the Naive Bayes algorithm:
library(e1071)
(f <- as.formula(paste("diabetes", paste(results$optVariables, collapse=" + "), s\
ep=" ~ ")))
Outer parentheses
Note the usage of the outer parenthesis. We both set the variable f equal to the Τ>
formula and print it to the console.
Chapter 7 - Feature Selection 69
library(MLmetrics)
train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes",
Accuracy(train_pred, trainData$diabetes), sep=" - ")
Using a simpler formula with 3 out of 8 attibutes, we were able to get better generalization results
in the validation dataset.
We will use the forward.search method of the FSelector package and the rpart decision trees
training method of the homonymous package. The forward.search method needs an evaluation
function that evaluates a subset of attributes. In our case the evaluator function performs 5-fold
cross-validation on the training dataset.
library(FSelector)
library(rpart)
## [1] "ClumpThickness"
## [1] 0.8535402
## [1] "UniformityCellSize"
## [1] 0.8982188
## [1] "UniformityCellShape"
## [1] 0.9132406
## [1] "MarginalAdhesion"
## [1] 0.8564007
## [1] "SingleEpithelialCellSize"
## [1] 0.9055005
## [1] "BareNuclei"
## [1] 0.9079083
## [1] "BlandChromatin"
## [1] 0.9027608
## [1] "NormalNucleoli"
## [1] 0.895089
## [1] "Mitoses"
## [1] 0.7942549
## [1] "ClumpThickness" "UniformityCellShape"
## [1] 0.9297849
## [1] "UniformityCellSize" "UniformityCellShape"
## [1] 0.9335359
## [1] "UniformityCellShape" "MarginalAdhesion"
## [1] 0.9167509
## [1] "UniformityCellShape" "SingleEpithelialCellSize"
## [1] 0.9404055
## [1] "UniformityCellShape" "BareNuclei"
## [1] 0.9384218
## [1] "UniformityCellShape" "BlandChromatin"
## [1] 0.9283328
## [1] "UniformityCellShape" "NormalNucleoli"
## [1] 0.944158
## [1] "UniformityCellShape" "Mitoses"
## [1] 0.9162663
## [1] "ClumpThickness" "UniformityCellShape" "NormalNucleoli"
## [1] 0.9405466
## [1] "UniformityCellSize" "UniformityCellShape" "NormalNucleoli"
## [1] 0.9322648
## [1] "UniformityCellShape" "MarginalAdhesion" "NormalNucleoli"
## [1] 0.9387149
## [1] "UniformityCellShape" "SingleEpithelialCellSize"
## [3] "NormalNucleoli"
## [1] 0.9314805
## [1] "UniformityCellShape" "BareNuclei" "NormalNucleoli"
## [1] 0.9470806
## [1] "UniformityCellShape" "BlandChromatin" "NormalNucleoli"
## [1] 0.9267064
## [1] "UniformityCellShape" "NormalNucleoli" "Mitoses"
## [1] 0.9401233
## [1] "ClumpThickness" "UniformityCellShape" "BareNuclei"
## [4] "NormalNucleoli"
## [1] 0.9335856
## [1] "UniformityCellSize" "UniformityCellShape" "BareNuclei"
## [4] "NormalNucleoli"
## [1] 0.9498017
## [1] "UniformityCellShape" "MarginalAdhesion" "BareNuclei"
Chapter 7 - Feature Selection 72
## [4] "NormalNucleoli"
## [1] 0.9418352
## [1] "UniformityCellShape" "SingleEpithelialCellSize"
## [3] "BareNuclei" "NormalNucleoli"
## [1] 0.9415623
## [1] "UniformityCellShape" "BareNuclei" "BlandChromatin"
## [4] "NormalNucleoli"
## [1] 0.9506024
## [1] "UniformityCellShape" "BareNuclei" "NormalNucleoli"
## [4] "Mitoses"
## [1] 0.9382617
## [1] "ClumpThickness" "UniformityCellShape" "BareNuclei"
## [4] "BlandChromatin" "NormalNucleoli"
## [1] 0.9515459
## [1] "UniformityCellSize" "UniformityCellShape" "BareNuclei"
## [4] "BlandChromatin" "NormalNucleoli"
## [1] 0.930654
## [1] "UniformityCellShape" "MarginalAdhesion" "BareNuclei"
## [4] "BlandChromatin" "NormalNucleoli"
## [1] 0.9454745
## [1] "UniformityCellShape" "SingleEpithelialCellSize"
## [3] "BareNuclei" "BlandChromatin"
## [5] "NormalNucleoli"
## [1] 0.9385037
## [1] "UniformityCellShape" "BareNuclei" "BlandChromatin"
## [4] "NormalNucleoli" "Mitoses"
## [1] 0.935793
## [1] "ClumpThickness" "UniformityCellSize" "UniformityCellShape"
## [4] "BareNuclei" "BlandChromatin" "NormalNucleoli"
## [1] 0.9460072
## [1] "ClumpThickness" "UniformityCellShape" "MarginalAdhesion"
## [4] "BareNuclei" "BlandChromatin" "NormalNucleoli"
## [1] 0.9494458
## [1] "ClumpThickness" "UniformityCellShape"
## [3] "SingleEpithelialCellSize" "BareNuclei"
## [5] "BlandChromatin" "NormalNucleoli"
## [1] 0.9506376
## [1] "ClumpThickness" "UniformityCellShape" "BareNuclei"
## [4] "BlandChromatin" "NormalNucleoli" "Mitoses"
## [1] 0.939423
After the search we get the following formula, where 5 out of the 9 variables were kept.
f <- as.simple.formula(subset, "Class")
print(f)
The fact that we performed forward search using decision trees in order to get a formula with
a subset of attributes, doesn’t stop us from using another classification model for training and
prediction. For example as in the previous examples in tehi chapter, we can use the Naive Bayes
algorithm to evaluate the forward selection algorithm both in the training and the validation
datasets under the accuracy metric.
Chapter 7 - Feature Selection 73
library(e1071)
model <- naiveBayes(Class ~ ., data=trainData, laplace = 1)
simpler_model <- naiveBayes(f, data=trainData, laplace = 1)
library(MLmetrics)
train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes",
Accuracy(train_pred, trainData$Class), sep=" - ")
In the breast cancer Wisconsin dataset, the feature selection algorithm did not outperform the use
of all attributes. The obvious cause for this is that in the dataset we have 9 attributes, handpicked
by domain experts and have indeed a combined predictive power that cannot outperform any
subset of them. So removing features in this case some does not produce better results.
Backward Search
One can also alter the forward.search method with backward.search to perform the
backward search wrapper selection method. How many attributes are kept in the case
of Backward Search?
Chapter 8 - Dimensionality
Reduction
8.1 Principal Components Analysis
d <- c(2.5, 2.4, 0.5, 0.7, 2.2, 2.9, 1.9, 2.2, 3.1, 3.0, 2.3,
2.7, 2, 1.6, 1, 1.1, 1.5, 1.6, 1.1, 0.9)
data <- matrix(d, ncol=2, byrow = T)
plot(data, xlab="x1", ylab="x2")
We subtract the mean of each attribute (x1, x2) from the respective values (centering the data)
using the function scale.
S <- cov(data_norm)
print(S)
## [,1] [,2]
## [1,] 0.6165556 0.6154444
## [2,] 0.6154444 0.7165556
## $d
## [1] 1.2840277 0.0490834
##
## $u
## [,1] [,2]
## [1,] -0.6778734 -0.7351787
## [2,] -0.7351787 0.6778734
##
## $v
## [,1] [,2]
## [1,] -0.6778734 -0.7351787
## [2,] -0.7351787 0.6778734
barplot(udv$d)
print(cumsum(udv$d)/sum(udv$d))
We can see that the 1st component accounts for more than 95% of the variance.
Step 6: Picking the 1st component
We transform the 2D dataset into a 1D dataset using just the 1st Principal Component (PC).
plot(data_new,data_new,asp=1,xlab="x1", ylab="x2")
Chapter 8 - Dimensionality Reduction 78
library(class)
library(stats)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7226 0.9313 0.36484 0.17916
## Proportion of Variance 0.7419 0.2168 0.03328 0.00802
## Cumulative Proportion 0.7419 0.9587 0.99198 1.00000
From the summary, we observe that with 2 PCs, we can explain more than 95% of the variance.
# no pca prediction
prediction = knn(trainDataX, validationDataX, trainDataY, k = 3)
# So let's predict using only the 7 principal components
prediction_pca = knn(train.pca$x[,1:2], validation.pca[,1:2], trainDataY, k = 3)
cat("Confusion matrix:\n")
## Confusion matrix:
##
## prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 12 1
## virginica 0 0 15
cat("\nEvaluation:\n\n")
##
## Evaluation:
## Accuracy: 0.97
Chapter 8 - Dimensionality Reduction 80
##
## prediction_pca setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 3
## virginica 0 2 13
cat("\nEvaluation PCA:\n\n")
##
## Evaluation PCA:
Even though the validation dataset is small, one can observe that using 50% less predictor
attributes, the algorithm made only 4 extra mistakes. Of course in real world scenarios, cross-
validation should be used with such small datasets.
In this case the PCA is not the best method for dimensionality reduction to improve the accuracy.
From the plot one cas observe that the versicolor and virginica species are intermixed in 2D
and thus the k-NN algorithm will have trouble classifying between the two.
Part III - Clustering
Chapter 9 - Centroid-based
clustering and Evaluation
Centroid-based clustering is a subset of clustering techniques where clusters are represented
by a central vector, which may not necessarily be a member of the data set. When the number
of clusters is fixed to a given number k, k-Means clustering gives a formal definition of the
following optimization problem:
Given a dataset containing N samples:
X = {x1 , x2 , x3 , . . . , xN }
k-Means Algorithm Purpose: Find the distribution of all samples among clusters so as to
minimize the sum of squares of the distances between each sample and the center of the cluster
it belongs. This metric corresponds to the Within cluster Sum of Squares – WSS or SSE and
is given by the following formula:
∑
K ∑
2
SSE = ‖x − mi ‖
i−1 x∈Ci
In order to solve the aforementioned optimization problem, k-Mean algorithm involves two
distinc steps:
1. Given certain centers, formulate clusters by assigning each sample to the closest cluster
2. Calculate the new centers based on the formulate clusters
In the initialization step, the algorithm randomly selects K samples as the initial centers and
then using these centers assigned each sample in the closest cluster. Once every sample has been
assigned to a certain cluster, the new center of each cluster is calculated as the mean of every
sample that belongs to the cluster. This iterative process is continuoued until there is no change
in the clusters in to consecutive iterations.
The most famous and used variation of the k-Means algorithm is k-Medoids, which is mainly
used for categorical data and allows the use of various different distance functions. It is worth
noticing that unlike k-Means, k-Medoids always uses a certain sample as the medoid of each
cluster.
Chapter 9 - Centroid-based clustering and Evaluation 84
9.1 - k-Means in R
For performing k-Means clustering in R, we can use the following command:
The centers parameter can either take the number of the initial centers (and thus the number of
clusters) or the centers themselves as argument.
We can use the following commands in order to show the final centroids and the distribution of
the samples into the formulated clusters:
# Get centroids
model$centers
9.2 - k-Medoids in R
In order to perform k-Medoids clustering in R, we need to import the library cluster:
library(cluster)
We can use the following commands in order to show the final medoids and the distribution of
the samples into the formulated clusters:
# Get centroids
model$medoids
If the dataset contains categorical variables, then we can show the medoids using the respective
variable with the following command:
data[model$id.med,]
Chapter 9 - Centroid-based clustering and Evaluation 85
∑
K ∑
2
W SS = ‖x − mi ‖
i−1 x∈Ci
∑
K
2
BSS = |Ci |‖m − mi ‖
i−1
• Silhouette coefficient
b(i) − a(i)
Silhouette(i) =
max{a(i), b(i)}
where a(i) is the average distance between the i-th sample and all other samples within the
same cluster and b(i) is the lowest average distance of the i-th sample to all samples in any
other clusters, of which i is not a member.
We can compute the aforementioned metrics using the following R commands:
# Compute WSS
cohesion = model$tot.withinss
# Compute BSS
separation = model$betweenss
# Compute Silhouette
model_silhouette = silhouette(model$cluster, dist(data))
In the computation of silhouette coefficient, data stands for the original data. In order to plot the
silhouette coeffiecient and the mean silhouette we can use the following commands:
Chapter 9 - Centroid-based clustering and Evaluation 86
# Plot Silhouette
plot(model_silhouette)
Another way that can provide us with an overview of the clustering and enables us to evaluate
its performance is the construction of heatmaps. At first, we can order out data samples using
the following command
data_ord = data[order(model$cluster),]
The following figure illustrates a sample heatmap constructed by using data samples that are
distributed among 3 clusters. The color of the heatmap refers to the distance of the ordered data
samples. The red color denotes small distance while yellow high distance.
Chapter 9 - Centroid-based clustering and Evaluation 87
X Y
x1 7 1
x2 3 4
x3 1 5
x4 5 8
x5 1 3
x6 7 8
x7 8 2
x8 5 9
Chapter 9 - Centroid-based clustering and Evaluation 88
# Construct vector containing the data labels which correspond to the row names
rnames = c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8")
Once having created our data frame, we can plot our data using the following commands:
1st iteration:
At first, we will compute the distances between all data points contained in the dataset and the
initial centers. These distances are given in the following distance matrix:
After having computed the distances, we find the minimum distance for each data point in order
to assign it into the respective cluster. Once all data point are assigned to one cluster, we compute
the new centers.
Chapter 9 - Centroid-based clustering and Evaluation 90
[ ]
x1 + x7 7.5
C11 = =
2 1.5
[ ] [ ]
x2 + x4 + x6 + x8 7 x3 + x5 1
C21 = = C31 = =
4 7.25 2 4
2nd iteration:
The following table contains the distances of the data points from the updated centers.
[ ] [ ]
x4 + x6 + x8 5.67 x2 + x3 + x5 1.67
C22 = = C32 = =
3 8.33 3 4
Given that the centers have changed, we the execution of the algorithm continuoues in the next
iteration
3rd iteration:
Once again, we compute the updated distances.
[ ]
x1 + x7 7.5
C13 = =
2 1.5
[ ] [ ]
x4 + x6 + x8 5.67 x2 + x3 + x5 1.67
C23 = = C33 = =
3 8.33 3 4
As we can see, the centers have not changes and thus the algorithm converges after the 3rd
iteration.
We can simply perform the aforementioned manual procedure with R using the following
commands:
• Cohesion computation
∑
K ∑
2
W SS = ‖x − mi ‖
i−1 x∈Ci
∑ 2 2 2
W SS(C1 ) = ‖x − m1 ‖ = ‖x1 − m1 ‖ + ‖x7 − m1 ‖
x∈C1
√ 2 √ 2
= (7 − 7.5)2 + (1 − 1.5)2 + (8 − 7.5)2 + (2 − 1.5)2
Chapter 9 - Centroid-based clustering and Evaluation 92
2nd Cluster:
∑ 2 2 2 2
W SS(C2 ) = ‖x − m2 ‖ = ‖x4 − m2 ‖ + ‖x6 − m2 ‖ + ‖x8 − m2 ‖
x∈C2
√ 2 √ 2
= (5 − 5.67)2 + (8 − 8.33)2 + (7 − 5.67)2 + (8 − 8.33)2
√ 2
+ (5 − 5.67)2 + (9 − 8.33)2
3rd Cluster:
∑ 2 2 2 2
W SS(C3 ) = ‖x − m3 ‖ = ‖x2 − m2 ‖ + ‖x3 − m2 ‖ + ‖x5 − m2 ‖
x∈C3
√ 2 √ 2
= (3 − 1.67)2 + (4 − 4)2 + (1 − 1.67)2 + (5 − 4)2
√ 2
+ (1 − 1.67)2 + (3 − 4)2
∑
W SST otal = W SS(Ci ) = W SS(C1 ) + W SS(C2 ) + W SS(C3 )
i
= 1 + 3.3334 + 4.6667 = 9
• Separation computation
The Separation (Between cluster Sum of Squares) metric is given by the following equation:
∑
K
2
BSST otal = |Ci | · ‖x − mi ‖
i−1
So, our first setp is to compute the center of our dataset, which is given by the mean of all data
points.
Chapter 9 - Centroid-based clustering and Evaluation 93
[ ]
x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 4.625
m= =
8 5
√ 2
BSST otal = 2 · (4.625 − 7.5)2 + (5 − 1.5)2
√ 2
+ 3· (4.625 − 5.67)2 + (5 − 8.33)2
√ 2
+ 3· (4.625 − 1.67)2 + (5 − 4)2
We can easily compute the above metrics using R with the following commands:
# Compute cohesion
cohesion = model$tot.withinss
# Compute separation
separation = model$betweenss
# Display centers
points(model$centers, col = 1:length(model$centers),
pch = "+", cex = 2)
X1 X2
Min. -4.720 -3.974
1st Qu. -1.237 -1.059
Median 1.757 2.096
Mean 1.322 1.532
3rd Qu. 3.592 3.675
Max. 6.077 6.689
We will use the aforementioned data in order to answer the following questions:
(a) Plot the data using a different color for each cluster, apply the k-Means algorithm and identify
the optimal number of clusters based on the SSE metric.
(b) Apply k-Means in order to formulate 3 clusters (plot the outcome using a different color for
each cluster along with the centroids) and compute the cohesion and the separation.
(c) Calculate silhouette coeffiecient and create the silhouette plot.
Chapter 9 - Centroid-based clustering and Evaluation 95
# Read data
cdata = read.csv("cdata.txt")
# The third column contains the cluster for each data point
target = cdata[, 3]
After that, we can plot the dataset using the following command:
# Plot data
plot(cdata, col = target)
The output is the following figure, where tha data point for each cluster are illustrated with
different color (red, black, green):
In order to identify the optimal number of clusters for the given dataset, we will be based on
the values of SSE metric. To that end, we will perform k-Means clustering setting the number of
Chapter 9 - Centroid-based clustering and Evaluation 96
clusters from 1 to 10 and calculate SSE for each different clustering approach. This process can
be done using the following commands:
Once finished, we can plot the SSE values using the command:
As we can see from the SSE values, following the elbow-method, the optimal number of clusters
for the given dataset is 2 or 3.
# Calculate cohesion
cohesion = model$tot.withinss
# Calculate separation
separation = model$betweenss
# Compute silhouette
model_silhouette = silhouette(model$cluster, dist(cdata))
# Plot silhouette
plot(model_silhouette)
In addition, we can compute the mean silhouette using the following command:
cdata_ord = cdata[order(model$cluster),]
Rank Topic
Conference_1 High SE
Conference_2 Low SE
Conference_3 High ML
Conference_4 Low DM
Conference_5 Low ML
Conference_6 High SE
The above dataset presents the rank and the topic of 6 different conferences. SE satnds for
Software Engineering, ML for Machine Learning and DM for Data Mining.
At first, we will construct our data frame containing the above dataset with the following
commands:
Chapter 9 - Centroid-based clustering and Evaluation 100
# Data frame
conferences = data.frame(Rank, Topic)
# Import library
library(cluster)
We can display the medoids and the distribution of the data instances into clusters using the
following commands:
# Get medoids
model$medoids
We can also display the medoids by their names using the following command:
conferences[model$id.med,]
# Plot data
plot(model$data, xaxt = "n", yaxt = "n", pch = 15, col = model$cluster)
# Display medoids
points(conferences[model$id.med,], col = 1:3, pch = "o", cex = 2)
Chapter 9 - Centroid-based clustering and Evaluation 101
Chapter 10 - Connectivity-based
clustering
Connectivity-based clustering, also known as hierarchical clustering, is based on the core
idea that every instance of a given dataset is related with every other instance. The main idea
behind the aforementioned assumption is that the degree of this relation is stronger for the
“nearby” objects than to objects farther away. The term nearby implies that these algorithms
connect data instances to form clusters based on their distance. A cluster can be described largely
by the maximum distance needed to connect parts of the cluster. At different distances, different
clusters will form, which can be represented using a dendrogram, which explains where the
common name “hierarchical clustering” comes from: these algorithms do not provide a single
partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge
with each other at certain distances.
X Y
x1 2 0
x2 8 4
x3 0 6
x4 7 2
x5 6 1
First of all, we need to properly construct our data frame containing the above dataset. In order
to do so, we can use the following commands:
# Construct vector containing the data labels which correspond to the row names
rnames = c("x1", "x2", "x3", "x4", "x5")
Once having created our data frame, we can plot our data using the following commands:
1st iteration:
The following table presents the distnce matrix for the aforementioned dataset.
x1 x2 x3 x4 x5
x1 0 7.21 6.32 5.39 4.12
x2 7.21 0 8.25 2.24 3.61
x3 6.32 8.25 0 8.06 7.81
x4 5.39 2.24 8.06 0 1.41
x5 4.12 3.61 7.81 1.41 0
As shown in the table, the points x4 and x5 are the ones having the smallest distance. As a result,
we link them together and thus we formulate our first cluster.
2nd iteration:
Since we are using single linkage, the distance between the cluster containing the points x4 and
x5 and every other point is the minimum of the distance between this point and the points x4
and x5.
The following table presents the updated distance matrix.
x1 x2 x3 x4 - x5
x1 0 7.21 6.32 4.12
x2 7.21 0 8.25 2.24
x3 6.32 8.25 0 7.81
x4 - x5 4.12 2.24 7.81 0
Again, using the same strategy, we link the point x2 with the cluster (x4 - x5).
3rd iteration:
In the next iteration, the distance matrix is updated again and the point x1 is linked with the
cluster (x2 - x4 - x5).
x1 x2 - x4 - x5 x3
x1 0 4.12 6.32
x2 - x4 - x5 4.12 0 7.81
x3 6.32 7.81 0
4th iteration:
Finally, the clustering is completed in the 4th iteration where the final point x3 is linked with the
cluster (x1 - x2 - x4 - x5).
x1 - x2 - x4 - x5 x3
x1 - x2 - x4 - x5 0 6.32
x3 6.32 0
At this point we will go through the same iterative process again, but for the case of complete
linkage.
Chapter 10 - Connectivity-based clustering 105
1st iteration:
The first step is the same as in the case of single linkage, where we compute the distance matrix
of our original dataset and link the points x4 and x5 which are the ones having the smallest
distance.
x1 x2 x3 x4 x5
x1 0 7.21 6.32 5.39 4.12
x2 7.21 0 8.25 2.24 3.61
x3 6.32 8.25 0 8.06 7.81
x4 5.39 2.24 8.06 0 1.41
x5 4.12 3.61 7.81 1.41 0
2nd iteration:
As opposed to the case of single linkage, in complete linkage the distance between the cluster
containing the points x4 and x5 and every other point is the maximum of the distance between
this point and the points x4 and x5.
The following table presents the updated distance matrix.
x1 x2 x3 x4 - x5
x1 0 7.21 6.32 5.39
x2 7.21 0 8.25 3.61
x3 6.32 8.25 0 8.06
x4 - x5 5.39 3.61 8.06 0
Based on the updated distances, we link the point x2 with the cluster (x4 - x5). Up to this point,
the results are the same with the case of single linkage.
3rd iteration:
In the next iteration, the distance matrix is updated again and this time the closest point are
found to be x1 and x3. As a result, they are linked together and formulate the cluster (x1 - x3)
x1 x2 - x4 - x5 x3
x1 0 7.21 6.32
x2 - x4 - x5 7.21 0 8.25
x3 6.32 8.25 0
4th iteration:
Finally, the clustering is completed in the 4th iteration where the clusters (x1 - x3) and (x2 - x4 -
x5) are linked together.
x1 - x3 x2 - x4 - x5
x1 - x2 - x4 - x5 0 8.25
x3 8.25 0
Chapter 10 - Connectivity-based clustering 106
d = dist(hdata)
Have in mind that hdata is the already constructed data frame that contains our dataset (see
Data Construction section).
We can use R commands in order to perform hierarchical clustering and plot the respective
dendrogram that illustrates the formulated clusters.
We can use the same commands for the case of complete linkage.
Chapter 10 - Connectivity-based clustering 107
The following figure depicts the produced dengrogram for the case of complete linkage.
It is worth noticing that the Y-axis of the dendrograms refers to the distance metric.
Once having performed the hierarchical clustering, We can split the original data into a number
of clusters based on our preference using the following command:
In order to demonstrate the formulated clusters from the two aforementioned approaches (single
and complete linkage), we can plot using the following commands:
Chapter 10 - Connectivity-based clustering 108
The following figure depicts the two formulated clusters for both the cases of single and complete
linkage.
(c) Use the given data in order to form 7 clusters and produce the respective dendrogram that
illustrates the results
(d) Create a 3D plot of the data using a different color for each cluster
(e) Calculate and plot the silhouette coefficient
# Import libraries
library(cluster)
library(scatterplot3d)
After reading the given dataset, we scale the data and we compute the distance matrix using
Eucledean distance as our distance metric.
After the computation of the distance matrix, we are ready to perform hierarchical clustering
using complete linkage.
We can plot the respective dendrogram (illustrated in the following Figure) using the following
command:
# Plot dendrogram
plot(hc)
Chapter 10 - Connectivity-based clustering 110
After the computation of the silhouette values for the different clusterings, we can plot them
using the following command:
The following figure depicts the mean silhouette values for the different number of clusters. As
shown form the graph, the optimal number of clusters that maximizes silhouette coefficient is 7.
Chapter 10 - Connectivity-based clustering 111
Uusing the aforementioned command, we have split our data into 7 clusters. The variable
clusters holds the information of how the clusters are distributed among the data instances.
In order to display the constructed clusters on the produced dendrogram, we can use the
following commands:
# Plot dendrogram
plot(hc)
# Display clusters
rect.hclust(hc, k = 7)
The following figure is the output of the aforementioned commands, where each red rectangle
refers to each cluster.
Chapter 10 - Connectivity-based clustering 112
We can also create a 3D plot of the clusters using the following commands:
# Plot dendrogram
s3d = scatterplot3d(europe, angle = 125, scale.y = 1.5,
color = clusters)
# Calculate silhouette
model_silhouette = silhouette(clusters, d)
# Plot silhouette
plot(model_silhouette)
Chapter 10 - Connectivity-based clustering 114
Chapter 11 - Density-based
clustering
Density-based clustering is based on the ground truth that given a cetrain dataset, clusters
can be defined as areas of higher density than the remainder of the dataset. Data instances
within the sparse areas, which are required to separate clusters, are usually considered to be
noise. The most famous and widely used density-based algorithm is DBSCAN (Density-based
Spatial Clustering of Applications with Noise)which was introduced by Martin Ester, Hans-Peter
Kriegel, Jorg Sander and Xiaowei Xu in 1996. The main idea behind the DBSCAN algorithm is
that given a set of points in some n-dimantional space, points that are closely packed together
(points with many nearby neighbors and thus of higher density) are grouped into clusters, while
points that lie alone in low-density regions are considered as outliers.
11.1 - DBSCAN in R
In order to use DBSCAN in R, we need to use the cluster library which can be imported with the
following command:
library(dbscan)
In order to perform clustering using DBSCAN and get the distribution of the input dataset into
clusters, we can use the following commands:
• epsilon (eps), which specifies the maximum distance that two data points can have in
order to be considered as part of the same cluster.
and
• minPoints (minPts), which specifies the minimum number of data points required to form
a cluster.
X Y
x1 2 10
x2 2 5
x3 8 4
x4 5 8
x5 7 5
x6 6 4
x7 1 2
x8 4 9
In order to visualize the data using R, we will first need to create the resperctive data frame usinf
the following commands:
# Construct vector containing the data labels which correspond to the row names
rnames = c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8")
Once having created our data frame, we can plot our data using the following commands:
We first need to compute the distance matrix for the given dataset which is given in the following
table:
x1 x2 x3 x4 x5 x6 x7 x8
x1 0.00
x2 5.00 0.00
x3 8.49 6.08 0.00
x4 3.61 4.24 5.00 0.00
x5 7.07 5.00 1.41 3.61 0.00
x6 7.21 4.12 2.00 4.12 1.41 0.00
x7 8.06 3.16 7.28 7.21 6.71 5.39 0.00
x8 2.24 4.47 6.40 1.41 5.00 5.39 7.62 0.00
Given the fact that minPoints value is 2, each cluster should contain at least two points. In
addition, based on the epsilon values, which is 2, DBSCAN forms the following clusters:
x1 x2 x3 x4 x5 x6 x7 x8
x1 0.00
x2 5.00 0.00
x3 8.49 6.08 0.00
x4 3.61 4.24 5.00 0.00
x5 7.07 5.00 1.41 3.61 0.00
x6 7.21 4.12 2.00 4.12 1.41 0.00
x7 8.06 3.16 7.28 7.21 6.71 5.39 0.00
x8 2.24 4.47 6.40 1.41 5.00 5.39 7.62 0.00
In order to get the distribution of the data point into clusters, we can use the following command:
clusters = model$cluster
Finally, we can plot the clustering results using the following commands:
# Plot clusters
plot(ddata, col=clusters+1, pch=15, main="DBSCAN(eps = 2, minPts = 2)")
In the following figure, we can see the formulated clusters from the previous subsection.
Chapter 11 - Density-based clustering 119
X1 X2
Min. 5.09 7.54
1st Qu. 10.45 9.78
Median 13.30 12.01
Mean 13.28 12.03
3rd Qu. 16.13 14.27
Max. 21.43 16.48
We will use the aforementioned data in order to answer the following questions:
(a) Plot the given data points
(b) Apply k-Means algorithm in order to create two clusters and plot the result using different
color for each cluster. Is the result satisfying? Do the created clusters correspond to the actual
ones?
(c) Optimize esp parameter for DBSCAN using kNN distance (use k = 10).
(d) Apply DBSCAN using eps = 0.4 and minPts = 10 into the given dataset. Plot the result using
different color for each cluster and black for the data points considered as noise.
# Plot library
library(dbscan)
# Read data
mdata = read.csv("mdata.txt")
# Plot data
plot(mdata)
The following figure illustrates the given dataset. As we can see from the plot, we can identify
two distinguished clusters.
Then, we apply k-Means and plot the clustering results using the following commands:
# Plot data
plot(mdata, col = model$cluster + 1)
Chapter 11 - Density-based clustering 121
As we can see, k-Means is not able to suffieciently split the data into clusters. This result
originates from the fact that k-Means relies on the Euclidean distance and thus cannot distingush
clusters organized in complex structures.
Then, we can plot the respective graph after sorting the data into ascending order:
# Plot distances
plot(sort(kdist), type = 'l', xlab = "Points sorted by distance",
ylab = "10-NN distance")
Chapter 11 - Density-based clustering 122
In order to optimize eps, we follow the elbow method. As we can see from the graph, the
optimized eps valus are in the interval [0.35, 0.45].
# Plot distances
plot(mdata, col = model$cluster + 1, pch = ifelse(model$cluster, 1, 4))
The following figure depicts the clustering result. As we can see, DBSCAN is able to efficiently
split the given dataset into clusters.
Chapter 11 - Density-based clustering 123
Chapter 12 - Distribution-based
clustering
12.1 - Theoretical background
Distribution-based clustering is closely related to statistics and is based on the fact that
clusters can easily be defined as objects including data instances belonging most likely to the
same distribution. While the theoretical foundation of the methods that apply to this clustering
category is excellent, they suffer from one key problem known as overfitting, unless cetrain
constraints are set into the complexity of the model. A more complex model will usually be able
to explain the data better, which makes choosing the appropriate level of complexity inherently
difficult. One prominent and widely used method is known as Gaussian Mixture Models or
GMMS (using the Expectation-Maximization algorithm or EM).
∑
ln P (X/θ) = ln P (X, Z/θ)
Z
The EM iteration alternates between performing an expectation step (E-step), which creates
a function for the expectation of the log-likelihood evaluated using the current estimate for
the parameters, and a maximization step (M-step), which computes parameters maximizing
the expected log-likelihood found on the E step. These parameter estimates are then used to
determine the distribution of the latent variables in the next E-step.
E-step:
M-step:
∑
θnew = argmaxθ ( P (Z/X, θold ) ln P (X, Z/θ))
Z
Chapter 12 - Distribution-based clustering 125
The EM algorithm reaches convergence when the log-likelihood value is no longer updated or
the updat is smaller than a given threshold.
∑
K
P (x) = πk N (x/µk , Σk )
k=1
where
∑
K
πk : M ixing Coef f icients ∈ [0, 1] and πk = 1
k=1
1 Σ −1 (x−µ)
e− 2 (x−µ)
1 T
N : N (x/µ, Σ) = 1/2
2π n/2 |Σ |
GMMs use EM as the optimization algorithm that enables calculating the latent variables of the
distributions that describe the goven data. As a result, the EM algorithm in the case of GMMs
maximizes the following quantity:
∑
N ∑
K
ln P (X/π, µ, Σ) = ln( πk N (xn /µk , Σk ))
n=1 k=1
D1_X D2_X
Min. -1.905 3.992
1st Qu. 0.223 6.322
Median 0.907 6.944
Mean 1.000 6.985
3rd Qu. 1.664 7.681
Max. 3.659 10.810
# Read data
gdata = read.csv("gdata.txt")
In order to plot the data along with the probability density function, we can use the following
commands:
# Plot data
plot(data.frame(x, 0), ylim = c(-0.01, 0.25), col = y,
xlab = "Data",
ylab = "Density")
The following figure illustrates the given dataset along with the probability density function.
The data from each cluster are illustrated with a different color (black and red)
Chapter 12 - Distribution-based clustering 127
E-step:
In this step, we want to calculate the probability for a given data point to originate from a certain
normal distibution. This probability is given by the following equation:
λj · P (xi /θj )
P (Dj /xi , θj ) =
λ1 · P (xi /θ1 ) + λ2 · P (xi /θ2 )
At this point, we will compute the above probability for the given point xi = 2:
(x −µ )2
1 − i2σ j
P (xi /θj ) = √ e j
2πσj
Chapter 12 - Distribution-based clustering 128
1 (2−0)2 1
P (2/θ1 ) = √ e− 2·1 = √ e−2 = 0.054
2π · 1 2π
1 (2−1)2 1
P (2/θ2 ) = √ e− 2·1 = √ e−1/2 = 0.242
2π · 1 2π
The probabilities of the given point to originate from distributions 1 and 2 are given by the
following two equations:
0.5 · 0.054
P (D1 /2, θ1 ) = = 0.182
0.5 · 0.054 + 0.5 · 0.242
0.5 · 0.242
P (D2 /2, θ2 ) = = 0.818
0.5 · 0.054 + 0.5 · 0.242
Given the aforementioned results, the point xi = 2 originates from distribution 1 with probability
0.182 and from distribution 2 with probability 0.818. Once having computed the aforementioned
probabilities for all 1,000 given data points, we will proceed in the maximization step of the EM
algorithm.
M-step:
In this step, we will calculate the parameters µ1 , µ2 that maximize the probabilities P (D1 /Xi , θ1 ), P (D2 /Xi , θ2 ).
In order to compute the parameters µ1 , µ2 , we will use the following equations:
∑
1000
xi · P (D1 /xi , θ1 )
i=1
µ1 =
∑
1000
P (D1 /xi , θ1 )
i=1
∑
1000
xi · P (D2 /xi , θ2 )
i=1
µ2 =
∑
1000
P (D2 /xi , θ2 )
i=1
The above equations calculate the weighted average of all data points, where the weights are
actually the probabilities for each data point to originate from one of the two distributions.
The mixing coefficients are given by the following equations:
∑
1000
P (D1 /xi , θ1 )
i=1
λ1 =
1000
∑
1000
P (D2 /xi , θ2 )
i=1
λ2 =
1000
Chapter 12 - Distribution-based clustering 129
After computing the values of the parameters µ1 , µ2 , λ1 , λ2 , the log-likelihood of the given dataset
is given by the following equation:
∑
1000 ∑
2
ln(P (X/θ)) = ln( λj · P (xi /θj ))
i=1 j=1
∑
1000
= ln(λ1 · P (xi /θ1 ) + λ2 · P (xi /θ2 ))
i=1
Once having computed the log-likelihood of the dataset, we compare it with the value of the
previous iteration. If the value has not changed more than a given threshold (espilon value),
then the algorithm converges. Otherwise, we return back in the E-step to re-calculate the new
parameters’ values and continue the execution.
# Initialize means
mu = c(0, 1)
# Initialize lambdas
lambda = c(0.5, 0.5)
repeat {
# ------ Expectation step ------
# Find distributions given mu, lambda (and sigma)
T1 <- dnorm(x, mean = mu[1], sd = 1)
T2 <- dnorm(x, mean = mu[2], sd = 1)
P1 <- lambda[1] * T1 / (lambda[1] * T1 + lambda[2] * T2)
P2 <- lambda[2] * T2 / (lambda[1] * T1 + lambda[2] * T2)
library(mixtools)
We can use the expectation maximization (EM) algorithm and overview the final calculated
parameters using the following commands:
We can plot both the original and the estimated probability density functions (into the same plot)
using the following commands:
The following figure illustrates the probability density functions. The dashed line refers to the
actual density, while the straight line to the estimated. As we can see, the estimation lies very
close to the actual density.
Chapter 12 - Distribution-based clustering 131
Then, we will use the given dataset in order to answer the following questions:
(a) Plot the given dataset using a different color for each cluster.
(b) Constuct a GMMs model inorder to cluster the provided dataset into 3 clusters. Set epsilon
value to 0.1.
(c) Show how the EM algorithm converges and plot the data using the estimated probability
density function
Chapter 12 - Distribution-based clustering 132
(d) Assign each data point to the cluster with the highest probability and plot the formulated
clusters along with their centroids.
(e) Calculate and plot silhouette coefficient
(f) Show the heatmap of the data after sorting them based on the above clustering results
# The third column contains the cluster for each data point
target = gsdata[, 3]
The output is the following figure, where tha data points for each cluster are illustrated with
different colors (red, black, green):
# Construct model
model = mvnormalmixEM(gsdata, k = 3 , epsilon = 0.1)
Once we have constructed the model, we can use the following command to demonstrate how
the algorithm converges:
plot(model, which = 1)
As shown in the graph, it is worth noticing that the EM algorith converges very fast. Since the
2nd iteration, the log-likelihood is almost the same.
Finally, we can plot the given dataset along with the probability density function for each cluster
using the following command:
plot(model, which = 2)
Using the model$posterior command, we can see the probability for each data point to originate
from each one of the three clusters (soft-assigments). However, in order to answer this question,
we have made hard-assigments using the following commands:
# Calculate centers
centers = matrix(unlist(model$mu), byrow = TRUE, ncol = 2)
Using the above hard-assigmentsl, we can compute the silhouette coeffiecient and construct the
heatmap.
# Calculate silhouette
model_silhouette = silhouette(clusters, dist(gsdata))
# Plot silhouette
plot(model_silhouette)
Chapter 12 - Distribution-based clustering 135
# Construct heatmap
heatmap(as.matrix(dist(gsdata _ord)), Rowv = NA, Colv = NA,
col = heat.colors(256), revC = TRUE)
Chapter 12 - Distribution-based clustering 136
X
Min. -2.2239
1st Qu. 0.6396
Median 5.1139
Mean 5.0336
3rd Qu. 9.1177
Max. 12.0908
We will use the given dataset in order to answer the following questions:
(a) Plot the data along with the probability density function
(b) Perform clustering using the EM algorithm setting the number of clusters to 2, 3, 4 and 5. For
each case, plot the probability density function and calculate the information criteria AIC and
BIC.
(c) Plot the AIC and BIC values for the different number of clusters and select the optimal number
of clusters
Chapter 12 - Distribution-based clustering 137
# Read data
icdata = read.csv("icdata.txt")
# The second column contains the cluster for each data point
y = icdata[, 2]
After that, we can plot the dataset along with the probability density function using the following
command:
# Plot data
plot(data.frame(x, 0), ylim = c(-0.01, 0.1))
N data samples and has log-likelihood L, the aforementioned criteria are given by the following
equations:
AIC = 2 · t − 2 · ln(L)
We can cluster the data using 2, 3, 4 and 5 distributions using the following code:
Finally, we can plot AIC and BIC values using the following commands:
The AIC and the BIC values in relation to the number of clusters are shown in the following
figure:
Part V - Extended Topics
Chapter 13 - Association Rules
The association rules extraction algorithm is included in the arules library. The supermarkt
transaction data to be used for executing market basket analysis can be found in the Grocery
Shopping datasets page of ACM RecSys. More specifically we will use the Belgium retail market
dataset.
First let’s load the data, read it with the read.transactions function and inspect the first 10 of
them:
library(arules)
fileURL <- "http://fimi.ua.ac.be/data/retail.dat.gz"
download.file(fileURL, destfile="retail.data.gz", method="curl")
# Read the data in basket format
trans = read.transactions("retail.data.gz", format = "basket", sep=" ");
inspect(trans[1:10])
## items
## [1] {0,
## 1,
## 10,
## 11,
## 12,
## 13,
## 14,
## 15,
## 16,
## 17,
## 18,
## 19,
## 2,
## 20,
## 21,
## 22,
## 23,
## 24,
## 25,
## 26,
## 27,
## 28,
## 29,
## 3,
## 4,
## 5,
## 6,
## 7,
## 8,
## 9}
## [2] {30,
Chapter 13 - Association Rules 142
## 31,
## 32}
## [3] {33,
## 34,
## 35}
## [4] {36,
## 37,
## 38,
## 39,
## 40,
## 41,
## 42,
## 43,
## 44,
## 45,
## 46}
## [5] {38,
## 39,
## 47,
## 48}
## [6] {38,
## 39,
## 48,
## 49,
## 50,
## 51,
## 52,
## 53,
## 54,
## 55,
## 56,
## 57,
## 58}
## [7] {32,
## 41,
## 59,
## 60,
## 61,
## 62}
## [8] {3,
## 39,
## 48}
## [9] {63,
## 64,
## 65,
## 66,
## 67,
## 68}
## [10] {32,
## 69}
One can see that each transaction containsa list of item IDs. The function summary given a dataset
read by the read.tansactions function will provide a specialized summary for transactional
data:
Chapter 13 - Association Rules 143
summary(trans)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 881
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.22s].
## sorting and recoding items ... [70 item(s)] done [0.01s].
Chapter 13 - Association Rules 144
## set of 84 rules
inspect(rules)
For example the first rule {37} => {38} says that if someone buys item with code 37, he or she
will probably buy item with code 38 with support 1.2% and confidence 97.4% and a lift of 5.5.
For a more detailed analysis of the arules library check the Introduction to arules14 vignette.
14 https://mran.revolutionanalytics.com/web/packages/arules/vignettes/arules.pdf