PLS Tutorial PDF
PLS Tutorial PDF
• Multivariate regression
• Multiple Linear Regression (MLR)
Partial Least Squares • Principal Component Regression (PCR)
• Partial Least Squares (PLS)
• Validation
A tutorial
• Preprocessing
Lutgarde Buydens
Raw data
p k p
2
1.5
1.5
1
X Y
1
0.5
X Y
0.5
0
0 n -0.5
2000 4000 6000 8000 10000 12000 14000
n -0.5
W avenumber (cm )
-1
From univariate to Multiple Linear Regression (MLR) MLR: Multiple Linear Regression
y y= b0 +b1 x1 + ε y y= b0 +b1 x1 + ε
x x
x2
1
12/9/2013
Ynk = XnpBpk + Enk p+1
b0 x1
y 1 X e
1 b1
+
b = (XTX)-1XTy = :
: x2
: bp
1
n n 1 n
y= b1 x1 + b2x2 + ε
x2
b1 b2 b1 b2
MLR 10.3 -6.92 2.96 0.28
R2 =0.98 R2 =0.98
Disadavantages: (XTX)-1
p a a1
cols cols
a2
PCA MLR
X T aa y
Step 1 Step2
p
X b0
n-rows n-rows n-rows
n a1 b1
Step 3
Dimension reduction Variable Selection a2
Step 3: Calculate b-coefficients from the a-coefficients
Latent variables (PCR, PLS) aa bp
2
12/9/2013
xp
Step 0 : Meancenter X
PC1
Step 1: Perform PCA: X = TPT X* = (TPT)*
Step 2: Perform MLR Y=TA
A = (TTT)-1TTY
x1 Step 3 : Calculate B Y = X* B
Y = (T PT) B MLR on reconstructed X*= (TPT)*
A = PT B
B = (PPT)-1PA
x2
Dimension reduction: B = PA
Calculate b0’s b 0 y yˆ
Use scores (projections) on latent variables that explain maximal variance in X
2
RMSECV ( y y )i i
n n-rows n-rows n-rows
a1
k
cols
Phase 3
b0
Y a1 b1
a2
n-rows aa bp
PLS: Partial Least Squares Regression PLS: Partial Least Squares Regression
3
12/9/2013
PLS: Partial Least Squares Regression PLS: Partial Least Squares Regression
Phase 1 Phase 2
p a a1 k
Phase 1 : Calculate new independent variables (T) cols col cols
a2
Sequential Algorithm: Latent variables and their scores are calculated sequentially PLS MLR
X T aa y
• Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
x2
y= b1 x1 + b2x2 + ε
b1 b2 b1 b2
MLR 10.3 -6.92 2.96 0.28
VALIDATION
Common measure for prediction error
Basic Principle:
4
12/9/2013
Basic Principle:
Prediction error of the samples the model was built on
test how well your model works with new data, it has not
Error is biased! seen yet!
Samples also used to build the model Split data in training and test set.
Validation
Training and test sets
Split in training and test set.
b0 • Test set should be
representative of training set
Training Build model :
set • Random choice is often the
bp best
Full data • Check for extremely unlucky
set divisions
• Apply whole procedure on the
ŷ test and validation sets
Test RMSEP
set
• The data
5
12/9/2013
• Split data into training set and validation set • Split data into training set and test set
• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)
6
12/9/2013
• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)
• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)
7
12/9/2013
Test
ŷ
RMSEP
set
ŷ
RMSEP
• Split data into training set and validation set • Split data into training set and validation set
8
12/9/2013
Double cross-validation
Lowest RMSECV
Double cross-validation
9
12/9/2013
1.5 0.2
Absorbance (a.u.)
model has not seen before 1 0.1
0.05
0.5 0
-0.05
0 -0.1
-0.15
Remark: for final model use whole data set.
-0.5 -0.2
2000 4000 6000 8000 10000 12000 14000 2000 4000 6000 8000 10000 12000 14000
Wavenumber (cm-1) Wavenumber (cm-1)
1.5
Absorbance (a.u.)
0.6 0.5
0
0.5
-0.5
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
0.4
RMSECV
Wavenumber (cm-1)
0.3
10
8
Regression coefficient
0.2
6
0.1 4
2
0
1 2 3 4 5 6 7 8 9 10
Number of LVs 0
-2
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
Wavenumber (cm-1)
10
12/9/2013
18 • Alignment 2
• Scatter correction
Intensity (a.u)
16
• Noise removal 1.5
NaOH, predicted
14 • Scaling, Normalisation 1
• Transformation
12
• ….. 0.5
10
0
8
8 10 12 14 16 18 20
Other 500 1000 1500
Wavelength (cm-1)
2000 2500
NaOH, true
• Missing values
• Outliers
0.8
original
0.7 0.8
original
0.7 offset
Intensity (a.u.)
0.6
offset+slope
0.6 multiplicative
0.5 offset + slope + multiplicative
Intensity (a.u.)
0.4 0.5
0.3 0.4
0.8
0.2
0.3 original
0.1 0.7 offset
0.2
0 offset+slope
0 200 400 600 800 1000 1200 1400 1600 0.1 multiplicative
Wavelength (a.u.) 0.6 offset + slope + multiplicative
0
0 200 400 600 800 1000 1200 1400 1600
Wavelength (a.u.) 0.5
Intensity (a.u.)
0.4
0.8
0.3
0.8 0.7
0.8
Intensity (a.u.)
0.6
Intensity (a.u.)
0.7 0.2
0.6 offset+slope
0.6 0.5
0.5 0.5 0.4 0.1
0.4 0.4
0.3
0.3 0.3
0.2 0
0.2 0.2 0 200 400 600 800 1000 1200 1400 1600
0.1 0.1 0.1 Wavelength (a.u.)
00 0
200 400 600 800 1000 1200 1400 160000 200 400 600 800 1000 1200 1400 16000 200 400 600 800 1000 1200 1400 1600
Wavelength (a.u.) Wavelength (a.u.) Wavelength (a.u.)
Log transformation
Classification accuracy %
J. Engel et al. TrAC 2013
11
12/9/2013
SOFTWARE
12