0% found this document useful (0 votes)
74 views

PLS Tutorial PDF

This document provides an overview of multivariate regression techniques, including multiple linear regression (MLR), principal component regression (PCR), and partial least squares (PLS). It discusses how MLR extends simple linear regression to handle multiple independent variables. PCR and PLS are introduced as alternatives to MLR that address issues like collinearity. The tutorial also covers preprocessing data and validating regression models.

Uploaded by

cahyati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

PLS Tutorial PDF

This document provides an overview of multivariate regression techniques, including multiple linear regression (MLR), principal component regression (PCR), and partial least squares (PLS). It discusses how MLR extends simple linear regression to handle multiple independent variables. PCR and PLS are introduced as alternatives to MLR that address issues like collinearity. The tutorial also covers preprocessing data and validating regression models.

Uploaded by

cahyati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

12/9/2013

Partial least Squares

• Multivariate regression
• Multiple Linear Regression (MLR)
Partial Least Squares • Principal Component Regression (PCR)
• Partial Least Squares (PLS)

• Validation
A tutorial
• Preprocessing
Lutgarde Buydens

Multivariate Regression Multivariate Regression


Raw data
k
2

Raw data
p k p
2
1.5

1.5
1

X Y
1
0.5

X Y
0.5
0

0 n -0.5
2000 4000 6000 8000 10000 12000 14000

n -0.5
W avenumber (cm )
-1

2000 4000 6000 8000 10000 12000 14000


W avenumber (cm-1 ) Rows: Cases, observations … Collums: Variables, Classes, tags
Rows: Cases, observations, … Collums: Variables, Classes, tags X: Independent variabels (will be always available)
Y: Dependent variables ( to be predicted later from X)
Analytical observations of different samples P: Spectral variables
Y = f(X) : Predict Y from X
Experimental runs Analytical measurements
Persons MLR: Multiple Linear Regression
…. K: Class information PCR: Principal Component Regression
X: Independent variabels (will be always available) Concentration,.. PLS: Partial Least Sqaures
Y: Dependent variables ( to be predicted later from X)

From univariate to Multiple Linear Regression (MLR) MLR: Multiple Linear Regression
y y= b0 +b1 x1 + ε y y= b0 +b1 x1 + ε

Least squares regression  b0 : intercept Least squares regression  b0 : intercept


 
ε b1 : slope ε b1 : slope

   

     
 x  x

Multiple Linear Regression


y 
 y= b0 +b1 x1 + b2x2 + … bpxp + ε ε
 

^ 
  Y  Y E 



maximizes r ( y, y ) x1

x2

1
12/9/2013

MLR: Multiple Linear


 Regression MLR: Multiple Linear Regression
x

y= b0 +b1 x1 + b2x2 + … bpxp + ε y  •Disadavantages: (XTX)-1


ε
 
^ • Uncorrelated X-variables required
Y  Y E 
 • n  p +1
  y  r(x1,x2) 1

x1

yn1 = Xnpbp1 + en1 
x2 


Ynk = XnpBpk + Enk p+1
b0  x1
y 1 X e
1 b1
+
b = (XTX)-1XTy = :
: x2
: bp

1
n n 1 n

MLR: Multiple Linear Regression MLR: Multiple Linear Regression


Disadavantages: (XTX)-1
y  r(x1,x2) 1

Disadavantages: (XTX)-1
• Uncorrelated X-variables required  
 x1
• Uncorrelated X-variables required 
Set A Set B x2
y  r(x1,x2) 1 x1 x2 x1 x2 y
Fits a plane through a line !!  -1.01 -0.99 -1.01 -0.99 -1.89
 3.23 3.25 3.23 3.25 10.33
  5.49 5.55 5.49 5.55 19.09
0.23 0.21 0.23 0.23 2.19

-2.87 -2.91 -2.87 -2.91 -8.09
 x1
3.67 3.76 3.67 3.76 11.29

y= b1 x1 + b2x2 + ε
x2
b1 b2 b1 b2
MLR 10.3 -6.92 2.96 0.28

R2 =0.98 R2 =0.98

MLR: Multiple Linear Regression PCR: Principal Component Regression

Disadavantages: (XTX)-1

• Uncorrelated X-variables required Step 1: Perform PCA on the original X

• n  p +1 Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model

p a a1
cols cols
a2
PCA MLR
X T aa y
Step 1 Step2
p
X b0
n-rows n-rows n-rows
n a1 b1
Step 3
Dimension reduction  Variable Selection a2
Step 3: Calculate b-coefficients from the a-coefficients
 Latent variables (PCR, PLS) aa bp

2
12/9/2013

PCR: Principal Component Regression PCR: Principal Component Regression

xp
Step 0 : Meancenter X
PC1
 Step 1: Perform PCA: X = TPT  X* = (TPT)*
Step 2: Perform MLR Y=TA
  
A = (TTT)-1TTY
 
x1 Step 3 : Calculate B Y = X* B
 Y = (T PT) B MLR on reconstructed X*= (TPT)*
  A = PT B
B = (PPT)-1PA
x2
Dimension reduction: B = PA
Calculate b0’s b 0  y  yˆ
Use scores (projections) on latent variables that explain maximal variance in X

PCR: Principal Component Regression PLS: Partial Least Squares Regression


Phase 1 Phase 2
Optimal number of PC’s p a a1 k
cols col cols
a2
PLS MLR
Calculate Crossvalidation RMSE for different # PC’s X T aa y


2
RMSECV  ( y  y )i i
n n-rows n-rows n-rows
a1
k
cols
Phase 3
b0
Y a1 b1
a2

n-rows aa bp

PLS: Partial Least Squares Regression PLS: Partial Least Squares Regression

Projection to Latent Structure Phase 1 : Calculate new independent variables (T)


PCR PLS Sequential Algorithm: Latent variables and their scores are calculated sequentially
xp xp

• Step 0: Mean center X


PC1
 
LV1 (w) • Step 1: Calculate w

      Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY

(XTY)pk = WpaDaa ZTak w1 = 1st col. of W


   
x1 x1
  xp 
    w1
Use LV: 
Use PC:  
Maximizes covariance (X,y)
x2 Maximizes variance in X x2 = VarX*vary*cor(X,y)  
x1

 x2 

3
12/9/2013

PLS: Partial Least Squares Regression PLS: Partial Least Squares Regression
Phase 1 Phase 2
p a a1 k
Phase 1 : Calculate new independent variables (T) cols col cols
a2
Sequential Algorithm: Latent variables and their scores are calculated sequentially PLS MLR
X T aa y
• Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY

(XTY)pk = WpaDaa ZTak w1 = 1st col. of W


n-rows n-rows n-rows
•Step 2: xp  a1
w k
Calculate t1, scores (projections) of X on w1 cols
  
Phase 3
b0
tn1 = Xnpwp1
  Y a1 b1
x1
a2

  aa bp
n-rows

x2

PLS: Partial Least Squares Regression MLR, PCR, PLS:


Optimal number of LV’s Set A Set B
Calculate Crossvalidation RMSE for different # LV’s x1 x2 x1 x2 y
 -1.01 -0.99 -1.01 -0.99 -1.89
(y i  y i )2
RMSECV   n 3.23 3.25 3.23 3.25 10.33
5.49 5.55 5.49 5.55 19.09
0.23 0.21 0.23 0.23 2.19
-2.87 -2.91 -2.87 -2.91 -8.09
3.67 3.76 3.67 3.76 11.29

y= b1 x1 + b2x2 + ε

b1 b2 b1 b2
MLR 10.3 -6.92 2.96 0.28

PCR 1.60 1.62 1.60 1.62

PLS 1.60 1.62 1.60 1.62

VALIDATION
Common measure for prediction error

Estimating prediction error.

Basic Principle:

test how well your model works with new data,

it has not seen yet!

4
12/9/2013

A Biased Approach Validation: Basic Principle

Basic Principle:
Prediction error of the samples the model was built on
test how well your model works with new data, it has not
Error is biased! seen yet!

Samples also used to build the model Split data in training and test set.

 model is biased towards accurate prediction of these Several ways:


specific samples One large test set
Leave one out and repeat: LOO
Leave n objects out and repeat: LNO
...
Apply entire model procedure on the test set

Validation
Training and test sets
Split in training and test set.
b0 • Test set should be
representative of training set
Training Build model :
set • Random choice is often the
bp best
Full data • Check for extremely unlucky
set divisions
• Apply whole procedure on the
ŷ test and validation sets
Test RMSEP
set

Remark: for final model use whole data set.

Cross-validation Cross-validation: an example

• The data

• Most simple case: Leave-One-Out (=LOO, segment=1


sample). Normally 10-20% out (=LnO).

• Remark: for final model use whole data set.

5
12/9/2013

Cross-validation: an example Cross-validation: an example

• Split data into training set and validation set • Split data into training set and test set

Cross-validation: an example Cross-validation: an example

• Build a model on the training set

Cross-validation: an example Cross-validation: an example

• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)

6
12/9/2013

Cross-validation: an example Cross-validation: an example

• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)

Cross-validation: an example Cross-validation: an example

• Split data again into training set and valid. set • Split data again into training set and valid. set
– Until all samples have been in the validation set once – Until all samples have been in the validation set once
– Common: Leave-One-Out (LOO) – Common: Leave-One-Out (LOO)

Cross-validation: a warning Cross-validation: a warning

• Data: 13 x 5 = 65 NIR spectra (1102 wavelengths)


– 13 samples: different composition of NaOH, NaOCl and Na2CO3
• The data
– 5 temperatures: each sample measured at 5 temperatures
1102
3
1
Composit NaOH (wt%) NaOCl Na2CO3 (wt%) Temperature (°C) 2
ion (wt%)
1 18.99 0 0 15 21 27 34 40 y

2 9.15 9.99 0.15 15 21 27 34 40 
3 15.01 0 4.01 15 21 27 34 40 
4 9.34 5.96 3.97 15 21 27 34 40 13
… … … … … 65 65
13 16.02 2.01 1.00 15 21 27 34 40
Leave SAMPLE out

7
12/9/2013

Selection of number of LV’s Validation

Training 1) determine #LV’s : wit test’ set


Trough Validation:
Set
2) Build model : b
Choose number of LV’s that results in model with 0
lowest prediction errror

Testset to assess final model cannot be used ! Test’


Full data
set bp
set
Divide trainingset
Crossvalidation

Test

RMSEP
set

Remark: for final model use whole data set.

Double Cross Validation Double cross-validation

1) determine #LV’s : CV Innerloop


CV2
2) Build model : CV Outer loop • The data
b0

Full data Training


set setC
CV 1
bp


RMSEP

Remark: for final model use whole data set Skip.

Double cross-validation Double cross-validation

• Split data into training set and validation set • Split data into training set and validation set

Used later to assess model performance!

8
12/9/2013

Double cross-validation

1LV 2LV 3LV


• Apply crossvalidation on the rest: Split training set into
(new) training set and test set

1LV 2LV 3LV 1LV 2LV 3LV

Lowest RMSECV

Double cross-validation

9
12/9/2013

Cross-validation: an example Cross-validation: an example

• Repeat procedure • Repeat procedure


– Until all samples have been in the validation set once – Until all samples have been in the validation set once

Double cross-validation PLS: an example

• In this way: Raw + meancentered data


– The number of LVs is determined by using samples not 2
Raw data
0.3
Meancentered data

used to build the model with 0.25

1.5 0.2

– The prediction error is also determined using samples the 0.15


Absorbance (a.u.)

Absorbance (a.u.)
model has not seen before 1 0.1

0.05

0.5 0

-0.05

0 -0.1

-0.15
Remark: for final model use whole data set.
-0.5 -0.2
2000 4000 6000 8000 10000 12000 14000 2000 4000 6000 8000 10000 12000 14000
Wavenumber (cm-1) Wavenumber (cm-1)

RMSECV vs. No of LVs Regression coeffficients


Raw data
2

1.5
Absorbance (a.u.)

RMSECV values for prediction of NaOH


0.7
1

0.6 0.5

0
0.5

-0.5
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
0.4
RMSECV

Wavenumber (cm-1)

0.3
10

8
Regression coefficient

0.2
6

0.1 4

2
0
1 2 3 4 5 6 7 8 9 10
Number of LVs 0

-2
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
Wavenumber (cm-1)

10
12/9/2013

True vs. predicted Why Pre-Processing ?


Data Artefacts 3
Original spectrum
True values vs. predictions Offset
20
• Baseline correction 2.5 Slope
Scatter

18 • Alignment 2

• Scatter correction

Intensity (a.u)
16
• Noise removal 1.5
NaOH, predicted

14 • Scaling, Normalisation 1

• Transformation
12
• ….. 0.5

10
0

8
8 10 12 14 16 18 20
Other 500 1000 1500
Wavelength (cm-1)
2000 2500

NaOH, true
• Missing values
• Outliers

0.8
original
0.7 0.8
original
0.7 offset
Intensity (a.u.)

0.6
offset+slope
0.6 multiplicative
0.5 offset + slope + multiplicative
Intensity (a.u.)

0.4 0.5
0.3 0.4
0.8
0.2
0.3 original
0.1 0.7 offset
0.2
0 offset+slope
0 200 400 600 800 1000 1200 1400 1600 0.1 multiplicative
Wavelength (a.u.) 0.6 offset + slope + multiplicative
0
0 200 400 600 800 1000 1200 1400 1600
Wavelength (a.u.) 0.5
Intensity (a.u.)

0.4

0.8
0.3
0.8 0.7
0.8
Intensity (a.u.)

0.7 offset multiplicative


Intensity (a.u.)

0.6
Intensity (a.u.)

0.7 0.2
0.6 offset+slope
0.6 0.5
0.5 0.5 0.4 0.1
0.4 0.4
0.3
0.3 0.3
0.2 0
0.2 0.2 0 200 400 600 800 1000 1200 1400 1600
0.1 0.1 0.1 Wavelength (a.u.)
00 0
200 400 600 800 1000 1200 1400 160000 200 400 600 800 1000 1200 1400 16000 200 400 600 800 1000 1200 1400 1600
Wavelength (a.u.) Wavelength (a.u.) Wavelength (a.u.)

Pre-Processing Methods Pre-Processing Results


4914 combinations: all reasonable • Complexity of the model : no of LV
STEP 1: STEP 4: • Classification Accuracy
(7x) BASELINE
STEP 2:
(10x) SCATTER
STEP 3:
(10x) NOISE (7x) SCALING &  Raw Data
TRANSFORMATION
No baseline correction No scatter correction No noise removal S
Meancentering

(3x) Detrending (4x) scaling: Mean (9x) S-G smoothing Autoscaling


Complexity of the model (no of LV)

polynomial order Median Max L2 norm (window: 5-9-11 pt)


(2-3-4) (order: 2-3-4) Range scaling

(2x) Derivatisation SNV Pareto scaling


(1st – 2nd )
(3x) RNV (15, 25, 35)% Poisson scaling

AsLS MSC Level scaling

Log transformation

Supervised pre-processing methods


OSC No noise removal Meancentering
DOSC Autoscaling
Range scaling
Pareto scaling
Poisson scaling
Level scaling
Log scaling

Classification accuracy %
J. Engel et al. TrAC 2013

11
12/9/2013

SOFTWARE

• PLS Toolbox (Eigenvector Inc.)


– www.eigenvector.com
– For use in MATLAB (or standalone!)
• XLSTAT-PLS (XLSTAT)
– www.xlstat.com
– For use in Microsoft Excel
• Package pls for R
– Free software
– http://cran.r-project.org

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy