Chapter 1 Linear Regression Notes (as FS2)
Chapter 1 Linear Regression Notes (as FS2)
Note: some of this content appears in AS Statistics, and A2 Statistics – we explore it more deeply here in FS2.
GCSE Recap
• The explanatory/independent variable is
always placed on the 𝑥-axis.
This is where we try to minimise the sum of the squares of the residuals.
You can think of a residual as the size of the error – the difference between the data point,
and the predicted value from the line of best fit. Naturally, these could be positive or
negative, so the squaring deals with this.
E episilon
residual
σ 𝜀𝑖2 = 𝜀12 + 𝜀22 + 𝜀32 + 𝜀42 +…+𝜀𝑛2
Sa
= Ex -
(2)
Exy -E by
Say =
= a+
bx
y
EY
-
Exy-
=
i =
σ 𝑥 = 300
σ 𝑥 2 = 22,000
a) See =
Ex-(2
N
𝑥ҧ = 60
σ 𝑥𝑦 = 18,238
σ 𝑦 2 = 16,879.14
= 22000 -
70" = 4000
σ 𝑦 = 288.6 5
𝑦ത = 57.72
Exy EY
-
Say
=
=
18238-
300 x 288 6
-
= 922
-
b) b
= 0 250
= =
y
.
0 2305x
43 89 +
.
Y
=
-
bi
y
-
a =
= 57 72 -
-
0 .
2305460
= 43 89 .
c) When < = 58
43 89 + 0 2305x58
-
.
.
y
= 57 .
259
=
57 .
3 cm (3st) L
-
When x = 130
- 2305x130
43 89 + 0
y
.
=
=
73 9am
-
.
13sfi
d) Prediction for 55g is
interporeliable
- lation as it is inside
σ ℎ = 45.8
𝑓ҧ = 8
b =I
y
=
ℎത = 6.543
σ 𝑓 2 = 560
σ 𝑓ℎ = 422.6 Sef
458
Stu Eth-ES
422 6
Ex
. -
= =
= 56-2
Set Ef-(ef) 560
E 112
-
= =
=
(3sf)
b
= =
0
-
.
502
bi
i
-
a =
= -
b7
0 50178 XS
6 543
.
.
- ...
=
= 2 53 .
(3sf)
h = 2 53
.
+ 0
.
502f
supplement given
Your Turn
A repair workshop finds it is having a problem with a pressure gauge it uses. It decides
to have it checked by a specialist firm. The following data were obtained.
a) Show that 𝑆𝑥𝑦 = 6.776 and find 𝑆𝑥𝑥. bx
a+
It is thought that a linear relationship of the form 𝑦 =umun
𝑎𝑥 + 𝑏 could be used to describe
these data.
b) Use linear regression to find the values of 𝑎 and 𝑏 giving your answers to 3sf.
c) The gauge shows a reading of 2 bars. Use the regression equation to what
- out what
Gauge reading, 𝑥 (bars) 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8
Correct reading, 𝑦 (bars) 0.96 1.33 1.75 2.14 2.58 2.97 3.38 3.75
19 2x18- 86
a) Say Exry-Ely
σ 𝑥 = 19.2 = = 52 04
. - .
σ 𝑥 2 = 52.8 g
σ 𝑦 = 18.86
= 6 776
.
σ 𝑦 2 = 51.30
σ 𝑥𝑦 = 52.04 -
Ex-EG
= 19 2
Sax 52 S
-
.
5
= : 72
6
b) b =
S = 66 =
100
6-72
= 1 008
:
(3dp)
bi
a =
i -
bx Ex
Ey
=
-
0 0625
192
.
100833 ... X
= -
= 18 86
.
-
0 0625 + 10082
y
.
= -
0 0625
1 008x
-
-
.
=
y
0 . 0625
2) 1 008x2
-
x= 2
,
=
, y
= 1 .
9535
= 1 95.
bars (3st) Ex 1A
=
Coding
Sometimes the data is coded to make it easier to manage – you can just use substitution
to return to the original variables
Eight samples of carbon steel were produced with different percentages, 𝑐 %, of carbon
in them. Each sample was heated in a furnace until it melted and the temperature, 𝑚 ℃,
at which it melted was recorded.
𝑚−700
The results were coded such that 𝑥 = 10𝑐 and 𝑦 = .
5
The coded results are shown in the table below.
a) Calculate 𝑆𝑥𝑦 and 𝑆𝑥𝑥.
b) Find the regression line of 𝑦 on 𝑥, and hence the regression line of 𝑚 on 𝑐.
=
c) Estimate the melting point of carbon steel which contains 0.25% carbon.
S
𝑦 35 28 24 16 15 12 8 6
Sa =
204 - 362 = 42
σ 𝑥 = 36 J
σ 𝑥 2 = 204
b) y a + bx b 4 0476
.
=S
σ 𝑦 = 144
:
= -
=- =
σ 𝑥𝑦 = 478
=
-
be
a =
y -
587
36 . ..
36 21428
= -
-4 047 ...
14
=
14
-
.
= -
4 047619 ...
.
36 2142857
-
Y
= - ...
= 0 .
25
c
507 85 C >
- = 5
I
=
102
Y x=
-
14
507 85 ↓ 2 5 .
x10
M0 = -
=
.
Sub Y ---
14 21
= 548
y
-
M -
700 = 2535- 4250 -
C 21
548
-
21
14 m -700 =
-
21
5
m = 881 -
202C
830 476
(3sf
=
m
·
- ..
830 5 C
6)
.
=
= 881
-
202x 0 25 .
m =
°
=
830 5 C
.
-
Ex 1B
Residuals
We mentioned residuals earlier on, and will now examine them more carefully – a residual
can be thought of as the error or difference between the data observed, and the data
predicted.
We can have positive residuals or negative residuals – positive for points that are above
the line, and negative for below.
For the least squares regression line, the sum of the residuals is always 0 – a very
important fact!
Intuitively, this is because the line always goes through (𝑥,ҧ 𝑦),
ത so naturally the differences
between the data points and the average will sum to zero.
𝒙 𝒚 𝒚 = 𝟎. 𝟐 + 𝟎. 𝟖𝒙 𝜺
0 1 - 0 3 to 2 + o
.
.
0 2
EEi
.
=-
= .
1 1.2 1.0 0 2
.
O
2 1.7 1.8 -
0 .
1 =
4 3.1 3.4 - 0 .
3
6 5.2 5.0 0 Z .
7 5.8 5.8 O
𝑡 𝑦 E E data,
2.1 19.2 18 8982 0 3018 follow the pattern of the
.
.
.
it is valid to remove it
3.7 27.3 25 8605
-
1 :
4395 so
4.8
-
S
26.9 30 6470 3 7470
10 669 + 4 357t (3dp)
c) y
.
- -
.
= .
d) t 4 8
y
= .
=
.
Cldp)
Your Turn
The table below shows the relationship between the temperature, 𝑡 ℃, and the sales of
ice cream, 𝑠, on five days in June.
The equation of the regression line of 𝑠 on 𝑡 is 𝑠 = −17.154 + 1.9693𝑡
a) Calculate the residuals for the given regression line and hence find the value of 𝑝.
-
6452
- 0 .
7934
Temp, 𝑡 (℃) 15 16 18 19 21
-
0 .
3855 +0 .
0 2013 =0
20 2627
- .
+p
-
-
17 154 + 1 96937
.
.
20 9977
5
.
=
p
Temp, 𝑡 (℃) Sales, 𝑠 (100s) E = 21 0 .
(3sf)
15 12 12 3855
·
-
0 3855 .
-
16 15 14 3548
.
0 6452
.
19 𝑝 20 2627
p
. - 20 2627 .
randomly
21 24 24 2013 ·
-
0 -
2013 O
,
so a near regression
↓ model is .
suitable
Sum is 8
Residual Sum of Squares (RSS)
We want to measure how well a given set of data fits a linear regression model – naturally,
we’d want the residuals to be small for a good fit, and we want to take all the residuals into
account…
… but as the sum of the residuals is always 0, adding them up to get a total isn’t very useful
– instead, we square them first and then add them to deal with the mix of positives and
negatives. This is called the Residual Sum of Squares, RSS.
The units of the RSS are the same as the units for 𝑦2 , as it is the sum of 𝜀 2 - so you should
only compare things measured in the same units. The lower the RSS, the better the data
fits a linear regression model.
2
σ 𝑥 = 53.7 Time saver:
= 3 668- 6 704
.
0 0294
σ 𝑥 2 = 589.09 -
.
𝑆𝑦𝑦 = 3.668
.
-
=
σ 𝑦 = 42.9 𝑆𝑥𝑥 = 12.352 12 352 C3sf)
-
suggests
this August is more
likely
that
to have a linear fit between the
number of hours
sof sunshine and the sales of ice lollies
Ex 1C
Exam Questions AS 2020
F
~
~
You can skip (d) until Ch2 has been completed, or just use the formula for RSS from before
b) 17 72b = 21 54
.
.
=
a
-
w = - 17 ·
72 + 21 .
54d
>
- w
y
-
=
(
d) RSS Syy(l-r2) Sww =
Ew" -
=
18
)
%
=
Sur (1-0
.
987
2 = 45178 68 .
- 643 .
62
572
589
= . -
= 18
= 22166 40444
-
e) The weight will correspond to the wire's thickness/area,
and area is proportional to not d
.
f) b =
Sey a =
y -
ba 63
STT2
i = 35 7555
. ...
w -
bi
b Sun
=
= a
-
-
3 .
859x157 57 ·
Sun a
= 35 7555
Ts
.
...
= 5721
-
.
625 =
1 973
.
(3dp)
=
1482 619-
1 973 + 3 8594
.
w = .
=
3 .
859
- -
9) 6253
Sow-Lw
22166-404 5721
RSS
- .
= =
-
1482 619
-
= ...
85 88 .
= 85 .
9 g2
h) Robert's is better as the RSS is much lower
.
i) d 3,
= u= 33 =
9
w
= 1 973
.
+ 3 859x9
-
= 36
-
.
7g (3sf)
A2 2020
- i) feasible as residuals
Sum to
Fo
appear
zero
>
-
ii) the residuals appear
i) not feasible
, non-random, this suggests
residuals are it is not
as all
probably
so do not sun
positive linear
to zero
& i) feasible as
ii) linear as
points appear
randomly scattered .
- -
A2 2022
b a
= 45 5t
.
+ 2080
y
t = 20
= 45 .
5x20 + 2080
= 1990 kg/hectare
y
-
b) the diff .
between
the observed
y value
c) RSS =
Syy(1-rY
1646567 = 1774155/1 -
r2)
1 -
r2 = 0 .
93935..
You can skip (c) until Ch2 has been completed
1 -
0 9393
.
...
= r2
u= 0 246
.
(3sf)
-
di) because
o is
close to 0
suggesting
weall correlation