0% found this document useful (0 votes)
17 views

Chapter 1 Linear Regression Notes (as FS2)

This document covers the concept of least squares linear regression, including how to derive the regression line equation and the importance of residuals. It provides examples of calculating regression lines from experimental data, as well as the implications of interpolation and extrapolation. Additionally, it discusses coding data for easier management and the significance of residuals in regression analysis.

Uploaded by

isfakfx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Chapter 1 Linear Regression Notes (as FS2)

This document covers the concept of least squares linear regression, including how to derive the regression line equation and the importance of residuals. It provides examples of calculating regression lines from experimental data, as well as the implications of interpolation and extrapolation. Additionally, it discusses coding data for easier management and the significance of residuals in regression analysis.

Uploaded by

isfakfx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

FS2: Chapter 1, Linear Regression

Ex 1A Least Squares Linear Regression


Ex 1B Coding
Ex 1C Residuals
Exam Questions

At GCSE, your teachers told you to


draw a line of best fit ‘by eye’.

They probably mentioned that you


should aim to have the same number
of points above and below the line,
and maybe they told you that it should
go through (𝑥,ҧ 𝑦).

In this chapter we’ll come up with the


equation for the ‘least squares linear
regression line’ of 𝑦 on 𝑥.

Note: some of this content appears in AS Statistics, and A2 Statistics – we explore it more deeply here in FS2.
GCSE Recap
• The explanatory/independent variable is
always placed on the 𝑥-axis.

• The response/dependent variable is always


placed on the 𝑦-axis.

• We can make reliable predictions for the


dependent variable from data that is inside
the range for 𝑥 – this is called interpolation.

• Making predictions outside the given range


of data is unreliable – why? This is called
extrapolation.

The regression line would be 𝑦 on 𝑥 and would be of the form 𝒚 = 𝒂 + 𝒃𝒙


Least Squares Regression Line
There are other types of regression line, but we study the least squares regression line.

This is where we try to minimise the sum of the squares of the residuals.

You can think of a residual as the size of the error – the difference between the data point,
and the predicted value from the line of best fit. Naturally, these could be positive or
negative, so the squaring deals with this.
E episilon
residual
σ 𝜀𝑖2 = 𝜀12 + 𝜀22 + 𝜀32 + 𝜀42 +…+𝜀𝑛2

If we minimise this sum, then we have


found a line that is as close as possible to all
the points. This is called the regression line.

The word regression is used, as the data all


‘regresses’ (meaning goes back to) towards
the mean.
Luckily, there’s a formula that does
all this for us! I recommend using
the formula booklet for this chapter.

Sa
= Ex -

(2)
Exy -E by
Say =

= a+
bx
y

EY
-

Exy-
=
i =

REGRESSION LINES - LEAST SQUARES PROOF (youtube.com)


Note: this proof requires Year 2 differentiation. Not compulsory.
Examples
The results from an experiment in which different masses were placed on a spring and
the resulting length of the spring measured, are shown below.
a) Calculate 𝑆𝑥𝑥 and 𝑆𝑥𝑦 and b
Find a

b) Calculate the regression line of 𝑦 on 𝑥


- -x
c) Use your equation to predict the length of the spring when the applied mass is 58 kg
-

and when the applied mass is 130 kg. Find x 58


↑ Y y for =

d) Comment on the reliability of your predictions. x = 130

If you’ve studied Stats Year 2, Regression and Correlation,


Mass, 𝑥 (kg) 20 40 60 80 100 you’ll have already used your calculator on raw data to
calculate the PMCC, 𝑟, and 𝑎 and 𝑏. Here we will focus on
Length, 𝑦 (𝑐𝑚) 48 55.1 56.3 61.2 68 using the summary statistics, rather than the calculator.
-

σ 𝑥 = 300
σ 𝑥 2 = 22,000
a) See =
Ex-(2
N
𝑥ҧ = 60
σ 𝑥𝑦 = 18,238
σ 𝑦 2 = 16,879.14
= 22000 -
70" = 4000
σ 𝑦 = 288.6 5
𝑦ത = 57.72
Exy EY
-

Say
=

=
18238-
300 x 288 6
-

= 922
-
b) b
= 0 250
= =

y
.

0 2305x
43 89 +
.

Y
=

-
bi
y
-

a =

= 57 72 -
-
0 .
2305460

= 43 89 .

c) When < = 58

43 89 + 0 2305x58

-
.
.

y
= 57 .
259
=
57 .
3 cm (3st) L
-

When x = 130

- 2305x130
43 89 + 0
y
.
=

=
73 9am
-
.

13sfi
d) Prediction for 55g is

interporeliable
- lation as it is inside

but unreliable for


the
range
it outside the
of data , -extrapolation
130kg as is range ofdata
A scientist working in agricultural research believes that there is a linear relationship
between the amount of a certain food supplement given to hens and the hardness of
the shells of the eggs they lay. As an experiment, controlled quantities of the
supplement were added to the hens’ normal diet for a period of two weeks and the
hardness of the shells at the end of this period was then measured on a scale from 1 to
10, with the below results. y
on x h
a) Find the equation of the regression line of ℎ on 𝑓.
yet
x[
- f
b) Interpret what the values of 𝑎 and 𝑏 tell you.
x
Food supplement, 𝑓 (g/day) 2 4 6 8 10 12 14
Hardness of shells, ℎ 3.2 5.2 5.5 6.4 7.2 8.5 9.8
Y
σ 𝑓 = 56 h = a + bf Find b ,
then a

σ ℎ = 45.8
𝑓ҧ = 8
b =I
y
=
ℎത = 6.543
σ 𝑓 2 = 560
σ 𝑓ℎ = 422.6 Sef
458
Stu Eth-ES
422 6
Ex
. -

= =

= 56-2
Set Ef-(ef) 560
E 112
-
= =
=

(3sf)
b
= =
0
-
.
502

bi
i
-

a =

= -
b7
0 50178 XS
6 543
.
.
- ...
=

= 2 53 .
(3sf)
h = 2 53
.
+ 0
.

502f

b) a tells the shell hardness


with no
food supplement given
how much the
b tells us

shell hardness increases by


for Ig of food
an extra

supplement given
Your Turn
A repair workshop finds it is having a problem with a pressure gauge it uses. It decides
to have it checked by a specialist firm. The following data were obtained.
a) Show that 𝑆𝑥𝑦 = 6.776 and find 𝑆𝑥𝑥. bx
a+
It is thought that a linear relationship of the form 𝑦 =umun
𝑎𝑥 + 𝑏 could be used to describe
these data.
b) Use linear regression to find the values of 𝑎 and 𝑏 giving your answers to 3sf.
c) The gauge shows a reading of 2 bars. Use the regression equation to what
- out what

the correct reading should be. x


2 = work

Gauge reading, 𝑥 (bars) 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8
Correct reading, 𝑦 (bars) 0.96 1.33 1.75 2.14 2.58 2.97 3.38 3.75

19 2x18- 86
a) Say Exry-Ely
σ 𝑥 = 19.2 = = 52 04
. - .

σ 𝑥 2 = 52.8 g
σ 𝑦 = 18.86
= 6 776
.

σ 𝑦 2 = 51.30
σ 𝑥𝑦 = 52.04 -
Ex-EG
= 19 2
Sax 52 S
-
.

5
= : 72
6
b) b =

S = 66 =
100
6-72
= 1 008
:

(3dp)
bi
a =
i -

bx Ex
Ey
=
-

0 0625
192
.

100833 ... X
= -

= 18 86
.
-

0 0625 + 10082
y
.
= -

0 0625
1 008x
-

-
.
=
y
0 . 0625
2) 1 008x2
-

x= 2
,
=
, y
= 1 .
9535
= 1 95.
bars (3st) Ex 1A
=
Coding
Sometimes the data is coded to make it easier to manage – you can just use substitution
to return to the original variables

Eight samples of carbon steel were produced with different percentages, 𝑐 %, of carbon
in them. Each sample was heated in a furnace until it melted and the temperature, 𝑚 ℃,
at which it melted was recorded.
𝑚−700
The results were coded such that 𝑥 = 10𝑐 and 𝑦 = .
5
The coded results are shown in the table below.
a) Calculate 𝑆𝑥𝑦 and 𝑆𝑥𝑥.
b) Find the regression line of 𝑦 on 𝑥, and hence the regression line of 𝑚 on 𝑐.
=

c) Estimate the melting point of carbon steel which contains 0.25% carbon.

478 36x144 170


a Say
-
-
=
𝑥 1 2 3 4 5 6 7 8 -
=

S
𝑦 35 28 24 16 15 12 8 6

Sa =
204 - 362 = 42
σ 𝑥 = 36 J
σ 𝑥 2 = 204
b) y a + bx b 4 0476
.
=S
σ 𝑦 = 144
:
= -
=- =

σ 𝑥𝑦 = 478
=
-
be
a =
y -

587

36 . ..
36 21428
= -

-4 047 ...
14
=
14
-
.

= -

4 047619 ...
.

36 2142857
-

Y
= - ...

= 0 .
25
c
507 85 C >
- = 5
I
=
102
Y x=
-

14
507 85 ↓ 2 5 .

x10
M0 = -
=

.
Sub Y ---
14 21

= 548
y
-

M -
700 = 2535- 4250 -
C 21

548
-
21
14 m -700 =
-
21
5
m = 881 -
202C
830 476
(3sf
=
m
·

- ..

830 5 C
6)
.
=

= 881
-
202x 0 25 .

m =
°

=
830 5 C
.

-
Ex 1B
Residuals
We mentioned residuals earlier on, and will now examine them more carefully – a residual
can be thought of as the error or difference between the data observed, and the data
predicted.

We can have positive residuals or negative residuals – positive for points that are above
the line, and negative for below.

For the least squares regression line, the sum of the residuals is always 0 – a very
important fact!

Intuitively, this is because the line always goes through (𝑥,ҧ 𝑦),
ത so naturally the differences
between the data points and the average will sum to zero.

For a point (𝑥𝑖 , 𝑦𝑖 ) the residual is:


= a+ bx ,
y
𝜀 = 𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 )
I predicted y
a
observed Y.
𝑥 1 2 4 6 7
The regression line for this data was found to be 𝑦 = 0.2 + 0.8𝑥
𝑦 1.2 1.7 3.1 5.2 5.8

observed value predicted value

𝒙 𝒚 𝒚 = 𝟎. 𝟐 + 𝟎. 𝟖𝒙 𝜺
0 1 - 0 3 to 2 + o
.
.

0 2
EEi
.
=-
= .

1 1.2 1.0 0 2
.

O
2 1.7 1.8 -
0 .
1 =

4 3.1 3.4 - 0 .

3
6 5.2 5.0 0 Z .

7 5.8 5.8 O

Notice how the residual plot shows the residuals


randomly scattered about 0.
This is because we expect some to be above the line, and
some to be below.

If we saw a trend in the residual plot, we should question


whether a linear model is appropriate – this could look
like an increasing, decreasing, or curved pattern.
Example
The table below shows the time taken, 𝑡 minutes, to produce 𝑦 litres of paint in a factory.
The equation of the regression line of 𝑦 on 𝑡 is given as 𝑦 = 9.7603 + 4.3514𝑡.
-
-

One of the 𝑦 values was incorrectly recorded.


a) Calculate the residuals and write down the outlier.
-

b) Comment on the validity of ignoring this outlier in your analysis.


c) Ignoring the outlier, produce a new model.
d) Use the new model to estimate the amount of paint that is produced in 4.8 minutes.

The 269 is the outier, as the


t 2.1 3.7 4.8 6.1 7.2
𝑦 19.2 27.3 26.9 38.5 40.9 residual is the
largest.
.7603 4 3514t
9
b) The residual shows it doesn't
+
.

𝑡 𝑦 E E data,
2.1 19.2 18 8982 0 3018 follow the pattern of the
.
.
.

it is valid to remove it
3.7 27.3 25 8605
-

1 :
4395 so
4.8
-
S
26.9 30 6470 3 7470
10 669 + 4 357t (3dp)
c) y
.
- -
.
= .

6.1 38.5 - 3038


36 2-1962
7.2 40.9 41 0904 -0 .
1904
31 6 litres
.

d) t 4 8
y
= .
=
.

Cldp)
Your Turn

The table below shows the relationship between the temperature, 𝑡 ℃, and the sales of
ice cream, 𝑠, on five days in June.
The equation of the regression line of 𝑠 on 𝑡 is 𝑠 = −17.154 + 1.9693𝑡
a) Calculate the residuals for the given regression line and hence find the value of 𝑝.
-

b) By considering the residuals, comment on whether a linear regression model is


suitable for these data.

6452
- 0 .
7934
Temp, 𝑡 (℃) 15 16 18 19 21
-

0 .
3855 +0 .

0 2013 =0
20 2627
- .

Sales, 𝑠 (100s) 12.0 15.0 17.5 𝑝 24.0


.

+p
-

-
17 154 + 1 96937
.
.

20 9977
5
.
=

p
Temp, 𝑡 (℃) Sales, 𝑠 (100s) E = 21 0 .
(3sf)
15 12 12 3855
·
-
0 3855 .
-

16 15 14 3548
.

0 6452
.

b) the residuals appear


18 17.5 18 2934 0 7934
scattered about
. - .

19 𝑝 20 2627
p
. - 20 2627 .

randomly
21 24 24 2013 ·
-
0 -
2013 O
,
so a near regression
↓ model is .
suitable
Sum is 8
Residual Sum of Squares (RSS)
We want to measure how well a given set of data fits a linear regression model – naturally,
we’d want the residuals to be small for a good fit, and we want to take all the residuals into
account…

… but as the sum of the residuals is always 0, adding them up to get a total isn’t very useful
– instead, we square them first and then add them to deal with the mix of positives and
negatives. This is called the Residual Sum of Squares, RSS.

The units of the RSS are the same as the units for 𝑦2 , as it is the sum of 𝜀 2 - so you should
only compare things measured in the same units. The lower the RSS, the better the data
fits a linear regression model.

There’s a shortcut using a formula – also in the formula booklet!

More on the second part later in FS2!


Example
The data below shows the sales, in 100s, 𝑦, of ice lollies at a riverside café and the
number of hours of sunshine, 𝑥, on five random days during August.-

a) Calculate the residual sum of squares (RSS).


The RSS for five random days in December is 0.0562.
b) State, with a reason, which month is more likely to have a linear fit between the
number of hours of sunshine and the sales of ice lollies.

𝑥 8 10 11.5 12 12.2 a RSS = Sey


Syy- Sa
𝑦 7.1 8.2 8.9 9.2 9.5

2
σ 𝑥 = 53.7 Time saver:
= 3 668- 6 704
.

0 0294
σ 𝑥 2 = 589.09 -
.

𝑆𝑦𝑦 = 3.668
.

-
=
σ 𝑦 = 42.9 𝑆𝑥𝑥 = 12.352 12 352 C3sf)
-

σ 𝑦 2 = 371.75 𝑆𝑥𝑦 = 6.704


σ 𝑥𝑦 = 467.45
b) As 0 0294 < 0
.
.
0562
,

suggests
this August is more
likely
that
to have a linear fit between the
number of hours
sof sunshine and the sales of ice lollies
Ex 1C
Exam Questions AS 2020
F

~
~

You can skip (d) until Ch2 has been completed, or just use the formula for RSS from before

b) 17 72b = 21 54
.
.
=
a
-

w = - 17 ·
72 + 21 .
54d

& not appropriate as the points appear on

a curved line so our live overestimate


,

the values in the .


middle
z x - > 4

>
- w
y
-
=

(
d) RSS Syy(l-r2) Sww =
Ew" -
=

18
)
%
=
Sur (1-0
.

987

2 = 45178 68 .
- 643 .
62
572
589
= . -

= 18
= 22166 40444
-
e) The weight will correspond to the wire's thickness/area,
and area is proportional to not d
.

f) b =
Sey a =
y -
ba 63
STT2
i = 35 7555
. ...

w -
bi
b Sun
=
= a
-

-
3 .
859x157 57 ·

Sun a
= 35 7555
Ts
.
...

= 5721
-
.
625 =
1 973
.
(3dp)
=
1482 619-

1 973 + 3 8594
.

w = .

=
3 .
859
- -

9) 6253
Sow-Lw
22166-404 5721
RSS
- .

= =
-

1482 619
-

= ...
85 88 .
= 85 .

9 g2
h) Robert's is better as the RSS is much lower
.

i) d 3,
= u= 33 =
9

w
= 1 973
.
+ 3 859x9
-

= 36
-
.

7g (3sf)
A2 2020

- i) feasible as residuals
Sum to
Fo
appear
zero

Isome above some


below)
,

>
-
ii) the residuals appear
i) not feasible
, non-random, this suggests
residuals are it is not
as all
probably
so do not sun
positive linear
to zero
& i) feasible as

some above some below

ii) linear as
points appear
randomly scattered .

- -
A2 2022

b a

= 45 5t
.

+ 2080
y
t = 20

= 45 .

5x20 + 2080
= 1990 kg/hectare
y
-
b) the diff .
between

the observed
y value

and the predicted y value .

c) RSS =
Syy(1-rY
1646567 = 1774155/1 -

r2)
1 -
r2 = 0 .

93935..
You can skip (c) until Ch2 has been completed
1 -
0 9393
.

...
= r2

u= 0 246
.
(3sf)
-

di) because
o is

close to 0
suggesting
weall correlation

ii) after around 20


%,
the residuals do not
be to be
appear .
randomly scattered
because the units of the two RSS calculations
2) No
,
are

cannot compare them in this


way
different so we .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy