2018 Math1024
2018 Math1024
1 Introduction to Statistics 9
1.1 Lecture 1: What is statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Early and modern definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Uncertainty: the main obstacle to decision making . . . . . . . . . . . . . . . 10
1.1.3 Statistics tames uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 Why should I study statistics as part of my degree? . . . . . . . . . . . . . . 10
1.1.5 What’s in this module? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.6 Take home points: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Lecture 2: Guessing ages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Revealing the truth and error calculation . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Protecting your privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 An example data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.6 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Lecture 3: Basic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 How do I obtain data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Summarising data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Lecture 4: Data visualisation with R . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Get into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Keeping and saving commands in a script file . . . . . . . . . . . . . . . . . . 19
1.4.4 How do I get my data into R? . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.5 Working with data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.6 Summary statistics from R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.7 Graphical exploration using R . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.8 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Lecture 5: Help with report writing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.2 Guidance on data inspection, analysis and conclusions . . . . . . . . . . . . . 22
1.5.3 Some advice on writing your report . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.4 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3
CONTENTS 4
2 Introduction to Probability 25
2.1 Lecture 6: Definitions of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Why should we study probability? . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.2 Two types of probabilities: subjective and objective . . . . . . . . . . . . . . 25
2.1.3 Union, intersection, mutually exclusive and complementary events . . . . . . 26
2.1.4 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.5 Application to an experiment with equally likely outcomes . . . . . . . . . . . 29
2.1.6 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Lecture 7: Using combinatorics to find probability . . . . . . . . . . . . . . . . . . . 29
2.2.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 Multiplication rule of counting . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Calculation of probabilities of events under sampling ‘at random’ . . . . . . . 31
2.2.4 A general ‘urn problem’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.5 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Lecture 8: Conditional probability and the Bayes Theorem . . . . . . . . . . . . . . 33
2.3.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Definition of conditional probability . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Multiplication rule of conditional probability . . . . . . . . . . . . . . . . . . 34
2.3.4 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.5 The Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.6 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Lecture 9: Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Lecture 10: Fun probability calculation for independent events . . . . . . . . . . . . 41
2.5.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 System reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.3 The randomised response technique . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Statistical Inference 83
4.1 Lecture 21: Foundations of statistical inference . . . . . . . . . . . . . . . . . . . . . 83
4.1.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.2 A fully specified model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.3 A parametric statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.4 A nonparametric statistical model . . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.5 Should we prefer parametric or nonparametric and why? . . . . . . . . . . . . 86
CONTENTS 6
Introduction to Statistics
• With the rapid industrialization of Europe in the first half of the 19th century, statistics
became established as a discipline. This led to the formation of the Royal Statistical Society,
the premier professional association of statisticians in the UK and also world-wide, in 1834.
• During this 19th century growth period, statistics acquired a new meaning as the interpre-
tation of data or methods of extracting information from data for decision making. Thus
statistics has its modern meaning as the methods for:
• Indeed, the Oxford English Dictionary defines statistics as: “The practice or science of col-
lecting and analysing numerical data in large quantities, especially for the purpose of inferring
proportions in a whole from those in a representative sample.”
• Note that the word ‘state’ has gone from its definition. Instead, statistical methods are now
essential for everyone wanting to answer questions using data.
For example, will it rain tomorrow? Does eating red meat make us live longer? Is smoking
harmful during pregnancy? Is the new shampoo better than the old? Will the UK economy get
better after Brexit? At a more personal level: What degree classification will I get at graduation?
How long will I live for? What prospects do I have in the career I have chosen? How do I invest
my money to maximise the return? Will the stock market crash tomorrow?
9
1 Introduction to Statistics 10
– For this we will use the R statistical package. R is freely available to download. Search
download R or go to: https://cran.r-project.org/. We will use it as a calculator
and also as a graphics package to explore data, perform statistical analysis, illustrate
theorems and calculate probabilities. You do not need to learn any program-
ming language. You will be instructed to learn basic commands like: 2+2; mean(x);
plot(x,y).
– In this module we will demonstrate using the R package. A nicer experience is provided
by the commercial, but still freely available, R Studio software. Please feel free to use
that.
• Chapter 3: Random variables. We will learn that the results of different random experi-
ments lead to different random variables following distributions such as the binomial, normal,
etc. We will learn their basic properties, e.g. mean and variance.
• This module will provide a very gentle introduction to statistics and probability together with
the software package R for data analysis.
• Statistical knowledge is essential for any scientific career in academia, industry and govern-
ment.
• Read the New York Times article For Today’s Graduate, Just One Word: Statistics
(search on Google).
• Watch the YouTube video Joy of Statistics before attending the next lecture.
1 Introduction to Statistics 12
1. All students are asked to be seated promptly and to form a group of four students each
starting from the left most person in a row and then continuing.
2. Students left over at the end of a row (which could be at most three) are asked to move
elsewhere so that a group of four can be formed. It does not matter if you are seated next to
your friend as it is only a scientific data collection exercise.
3. In the extreme cases groups of size three will be allowed to ease the problem of accommodating
everyone. It is compulsory that every student must be a member of a group - no students
should be left out.
4. Each group should nominate an anchor person who will fill in the form and communicate the
guesses.
5. At this point classroom assistants will distribute identical photo-cards and a form to fill in.
6. Each group will have up to 15 minutes to guess the ages of the people appearing in the
photographs. The anchor person must then note down the guesses and fill in the rest of the
form.
2. The anchor person is then asked to note down the errors and calculate average absolute
errors.
3. Once this has been done the anchor person should hand over the form to one of the classroom
assistants. You are allowed to take a picture of the filled-in form before handing it in.
4. If time permits we will enter the absolute errors only in a live displayed spreadsheet so that
the groups can see the variation of the errors and see for themselves how they fared in the
guessing game. The individual group errors will not be revealed in the class to avoid possible
embarrassment! Below is how we will protect your privacy.
1 Introduction to Statistics 13
3. For the interest of science and curiosity, the form will collect aggregate information on:
♥ Example 1 Errors in age guessing Table 1.1 provides an example data set taken from the
book Teaching Statistics: A bag of tricks by Andrew Gelman and Deborah Nolan, publisher Oxford
University Press.
Table 1.1: Errors (estimated age minus actual age) for 10 photographed individuals (numbered 1
to 10) by 10 groups of students, A-J. The group sizes and sexes of the groups are given in the last
two columns of the table.
Photograph number Average # in
Group 1 2 3 4 5 6 7 8 9 10 abs error group Sex
A +14 –6 +5 +19 0 –5 –9 –1 –7 –7 7.3 3 F
B +8 0 +5 0 +5 –1 –8 –1 –8 0 3.7 4 F
C +6 –5 +6 +3 +1 –8 –18 +1 –9 –6 6.1 4 M
D +10 –7 +3 +3 +2 –2 –13 +6 –7 –7 6.0 2 M/F
E +11 –3 +4 +2 –1 0 –17 0 –14 +3 5.5 2 F
F +13 –3 +3 +5 –2 –8 –9 –1 –7 0 5.1 3 F
G +9 –4 +3 0 +4 –13 –15 +6 –7 +5 6.6 4 M
H +11 0 +2 +8 +3 +3 –15 +1 –7 0 5.0 4 M
I +6 –2 +2 +8 +3 –8 –7 –1 +1 –2 4.0 4 F
J +11 +2 +3 +11 +1 –8 –14 –2 –1 0 5.3 4 F
Truth 29 30 14 42 37 57 73 24 82 45
As well as randomness, we need to pay attention to the design of the study. In a designed
experiment the investigator controls the values of certain experimental variables and then measures
a corresponding output or response variable. In designed surveys an investigator collects data on a
randomly selected sample of a well-defined population. Designed studies can often be more effective
at providing reliable conclusions, but are frequently not possible because of difficulties in the study.
We will return to the topics of survey methods and designed surveys later in Lecture 30. Until
then we assume that we have data from n randomly selected sampling units, which we will conve-
niently denote by x1 , x2 , . . . , xn . We will assume that these values are numeric, either discrete like
counts, e.g. number of road accidents, or continuous, e.g. heights of 4-year-olds, marks obtained
in an examination. We will consider the following example:
♥ Example 2 Fast food service time The service times (in seconds) of customers at a fast-food
restaurant. The first row is for customers who were served from 9–10AM and the second row is for
customers who who were served from 2–3PM on the same day.
AM 38 100 64 43 63 59 107 52 86 77
PM 45 62 52 72 81 88 64 75 59 70
Measures of location
• We are seeking a representative value for the data x1 , x2 , . . . , xn which should be a function
of the data. If a is that representative value then how much error is associated with it?
• What value of a will minimise the SSE or the SSA? For SSE the answer is the sample mean
and for SSA the answer is the sample median.
∂
• How can we prove the above assertion? Use the derivative method. Set ∂a SSE = 0 and solve
for a. Check the second derivative condition that it is negative at the solution for a. Try this
at home.
– Now note that: the first term is free of a; the second term is non-negative for any value
of a. Hence the minimum occurs when the second term is zero, i.e. when a = x̄.
– This establishes the fact that
the sum of (or mean) squares of the deviation from any number a is
minimised when a is the mean.
– In the proof we also noted that ni=1 (xi − x̄) = 0. This is stated as:
P
• The above justifies why we often use the mean as a representative value. For the service time
data, the mean time in AM is 68.9 seconds and for PM the mean is 66.8 seconds.
1 Introduction to Statistics 16
• Easy to argue that |x(1) − a| + |x(n) − a| is minimised when a is such that x(1) ≤ a ≤ x(n) .
• Easy to argue that |x(2) − a| + |x(n−1) − a| is minimised when a is such that x(2) ≤ a ≤ x(n−1) .
• Finally, when n is odd, the last term |x( n+1 ) − a| is minimised when a = x( n+1 ) or the middle
2 2
value in the ordered list.
• If however, n is even, the last pair of terms will be |x( n2 ) − a| + |x( n2 +1) − a|. This will be
minimised when a is any value between x( n2 ) and x( n2 +1) . For convenience, we often take the
mean of these as the middle value.
• Hence the middle value, popularly known as the median, minimises the SSA. Hence the
median is also often used as a representative value or a measure of central tendency. This
establishes the fact that:
the sum of (or mean) of the absolute deviations from any number a is
minimised when a is the median.
• To recap: the median is defined as the observation ranked 21 (n+1) in the ordered list if n is odd.
If n is even, the median is any value between n2 th and ( n2 +1)th in the ordered list. For example,
for the AM service times, n = 10 and 38 < 43 < 52 < 59 < 63 < 64 < 77 < 86 < 100 < 107.
So the median is any value between 63 and 64. For convenience, we often take the mean of
these. So the median is 63.5 seconds. Note that we use the unit of the observations when
reporting any measure of location.
The sample mode minimises the average of a 0-1 error function.
The mode or the most frequent (or the most likely) value in the data is taken as the most represen-
tative value if we consider a 0-1 error function instead of the SSA or SSE above. Here, one assumes
that the error is 0 if our guess a is the correct answer and 1 if it is not. It can then be proved that
(proof not required) the best guess a will be the mode of the data.
Which of the three (mean, median and mode) would you prefer?
The mean gets more affected by extreme observations while the median does not. For example for
the AM service times, suppose the next observation is 190. The median will be 64 instead of 63.5
but the mean will shoot up to 79.9.
1 Introduction to Statistics 17
Measures of spread
• A quick measure of the spread is the range, which is defined as the difference between the
maximum and minimum observations. For the AM service times the range is 69 (107 − 38)
seconds.
1 Pn 2
• Standard deviation: square root of variance = n−1 i=1 (xi − x̄) .
n
X n
X n n
2 2 2
X 2 2
X
(xi − x̄) = xi − 2xi x̄ + x̄ = xi − 2x̄(nx̄) + nx̄ = x2i − nx̄2 .
i=1 i=1 i=1 i=1
• Sometimes the variance is defined with the divisor n instead of n − 1. We have chosen n − 1
since this is the default in R. We will return to this in Chapter 4.
• The standard deviation (sd)for the AM service times is 23.2 seconds. Note that it has the
same unit as the observations.
• The interquartile range (IQR) is the difference between the third, Q3 and first, Q1 quartiles,
which are respectively the observations ranked 14 (3n + 1) and 14 (n + 3) in the ordered list.
Note that the median is the second quartile, Q2 . When n is even, definitions of Q3 and Q1
are similar to that of the median, Q2 . The IQR for the AM service times is 83.75 − 53.75 = 30
seconds.
Our mission in this lecture is to get started with R. We will learn the basic R commands (mean,
var, summary, table, hist and boxplot) to explore data sets.
♥ Example 3 Computer failures
Weekly failures of a university computer system over a period of two years: 4, 0, 0, 0, . . . , 4, 2, 13.
4 0 0 0 3 2 0 0 6 7
6 2 1 11 6 1 2 1 1 2
0 2 2 1 0 12 8 4 5 0
5 4 1 0 8 2 5 2 1 12
8 9 10 17 2 3 4 8 1 2
5 1 2 2 3 1 2 0 2 1
6 3 3 6 11 10 4 3 0 2
4 2 1 5 3 3 2 5 3 4
1 3 6 4 4 5 2 10 4 1
5 6 9 7 3 1 3 0 2 2
1 4 2 13
• A small window (the R console) will appear, and you can type commands at the prompt >
• Example: mean(c(38, 100, 64, 43, 63, 59, 107, 52, 86, 77)) and hitting the Enter
button computes the mean of the numbers entered.
• The letter c() is also a command that concatenates (collects) the input numbers into a vector
1 Introduction to Statistics 19
• Even when an R function has no arguments we still need to use the brackets, such as in ls()
which gives a list of objects in the current workspace.
• Example: x <- 2+2 means that x is assigned the value of the expression 2+2.
• However, we almost never type the long R commands at the R prompt > as we are prone to
making mistakes and we may need to modify the commands for improved functionality.
• That is why we prefer to simply write down the commands one after another in a script file
and save those future use.
• You can either execute the entire script or only parts by highlighting the respective commands
and then clicking the Run button or Ctrl + R to execute.
• To read a tab-delimited text file of data with the first row giving the column headers, the
command is: read.table("filename.txt", head=TRUE).
• For comma-separated files (such as the ones exported by EXCEL), the command is
read.table("filename.csv", head=TRUE, sep=",") or simply
read.csv("filename.csv", head=TRUE).
• The option head=TRUE tells that the first row of the data file contains the column headers.
• For these commands to work we need to tell R where in the computer these data files are
stored.
• Assume that our data directory for MATH1024 will be the data sub-folder within the math1024
folder in the home folder H:/. Hence the directory path is: H:/math1024/data
• To get (e.g. to print) the current working directory, simply type getwd() and press Enter.
• You are reminded that the following data reading commands will fail if you have not set the
working directory correctly.
• Assuming that you have set the working directory to where your data files are saved, simply
type and enter
• That’s all the hard work needed to get the data into R!
• Just type ffood and hit the Enter button or the Run icon. See what happens.
• A convenient way to see the data is to see either the head or the tail of the data. For example,
type head(ffood) and Enter or tail(ffood) and Enter.
• To know the dimension (how many rows and columns) issue dim(ffood).
• To access elements of a data frame we can use square brackets, e.g. ffood[1, 2] gives the
first row second column element, ffood[1, ] gives everything in the first row and ffood[,
1] gives everything in the first column.
• The named columns in a data frame are often accessed by using the $ operator. For example,
ffood$AM prints the column whose header is AM.
• There are many R functions with intuitive names, e.g. mean, median, var, min, max,
sum, prod, summary, seq, rep etc. We will explain them as we need them.
• To calculate variance, try var(ffood$AM). What does the command var(c(ffood$AM, ffoood$PM)
give?
1 Introduction to Statistics 21
2. Obtain simple summaries of the whole data and also by different groups as shown in the
sample report.
4. Provide histograms of the error values and also the absolute errors. How do the shapes change
and why?
5. Provide side-by-side boxplots separately for each classifier of the data, e.g, gender, true age,
race.
6. Other investigations you may undertake: (i) ranking of the groups according to average
absolute accuracy, (ii) identifying patterns in accuracy, for example, is it easier to guess the
ages of older people? (iii) which of the variables could be the best predictor of the true ages:
true age, race, sex?
7. You are also free to investigate to discover other patterns which may be present in the data.
There may be other patterns lurking around
8. Based on your experience of the experiment, try to suggest an aspect of the experimental
procedure which might be changed to improve the experiment in the future, or any additional
information which might usefully be included in future experiments to improve the data
collection and analysis.
2. Including code and output. You are asked to include R code, suitably edited, in your
submission, but reports which include excessive and unnecessary R output will be penalised
– you should extract exactly what you need and nothing else, from R to incorporate in your
report.
1 Introduction to Statistics 23
3. The project involves the modelling of real data. There is not necessarily a single ‘correct’
form of analysis. You may explore your own ideas in analysis which you think are plausible.
Submissions which demonstrate a good appreciation of statistical data analysis, together with
correct application of appropriate methods will receive high marks. Careful explanation and
clear presentation are also important.
Introduction to Probability
Chapter mission
Why should we study probability? What are probabilities? How do you find them? What are the
main laws of probabilities? How about some fun examples where probabilities are used to solve
real-life problems?
25
2 Introduction to Probability 26
in a statistical framework called Bayesian inference. Such methods allow one to combine expert
opinion and evidence from data to make the best possible inferences and prediction. Unfortunately
discussion of Bayesian inference methods is beyond the scope of this module, although we will talk
about it when possible.
The second definition of probability comes from the long-term relative frequency of a result of
a random experiment (e.g. coin tossing) which can be repeated an infinite number of times under
essentially similar conditions. First we give some essential definitions.
Random experiments. The experiment is random because in advance we do not know exactly
what outcome the experiment will give, even though we can write down all the possible outcomes
which together are called the sample space (S). For example, in a coin tossing experiment, S
= {head, tail}. If we toss two coins together, S = {HH, HT, TH, TT} where H and T denote
respectively the outcome head and tail from the toss of a single coin.
Event. An event is defined as a particular result of the random experiment. For example, HH
(two heads) is an event when we toss two coins together. Similarly, at least one head e.g. {HH, HT,
TH} is an event as well. Events are denoted by capital letters A, B, C, . . . or A1 , B1 , A2 etc., and
a single outcome is called an elementary event, e.g. HH. An event which is a group of elementary
events is called a composite event, e.g. at least one head. How to determine the probability of a
given event A, P {A}, is the focus of probability theory.
Probability as relative frequency. Imagine we are able to repeat a random experiment
under identical conditions and count how many of those repetitions result in the event A. The
relative frequency of A, i.e. the ratio
the number of repetitions resulting in A
,
total number of repetitions
approaches a fixed limit value as the number of repetitions increases. This limit value is defined as
P {A}.
As a simple example, in the experiment of tossing a particular coin, suppose we are interested
in the event A of getting a ‘head’. We can toss the coin 1000 times (i.e. do 1000 replications of
the experiment) and record the number of heads out of the 1000 replications. Then the relative
frequency of A out of the 1000 replications is the proportion of heads observed.
Sometimes, however, it is much easier to find P {A} by using some ‘common knowledge’ about
probability. For example, if the coin in the example above is fair (i.e. P {‘head0 } = P {‘tail0 }), then
this information and the common knowledge that P {‘head0 }+P {‘tail0 } = 1 immediately imply that
P {‘head0 } = 0.5 and P {‘tail0 } = 0.5. Next, the essential ‘common knowledge’ about probability
will be formalized as the axioms of probability, which form the foundation of probability theory.
But before that, we need to learn a bit more about the event space (collection of all events).
♥ Example 5 Die throw Roll a six-faced die and observe the score on the uppermost face.
Here S = {1, 2, 3, 4, 5, 6}, which is composed of six elementary events.
The union of two given events A and B, denoted as (A or B) or A ∪ B, consists of the outcomes
that are either in A or B or both. ‘Event A ∪ B occurs’ means ‘either A or B occurs or both occur’.
For example, in Example 5, suppose A is the event that an even number is observed. This
event consists of the set of outcomes 2, 4 and 6, i.e. A = {an even number} = {2, 4, 6}. Sup-
pose B is the event that a number larger than 3 is observed. This event consists of the out-
comes 4, 5 and 6, i.e. B = {a number larger than 3} = {4, 5, 6}. Hence the event A ∪ B =
{an even number or a number larger than 3} = {2, 4, 5, 6}. Clearly, when a 6 is observed, both A
and B have occurred.
The intersection of two given events A and B, denoted as (A and B) or A ∩ B, consists of the
outcomes that are common to both A and B. ‘Event A∩B occurs’ means ‘both A and B occur’. For
example, in Example 5, A∩B = {4, 6}. Additionally, if C = {a number less than 6} = {1, 2, 3, 4, 5},
the intersection of events A and C is the event A ∩ C = {an even number less than 6} = {2, 4}.
The union and intersection of two events can be generalized in an obvious way to the union and
intersection of more than two events.
Two events A and D are said to be mutually exclusive if A ∩ D = ∅, where ∅ denotes the empty
set, i.e. A and D have no outcomes in common. Intuitively, ‘A and D are mutually exclusive’
means ‘A and D cannot occur simultaneously in the experiment’.
Figure 2.1: In the left plot A and B are mutually exclusive; the right plot shows A ∪ B and A ∩ B.
In Example 5, if D = {an odd number} = {1, 3, 5}, then A ∩ D = ∅ and so A and D are
mutually exclusive. As expected, A and D cannot occur simultaneously in the experiment.
For a given event A, the complement of A is the event that consists of all the outcomes not in
A and is denoted by A0 . Note that A ∪ A0 = S and A ∩ A0 = ∅.
2 Introduction to Probability 28
Thus, we can see the parallels between Set theory and Probability theory:
Set theory Probability theory
(1) Space Sample space
(2) Element or point Elementary event
(3) Set Event
A1 P {S} = 1,
A2 0 ≤ P {A} ≤ 1 for any event A,
A3 P {A ∪ B} = P {A} + P {B} provided that A and B are mutually exclusive events.
Proof: We can write A ∪ B = (A ∩ B 0 ) ∪ (A ∩ B) ∪ (A0 ∩ B). All three of these are mutually
exclusive events. Hence,
P {A ∪ B} = P {A ∩ B 0 } + P {A ∩ B} + P {A0 ∩ B}
= P {A} − P {A ∩ B} + P {A ∩ B} + P {B} − P {A ∩ B}
= P {A} + P {B} − P {A ∩ B}.
(6) The sum of the probabilities of all the outcomes in the sample space S is 1.
2 Introduction to Probability 29
For any event A, we find P {A} by adding up 1/N for each of the outcomes in event A:
number of outcomes in A
P {A} = .
total number of possible outcomes of the experiment
Return to Example 5 where a six-faced die is rolled. Suppose that one wins a bet if a 6 is
rolled. Then the probability of winning the bet is 1/6 as there are six possible outcomes in the
sample space and exactly one of those, 6, wins the bet. Suppose A denotes the event that an
even-numbered face is rolled. Then P {A} = 3/6 = 1/2 as we can expect.
♥ Example 6 Dice throw Roll 2 distinguishable dice and observe the scores. Here S =
{(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), . . . , (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)} which consists of 36
possible outcomes or elementary events, A1 , . . . , A36 . What is the probability of the outcome 6 in
both the dice? The required probability is 1/36. What is the probability that the sum of the two
dice is greater than 6? How about the probability that the sum is less than any number, e.g. 8?
Hint: Write down the sum for each of the 36 outcomes and then find the probabilities asked just
by inspection. Remember, each of the 36 outcomes has equal probability 1/36.
The next lecture will continue to find probabilities using specialist counting techniques called
permutation and combination. This will allow us to find probabilities in a number of practical
situations.
The UK National Lottery selects 6 numbers at random from 1 to 49. I bought one ticket - what
is the probability that I will win the jackpot?
2 Introduction to Probability 30
♥ Example 7 Counting Suppose there are 7 routes to London from Southampton and then
there are 5 routes to Cambridge out of London. How many ways can I travel to Cambridge from
Southampton via London. The answer is obviously 35.
The task is to select k(≥ 1) from the n (n ≥ k) available people and sit the k selected people in k
(different) chairs. By considering the i-th sub-task as selecting a person to sit in the i-th chair (i =
1, . . . , k), it follows directly from the multiplication rule above that there are n(n−1) · · · (n−[k −1])
ways to complete the task. The number n(n−1) · · · (n−[k−1]) is called the number of permutations
of k from n and denoted by
n
Pk = n(n − 1) · · · (n − [k − 1]).
In particular, when k = n we have n Pn = n(n − 1) · · · 1, which is called ‘n factorial’ and denoted
as n!. Note that 0! is defined to be 1. It is clear that
n n(n − 1) · · · (n − [k − 1]) × (n − k)! n!
Pk = n(n − 1) · · · (n − [k − 1]) = = .
(n − k)! (n − k)!
♥ Example 8 Football How many possible rankings are there for the 20 football teams in
the premier league at the end of a season? This number is given by 20 P20 = 20!, which is a huge
number! How many possible permutations are there for the top 4 positions who will qualify to play
in Europe in the next season? This number is given by 20 P4 = 20 × 19 × 18 × 17.
nC n
The number of combinations of k from n: k or k
The task is to select k(≥ 1) from the n (n ≥ k) available people. Note that this task does NOT
involve sitting the k selected people in k (different) chairs.We want to find the number of possible
ways to complete this task, which is denoted as n Ck or nk .
For this, let us reconsider the task of “selecting k(≥ 1) from the n (n ≥ k) available people and
sitting the k selected people in k (different) chairs”, which we already know from the discussion
above has n Pk ways to complete.
Alternatively, to complete this task, one has to complete two sub-tasks sequentially. The first
sub-task is to select k(≥ 1) from the n (n ≥ k) available people, which has n Ck ways. The second
sub-task is to sit the k selected people in k (different) chairs, which has k! ways. It follows directly
from the multiplication rule that there are n Ck × k! to complete the task. Hence we have
nP n!
n k
Pk = n C k × k!, i.e., n
Ck = =
k! (n − k)!k!
2 Introduction to Probability 31
♥ Example 9 Football How many possible ways are there to choose 3 teams for the bottom
positions of the premier league table at the end of a season? This number is given by 20 C3 =
20 × 19 × 18/3!, which does not take into consideration the rankings of the three bottom teams!
♥ Example 10 Microchip A box contains 12 microchips of which 4 are faulty. a sample of size
3 is drawn from the box without replacement.
More examples and details regarding the combinations are provided in Section A.2. You are
strongly recommended to read that section now.
A sample of size n is drawn at random without replacement from a box of N items containing
a proportion p of defective items.
• How many defective items are in the box? N p. How many good items are there? N (1 − p).
Assume these to be integers.
2 Introduction to Probability 32
• Which values of x (in terms of N , n and p) make this expression well defined?
We’ll see later that these values of x and the corresponding probabilities make up what is called
the hyper-geometric distribution.
♥ Example 13 The National Lottery In Lotto, a winning ticket has six numbers from 1 to 49
matching those on the balls drawn on a Wednesday or Saturday evening. The ‘experiment’ consists
of drawing the balls from a box containing 49 balls. The ‘randomness’, the equal chance of any set
of six numbers being drawn, is ensured by the spinning machine, which rotates the balls during the
selection process. What is the probability of winning the jackpot?
There is one other way of winning by using the bonus ball – matching 5 of the selected 6 balls
plus matching the bonus ball. The probability of this is given by
6
P {5 matches + bonus} = 49 C
= 4.29 × 10−7 .
6
Adding all these probabilities of winning some kind of prize together gives
So a player buying one ticket each week would expect to win a prize, (most likely a £10 prize for
matching three numbers) about once a year
Applications of conditional probability occur naturally in actuarial science and medical stud-
ies, where conditional probabilities such as “what is the probability that a person will survive for
another 20 years given that they are still alive at the age of 40?” are calculated.
In many real problems, one has to determine the probability of an event A when one already
has some partial knowledge of the outcome of an experiment, i.e. another event B has already
occurred. For this, one needs to find the conditional probability.
♥ Example 14 Dice throw continued Return to the rolling of a fair die (Example 5). Let
It is clear that P {B} = 3/6 = 1/2. This is the unconditional probability of the event B. It is
sometimes called the prior probability of B.
2 Introduction to Probability 34
However, suppose that we are told that the event A has already occurred. What is the proba-
bility of B now given that A has already happened?
The sample space of the experiment is S = {1, 2, 3, 4, 5, 6}, which contains n = 6 equally likely
outcomes.
Given the partial knowledge that event A has occurred, only the nA = 3 outcomes in A =
{4, 5, 6} could have occurred. However, only some of the outcomes in B among these nA outcomes
in A will make event B occur; the number of such outcomes is given by the number of outcomes
nA∩B in both A and B, i.e., A ∩ B, and equal to 2. Hence the probability of B, given the partial
knowledge that event A has occurred, is equal to
2 nA∩B nA∩B /n P {A ∩ B}
= = = .
3 nA nA /n P {A}
Hence we say that P {B|A} = 32 , which is often interpreted as the posterior probability of B given
A. The additional knowledge that A has already occurred has helped us to revise the prior proba-
bility of 1/2 to 2/3.
This simple example leads to the following general definition of conditional probability.
Hence the multiplication rule of conditional probability for two events is:
♥ Example 17 Phones Suppose that in our world there are only three phone manufacturing
companies: A Pale, B Sung and C Windows, and their market shares are respectively 30, 40 and 30
percent. Suppose also that respectively 5, 8, and 10 percent of their phones become faulty within
one year. If I buy a phone randomly (ignoring the manufacturer), what is the probability that my
phone will develop a fault within one year? After finding the probability, suppose that my phone
developed a fault in the first year - what is the probability that it was made by A Pale?
To answer this type of question, we derive two of the most useful results in probability theory:
the total probability formula and the Bayes theorem. First, let us derive the total probability
formula.
2 Introduction to Probability 36
Bi ∩ Bj = ∅ for all 1 ≤ i 6= j ≤ k,
B1 ∪ B2 ∪ . . . ∪ Bk = S.
A = A ∪ S = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ . . . ∪ (A ∩ Bk )
P {A} = P {A ∩ B1 } + P {A ∩ B2 } + . . . + P {A ∩ Bk }
= P {B1 }P {A|B1 } + P {B2 }P {A|B2 } + . . . + P {Bk }P {A|Bk };
this last expression is called the total probability formula for P {A}.
Figure 2.2: The left figure shows the mutually exclusive and exhaustive events B1 , . . . , B6 (they
form a partition of the sample space); the right figure shows a possible event A.
♥ Example 18 Phones continued We can now find the probability of the event, say A, that
a randomly selected phone develops a fault within one year. Let B1 , B2 , B3 be the events that the
phone is manufactured respectively by companies A Pale, B Sung and C Windows. Then we have:
Now suppose that my phone has developed a fault within one year. What is the probability that
it was manufactured by A Pale? To answer this we need to introduce the Bayes Theorem.
2 Introduction to Probability 37
P {Bi }P {A|Bi }
P {Bi |A} = Pk
j=1 P {Bj }P {A|Bj }
The Bayes theorem follows directly by substituting P {A} by the total probability formula.
The probability, P {Bi |A} is called the posterior probability of Bi and P {Bi } is called the prior
probability. The Bayes theorem is the rule that converts the prior probability into the poste-
rior probability by using the additional information that some other event, A above, has already
occurred.
♥ Example 19 Phones continued The probability that my faulty phone was manufactured
by A Pale is
P {B1 }P {A|B1 } 0.30 × 0.05
P {B1 |A} = = = 0.1948.
P {A} 0.077
Similarly, the probability that the faulty phone was manufactured by B Sung is 0.4156, and the
probability that it was manufactured by C Windows is 1-0.1948-0.4156 = 0.3896.
Pk The worked examples section contains further illustrations of the Bayes theorem. Note that
i=1 P {Bi |A} = 1. Why? Nowadays the Bayes theorem is used to make statistical inference as
well.
2.4.2 Definition
We have seen examples where prior knowledge that an event A has occurred has changed the prob-
ability that event B occurs. There are many situations where this does not happen. The events
are then said to be independent.
Intuitively, events A and B are independent if the occurrence of one event does not affect the
probability that the other event occurs.
P {B|A} = P {B}, where P {A} > 0, and P {A|B} = P {A}, where P {B} > 0.
For this, we have P {A ∩ B} = P {either a 4 or 6 thrown} = 1/3, but P {A} = 1/2 and P {B} =
1/2, so that P {A}P {B} = 1/4 6= 1/3 = P {A ∩ B}. Therefore A and B are not independent events.
Note that independence is not the same as the mutually exclusive property. When two events, A
and B, are mutually exclusive, the probability of their intersection, A∩B, is zero, i.e. P {A∩B} = 0.
But if the two events are independent then P {A ∩ B} = P {A} × P {B}.
Independence is often assumed on physical grounds, although sometimes incorrectly. There are
serious consequences for wrongly assuming independence, e.g. the financial crisis in 2008. However,
when the events are independent then the simpler product formula for joint probability is then used
2 Introduction to Probability 39
♥ Example 21 Two fair dice when shaken together are assumed to behave independently. Hence
the probability of two sixes is 1/6 × 1/6 = 1/36.
♥ Example 22 Assessing risk in legal cases In recent years there have been some disastrous
miscarriages of justice as a result of incorrect assumption of independence. Please read “Incorrect
use of independence – Sally Clark Case” on Blackboard.
P {A0 ∩ B 0 } = 1 − P {A ∪ B}
= 1 − [P {A} + P {B} − P {A ∩ B}]
= 1 − [P {A} + P {B} − P {A}P {B}]
= [1 − P {A}] − P {B}[1 − P {A}]
= [1 − P {A}][1 − P {B}] = P {A0 }P {B 0 }
The ideas of conditional probability and independence can be extended to more than two events.
Note that (2.1) does NOT imply (2.2), as shown by the next example. Hence, to show the
independence of A, B and C, it is necessary to show that both (2.1) and (2.2) hold.
♥ Example 23 A box contains eight tickets, each labelled with a binary number. Two are
labelled with the binary number 111, two are labelled with 100, two with 010 and two with 001.
An experiment consists of drawing one ticket at random from the box.
Let A be the event “the first digit is 1”, B the event “the second digit is 1” and C be the event
“the third digit is 1”. It is clear that P {A} = P {B} = P {C} = 4/8 = 1/2 and P {A ∩ B} =
P {A ∩ C} = P {B ∩ C} = 1/4, so the events are pairwise independent, i.e. (2.1) holds. However
P {A ∩ B ∩ C} = 2/8 6= P {A}P {B}P {C} = 1/8. So (2.2) does not hold and A, B and C are not
independent.
Bernoulli trials The notion of independent events naturally leads to a set of independent trials
(or random experiments, e.g. repeated coin tossing). A set of independent trials, where each trial
has only two possible outcomes, conveniently called success (S) and failure (F), and the probability
2 Introduction to Probability 40
of success is the same in each trial are called a set of Bernoulli trials. There are lots of fun examples
involving Bernoulli trials.
♥ Example 24 Feller’s road crossing example The flow of traffic at a certain street crossing
is such that the probability of a car passing during any given second is p and cars arrive randomly,
i.e. there is no interaction between the passing of cars at different seconds. Treating seconds as
indivisible time units, and supposing that a pedestrian can cross the street only if no car is to
pass during the next three seconds, find the probability that the pedestrian has to wait for exactly
k = 0, 1, 2, 3, 4 seconds.
Let Ci denote the event that a car comes in the ith second and let Ni denote the event that no
car arrives in the ith second.
1. Consider k = 0. The pedestrian does not have to wait if and only if there are no cars in the
next three seconds, i.e. the event N1 N2 N3 . Now the arrival of the cars in successive seconds
are independent and the probability of no car coming in any secod is q = 1 − p. Hence the
answer is P {N1 N2 N3 } = q · q · q = q 3 .
2. Consider k = 1. The person has to wait for one second if there is a car in the first second
and none in the next three, i.e. the event C1 N2 N3 N4 . Hence the probability of that is pq 3 .
3. Consider k = 2. The person has to wait two seconds if and only if there is a car in the 2nd
second but none in the next three. It does not matter if there is a car or none in the first
second. Hence:
P {wait 2 seconds} = P {C1 C2 N3 N4 N5 } + P {N1 C2 N3 N4 N5 } = p · p · q 3 + q · p · q 3 = pq 3 .
4. Consider k = 3. The person has to wait for three seconds if and only if a car passes in the
3rd second but none in the next three, C3 N4 N5 N6 . Anything can happen in the first two
seconds, i.e. C1 C2 , C1 N2 , N1 C2 , N1 N2 − all these four cases are mutually exclusive. Hence,
P {wait 3 seconds} = P {C1 C2 C3 N4 N5 N6 } + P {N1 C2 C3 N4 N5 N6 }
+P {C1 N2 C3 N4 N5 N6 } + P {N1 N2 C3 N4 N5 N6 }
= p · p · p · q3 + p · q · p · q3 + q · p · p · q3 + q · q · p · q3
= pq 3 .
5. Consider k = 4. This is more complicated because the person has to wait exactly 4 seconds
if and only if a car passes in at least one of the first 3 seconds, one passes at the 4th but none
pass in the next 3 seconds. The probability that at least one passes in the first three seconds
is 1 minus the probability that there is none in the first 3 seconds. This probability is 1 − q 3 .
Hence the answer is (1 − q 3 )pq 3 .
The reliability gets lower when components are included in series. For n components in series,
P {system works} = p1 p2 · · · pn . When pi = p for all i, the reliability of a series of n components is
P {system works} = pn .
This is greater than either p1 or p2 so that the inclusion of a (redundant) component in parallel
increases the reliability of the system. Another way of arriving at this result uses complementary
events:
A general system
The ideas above can be combined to evaluate the reliability of more complex systems.
♥ Example 25 Switches Six switches make up the circuit shown in the graph.
Each has the probability pi = P {Di } of closing correctly; the mechanisms are independent; all
are operated by the same impulse. Then
There are some additional examples of reliability applications given in the “Reliability Exam-
ples” document available on Blackboard. You are advised to read through and understand these
additional examples/applications.
It is unlikely that truthful answers will be given in an open questionnaire, even if it is stressed
that the responses would be treated with anonymity. Some years ago a randomised response
technique was introduced to overcome this difficulty. This is a simple application of conditional
probability. It ensures that the interviewee can answer truthfully without the interviewer (or any-
one else) knowing the answer to the sensitive question. How? Consider two alternative questions,
for example:
Question 1 should not be contentious and should not be such that the interviewer could find
out the true answer.
The respondent answers only 1 of the two questions. Which question is answered by the respon-
dent is determined by a randomisation device, the result of which is known only to the respondent.
The interviewer records only whether the answer given was Yes or No (and he/she does not know
which question has been answered). The proportion of Yes answers to the question of interest can
be estimated from the total proportion of Yes answers obtained. Carry out this simple experiment:
Toss a coin - do not reveal the result of the coin toss!
If heads - answer Question 1: Was your mother born in January?
If tails - answer Question 2: Have you ever taken illegal substances in the last 12 months?
We need to record the following information for the outcome of the experiment:
Total number in sample = n;
Total answering Yes = r, so that an estimate of P {Yes} is r/n.
This information can be used to estimate the proportion of Yes answers to the main question
of interest, Question 2.
Suppose that
Then, assuming that the coin was unbiased, P {Q1 } = 0.5 and P {Q2 } = 0.5. Also, assuming that
birthdays of mothers are evenly distributed over the months, we have that the probability that the
interviewee will answer Yes to Q1 is 1/12. Let Y be the event that a ‘Yes’ answer is given. Then
the total probability formula gives
which leads to
r 1 1 1
≈ × + × P {Y |Q2 }.
n 2 12 2
Hence
r 1
P {Y |Q2 } ≈ 2 · − .
n 12
We, however, will move on to the next chapter on random variables, which formalises the
concepts of probabilities in structured practical cases. The concept of random variables allows us
to calculate probabilities of random events much more easily in structured ways.
2 Introduction to Probability 44
Chapter 3
Chapter mission
Last chapter’s combinatorial probabilities are difficult to find and very problem-specific. Instead,
in this chapter we shall find easier ways to calculate probability in structured cases. The outcomes
of random experiments will be represented as values of a variable which will be random since the
outcomes are random (or un-predictable with certainty). In so doing, we will make our life a lot
easier in calculating probabilities in many stylised situations which represent reality. For example,
we shall learn to calculate what is the probability that a computer will make fewer than 10 errors
while making 1015 computations when it has a very tiny chance, 10−14 , of making an erroneous
computation.
3.1.2 Introduction
A random variable defines a one-to-one mapping of the sample space consisting of all possible
outcomes of a random experiment to the set of real numbers. For example, I toss a coin. Assuming
the coin is fair, there are two possible equally likely outcomes: head or tail. These two outcomes
must be mapped to real numbers. For convenience, I may define the mapping which assigns the
value 1 if head turns up and 0 otherwise. Hence, we have the mapping:
Head → 1, Tail → 0.
45
3 Random Variables and Their Probability Distributions 46
We can conveniently denote the random variable by X which is the number of heads obtained by
tossing a single coin. Obviously, all possible values of X are 0 and 1.
You will say that this is a trivial example. Indeed it is. But it is very easy to generalise the
concept of random variables. Simply define a mapping of the outcomes of a random experiment
to the real number space. For example, I toss the coin n times and count the number of heads
and denote that to be X. Obviously, X can take any real positive integer value between 0 and
n. Among other examples, suppose I select a University of Southampton student at random and
measure their height. The outcome in metres will be a number between one metre and two metres
for sure. But I can’t exactly tell which value it will be since I do not know which student will be
selected in the first place. However, when a student has been selected I can measure their height
and get a value such as 1.432 metres.
We now introduce two notations: X (or in general the capital letters Y , Z etc.) to denote the
random variable, e.g. height of a randomly selected student, and the corresponding lower case letter
x (y, z) to denote a particular value, e.g. 1.432 metres. We will follow this convention throughout.
For a random variable, say X, we will also adopt the notation P (X ∈ A), read probability that X
belongs to A, instead of the previous P {A} for any event A.
When the random variable can take any value on the real line it is called a continuous random
variable. For example, the height of a randomly selected student. A random variable can also take
a mixture of discrete and continuous values, e.g. volume of precipitation collected in a day; some
days it could be zero, on other days it could be a continuous measurement, e.g. 1.234 mm.
This is an example of the Bernoulli distribution with parameter p, perhaps the simplest discrete
distribution.
♥ Example 26 Suppose we consider tossing the coin twice and again defining the random variable
X to be the number of heads obtained. The values that X can take are 0, 1 and 2 with probabilities
(1 − p)2 , 2p(1 − p) and p2 , respectively. Here the distribution is:
Value(x) P (X = x)
0 (1 − p)2
1 2p(1 − p)
2 p2
Total prob 1.
This is a particular case of the Binomial distribution. We will learn about it soon.
In general, for a discrete random variable we define a function f (x) to denote P (X = x) (or
f (y) to denote P (Y = y)) and call the function f (x) the probability function (pf ) or probability
mass function (pmf ) of the random variable X. Arbitrary functions cannot be a pmf since the
total probability must be 1 and all probabilities are non-negative. Hence, for f (x) to be the pmf
of a random variable X, we require:
Note that f (x) = 0 for any other value of x and thus f (x) is a discrete function of x.
For a continuous random variable, P (X = x) is defined to be zero since we assume that the
measurements are continuous and there is zero probability of observing a particular value, e.g. 1.2.
The argument goes that a finer measuring instrument will give us an even more precise measure-
ment than 1.2 and so on. Thus for a continuous random variable we adopt the convention that
3 Random Variables and Their Probability Distributions 48
P (X = x) = 0 for any particular value x on the real line. But we define probabilities for positive
length intervals, e.g. P (1.2 < X < 1.9).
For a continuous random variable X we define its probability by using a continuous function
f (x) which we call its probability density function, abbreviated as its pdf. With the pdf we define
probabilities as integrals, e.g.
Z b
P (a < X < b) = f (u) du,
a
which is naturally interpreted as the area under the curve f (x) inside the interval (a, b). Recall
that we do not use f (x) = P (X = x) for any x as by convention we set P (X = x) = 0.
Figure 3.1: The shaded area is P (a < X < b) if the pdf of X is the drawn curve.
Since we are dealing with probabilities which are always between 0 and 1, just any arbitrary
function f (x) cannot be a pdf of some random variable. For f (x) to be a pdf, as in the discrete
case, we must have:
(i) the probabilities are non-negative, and (ii) the total probability must be 1,
♥ Example 27 Let X be the number of heads in the experiment of tossing two fair coins. Then
the probability function is
Note that the cdf for a discrete random variable is a step function. The jump-points are the
possible values of the random variable (r.v.), and the height of a jump gives the probability of
the random variable taking that value. It is clear that the probability mass function is uniquely
determined by the cdf.
dF (x)
f (x) =
dx
that is, for a continuous random variable the pdf is the derivative of the cdf. Also for any random
variable X, P (c < X < d) = F (d) − F (c). Let us consider an example.
♥ Example 28 Uniform distribution Suppose,
1
f (x) = b−a if a < x < b .
0 otherwise
R x du 0
We now have the cdf F (x) = a b−a = x−a
b−a , a < x < b. A quick check confirms that F (x) = f (x).
If a = 0, b = 1 and then P (0.5 < X < 0.75) = F (0.75) − F (0.5) = 0.25. We shall see many more
examples later.
1
♥ Example 30 Continuous Consider the uniform distribution which has the pdf f (x) = b−a , a <
x < b. R∞
E(X) = −∞ x f (x)dx
Rb x
= a b−a dx
b2 −a2
= 2(b−a) = b+a 2 ,
If Y = g(X) for any function g(·), then Y is a random variable as well. To find E(Y ) we simply
use the value times probability rule, i.e. the expected value of Y is either sum or integral of its
value, g(x) times probability f (x).
X
g(x)f (x) if X is discrete
E(Y ) = E(g(X)) = all x .
∞ R
−∞ g(x)f (x)dx if X is continuous
3 Random Variables and Their Probability Distributions 51
R∞
For example, if X is continuous, then E(X 2 ) = −∞ x2 f (x)dx. We prove one important property
of expectation, namely expectation is a linear operator.
using
R ∞ the value times probability definition of the expectation and the total probability is 1 property
( −∞ f (x)dx = 1) in the last integral. This is very convenient, e.g. suppose E(X) = 5 and
Y = −2X + 549; then E(Y ) = 539.
where µ = E(X), and when the sum or integral exists. They can’t always be assumed to exist!
When the variance exists, it is the expectation of (X − µ)2 where µ is the mean of X. We now
derive an easy formula to calculate the variance:
We usually denote the variance by σ 2 . The square is there to emphasise that the variance of any
random variable is always non-negative. When can the variance be zero? When there is no variation
at all in the random variable, i.e. it takes only a single value µ with probability 1. Hence, there is
nothing random about the random variable – we can predict its outcome with certainty.
The square root of the variance is called the standard deviation of the
random variable.
3 Random Variables and Their Probability Distributions 52
1
♥ Example 31 Uniform Consider the uniform distribution which has the pdf f (x) = b−a , a <
x < b. R b x2
E(X 2 ) = a b−a dx
b3 −a3 2 2
= 3(b−a) = b +ab+a
3 ,
Hence 2
b2 + ab + a2 (b − a)2
b+a
Var(X) = − = ,
3 2 12
after simplification.
2
R ∞ − E(Y ))
Var(Y ) = E(Y
= −∞ (ax + b − aµ − b)2 f (x)dx
∞
= a2 −∞ (x − µ)2 f (x)dx
R
= a2 Var(X).
This is a very useful result, e.g. suppose Var(X) = 25 and Y = −X + 5, 000, 000; then Var(Y ) =
Var(X) = 25 and the standard deviation, σ = 5. In words a location shift, b, does not change
variance but a multiplicative constant, a say, gets squared in variance, a2 .
An outcome of the experiment (of carrying out n such independent trials) is represented by a
sequence of S’s and F ’s (such as SS...F S...SF ) that comprises x S’s, and (n − x) F ’s.
For this sequence, X = x, but there are many other sequences which will also give X = x. In
fact there are nx such sequences. Hence
n x
P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n.
x
This is the pmf of the Binomial Distribution with parameters n and p, often written as Bin(n, p).
How can we guarantee that nx=0 P (X = x) = 1? This guarantee is provided by the binomial
P
theorem:
n n n n−1 n x n−x
(a + b) = b + ab + ··· + a b + · · · + an .
1 x
To prove, nx=0 P (X = x) = 1, i.e. to prove, nx=0 nx px (1 − p)n−x = 1, choose a = p and b = 1 − p
P P
♥ Example 32 Suppose that widgets are manufactured in a mass production process with 1%
defective. The widgets are packaged in bags of 10 with a money-back guarantee if more than 1
widget per bag is defective. For what proportion of bags would the company have to provide a
refund?
Firstly, we want to find the probability that a randomly selected bag has at most 1 defective
widget. Note that the number of defective widgets in a bag X, X ∼ Bin(n = 10, p = 0.01). So, this
probability is equal to
Hence the probability that a refund is required is 1 − 0.9957 = 0.0043, i.e. only just over 4 in 1000
bags will incur the refund on average.
x = 3. That is, the command dbinom(x=3, size=5, prob=0.34) will return the value P (X =
3) = 53 (0.34)3 (1 − 0.34)5−3 . The command pbinom returns the cdf or the probability up to
and including the argument. Thus pbinom(q=3, size=5, prob=0.34) will return the value of
P (X ≤ 3) when X ∼ Bin(n = 5, p = 0.34). As a check, in the above example the command is
pbinom(q=1, size=10, prob=0.01), which returns 0.9957338.
♥ Example 33 A binomial random variable can also be described using the urn model. Suppose
we have an urn (population) containing N individuals, a proportion p of which are of type S and a
proportion 1 − p of type F . If we select a sample of n individuals at random with replacement,
then the number, X, of type S individuals in the sample follows the binomial distribution with
parameters n and p.
Mean of the Binomial distribution
Let X ∼ Bin(n, p). We have
n n
X X n x
E(X) = xP (X = x) = x p (1 − p)n−x .
x
x=0 x=0
Below we prove that E(X) = np. Recall that k! = k(k − 1)! for any k > 0.
Pn
x n px (1 − p)n−x
E(X) =
Pnx=0 x n!
= x x!(n−x)! px (1 − p)n−x
Px=1
n n! x n−x
= x=1 (x−1)!(n−x)! p (1 − p)
Pn (n−1)!
= np x=1 (x−1)!(n−1−x+1)! px−1 (1 − p)n−1−x+1
Pn−1 (n−1)!
= np y=0 (y)!(n−1−y)! py (1 − p)n−1−y
= np(p + 1 − p)n−1 = np,
where we used the substitution y = x − 1 and then the binomial theorem to conclude that the last
sum is equal to 1.
It is illuminating to see these direct proofs. Later on we shall apply statistical theory to directly
prove these! Notice that the binomial theorem is used repeatedly to prove the results.
3 Random Variables and Their Probability Distributions 55
S X = 1, P (X = 1) = p
FS X = 2, P (X = 2) = (1 − p)p
FFS X = 3, P (X = 3) = (1 − p)2 p
FFFS X = 4, P (X = 4) = (1 − p)3 p
.. ..
. .
In general we have
P (X = x) = (1 − p)x−1 p, x = 1, 2, . . .
This is called the geometric distribution, and it has a (countably) infinite domain starting at 1 not
0. We write X ∼Geo(p).
Let us check that the probability function has the required property:
∞
X ∞
X
P (X = x) = (1 − p)x−1 p
x=1 x=1
X∞
= p (1 − p)y [substitute y = x − 1]
y=0
1
= p [see Section A.4]
1 − (1 − p)
= 1.
We can also find the probability that X > k for some given natural number k:
∞
X ∞
X
P (X = x) = (1 − p)x−1 p
x=k+1 x=k+1
The proof is given below. In practice this means that the random variable does not remember its
age (denoted by k) to determine how long more (denoted by s) it will survive! The proof below
uses the definition of conditional probability
P {A ∩ B}
P {B|A} = .
P {B}
Now the proof,
P (X>s+k,X>k)
P (X > s + k|X > k) = P (X>k)
P (X>s+k)
= P (X>k)
(1−p)s+k
= (1−p)k
= (1 − p)s ,
which does not depend on k. Note that the event X > s + k and X > k implies and is implied by
X > s + k since s > 0.
For n > 0 and |x| < 1, the negative binomial series is given by:
1 1 n(n + 1)(n + 2) · · · (n + k − 1) k
(1−x)−n = 1+nx+ n(n+1)x2 + n(n+1)(n+2)x3 +· · ·+ x +· · ·
2 6 k!
With n = 2 and x = 1 − p the general term is given by:
n(n + 1)(n + 2)(n + k − 1) 2 × 3 × 4 × · · · × (2 + k − 1)
= = k + 1.
k! k!
Thus E(X) = p(1 − 1 + p)−2 = 1/p. It can be shown that Var(X) = (1 − p)/p2 using negative bino-
mial series. But this is more complicated and is not required. The second-year module MATH2011
will provide an alternative proof.
The proofs of the above results use complicated finite summation and so are omitted. But note
that when N → ∞ the variance converges to the variance of the binomial distribution. Indeed, the
hypergeometric distribution is a finite population analogue of the binomial distribution.
♥ Example 34 In a board game that uses a single fair die, a player cannot start until they have
rolled a six. Let X be the number of rolls needed until they get a six. Then X is a Geometric
random variable with success probability p = 1/6.
♥ Example 35 A man plays roulette, betting on red each time. He decides to keep playing until
he achieves his second win. The success probability for each game is 18/37 and the results of games
are independent. Let X be the number of games played until he gets his second win. Then X is
a Negative Binomial random variable with r = 2 and p = 18/37. What is the probability he plays
more than 3 games? i.e. find P (X > 3).
Derivation of the mean and variance of the negative binomial distribution involves compli-
cated negative binomial series and will be skipped for now, but will be proved in Lecture 19. For
completeness we note down the mean and variance:
r 1−p
E(X) = , Var(X) = r .
p p2
3 Random Variables and Their Probability Distributions 58
Thus when r = 1, the mean and variance of the negative binomial distribution are equal to those
of the geometric distribution.
λx
e−λ
x!
as n → ∞ for any fixed value of x in the range 0, 1, 2, . . .. Note that we have used the exponential
limit:
λ n
−λ
e = lim 1 − ,
n→∞ n
and
λ −x
lim 1 − =1
n→∞ n
and
n (n − 1) (n − x + 1)
lim ··· = 1.
n→∞ n n n
A random variable X has the Poisson distribution with parameter λ if it has the pmf:
λx
P (X = x) = e−λ , x = 0, 1, 2, . . .
x!
P∞ −λ λx
We write X ∼ Poisson(λ). It is trivial to show ∞
P
x=0 P (X = x) = 1, i.e. x=0 e x! = 1. The
identity you need is simply the expansion of eλ .
3 Random Variables and Their Probability Distributions 59
Hence, the mean and variance are the same for the Poisson distribution.
The Poisson distribution can be derived from another consideration when we are waiting for
events to occur, e.g. waiting for a bus to arrive or to be served at a supermarket till. The number
of occurrences in a given time interval can sometimes be modelled by the Poisson distribution.
Here the assumption is that the probability of an event (arrival) is proportional to the length of the
waiting time for small time intervals. Such a process is called a Poisson process, and it can be shown
that the waiting time between successive events can be modelled by the exponential distribution
which is discussed in the next lecture.
is defined to be the gamma function and it has a finite real value. Moreover, we have the following
facts:
√
1
Γ = π; Γ(1) = 1; Γ(a) = (a − 1)Γ(a − 1) if a > 1.
2
These last two facts imply that Γ(k) = (k − 1)! when k is a positive integer. Find Γ 32 .
and so Var(X) = E(X 2 ) − [E(X)]2 = 2/θ2 − 1/θ2 = 1/θ2 . Note that for this random variable the
mean is equal to the standard deviation.
We have F (0) = 0 and F (x) → 1 when x → ∞ and F (x) is non-decreasing in x. The cdf can be
used to solve many problems. A few examples follow.
Using R to calculate probabilities
For the exponential distribution the command dexp(x=3, rate=1/2) calculates the pdf at
x = 3. The rate parameter to be supplied is the θ parameter here. The command pexp returns the
cdf or the probability up to and including the argument. Thus pexp(q=3, rate=1/2) will return
the value of P (X ≤ 3) when X ∼ Exponential(θ = 0.5).
♥ Example 36 Mobile phone Suppose that the lifetime of a phone (e.g. the time until the
phone does not function even after repairs), denoted by X, manufactured by the company A Pale,
is exponentially distributed with mean 550 days.
1. Find the probability that a randomly selected phone will still function after two years, i.e.
X > 730? [Assume there is no leap year in the two years].
2. What are the times by which 25%, 50%, 75% and 90% of the manufactured phones will have
failed?
Here the mean 1/θ = 550. Hence θ = 1/550 is the rate parameter. The solution to the first
problem is
For the second problem we are given the probabilities of failure (0.25, 0.50 etc.). We will have
to invert the probabilities to find the value of the random variable. In other words, we will have
to find a q such that F (q) = p, where p is the given probability. For example, what value of q will
give us F (q) = 0.25, so that 25% of the phones will have failed by time q?
For a given 0 < p < 1, the pth quantile (or 100p percentile) of the random variable
X with cdf F (x) is defined to be the value q for which F (q) = p.
The 50th percentile is called the median. The 25th and 75th percentiles are called
the quartiles.
♥ Example 37 Uniform distribution Consider the uniform distribution U (a, b) in the interval
(a, b). Here F (x) = x−a
b−a . So for a given p, F (q) = p implies q = a + p(b − a).
3 Random Variables and Their Probability Distributions 62
b+a b+3a
For the uniform U (a, b) distribution the median is 2 , and the quartiles are: 4 and 3b+a
4 .
Returning to the exponential distribution example, we have p = F (q) = 1 − e−θq . Find q when
p is given.
p = 1 − e−θq
⇒ e −θq = 1−p
⇒ −θq = log(1 − p)
− log(1−p)
⇒ q = θ
⇒ q = −550 × log(1 − p).
Review the rules of log in Section A.5. Now we have the following table:
p q = −550 × log(1 − p)
0.25 158.22
0.50 381.23
0.75 762.46
0.90 1266.422
In R you can find these values by qexp(p=0.25, rate=1/550), qexp(p=0.50, rate=1/550), etc.
For fun, you can find qexp(p=0.99, rate=1/550) = 6 years and 343 days! The function qexp(p,
rate) calculates the 100p percentile of the exponential distribution with parameter rate.
Assuming the mean survival time to be 100 days for a fatal late detected cancer, we can expect
that half of the patients survive 69.3 days after chemo since qexp(0.50, rate=1/100) = 69.3.
You will learn more about this in a third-year module, Math3085: Survival models, important in
actuary.
3 Random Variables and Their Probability Distributions 63
♥ Example 39 Memoryless property Like the geometric distribution, the exponential distri-
bution also has the memoryless property. In simple terms, it means that the probability that the
system will survive an additional period s > 0 given that it has survived up to time t is the same
as the probability that the system survives the period s to begin with. That is, it forgets that it
has survived up to a particular time when it is thinking of its future remaining life time.
The proof is exactly as in the case of the geometric distribution, reproduced below. Recall the
definition of conditional probability:
P {A ∩ B}
P {B|A} = .
P {B}
Now the proof,
P (X>s+t,X>t)
P (X > s + t|X > t) = P (X>t)
P (X>s+t)
= P (X>t)
e−θ(s+t)
= e−θt
= e−θs
= P (X > s).
Note that the event X > s + t and X > t implies and is implied by X > s + t since s > 0.
♥ Example 40 The time T between any two successive arrivals in a hospital emergency depart-
ment has probability density function:
λe−λt if t ≥ 0
f (t) =
0 otherwise.
Historically, on average the mean of these inter-arrival times is 5 minutes. Calculate (i) P (0 < T <
5), (ii) P (T < 10|T > 5).
(ii)
Suppose all of these hold (since they are proved below). Then it is easy to remember the pdf
of the normal distribution:
(variable − mean)2
1
f (variable) = √ exp −
2π variance 2 variance
where variable denotes the random variable. The density (pdf) is much easier to remember and
work with when the mean µ = 0 and variance σ 2 = 1. In this case, we simply write:
variable2
2
1 x 1
f (x) = √ exp − or f (variable) = √ exp − .
2π 2 2π 2
Now let us prove the 3 assertions, R1, R2 and R3. R1 is proved as follows:
n o
R∞ R∞ (x−µ)2
f (x)dx = √ 1 exp − 2 dx
−∞ −∞ 2πσ 2
n 2 o2σ
R∞
= √2π −∞ exp − 2 dz [substitute z = x−µ
1 z
σ so that dx = σdz]
n 2o
∞
= √12π 2 0 exp − z2 dz [since the integrand is an even function]
R
R∞ 2 √
= √12π 2 0 exp {−u} √du 2u
[substitute u = z2 so that z = 2u and dz = √du ]
2u
R∞ 1
= 2√1 π 2 0 u 2 −1 exp {−u} du [rearrange the terms]
= √1π Γ 12 [recall the definition of the Gamma function]
√ √
= √1π π = 1 [as Γ 12 = π].
3 Random Variables and Their Probability Distributions 65
X −µ
(i)X ∼ N (µ, σ 2 ) ←→ Z ≡ ∼ N (0, 1) (3.2)
σ
Then by the linearity of expectations, i.e. if X = µ + σZ for constants µ and σ then E(X) =
µ + σE(Z) = µ, the result follows. To prove (3.2), we first calculate the cdf, given by:
Φ(z) = P (Z ≤ z)
X −µ
= P ≤z
σ
= P (X ≤ µ + zσ)
Z µ+zσ
(x − µ)2
1
= √ exp − dx
−∞ 2πσ 2 2σ 2
Z z 2
1 u
= √ exp − du, [u = (x − µ)/σ]
−∞ 2π 2
2
dΦ(z) 1 z
= √ exp − for − ∞ < z < ∞,
dz 2π 2
by the fundamental theorem of calculus. This proves that Z ∼ N (0, 1). The converse is proved just
by reversing the steps. Thus we have proved (i) above. We use the Φ(·) notation to denote the cdf
of the standard normal distribution. Now:
R∞
E(Z) = zf (z)dz n
R−∞
∞ 2
o
= z √1 exp − z dz
−∞ 2π 2
= √1 × 0 = 0,
2π
n 2o
since the integrand g(z) = z exp − z2 is an odd function, i.e. g(z) = g(−z); for an odd function
Ra
g(z), −a g(z)dz = 0 for any a. Therefore we have also proved (3.3) and hence R2.
To prove R3, i.e. Var(X) = σ 2 , we show that Var(Z) = 1 where Z = X−µ σ and then claim
that Var(X) = σ 2 Var(Z) = σ 2 from our earlier result. Since E(Z) = 0, Var(Z) = E(Z 2 ), which is
3 Random Variables and Their Probability Distributions 66
calculated below:
R∞
E(Z 2 ) = −∞ z 2 f (z)dz n o
R∞ 2
= −∞ z 2 √12π exp − z2 dz
R∞ n 2o
= √22π 0 z 2 exp − z2 dz [since the integrand is an even function]
R∞ 2 √
= √22π 0 2u exp {−u} √du 2u
[substituted u = z2 so that z = 2u and dz = √du ]
2u
R∞ 1
= 2√4 π 0 u 2 exp {−u} du
R∞ 3
= √2π 0 u 2 −1 exp {−u} du
= √2π Γ 32
[definition of the gamma function]
2 3 3
= 2 − 1 Γ 2 − 1 [reduction property of the gamma function]
√
π √ √
= √2π 12 π [since Γ 21 = π]
= 1,
distribution because of the following reasons. Suppose X ∼ N (µ, σ 2 ) and we are interested in finding
P (a ≤ X ≤ b) for two constants a and b.
Rb
P (a ≤ X ≤ b) = f (x)dx
Rab n
(x−µ)2
o
√ 1
= a 2πσ 2 exp − 2σ 2 dx
R b−µ n 2o
= √1 σ
2π a−µ
exp − z2 dz [substituted z = x−µσ so that dx = σdz]
σ
b−µ n o a−µ n o
z2 z2
R σ 1 R σ 1
= −∞ √
2π
exp − 2 dz − −∞
√
2π
exp − 2 dz
= P Z ≤ b−µ − P Z ≤ a−µ
σ σ
= cdf of Z at b−µ
σ − cdf of Z at
a−µ
σ
= Φ σ − Φ a−µ
b−µ
σ
This result allows us to find the probabilities about a normal random variable X of any mean µ and
variance σ 2 through the probabilities of the standard normal random variable Z. For this reason,
only Φ(z) is tabulated. Further more, due to the symmetry of the pdf of Z, Φ(z) is tabulated only
for positive z values. Suppose a > 0, then
In R, we use the function pnorm to calculate the probabilities. The general function is: pnorm(q,
mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE). So, we use the command pnorm(1)
to calculate Φ(1) = P (Z ≤ 1). We can also use the command pnorm(15, mean=10, sd=2) to
calculate P (X ≤ 15) when X ∼ N (µ = 10, σ 2 = 4) directly.
1. P (−1 < Z < 1) = Φ(1) − Φ(−1) = 0.6827. This means that 68.27% of the probability lies
within 1 standard deviation of the mean.
2. P (−2 < Z < 2) = Φ(2) − Φ(−2) = 0.9545. This means that 95.45% of the probability lies
within 2 standard deviations of the mean.
3. P (−3 < Z < 3) = Φ(3) − Φ(−3) = 0.9973. This means that 99.73% of the probability lies
within 3 standard deviations of the mean.
We are often interested in the quantiles (inverse-cdf of probability, Φ−1 (·) of the normal distribution
for various reasons. We find the pth quantile by issuing the R command qnorm(p).
1. qnorm(0.95) = Φ−1 (0.95) = 1.645. This means that the 95th percentile of the standard
normal distribution is 1.645. This also means that P (−1.645 < Z < 1.645) = Φ(1.645) −
Φ(−1.645) = 0.90.
3 Random Variables and Their Probability Distributions 68
2. qnorm(0.975) = Φ−1 (0.975) = 1.96. This means that the 97.5th percentile of the stan-
dard normal distribution is 1.96. This also means that P (−1.96 < Z < 1.96) = Φ(1.96) −
Φ(−1.96) = 0.95.
♥ Example 41 Historically, the marks in MATH1024 follow the normal distribution with mean
58 and standard deviation 32.25.
1. What percentage of students will fail (i.e. score less than 40) in MATH1024? Answer:
pnorm(40, mean=58, sd=32.25) = 28.84%.
2. What percentage of students will get an A result (score greater than 70)? Answer: 1-
pnorm(70, mean=58, sd=32.25) = 35.49%.
3. What is the probability that a randomly selected student will score more than 90? Answer:
1- pnorm(90, mean=58, sd=32.25) = 0.1605.
4. What is the probability that a randomly selected student will score less than 25? Answer:
pnorm(25, mean=58, sd=32.25) = 0.1531. Ouch!
5. What is the probability that a randomly selected student scores a 2:1, (i.e. a mark between
60 and 70)? Left as an exercise.
♥ Example 42 A lecturer set and marked an examination and found that the distribution
of marks was N (42, 142 ). The school’s policy is to present scaled marks whose distribution is
N (50, 152 ). What linear transformation should the lecturer apply to the raw marks to accomplish
this and what would the raw mark of 40 be transformed to?
Suppose X ∼ N (µx = 42, σx2 = 142 ) and Y ∼ N (µy = 50, σy2 = 152 ). Hence, we should have
X − µx Y − µy
Z= = ,
σx σy
giving us:
σy 15
Y = µy + (X − µx ) = 50 + (X − 42).
σx 14
Now at raw mark X = 40, the transformed mark would be:
15
Y = 50 + (40 − 42) = 47.86.
14
E(Y ) = E[exp(X)]
Z ∞
(x − µ)2
1
= exp(x) √ exp − dx
−∞ σ 2π 2σ 2
2 Z ∞
µ − (µ + σ 2 )2
2
x − 2(µ + σ 2 )x + (µ + σ 2 )2
1
= exp − √ exp − dx
2σ 2 −∞ σ 2π 2σ 2
2
µ − (µ + σ 2 )2
= exp − [integrating a N (µ + σ 2 , σ 2 ) r.v. over its domain]
2σ 2
= exp µ + σ 2 /2
E(Y 2 ) = E[exp(2X)]
Z ∞
(x − µ)2
1
= exp(2x) √ exp − dx
−∞ σ 2π 2σ 2
= ···
= exp 2µ + 2σ 2 .
(i) f (x, y) ≥ 0
The marginal probability mass functions (marginal pmf’s) of X and Y are respectively
X X
fX (x) = f (x, y), fY (y) = f (x, y).
y x
P P
Use the identity x y f (x, y) = 1 to prove that fX (x) and fY (y) are really pmf’s.
♥ Example 44 Suppose that two fair dice are tossed independently one after the other. Let
−1 if the result from die 1 is larger
X= 0 if the results are equal
1 if the result from die 1 is smaller.
Let Y = |difference between the two dice|. There are 36 possible outcomes. Each of them gives
a pair of values of X and Y . Y can take any of the values 0, 1, 2, 3, 4, 5. Construct the joint
probability table for X and Y .
Each pair of results above (and hence pair of values of X and Y ) has the same probability 1/36.
Hence the joint probability table is given in Table 3.1
The marginal probability distributions are just the row totals or column totals depending on
whether you want the marginal distribution of X or Y . For example, the marginal distribution of
X is given in Table 3.2.
3 Random Variables and Their Probability Distributions 71
Exercises: Write down the marginal distribution of Y and hence find the mean and variance of
Y.
How can we show that the above is a pdf? It is non-negative for all x and y values. But does it
integrate to 1? We are going to use the following rule.
Result Suppose that a real-valued function f (x, y) is continuous in a region D where a < x < b
and c < y < d, then
Z Z Z d Z b
f (x, y)dxdy = dy f (x, y)dx.
D c a
3 Random Variables and Their Probability Distributions 72
• Rewrite the region A as an intersection of two one-dimensional intervals. The first interval is
obtained by treating one variable as constant.
♥ Example 46 Continued
R1R1 R1R1 2
0 0 f (x, y)dxdy = 0R 0 6xyR dxdy
1 2 1
= 6 0 y dy 0 x dx
R1 2 R1
= 3 0 y dy [as 0 x dx = 21 ]
R1
= 1. [as 0 y 2 dy = 31 ]
The probability of any event in the two-dimensional space can be found by integration and again
more details will be provided in a second-year module. You will come across multivariate integrals
in a second semester module. You will not be asked to do bivariate integration in this
module.
We will not consider any continuous examples as the second-year module MATH2011 will study
them in detail.
Suppose that two random variables X and Y have joint pmf or pdf f (x, y) and let E(X) = µx
and E(Y ) = µy . The covariance between X and Y is defined by
3 Random Variables and Their Probability Distributions 73
Let σx2 = Var(X) = E(X 2 ) − µ2x and σy2 = Var(Y ) = E(Y 2 ) − µ2y . The correlation coefficient
between X and Y is defined by:
Cov(X, Y ) E(XY ) − µx µy
Corr(X, Y ) = p = .
Var(X) Var(Y ) σx σy
It can be proved that for any two random variables, −1 ≤ Corr(X, Y ) ≤ 1. The correlation
Corr(X, Y ) is a measure of linear dependency between two random variables X and Y , and it is
free of the measuring units of X and Y as the units cancel in the ratio.
3.8.4 Independence
Independence is an important concept. Recall that we say two events A and B are independent if
P (A ∩ B) = P (A) × P (B). We use the same idea here. Two random variables X and Y having the
joint pdf or pmf f (x, y) are said to be independent if and only if
♥ Example 48 Discrete Case X and Y are independent if each cell probability, f (x, y), is the
product of the corresponding row and column totals. In our very first dice example (Example 44)
X and Y are not independent. Verify that in the following example X and Y are independent. We
need to check all 9 cells.
y
1 2 3 Total
1 1 1 1
0 6 12 12 3
x 1 1 1 1
1 4 8 8 2
1 1 1 1
2 12 24 24 6
1 1 1
Total 2 4 4 1
♥ Example 49 Let f (x, y) = 6xy 2 , 0 < x < 1, 0 < y < 1. Check that X and Y are independent.
♥ Example 51 Deceptive
The joint pdf may look like something you can factorise. But X and Y may not be independent
because they may be related in the domain.
3 Random Variables and Their Probability Distributions 74
21 2
1. f (x, y) = 4 x y, x2 ≤ y ≤ 1. Not independent!
Consequences of Independence
P (X ∈ A, Y ∈ B) = P (X ∈ A) × P (Y ∈ B)
for any events A and B. That is, the joint probability can be obtained as the product of the
marginal probabilities. We will use this result in the next lecture. For example, suppose Jack
and Jess are two randomly selected students. Let X denote the height of Jack and Y denote
the height of Jess. Then we have,
Obviously this has to be true for any numbers other than the example numbers 182 and 165,
and for any inequalities.
• Further, let g(x) be a function of x only and h(y) be a function of y only. Then, if X and Y
are independent, it is easy to prove that
question, and the sample mean is proportional to the sum of the sample values. By doing this, in
the next lecture we will introduce the widely-used central limit theorem, the normal approximation
to the binomial distribution and so on. In this lecture we will also use this theory to reproduce
some of the results we obtained before, e.g. finding the mean and variance of the binomial and
negative binomial distributions.
3.9.2 Introduction
Suppose we have obtained a random sample from a distribution with pmf or pdf f (x), so that
X can either be a discrete or a continuous random variable. We will learn more about random
sampling in the next chapter. Let X1 , . . . , Xn denote the random sample of size n where n is a
positive integer. We use upper case letters since each member of the random sample is a random
variable. For example, I toss a fair coin n times and let Xi take the value 1 if a head appears in the
ith trial and 0 otherwise. Now I have a random sample X1 , . . . , Xn from the Bernoulli distribution
with probability of success equal to 0.5 since the coin is assumed to be fair.
We can get a random sample from a continuous random variable as well. Suppose it is known
that the distribution of the heights of first-year students is normal with mean 175 centimetres and
standard deviation 8 centimetres. I can randomly select a number of first-year students and record
each student’s height.
Suppose X1 , . . . , Xn is a random sample from a population with distribution f (x). Then it can
be shown that the random variables X1 , . . . , Xn are mutually independent, i.e.
P (X1 ∈ A1 , X2 ∈ A2 , . . . , Xn ∈ An ) = P (X1 ∈ A1 ) × P (X2 ∈ A2 ) × · · · P (Xn ∈ An )
for any set of events, A1 , A2 , . . . An . That is, the joint probability can be obtained as the product
of individual probabilities. An example of this for n = 2 was given in the previous lecture; see the
discussion just below the paragraph Consequences of independence.
♥ Example 52 Distribution of the sum of independent binomial random variables
Suppose X ∼ Bin(m, p) and Y ∼ Bin(n, p) independently. Note that p is the same in both distri-
butions. Using the above fact that joint probability is the multiplication of individual probabilities,
we can conclude that Z = X + Y has the binomial distribution. It is intuitively clear that this
should happen since X comes from m Bernoulli trials and Y comes from n Bernoulli trials indepen-
dently, so Z comes from m + n Bernoulli trials with common success probability p. We can prove
the result mathematically as well, by finding the probability mass function of Z = X + Y directly
and observing that it is of the appropriate form. First, note that
P (Z = z) = P (X = x, Y = y)
subject to the constraint that x + y = z, 0 ≤ x ≤ m, 0 ≤ y ≤ n. Thus,
P
P (Z = z) = P (X = x, Y = y)
Px+y=z m x
= (1 − p)m−x ny py (1 − p)n−y
x p
Px+y=z m n z m+n−z
= x+y=z x y pP(1 − p)
= pz (1 − p)m+n−z x+y=z m
n
x y
= m+n
z m+n−z ,
z p (1 − p)
3 Random Variables and Their Probability Distributions 76
using a result stated in Section A.3. Thus, we have proved that the sum of independent binomial
random variables with common probability is binomial as well. This is called the reproductive
property of random variables. You are asked to prove this for the Poisson distribution in an
exercise sheet.
Now we will state two main results without proof. The proofs will presented in the second-
year distribution theory module MATH2011. Suppose that X1 , . . . , Xn is a random sample from
a population distribution with finite variance, and suppose that E(Xi ) = µi and Var(Xi ) = σi2 .
Define a new random variable
Y = a1 X1 + a2 X2 + · · · + an Xn
1. E(Y ) = a1 µ1 + a2 µ2 + · · · + an µn .
For example, if ai = 1 for all i = 1, . . . , n, the two results above imply that:
The expectation of the sum of independent random variables is the sum of the expectations
of the individual random variables
and
the variance of the sum of independent random variables is the sum of the variances of
the individual random variables.
The second result is only true for independent random variables, e.g. random samples. Now we
will consider many examples.
Y = X1 + X2 + . . . + Xn
where each Xi is an independent Bernoulli trial with success probability p. We have shown before
that, E(Xi ) = p and Var(Xi ) = p(1 − p) by direct calculation. Now the above two results imply
that:
n
!
X
E(Y ) = E Xi = p + p + . . . + p = np.
i=1
r-th success in a sequence of independent Bernoulli trials, each with success probability p. Let Xi
be the number of trials needed after the (i − 1)-th success to obtain the i-th success. It is easy to
see that each Xi is a geometric random variable and Y = X1 + · · · + Xr . Hence,
and
Var(Y ) = Var(X1 ) + · · · + Var(Xr ) = (1 − p)/p2 + · · · + (1 − p)/p2 = r(1 − p)/p2 .
As a consequence of the stated result we can easily see the following. Suppose X1 and X2 are
independent N (µ, σ 2 ) random variables. Then 2X1 ∼ N (2µ, 4σ 2 ), X1 + X2 ∼ N (2µ, 2σ 2 ), and
X1 − X2 ∼ N (0, 2σ 2 ). Note that 2X1 and X1 + X2 have different distributions.
X1 + · · · + Xn ∼ N (nµ, nσ 2 ),
and consequently,
σ2
1
X̄ = (X1 + · · · + Xn ) ∼ N µ, .
n n
This also implies that X̄ = n1 Y also follows the normal distribution approximately, as the sample
size n → ∞. In particular, if µi = µ and σi2 = σ 2 , i.e. all means are equal and all variances are
equal, then the CLT states that, as n → ∞,
σ2
X̄ ∼ N µ, .
n
Equivalently,
√
n(X̄ − µ)
∼ N (0, 1)
σ
as n → ∞. The notion of convergence is explained by the convergence of distribution of X̄ to that
of the normal distribution with the appropriate mean and variance. It means that the cdf of the
√
left hand side, n (X̄−µ)
σ , converges to the cdf of the standard normal random variable, Φ(·). In
other words,
√ (X̄ − µ)
lim P n ≤ z = Φ(z), −∞ < z < ∞.
n→∞ σ
So for “large samples”, we can use N (0, 1) as an approximation to the sampling distribution of
√
n(X̄ − µ)/σ. This result is ‘exact’, i.e. no approximation is required, if the distribution of the
Xi ’s are normal in the first place – this was discussed in the previous lecture.
How large does n have to be before this approximation becomes usable? There is no definitive
answer to this, as it depends on how “close to normal” the distribution of X is. However, it is
often a pretty good approximation for sample sizes as small as 20, or even smaller. It also depends
on the skewness of the distribution of X; if the X-variables are highly skewed, then n will usually
need to be larger than for corresponding symmetric X-variables for the approximation to be good.
We will investigate this numerically using R.
3 Random Variables and Their Probability Distributions 79
Figure 3.3: Distribution of normalised sample means for samples of different sizes. Initially very
skew (original distribution, n = 1) becoming rapidly closer to standard normal (dashed line) with
increasing n.
We know that a binomial random variable Y with parameters n and p is the number of successes
in a set of n independent Bernoulli trials, each with success probability p. We have also learnt that
Y = X1 + X2 + · · · + Xn ,
where X1 , . . . , Xn are independent Bernoulli random variables with success probability p. It fol-
lows from the CLT that, for a sufficiently large n, Y is approximately normally distributed with
expectation E(Y ) = np and variance Var(Y ) = np(1 − p).
Hence, for given integers y1 and y2 between 0 and n and a suitably large n, we have
( )
y1 − np Y − np y2 − np
P (y1 ≤ Y ≤ y2 ) = P p ≤p ≤p
np(1 − p) np(1 − p) np(1 − p)
( )
y1 − np y2 − np
≈ P p ≤Z≤ p ,
np(1 − p) np(1 − p)
We should take account of the fact that the binomial random variable Y is integer-valued, and
so P (y1 ≤ Y ≤ y2 ) = P (y1 − f1 ≤ Y ≤ y2 + f2 ) for any two fractions 0 < f1 , f2 < 1. This is called
3 Random Variables and Their Probability Distributions 80
Figure 3.4: Histograms of normalised sample means for Bernoulli (p = 0.8) samples of different
sizes. – converging to standard normal.
♥ Example 56 A producer of natural yoghurt believed that the market share of their brand
was 10%. To investigate this, a survey of 2500 yoghurt consumers was carried out. It was observed
that only 205 of the people surveyed expressed a preference for their brand. Should the producer
be concerned that they might be losing market share?
Assume that the conjecture about market share is true. Then the number of people Y who
prefer this product follows a binomial distribution with p = 0.1 and n = 2500. So the mean is
np = 250, the variance is np(1 − p) = 225, and the standard deviation is 15. The exact probability
of observing (Y ≤ 205) is given by the sum of the binomial probabilities up to and including 205,
3 Random Variables and Their Probability Distributions 81
which is difficult to compute. However, this can be approximated by using the CLT:
P (Y ≤ 205) = P (Y ≤ 205.5)
( )
Y − np 205.5 − np
= P p ≤p
np(1 − p) np(1 − p)
( )
205.5 − np
≈ P Z≤p
np(1 − p)
205.5 − 250
= P Z≤
15
= Φ(−2.967) = 0.0015.
This probability is so small that it casts doubt on the validity of the assumption that the market
share is 10%.
Statistical Inference
Chapter mission
In the last chapter we learned the probability distributions of common random variables that we use
in practice. We learned how to calculate the probabilities based on our assumption of a probability
distribution with known parameter values. Statistical inference is the process by which we try to
learn about those probability distributions using only random observations. Hence, if our aim is
to learn about some typical characteristics of the population of Southampton students, we simply
randomly select few students, observe their characteristics and then try to generalise, as discussed
in Lecture 1. For example, suppose we are interested in learning what proportion of Southampton
students are of Indian origin. We may then select a number of students at random and observe
the sample proportion of Indian origin students. We will then claim that the sample proportion is
really our guess for the population proportion. But obviously we may be making grave errors since
we are inferring about some unknown based on only a tiny fraction of total information. Statistical
inference methods formalise these aspects. We will learn some of these methods here.
83
4 Statistical Inference 84
• The form of the assumed model helps us to understand the real-world process by which the
data were generated.
• If the model explains the observed data well, then it should also inform us about future
(or unobserved) data, and hence help us to make predictions (and decisions contingent on
unobserved data).
• The use of statistical models, together with a carefully constructed methodology for their
analysis, also allows us to quantify the uncertainty associated with any conclusions, predic-
tions or decisions we make.
Assumption 1 depends on the sampling mechanism and is very common in practice. If we are
to make this assumption for the Southampton student sampling experiment, we need to select
randomly among all possible students. We should not get the sample from an event in the Indian
or Chinese Student Association as that will give us a biased result. The assumption will be vio-
lated when samples are correlated either in time or in space, e.g. the daily air pollution level in
Southampton for the last year or the air pollution levels in two nearby locations in Southampton.
In this module we will only consider data sets where Assumption 1 is valid. Assumption 2 is not
always appropriate, but is often reasonable when we are modelling a single variable. In the fast
food waiting time example, if we assume that there are no differences between the AM and PM
waiting times, then we can say that X1 , . . . , X20 are independent and identically distributed (or
i.i.d. for short).
to make any inference about any unknown quantities, although we may use the data to judge the
plausibility of the model.
However, a fully specified model would be appropriate when for example, there is some external
(to the data) theory as to why the model (in particular the values of µ and σ 2 ) was appropriate.
Fully specified models such as this are uncommon as we rarely have external theory which allows
us to specify a model so precisely.
When a parametric statistical model is assumed with some unknown parameters, statistical
inference methods use data to estimate the unknown parameters, e.g. λ, µ, σ 2 . Estimation will be
discussed in more detail in the following lectures.
2. Parametric. Suppose we assume that X follows the Poisson distribution with parameter λ.
4 Statistical Inference 86
which involves the unknown parameter λ. For the Poisson distribution we know that E(X) =
λ. Hence we could use the sample mean X̄ to estimate E(X) = λ. Thus our estimate
λ̂ = x̄ = 3.75. This type of estimator is called a moment estimator. Now our answer is
52 × 1 − e−3.75 =52 * (1- exp(-3.75)) = 50.78 ≈ 51, which is very different compared
to our answer of 46 from the nonparametric approach.
The nonparametric approach should be preferred if the model cannot be justified for the data,
as in this case the parametric approach will provide incorrect answers.
• The observations x = (x1 , . . . , xn ) are called the sample, and quantities derived from the
sample are sample quantities. For example, as in Chapter 1, we call
n
1X
x̄ = xi
n
i=1
• The probability distribution for X specified in our model represents all possible observations
which might have been observed in our sample, and is therefore sometimes referred to as the
population. Quantities derived from this distribution are population quantities.
For example, if our model is that X1 , . . . , Xn are i.i.d., following the common distribution of
a random variable X, then we call E(X) the population mean.
The probability distribution of any estimator θ̃(X) is called its sampling distribution. The
estimate θ̃(x) is an observed value (a number), and is a single observation from the sampling dis-
tribution of θ̃(X).
♥ Example 58 Suppose that we have a random sample X1 , . . . , Xn from the uniform distribu-
tion on the interval [0, θ] where θ > 0 is unknown. Suppose that n = 5 and we have the sample
observations x1 = 2.3, x2 = 3.6, x3 = 20.2, x4 = 0.9, x5 = 17.2. Our objective is to estimate θ. How
can we proceed?
Rθ
Here the pdf f (x) = 1θ for 0 ≤ x ≤ θ and 0 otherwise. Hence E(X) = 0 1θ xdx = 2θ . There are
many possible estimators for θ, e.g. θ̂1 (X) = 2 X̄, which is motivated by the method of moments
because θ = 2E(X). A second estimator is θ̂2 (X) = max{X1 , X2 , . . . , Xn }, which is intuitive since
4 Statistical Inference 88
θ must be greater than or equal to all observed values and thus the maximum of the sample value
will be closest to θ. This is also the maximum likelihood estimate of θ, which you will learn in
MATH3044.
How could we choose between the two estimators θ̂1 and θ̂2 ? This is where we need to learn the
sampling distribution of an estimator to determine which estimator will be unbiased, i.e. correct on
average, and which will have minimum variability. We will formally define these in a minute, but
first let us derive the sampling distribution, i.e. the pdf, of θ̂2 . Note that θ̂2 is a random variable
since the sample X1 , . . . , Xn is random. We will first find its cdf and then differentiate the cdf to
get the pdf. For ease of notation, suppose Y = θ̂2 (X) = max{X1 , X2 , . . . , Xn }. For any 0 < y < θ,
the cdf of Y , F (y) is given by:
P (Y ≤ y) = P (max{X1 , X2 , . . . , Xn } ≤ y)
= P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y)) [max ≤ y if and only if each ≤ y]
= P (X1 ≤ y)P (X2 ≤ y) · · · P (Xn ≤ y) [since the X’s are independent]
= yθ yθ · · · yθ
n
= yθ .
dF (y) n−1
Now the pdf of Y is f (y) = dy = n y θn for 0 ≤ y ≤ θ. We can plot this as a function of y to
n nθ2
see the pdf. Now E(θ̂2 ) = E(Y ) = n+1 θ and Var(θ̂2 ) = (n+2)(n+1)2
. You can prove this by easy
integration.
bias(θ̃) = E(θ̃) − θ.
So an estimator is unbiased if the expectation of its sampling distribution is equal to the quantity
we are trying to estimate. Unbiased means “getting it right on average”, i.e. under repeated sam-
pling (relative frequency interpretation of probability).
Thus for the uniform distribution example, θ̂2 is a biased estimator of θ and
n 1
bias(θ̂2 ) = E(θ̂2 ) − θ = θ−θ =− θ,
n+1 n+1
which goes to zero as n → ∞. However, θ̂1 = 2X̄ is unbiased since E(θ̂1 ) = 2E(X̄) = 2 2θ = θ.
4 Statistical Inference 89
Unbiased estimators are “correct on average”, but that does not mean that they are guaranteed
to provide estimates which are close to the estimand θ. A better measure of the quality of an
estimator than bias is the mean squared error (or m.s.e.), defined as
h i
m.s.e.(θ̃) = E (θ̃ − θ)2 .
Therefore, if θ̃ is unbiased for θ, i.e. if E(θ̃) = θ, then m.s.e.(θ̃) = Var(θ̃). In general, we have the
following result:
m.s.e.(θ̃) = Var(θ̃) + bias(θ̃)2 .
The proof is similar to the one we did in Lecture 3.
h i
m.s.e.(θ̃) = E (θ̃ − θ)2
2
= E θ̃ − E θ̃ + E θ̃ − θ
2 2
= E θ̃ − E θ̃ + E θ̃ − θ + 2 θ̃ − E θ̃ E θ̃ − θ
h i2 h i2 h i
= E θ̃ − E θ̃ + E E θ̃ − θ + 2E θ̃ − E θ̃ E θ̃ − θ
h i2 h i
= Var θ̃ + E θ̃ − θ + 2 E θ̃ − θ E θ̃ − E θ̃
h i
= Var θ̃ + bias(θ̃)2 + 2 E θ̃ − θ E θ̃ − E θ̃
= Var θ̃ + bias(θ̃)2 .
Hence, the mean squared error incorporates both the bias and the variability (sampling variance)
of θ̃. We are then faced with the bias-variance trade-off when selecting an optimal estimator. We
may allow the estimator to have a little bit of bias if we can ensure that the variance of the biased
estimator will be much smaller than that of any unbiased estimator.
♥ Example 59 Uniform distribution Continuing with the uniform distribution U [0, θ] example,
1
we have seen that θ̂1 = 2X̄ is unbiased for θ but bias(θ̂2 ) = − n+1 θ. How do these estimators
compare with respect to the m.s.e? Since θ̂1 is unbiased, its m.s.e is its variance. In the next
lecture, we will prove that for random sampling from any population
Var(X)
Var(X̄) = ,
n
where Var(X) is the variance of the population sampled from. Returning to our example, we know
θ2
that if X ∼ U [0, θ] then Var(X) = 12 . Therefore we have:
θ2 θ2
m.s.e.(θ̂1 ) = Var θ̂1 = Var 2X̄ = 4Var X̄ = 4 = .
12n 3n
1
2. bias(θ̂2 ) = − n+1 θ.
Now
m.s.e.(θ̂2 ) = Var θ̂2 + bias(θ̂2 )2
nθ2 θ2
= (n+2)(n+1) 2 + (n+1)2
θ2 n
= (n+1)2 n+2
+1
θ 2 2n+2
= (n+1)2 n+2
.
Clearly, the m.s.e of θ̂2 is an order of magnitude (of order n2 rather than n) smaller than the m.s.e
of θ̂1 , providing justification for the preference of θ̂2 = max{X1 , X2 , . . . , Xn } as an estimator of θ.
R2 Var(X̄) = σ 2 /n,
We prove R1 as follows.
n n
1X 1X
E[X̄] = E(Xi ) = E(X) = E(X),
n n
i=1 i=1
We prove R2 using the result that for independent random variables the variance of the sum is
the sum of the variances from Lecture 19. Thus,
n n
1 X 1 X 1 σ2
Var[X̄] = 2 Var(Xi ) = 2 Var(X) = 2 Var(X) = ,
n n n n
i=1 i=1
so the m.s.e. of X̄ is Var(X)/n. This proves the following assertion we made earlier:
Variance of the sample mean = Population Variance divided by the sample size.
We now want to prove R3, i.e. show that the sample variance with divisor n − 1 is an unbiased
estimator of the population variance σ 2 , i.e. E(S 2 ) = σ 2 . We have
n
" n #
2 1 X 2 1 X
2 2
S = Xi − X̄ = Xi − nX̄ .
n−1 n−1
i=1 i=1
To evaluate the expectation of the above, we need E(Xi2 ) and E(X̄ 2 ). In general, we know for any
random variable,
Thus, we have
E(Xi2 ) = Var(Xi ) + (E(Xi ))2 = σ 2 + µ2 ,
and
E(X̄ 2 ) = Var(X̄) + (E(X̄))2 = σ 2 /n + µ2 ,
4 Statistical Inference 92
1 Pn 2 ) − nE(X̄ 2 )
= n−1 Pni=1 E(X i
1 2 2 2 + µ2 )
= n−1 i=1 (σ + µ ) − n(σ /n
1
nσ 2 + nµ2 − σ 2 − nµ2 )
= n−1
= σ 2 ≡ Var(X).
m.s.e.(θ̃) = Var(θ̃)
and therefore the sampling variance of the estimator is an important summary of its quality.
We usually prefer to focus on the standard deviation of the sampling distribution of θ̃,
q
s.d.(θ̃) = Var(θ̃).
In practice we will not know s.d.(θ̃), as it will typically depend on unknown features of the
distribution of X1 , . . . , Xn . However, we may be able to estimate s.d.(θ̃) using the observed sample
x1 , . . . , xn . We define the standard error, s.e.(θ̃), of an estimator θ̃ to be an estimate of the standard
deviation of its sampling distribution, s.d.(θ̃).
We proved that
σ2 σ
Var[X̄] = ⇒ s.d.(X̄) = √ .
n n
As σ is unknown, we cannot calculate this standard deviation. However, we know that E(S 2 ) = σ 2 ,
i.e. that the sample variance is an unbiased estimator of the population variance. Hence S 2 /n is
an unbiased estimator for Var(X̄). Therefore we obtain the standard error of the mean, s.e.(X̄),
by plugging in the estimate
n
!1/2
1 X
s= (xi − x̄)2
n−1
i=1
Therefore, for the computer failure data, our estimate, x̄ = 3.75, for the population mean is
associated with a standard error
3.381
s.e.(X̄) = √ = 0.332.
104
Note that this is ‘a’ standard error, so other standard errors may be available. Indeed, for parametric
inference, where we make assumptions about f (x), alternative standard errors are available. For
example, X1 , . . . , Xn are i.i.d. Poisson(λ) random q variables. E(X) = λ, so X̄ is an unbiased
p
estimator of λ. Var(X) = λ, so another s.e.(X̄) = λ̂/n = x̄/n. In the computer failure data
q
example, this is 3.75 104 = 0.19.
4.4.2 Basics
An estimate θ̃ of a parameter θ is sometimes referred to as a point estimate. The usefulness of a
point estimate is enhanced if some kind of measure of its precision can also be provided. Usually,
for an unbiased estimator, this will be a standard error, an estimate of the standard deviation of the
associated estimator, as we have discussed previously. An alternative summary of the information
provided by the observed data about the location of a parameter θ and the associated precision is
an interval estimate or confidence interval.
a random variable T (X, θ) whose distribution does not depend on θ and is therefore known. This
random variable T (X, θ) is called a pivot for θ. Hence we can find numbers h1 and h2 such that
where 1 − α is any specified probability. If (1) can be ‘inverted’ (or manipulated), we can write it
as
P [g1 (X) ≤ θ ≤ g2 (X)] = 1 − α. (2)
Hence with probability 1 − α, the parameter θ will lie between the random variables g1 (X) and
g2 (X). Alternatively, the random interval [g1 (X), g2 (X)] includes θ with probability 1 − α. Now,
when we observe x1 , . . . , xn , we observe a single observation of the random interval [g1 (X), g2 (X)],
which can be evaluated as [g1 (x), g2 (x)]. We do not know if θ lies inside or outside this interval,
but we do know that if we observed repeated samples, then 100(1 − α)% of the resulting intervals
would contain θ. Hence, if 1 − α is high, we can be reasonably confident that our observed interval
contains θ. We call the observed interval [g1 (x), g2 (x)] a 100(1 − α)% confidence interval for θ.
It is common to present intervals with high confidence levels, usually 90%, 95% or 99%, so that
α = 0.1, 0.05 or 0.01 respectively.
It is common practice to make the interval symmetric, so that the two unshaded areas are equal
(to α/2), in which case
α
−h1 = h2 ≡ h and Φ(h) = 1 − .
2
The most common choice of confidence level is 1 − α = 0.95, in which case h = 1.96 =
qnorm(0.975). You may also occasionally see 90% (h = 1.645 = qnorm(0.95)) or 99% (h =
4 Statistical Inference 95
2.58=qnorm(0.995)) intervals. We discussed these values in Lecture 17. We generally use the 95%
intervals for a reasonably high level of confidence without making the interval unnecessarily wide.
Therefore we have
√ (X̄ − µ)
P −1.96 ≤ n ≤ 1.96 = 0.95
σ
σ σ
⇒ P X̄ − 1.96 √ ≤ µ ≤ X̄ + 1.96 √ = 0.95.
n n
Hence, X̄ − 1.96 √σn and X̄ + 1.96 √σn are the endpoints of a random interval which includes µ with
probability 0.95. The observed value of this interval, x̄ ± 1.96 √σn , is called a 95% confidence
interval for µ.
♥ Example 60 For the fast food waiting time data, we have n = 20 data points combined from
the morning and afternoon data sets. We have x̄ = 67.85 and n = 20. Hence, under the normal
model assuming (just for the sake of illustration) σ = 18, a 95% confidence interval for µ is
√ √
67.85 − 1.96(18/ 20) ≤ µ ≤ 67.85 + 1.96(18/ 20)
⇒ 59.96 ≤ µ ≤ 75.74
2. Confidence intervals are frequently used, but also frequently misinterpreted. A 100(1 − α)%
confidence interval for θ is a single observation of a random interval which, under repeated
sampling, would include θ 100(1 − α)% of the time.
The following example from the National Lottery in the UK clarifies the interpretation. We
collected 6 chosen lottery numbers (sampled at random from 1 to 49) for 20 weeks and then
constructed 95% confidence intervals for the population mean µ = 25 and plotted the intervals
along with the observed sample means in the following figure. It can be seen that exactly
4 Statistical Inference 96
one out of 20 (5%) of the intervals do not contain the true population mean 25. Although
this is a coincidence, it explains the main point that if we construct the random intervals
with 100(1 − α)% confidence levels again and again for hypothetical repetition of the data,
on average 100(1 − α)% of them will contain the true parameter.
3. A confidence interval is not a probability interval. You should avoid making statements like
P (1.3 < θ < 2.2) = 0.95. In the classical approach to statistics you can only make probability
statements about random variables, and θ is assumed to be a constant.
In this lecture we have learned to obtain confidence intervals by using an appropriate statistic in the
pivoting technique. The main task is then to invert the inequality so that the unknown parameter
is in the middle by itself and the two end points are functions of the sample observations. The most
difficult task is to correctly interpret confidence intervals, which are not probability intervals but
have long-run properties. That is, the interval will contain the true parameter with the stipulated
confidence level only under infinitely repeated sampling.
4 Statistical Inference 97
√ (X̄ − µ) approx
n ∼ N (0, 1) as n → ∞.
σ
So a general confidence interval for µ can be constructed, just as before in Section 4.4.3. Thus a
95% confidence interval (CI) for µ is given by x̄ ± 1.96 √σn . But note that σ is unknown so this
CI cannot be used unless we can estimate σ, i.e. replace the unknown s.d. of X̄ by its estimated
standard error. In this case, we get the CI in the familiar form:
Suppose that we do not assume any distribution for the sampled random variable X but assume
only that X1 , . . . , Xn are i.i.d, following the distribution of X where E(X) = µ and Var(X) = σ 2 .
√
We know that the standard error of X̄ is s/ n where s is the sample standard deviation with
divisor n − 1. Then the following provides a 95% CI for µ:
s
x̄ ± 1.96 √ .
n
♥ Example 61 For the computer failure data, x̄ = 3.75, s = 3.381 and n = 104. Under the
model that the data are observations of i.i.d. random variables with population mean µ (but no
other assumptions about the underlying distribution), we compute a 95% confidence interval for µ
to be
3.381 3.381
3.75 − 1.96 √ , 3.75 + 1.96 √ = (3.10, 4.40).
104 104
If we can assume a distribution for X, i.e. a parametric model for X, then we can do slightly
better in estimating the standard error of X̄ and as a result we can improve upon the previously
obtained 95% CI. Two examples follow.
♥ Example 62 Poisson If X1 , . . . , Xn are modelled as i.i.d. Poisson(λ) randomq
variables, then
2 2
p
µ = λ and σ = λ. We know Var(X̄) = σ /n = λ/n. Hence a standard error is λ̂/n = x̄/n
4 Statistical Inference 98
For the computer failure data, x̄ = 3.75, s = 3.381 and n = 104. Under the model that the data
are observations of i.i.d. random variables following a Poisson distribution with population mean
λ, we compute a 95% confidence interval for λ as
r
x̄ p
x̄ ± 1.96 = 3.75 ± 1.96 3.75/104 = (3.38, 4.12).
n
We see that this interval is narrower (0.74 = 4.12 − 3.38) than the earlier interval (3.10,4.40),
which has a length of 1.3. We prefer narrower confidence intervals as they facilitate more accurate
inference regarding the unknown parameter.
This is wrong as n is too small for the large sample approximation to be accurate. Hence we need
to look for other alternatives which may work better.
√ (X̄−p)
P −1.96 ≤ n √ ≤ 1.96 = 0.95
p(1−p)
p √ p
⇔ P −1.96 p(1 − p) ≤ n(X̄ − p) ≤ 1.96 p(1 − p) = 0.95
p p
⇔ P −1.96 p(1 − p)/n ≤ (X̄ − p) ≤ 1.96 p(1 − p)/n = 0.95
p p
⇔ P p − 1.96 p(1 − p)/n ≤ X̄ ≤ p + 1.96 p(1 − p)/n = 0.95
⇔ P L(p) ≤ X̄ ≤ R(p) = 0.95,
4 Statistical Inference 99
p p
where L(p) = p − h p(1 − p)/n, R(p) = p + h p(1 − p)/n, h = 1.96. Now, consider the inverse
mappings L−1 (x) and R−1 (x) so that:
P L(p) ≤ X̄ ≤ R(p) = 0.95
⇔ P R−1 (X̄) ≤ p ≤ L−1 (X̄) = 0.95
which now defines our confidence interval (R−1 (X̄), L−1 (X̄)) for p. We can obtain R−1 (x̄) and
L−1 (x̄) by solving the equations R(p) = x̄ and L(p) = x̄ for p, treating n and x̄ as known quantities.
Thus we have,
R(p) = x̄, L(p) = x̄
⇔ (x̄ − p)2 = h2 p(1 − p)/n, where h = 1.96
⇔ p (1 + h /n) − p(2x̄ + h /n) + x̄2 = 0
2 2 2
The endpoints of the confidence interval are the roots of the quadratic. Hence, the endpoints
of the 95% confidence interval for p are:
2 1/2
h2 h2 2 h2
2x̄ + n ± 2x̄ + n − 4x̄ 1 + n
2
2 1 + hn
2 1/2
h2 h2 2 h2
x̄ + 2n ± x̄ + 2n − x̄ 1 + n
=
2
1 + hn
h 2 i1/2
h2
x̄ + 2n ± √hn 4nh
+ x̄(1 − x̄)
=
2
.
1 + hn
This is sometimes called the Wilson Score Interval. The following R code calculates this for
given n, x̄ and confidence level α which determines the value of h. Returning to the previous
example, n = 10 and x̄ = 0.2, the 95% CI obtained from this method is (0.057, 0.510) compared
to the previous illegitimate one (−0.048, 0.448). In fact you can see that the intervals obtained by
quadratic inversion are more symmetric and narrower as n increases, and are also more symmetric
for x̄ closer to 0.5. See the table below:
4 Statistical Inference 100
For smaller n and x̄ closer to 0 (or 1), the approximation required for the plug-in estimate of the
standard error is insufficiently reliable. However, for larger n it is adequate.
Now the confidence interval for λ is found by solving the (quadratic) equality for λ by treating n, x̄
and h to be known:
(x̄ − λ)2
n = h2 , where h = 1.96
λ
⇒ x̄2 − 2λx̄ + λ2 = h2 λ/n
⇒ λ2 − λ(2x̄ + h2 /n) + x̄2 = 0.
2 1/2
h2 h2
2x̄ + ± 2x̄ + − 4x̄2 1/2
h2
n n
2
h h
= x̄ + ± + x̄ .
2 2n n1/2 4n
♥ Example 64 For the computer failure data, x̄ = 3.75 and n = 104. For a 95% confidence
interval (CI), h = 1.96. Hence, we calculate the above CI using the R commands:
x <- scan("compfail.txt")
n <- length(x)
h <- qnorm(0.975)
mean(x) + (h*h)/(2*n) + c(-1, 1) * h/sqrt(n) * sqrt(h*h/(4*n) + mean(x))
The result is (3.40, 4.14), which compares well with the earlier interval (3.38, 4.12).
4 Statistical Inference 101
4.6 Lecture 26: Exact confidence interval for the normal mean
4.6.1 Lecture mission
Recall that we can obtain better quality inferences if we can justify a precise model for the data.
This saying is analogous to the claim that a person can better predict and infer in a situation
when there are established rules and regulations, i.e. the analogue of a statistical model. In this
lecture, we will discuss a procedure for finding confidence intervals based on the statistical modelling
assumption that the data are from a normal distribution. This assumption will enable us to find
an exact confidence interval for the mean rather than an approximate one using the central limit
theorem.
P (−h ≤ T ≤ h) = 1 − α
√ (X̄ − µ)
i.e. P −h ≤ n ≤ h = 0.95
S
S S
⇒ P X̄ − h √ ≤ µ ≤ X̄ + h √ = 0.95
n n
The observed value of this interval, (x̄ ± h √sn ), is the 95% confidence interval for µ. Remarkably,
this also of the general form, Estimate ± Critical value × Standard error, where the Critical value
is h and the standard error of the sample mean is √sn . Now, how do we find the critical value h for
4 Statistical Inference 102
1 Pn
Let X1 , . . . , Xn be i.i.d N (µ, σ 2 ) random variables. Define X̄ = n i=1 Xi and
n
!
1 X
S2 = Xi2 − nX̄ 2 .
n−1
i=1
√ (X̄ − µ)
n ∼ tn−1 ,
S
where tn−1 denotes the standard t distribution with n − 1 degrees of freedom. The standard t dis-
tribution is a family of distributions which depend on one parameter called the degrees-of-freedom
(df) which is n − 1 here. The concept of degrees of freedom is that it is usually the number of
independent random samples, n here, minus the number of linear parameters estimated, 1 here for
µ. Hence the df is n − 1.
The probability density function of the tk distribution is similar to a standard normal, in that
it is symmetric around zero and ‘bell-shaped’, but the t-distribution is more heavy-tailed, giving
greater probability to observations further away from zero. The figure below illustrates the tk
density function for k = 1, 2, 5, 20 together with the standard normal pdf (solid line).
The values of h for a given 1 − α have been tabulated using the standard t-distribution and can
be obtained using the R command qt (abbreviation for quantile of t). For example, if we want to
find h for 1 − α = 0.95 and n = 20 then we issue the command: qt(0.975, df=19) = 2.093. Note
that it should be 0.975 so that we are splitting 0.05 probability between the two tails equally and
the df should be n − 1 = 19. Indeed, using the above command repeatedly, we obtain the following
critical values for the 95% interval for different values of the sample size n.
n 2 5 10 15 20 30 50 100 ∞
h 12.71 2.78 2.26 2.14 2.09 2.05 2.01 1.98 1.96
Note that the critical value approaches 1.96 (which is the critical value for the normal distribution)
as n → ∞, since the t-distribution itself approaches the normal distribution for large values of its
df parameter.
4 Statistical Inference 103
If you can justify that the underlying distribution is normal then you
can use the t-distribution-based confidence interval.
♥ Example 65 Fast food waiting time revisited We would like to find a confidence interval for
the true mean waiting time. If X denotes the waiting time in seconds, we have n = 20, x̄ = 67.85,
s = 18.36. Hence, recalling that the critical value h = 2.093, from the command qt(0.975,
df=19), a 95% confidence interval for µ is
√ √
67.85 − 2.093 × 18.36/ 20 ≤ µ ≤ 67.85 + 2.093 × 18.36/ 20
⇒ 59.26 ≤ µ ≤ 76.44.
♥ Example 66 Weight gain revisited We would like to find a confidence interval for the true
average weight gain (final weight – initial weight). Here n = 68, x̄ = 0.8672 and s = 0.9653. Hence,
a 95% confidence interval for µ is
√ √
0.8672 − 1.996 × 0.9653/ 68 ≤ µ ≤ 0.8672 + 1.996 × 0.9653/ 68
⇒ 0.6335 ≤ µ ≤ 1.1008
[In R, we obtain the critical value 1.996 by qt(0.975, df=67) or -qt(0.025, df=67)]
In R the command is: mean(x) + c(-1, 1) * qt(0.975, df=67) * sqrt(var(x)/68) if the
vector x contains the 68 weight gain differences. You may obtain this by issuing the commands:
wgain <- read.table("wtgain.txt", head=T)
x <- wgain$final -wgain$initial
Note that the interval here does not include the value 0, so it is very likely that the weight gain
is significantly positive, which we will justfy using what is called testing of hypothesis.
to the normal distribution when its only parameter, called the degrees of freedom, becomes very
large. If the assumption of the normal distribution for the data can be justified, then the method
of inference based on the t-distribution is best when the variance parameter, sometimes called the
nuisance parameter, is unknown.
4.7.2 Introduction
In statistical inference, we use observations x1 , . . . , xn of univariate random variables X1 , . . . , Xn in
order to draw inferences about the probability distribution f (x) of the underlying random variable
X. So far, we have mainly been concerned with estimating features (usually unknown parameters)
of f (x). It is often of interest to compare alternative specifications for f (x). If we have a set of
competing probability models which might have generated the observed data, we may want to de-
termine which of the models is most appropriate. A proposed (hypothesised) model for X1 , . . . , Xn
is then referred to as a hypothesis, and pairs of models are compared using hypothesis tests.
For example, we may have two competing alternatives, f (0) (x) (model H0 ) and f (1) (x) (model
H1 ) for f (x), both of which completely specify the joint distribution of the sample X1 , . . . , Xn .
Completely specified statistical models are called simple hypotheses. Usually, H0 and H1 both take
the same parametric form f (x, θ), but with different values θ(0) and θ(1) of θ. Thus the joint distri-
bution of the sample given by f (X) is completely specified apart from the values of the unknown
parameter θ and θ(0) 6= θ(1) are specified alternative values.
More generally, competing hypotheses often do not completely specify the joint distribution of
X1 , . . . , Xn . For example, a hypothesis may state that X1 , . . . , Xn is a random sample from the
probability distribution f (x; θ) where θ < 0. This is not a completely specified hypothesis, since it
is not possible to calculate probabilities such as P (X1 < 2) when the hypothesis is true, as we do
not know the exact value of θ. Such an hypothesis is called a composite hypothesis.
Examples of hypotheses:
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ 2 = 2.
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ 2 ∈ R+ .
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ 6= 0, σ 2 ∈ R+ .
X1 , . . . , Xn ∼ Bernoulli(p) with p = 12 .
X1 , . . . , Xn ∼ Bernoulli(p) with p 6= 12 .
X1 , . . . , Xn ∼ Bernoulli(p) with p > 12 .
4 Statistical Inference 105
X1 , . . . , Xn ∼ Poisson(λ) with λ = 1.
X1 , . . . , Xn ∼ Poisson(θ) with θ > 1.
Hence, the fact that a hypothesis test does not reject H0 should not be taken as evidence that
H0 is true and H1 is not, or that H0 is better-supported by the data than H1 , merely that the data
does not provide significant evidence to reject H0 in favour of H1 .
A hypothesis test is defined by its critical region or rejection region, which we shall denote by
C. C is a subset of Rn and is the set of possible observed values of X which, if observed, would
lead to rejection of H0 in favour of H1 , i.e.
If x ∈ C H0 is rejected in favour of H1
If x ∈
6 C H0 is not rejected
As X is a random variable, there remains the possibility that a hypothesis test will give an erroneous
result. We define two types of error:
H0 true H0 false
Reject H0 Type I error Correct decision
Do not reject H0 Correct decision Type II error
♥ Example 67 Uniform Suppose that we have one observation from the uniform distribution
on the range (0, θ). In this case, f (x) = 1/θ if 0 < x < θ and P (X ≤ x) = xθ for 0 < x < θ. We
want to test H0 : θ = 1 against the alternative H1 : θ = 2. Suppose we decide arbitrarily that we
will reject H0 if X > 0.75. Then
1
α = P (X > 0.75|θ = 1) = 1 − 0.75 = ,
4
3
β = P (X < 0.75|θ = 2) = 0.75/2 = .
8
Here the notation | means given that.
Sometimes α is called the size (or significance level) of the test and ω ≡ 1 − β is called the
power of the test. Ideally, we would like to avoid error so we would like to make both α and β as
small as possible. In other words, a good test will have small size, but large power. However, it is
not possible to make α and β both arbitrarily small. For example if C = ∅ then α = 0, but β = 1.
On the other hand if C = S = Rn then β = 0, but α = 1.
The general hypothesis testing procedure is to fix α to be some small value (often 0.05), so that
the probability of a Type I error is limited. In doing this, we are giving H0 precedence over H1 ,
and acknowledging that Type I error is potentially more serious than Type II error. (Note that for
discrete random variables, it may be difficult to find C so that the test has exactly the required
size). Given our specified α, we try to choose a test, defined by its rejection region C, to make β
as small as possible, i.e. we try to find the most powerful test of a specified size. Where H0 and
H1 are simple hypotheses this can be achieved easily.
Note that tests are usually based on a one-dimensional test statistic T (X) whose sample space
is some subset of R. The rejection region is then a set of possible values for T (X), so we also think
of C as a subset of R. In order to be able to ensure the test has size α, the distribution of the test
statistic under H0 should be known.
H0 : µ = µ 0
4 Statistical Inference 107
♥ Example 69 Fast food waiting time revisited Suppose the manager of the fast food outlet
claims that the average waiting time is only 60 seconds. So, we want to test H0 : µ = 60. We have
n = 20, x̄ = 67.85, s = 18.36. Hence our test statistic for the null hypothesis H0 : µ = µ0 = 60 is
√ (x̄ − µ0 ) √ (67.85 − 60)
n = 20 = 1.91.
s 18.36
The observed value of 1.91 may or may not be reasonable from the graph below. The graph
has plotted the density of the t-distribution with 19 degrees of freedom and a vertical line is drawn
at the observed value of 1.91. This value is a bit out in the tail but we are not sure, unlike in the
previous weight gain example. So how can we decide whether to reject the null hypothesis?
4 Statistical Inference 108
The value of h depends on the sample size n and can be found by issuing the qt command.
Here are few examples obtained from qt(0.975, df=c(1, 4, 9, 14, 19, 29, 49, 99)):
n 2 5 10 15 20 30 50 100 ∞
h 12.71 2.78 2.26 2.14 2.09 2.05 2.01 1.98 1.96
Note that we need to put n − 1 in the df argument of qt and the last value for n = ∞ is obtained
from the normal distribution.
However, if the alternative hypothesis is one-sided, e.g. H1 : µ > µ0 , then the critical region
will only be in the right tail. Consequently, we need to leave an area α on the right and as a result
the critical values will be from a command such as:
qt(0.95, df=c(1, 4, 9, 14, 19, 29, 49, 99))
n 2 5 10 15 20 30 50 100 ∞
h 6.31 2.13 1.83 1.76 1.73 1.70 1.68 1.66 1.64
3. If your computed t lies in the rejection region, i.e. |t| > h, you report that H0 is rejected in
favour of H1 at the chosen level of significance. If t does not lie in the rejection region, you
report that H0 is not rejected. [Never refer to ‘accepting’ a hypothesis.]
♥ Example 70 Fast food waiting time We would like to test H0 : µ = 60 against the alternative
H1 : µ > 60, as this alternative will refute the claim of the store manager that customers only wait
for a maximum of one minute. We calculated the observed value to be 1.91. This is a one-sided test
and for a 5% level of significance, the critical value h will come from qt(0.95, df=19)=1.73. Thus
the observed value is higher than the critical value so we will reject the null hypothesis, disputing
4 Statistical Inference 110
• Under H0 this is an observation from a t67 distribution. For significance level α = 0.05 the
rejection region is |t| > 1.996.
• Our computed test statistic lies in the rejection region, i.e. |t| > 1.996, so H0 is rejected in
favour of H1 at the 5% level of significance.
4.8.5 p-values
The result of a test is most commonly summarised by rejection or non-rejection of H0 at the
stated level of significance. An alternative, which you may see in practice, is the computation of
a p-value. This is the probability that the reference distribution would have generated the actual
observed value of the statistic or something more extreme. A small p-value is evidence against
the null hypothesis, as it indicates that the observed data were unlikely to have been generated
by the reference distribution. In many examples a threshold of 0.05 is used, below which the null
hypothesis is rejected as being insufficiently well-supported by the observed data. Hence for the
t-test with a two-sided alternative, the p-value is given by:
where T has a tn−1 distribution and tobs is the observed sample value.
However, if the alternative is one-sided and to the right then the p-value is given by:
p = P (T > tobs ),
where T has a tn−1 distribution and tobs is the observed sample value.
4 Statistical Inference 111
A small p-value corresponds to an observation of T that is improbable (since it is far out in the
low probability tail area) under H0 and hence provides evidence against H0 . The p-value should not
be misinterpreted as the probability that H0 is true. H0 is not a random event (under our models)
and so cannot be assigned a probability. The null hypothesis is rejected at significance level α if
the p-value for the test is less than α.
When the alternative hypothesis is two-sided the p-value has to be calculated from P (|T | > tobs ),
where tobs is the observed value and T follows the t-distribution with n − 1 df. For the weight gain
example, because the alternative is two-sided, the p-value is given by:
This very small p-value for the second example indicates very strong evidence against the null
hypothesis of no weight gain in the first year of university.
and therefore
σ2 σ2
X̄ − Ȳ ∼ N µX − µ Y , X + Y .
n m
Hence, under H0 ,
r
2 1 1 nm (X̄ − Ȳ )
X̄ − Ȳ ∼ N 0, σ + ⇒ ∼ N (0, 1).
n m n+m σ
The involvement of the (unknown) σ above means that this is not a pivotal test statistic. It will be
proved in MATH2011 that if σ is replaced by its unbiased estimator S, which here is the two-sample
estimator of the common standard deviation, given by
Pn 2
Pm 2
2 i=1 (Xi − X̄) + i=1 (Yi − Ȳ )
S = ,
n+m−2
then r
nm (X̄ − Ȳ )
∼ tn+m−2 .
n+m S
4 Statistical Inference 113
Hence r
nm (x̄ − ȳ)
t=
n+m s
is a test statistic for this test. The rejection region is |t| > h where −h is the α/2 (usually 0.025)
percentile of tn+m−2 .
(n − 1)s2x + (m − 1)s2y
s2 = = 354.8,
n+m−2
r
nm (x̄ − ȳ)
tobs = = 0.25.
n+m s
This is not significant as the critical value h = qt(0.975,18)= 2.10 is larger in absolute value than
0.25. This can be achieved by calling the R function t.test as follows:
y <- read.csv("servicetime.csv", head=T)
t.test(y$AM, y$PM)
It automatically calculates the test statistic as 0.249 and a p-value of 0.8067. It also obtains
the 95% CI given by (–15.94, 20.14).
zi = xi − yi , i = 1, . . . , n
and modelling these differences as observations of i.i.d. N (µz , σZ2 ) variables Z1 , . . . , Zn . Then, a
test of the hypothesis µX = µY is achieved by testing µZ = 0, which is just a standard (one sample)
t-test, as described previously.
4 Statistical Inference 114
♥ Example 73 Paired t-test Water-quality researchers wish to measure the biomass to chloro-
phyll ratio for phytoplankton (in milligrams per litre of water). There are two possible tests, one
less expensive than the other. To see whether the two tests give the same results, ten water samples
were taken and each was measured both ways. The results are as follows:
Test 1 (x) 45.9 57.6 54.9 38.7 35.7 39.2 45.9 43.2 45.4 54.8
Test 2 (y) 48.2 64.2 56.8 47.2 43.7 45.7 53.0 52.0 45.1 57.5
Interpretation: The values of the second test are significantly higher than the ones of the first
test, and so the second test cannot be considered as a replacement for the first.
optimal design of the data collection experiments are necessary to make valid statistical inferences.
This lecture will discuss several methods for statistical data collection.
Suppose there are N individuals in the population and we are drawing a sample of n individuals.
In SRSWR, the same unit of the population may occur more than once in the sample; there are
N n possible samples (using multiplication rules of counting), and each of these samples has equal
chance of 1/N n to materialise. In the case of SRSWOR, at the rth drawing (r = 1, . . . , n) there
are N − r + 1 individuals in the population to sample from. All of these individuals are given equal
probability of inclusion in the sample. Here no member of the population can occur more than
once in the sample. There are N Cn possible samples and each has equal probability of inclusion
1/N Cn . This is also justified as at the rth stage one is to choose from N − r + 1 individuals one
of the n − r + 1 individuals to be included in the sample which have not yet been chosen in earlier
drawings. In this case too, the probability that any specified individual, say the ith, is selected at
any drawing, say the kth drawing, is:
N −1 N −2 N −k+1 1 1
× × ··· × =
N N −1 N −k+2 N −k+1 N
as in the case of the SRSWR. It is obvious that if one takes n individuals all at a time from the
population, giving equal probability to each of the N Cn combinations of n members out of the N
members in the population, one will still have SRSWOR.
sample(200, size=50) for SRSWOR and sample(200, size=50, replace=T) for SRSWR.
There are a huge number of considerations and concepts to design good surveys avoiding bias.
There may be response bias, observational bias, biases from non-response, interviewer bias, bias
due to defective sampling technique, bias due to substitution, bias due to faulty differentiation of
sampling units and so on. However, discussion of such topics is beyond the scope and syllabus of
this module.
Treatment. The different procedures under comparison in an experiment are the different treat-
ments. For example, in a chemical engineering experiment different factors such as Temperature
(T), Concentration (C) and Catalyst (K) may affect the yield value from the experiment.
Experimental unit. An experimental unit is the material to which the treatment is applied and
on which the variable under study is measured. In a human experiment in which the treatment
affects the individual, the individual will be the experimental unit.
1. Randomisation. This is necessary to draw valid conclusions and minimise bias. In an ex-
periment to compare two pain-relief tablets we should allocate the tablets randomly among
participants – not one tablet to the boys and the other to the girls.
2. Replication. A treatment is repeated a number of times in order to obtain a more reliable
estimate than a single observation. In an experiment to compare two diets for children, we
can plan the experiment so that no particular diet is favoured in the experiment, i.e. each
diet is applied approximately equally among all types of children (boys, girls, their ethnicity
etc.).
The most effective way to increase the precision of an experiment is to increase the number of
replications. Remember, Var(X̄) = σ 2 /n, which says that the standard deviation decreases
proportional to the square root of the number of replications. However, replication beyond a
limit may be impractical due to cost and other considerations.
3. Local control. In the simplest case of local control, the experimental units are divided into
homogeneous groups or blocks. The variation among these blocks is eliminated from the error
4 Statistical Inference 117
and thereby efficiency is increased. These considerations lead to the topic of construction of
block designs, where random allocation of treatments to the experimental units may be re-
stricted in different ways in order to control experimental error. Another means of controlling
error is through the use of confounded designs where the number of treatment combinations
is very large, e.g. in factorial experiments.
Factorial experiment A thorough discussion of construction of block designs and factorial ex-
periments is beyond the scope of this module. However, these topics are studied in the third-year
module MATH3014: Design of Experiments. In the remainder of this lecture, we simply discuss an
example of a factorial experiment and how to estimate different effects.
To investigate how factors jointly influence the response, they should be investigated in an
experiment in which they are all varied. Even when there are no factors that interact, a factorial
experiment gives greater accuracy. Hence they are widely used in science, agriculture and industry.
Here we will consider factorial experiments in which each factor is used at only two levels. This is
a very common form of experiment, especially when many factors are to be investigated. We will
code the levels of each factor as 0 (low) and 1 (high). Each of the 8 combinations of the factor
levels were used in the experiment. Thus the treatments in standard order were:
000, 001, 010, 011, 100, 101, 110, 111.
Each treatment was used in the manufacture of one batch of the chemical and the yield (amount
in grams of chemical produced) was recorded. Before the experiment was run, a decision had to
be made on the order in which the treatments would be run. To avoid any unknown feature that
changes with time being confounded with the effects of interest, a random ordering was used; see
below. The response data are also shown in the table.
Questions of interest
1. How much is the response changed when the level of one factor is changed from high to low?
For simplicity, we first consider the case of two factors only and call them factors A and B, each
having two levels, ‘low’ (0) and ‘high’ (1). The four treatments in the experiment are then 00, 01,
10, 11, and suppose that we have just one response measured for each treatment combination. We
denote the four response values by yield00 , yield01 , yield10 and yield11 .
Average yield
at high B
Yield yield11
high
Average yield
at high A
Factor B
yield01 yield10
Average yield
at low A
Average yield
yield00 low at low B
Figure
Figure1: 4.1:
Illustration
Figureof the yields infactorial
showing a two-factoreffects.
experiment.
Main effects
For this particular experiment, we can answer the first question by measuring the difference between
the average yields at the two levels of A:
The average yield at the high level of A is 12 (yield11 + yield10 ).
The average yield at the low level of A is 21 (yield01 + yield00 ).
These are represented by the open stars in the Figure 4.1. The main effect of A is defined as
the difference between these two averages, that is
Yield
Effect of Factor
1 B at a given A 1
A = ) − (yieldyield
(yield11 + yield10high 11
01 + yield )
2 2 The00effect of changing
1 B at high A
= (yield11 + yield10 − yield01 − yield00 ),
yield
2 01
yield10
The effect of
which is represented
changing by
B atthe difference between the two open stars in Figure 4.1. Notice that A is
used to denotelowthe
A main effect of a factor as well
yield00 low as its name. This is a common practice. This
quantity measures how much the response changes when factor A is changed from its low to its
high level, averaged over the levels of factor
low
B.Factor A high
Yield
Effect of Factor
B at a given A yield11
high
The effect of changing
B at high A
yield01
yield10
The effect of
changing B at
low A yield00 low
When the effect of factor B (difference between the two black stars) is different from the
corresponding differences for different levels of A, the two factors A and B are said to interact with
each other. The response lines are not parallel in Figure 4.3
1
yield11
Yield
Main effect of high
Factor B
The effect of changing
B at high A
• If we interchange the roles of A and B in this expression we obtain the same formula.
• Definition: The main effects and interactions are known collectively as the factorial effects.
• Important note: When there is a large interaction between two factors, the two main effects
cannot be interpreted separately.
During the first week’s workshop and problem class you are asked to go through this. Please try
the proofs/exercises as well and verify that your solutions are correct by talking to a workshop
assistant. Solutions to some of the exercises are given at the end of this chapter and some others
are discussed in lectures.
121
A Mathematical Concepts Needed in MATH1024 122
Prove that
n n × (n − 1) × · · · × (n − k + 1)
=
k 1 × 2 × 3 × · · · × k.
Proof We have:
n n!
k = k! (n−k)!
1 1×2×3×···×(n−k)×(n−k+1)×···×(n−1)×n
= k! 1×2×3×···×(n−k)
1
= k! [(n − k + 1) × · · · × (n − 1) × n] .
6 6×5
Hence the proof is complete. This enables us to calculate 2 = 1×2 = 15. In general for calculating
n
k :
the numerator is the multiplication of k terms starting with n and counting down,
and the denominator is the multiplication of the first k positive integers.
1. nC= n C n−k . [This means number of ways of choosing k items out of n items is same as the
k
number of ways of choosing n − k items out of n items. Why is this meaningful?]
2. n+1 = nk + k−1 n
k .
3. For each of (1) and (2), state the meaning of these equalities in terms of the numbers of
selections of k items without replacement.
for any numbers a and b and a positive integer n. This is called the binomial theorem. This can
be used to prove the following:
n n n−1 n x
(1 − p) + p(1 − p) + ··· + p (1 − p)n−x + · · · + pn = (p + 1 − p)n = 1.
1 x
Thus
n
X n
px (1 − p)n−x = 1,
x
x=0
A Mathematical Concepts Needed in MATH1024 123
for n > 1.
Exercise 1. Hard Show that
X mn m + n
= ,
x+y=z
x y z
where the above sum is also over all possible integer values of x and y such that 0 ≤ x ≤ m and
0 ≤ y ≤ n.
Hint Consider the identity
(1 + t)m (1 + t)n = (1 + t)m+n
and compare the coefficients of tm+n on both sides. If this is hard, please try small values of m and
n, e.g. 2, 3 and see what happens.
The power of r, k + 1, in the formula for the sum is the number of terms.
2. When k → ∞ we can evaluate the sum only when |r| < 1. In that case
∞
X 1
rx = [rk+1 → 0 as k → ∞ for |r| < 1].
1−r
x=0
3. For a positive n and |x| < 1, the negative binomial series is given by:
1 1 n(n + 1)(n + 2) · · · (n + k − 1) k
(1−x)−n = 1+nx+ n(n+1)x2 + n(n+1)(n+2)x3 +· · ·+ x +· · ·
2 6 k!
With n = 2 the general term is given by:
1 + 2q + 3q 2 + 4q 3 + · · · = (1 − q)−2 .
A Mathematical Concepts Needed in MATH1024 124
1. log(ab) = log(a) + log(b) [Log of the product is the sum of the logs]
2. log ab = log(a) − log(b) [Log of the ratio is the difference of the logs]
There is no simple formula for log(a + b) or log(a − b). Now try the following exercises:
3
1. Show that log xeax +3x+b = log(x) + ax3 + 3x + b.
−λ λx
P∞
2. Satisfy yourself that x=0 e x! = 1.
A.6 Integration
A.6.1 Fundamental theorem of calculus
We need to remember (but not prove) the fundamental theorem of calculus:
Z x
dF (x)
F (x) = f (u)du implies f (x) =
−∞ dx
f (x) = f (−x)
x2
for all possible values of x. For example, f (x) = e− 2 is an even function for real x.
f (x) = −f (−x)
x2
for all possible values of x. For example, f (x) = xe− 2 is an odd function for real x. It can be
proved that:
It can be shown that Γ(α) exists and is finite for all real values of α > 0. Obviously it is non-negative
since it is the integral of a non-negative function. It is easy to see that
Γ(1) = 1
R∞
since 0 e−x dx = 1. The argument α enters only through the power of the dummy (x). Remember
this, as we will have to recognise many gamma integrals. Important points are:
1. It is an integral from 0 to ∞.
2. The integrand must be of the form dummy (x) to the power of the parameter (α) minus one
(xα−1 ) multiplied by e to the power of the negative dummy (e−x ).
provided parameter > 1. The condition α > 1 is required to ensure that Γ(α − 1) exists. The
proof of this is not required, but can be proved easily by integration by parts by integrating the
function e−x and differentiating the function xα−1 . For an integer n, by repeatedly applying the
reduction formula and Γ(1) = 1, show that
Γ(n) = (n − 1)!.
Thus Γ(5) = 4! = 24. You can guess how rapidly the gamma function increases! The last formula
we need to remember for our purposes is:
1
√
Γ 2 = π.
Proof of this is complicated and not required for this module. Using this we can calculate
√
3 3 π
− 1 Γ 12 =
Γ = .
2 2 2
Now we can easily tackle integrals such as the following:
R∞
1. For fun try to evaluate 0 xα−1 e−βx dx for α > 0 and β > 0.
R∞
2. Prove that 0 xe−λx dx = λ12 .
R∞
3. Prove that 0 x2 e−λx dx = λ23 .
A Mathematical Concepts Needed in MATH1024 127
2.
n! n!
RHS = +
k!(n − k)! (k − 1)!(n − [k − 1])!
n!
= [n − (k − 1) + k]
k!(n − [k − 1])!
n![n + 1]
=
k!(n − [k − 1])!
(n + 1)!
=
k!(n + 1 − k)!
= LHS
3. (a) Number of selections (without replacement) of k objects from n is exactly the same as
the number of selections of (n − k) objects from n.
(b) The number of selections of k items from (n + 1) consists of:
n
• The number of selections that include the (n + 1)th item. There are
k−1 of these.
n
• The number of selections that exclude the (n + 1)th item. There are
k of these.
A Mathematical Concepts Needed in MATH1024 128
Appendix B
Worked Examples
Let
129
B Worked Examples 130
Note that B1 , B2 , B3 are mutually exclusive and exhaustive. Find P {A} and P {B1 |A}.
79. [Independent events] The probability that Jane can solve a certain problem is 0.4 and that
Alice can solve it is 0.3. If they both try independently, what is the probability that it is
solved?
80. [Random variable] A fair die is tossed twice. Let X equal the first score plus the second
score. Determine
81. [Random variable] A coin is tossed three times. If X denotes the number of heads minus
the number of tails, find the probability function of X and draw a graph of its cumulative
distribution function when
82. [Expectation and variance] The random variable X has probability function
1
px = 14 (1 + x) if x = 1, 2, 3, 4
0 otherwise.
83. [Expectation and variance] Let X denote the score when a fair die is thrown. Determine the
probability function of X and find its mean and variance.
84. [Expectation and variance] Two fair dice are tossed and X equals the larger of the two scores
obtained. Find the probability function of X and determine E(X).
85. [Expectation and variance] The random variable X is uniformly distributed on the integers
0, ±1, ±2, . . . , ±n, i.e.
1
px = 2n+1 if x = 0, ±1, ±2, . . . , ±n
0 otherwise.
Obtain expressions for the mean and variance in terms of n. Given that the variance is 10,
find n.
86. [Poisson distribution] The number of incoming calls at a switchboard in one hour is Poisson
distributed with mean λ = 8. The numbers arriving in non-overlapping time intervals are
statistically independent. Find the probability that in 10 non-overlapping one hour periods
at least two of the periods have at least 15 calls.
87. [Continuous distribution] The random variable X has probability density function
kx2 (1 − x) if 0 ≤ x ≤ 1
f (x) =
0 otherwise.
B Worked Examples 131
88. [Exponential distribution] The random variable X has probability density function
λe−λx if x ≥ 0
f (x) =
0 otherwise.
89. [Cauchy distribution] A random variable X is said to have a Cauchy distribution if its
probability density function is given by
1
f (x) = , −∞ < x < +∞.
π(1 + x2 )
(a) Verify that it is a valid probability density function and sketch its graph.
(b) Find the cumulative distribution function F (x).
(c) Find P (−1 ≤ X ≤ 1).
90. [Continuous distribution] The probability density function of X has the form
a + bx + cx2 if 0 ≤ x ≤ 4
f (x) =
0 otherwise.
12
If E(X) = 2 and Var(X) = 5 , determine the values of a, b and c.
P {S|GS}P {GS}
P {GS|S} =
P {S|GG}P {GG} + P {S|GS}P {GS} + P {S|SS}P {SS}
1 1
2 × 3 1
= 1 1 1 1 = 3.
0× 3 + 2 × 3 +1× 3
(a)
(b)
Hence
P {A} = P {B1 }P {A|B1 } + P {B2 }P {A|B2 } + P {B3 }P {A|B3 }
= 0.3 × 0.1 + 0.5 × 0.3 + 0.2 × 0.5
= 0.31
Now by the Bayes theorem,
P {B1 }P {A|B1 }
P {B1 |A} = P {A}
0.3×0.1
= 0.31
3
= 31 .
B Worked Examples 133
79.
(a) Working along the cross-diagonals we find by enumeration that X has the following
probability function
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
px 36 36 36 36 36 36 36 36 36 36 36
More concisely,
(
6−|x−7|
36 if x = 2, . . . , 12
px =
0 otherwise.
(b)
0
if x < 2
1
F (x) = 36 if 2 ≤ x < 3
3
if 3 ≤ x < 4, etc.
36
(a) (b)
x -3 -1 1 3 x -3 -1 1 3
1 3 3 1 8 36 54 27
px 8 8 8 8 px 125 125 125 125
(a) (b)
0 if x < −3
0 if x < −3
1
if − 3 ≤ x < −1 8
if − 3 ≤ x < −1
8
125
1 44
F (x) = 2 if −1≤x<1 F (x) = 125 if −1≤x<1
7
if 1≤x<3 98
if 1≤x<3
8
125
1 if x ≥ 3.
1 if x ≥ 3.
B Worked Examples 135
2 3 4 5
82. p1 = 14 , p2 = 14 , p3 = 14 , p4 = 14 .
X
E(X) = xpx
x
2 3 4 5 20
+2× =1× +3× +4× = .
14 14 14 14 7
2 3 4 5 65
E(X 2 ) = 1 × +4× +9× + 16 × = .
14 14 14 14 7
Therefore Var(X) = E(X 2 ) − [E(X)]2
65 400 55
= − = .
7 49 49
x 1 2 3 4 5 6
1 1 1 1 1 1
px 6 6 6 6 6 6
X
E(X) = xpx
x
1 1 1 1 1 1
×1+ ×2+ ×3+ ×4+ ×5+ ×6
=
6 6 6 6 6 6
7
= .
2
Var(X) = E(X 2 ) − [E(X)]2
2
1 1 1 1 1 1 7
= × 1 + × 4 + × 9 + × 16 + × 25 + × 36 −
6 6 6 6 6 6 2
35
= .
12
84. Using the sample space for Question 5, we find that X has probability function
x 1 2 3 4 5 6
1 3 5 7 9 11
px 36 36 36 36 36 36
B Worked Examples 136
1 3 5 7 9 11
E(X) = ×1+ ×2+ ×3+ ×4+ ×5+ ×6
36 36 36 36 36 36
161
= .
36
85.
n
X 1 X
E(X) = xpx = x = 0.
x
2n + 1 x=−n
n
X 1 X
E(X 2 ) = x2 p x = x2
x
2n + 1 x=−n
2 n(n + 1)(2n + 1) n(n + 1)
= = .
(2n + 1) 6 3
n(n + 1)
Therefore Var(X) = E(X 2 ) − [E(X)]2 = .
3
If Var(X) = 10,
n(n + 1)
then = 10
3
n2 + n − 30 = 0.
Therefore n = 5 (rejecting − 6).
86. Let X be the number of calls arriving in an hour and let P (X ≥ 15) = p.
Then Y , the number of times out of 10 that X ≥ 15, is B(n, p) with n = 10 and p =
1 − 0.98274 = 0.01726.
Therefore P (Y ≥ 2) = 1 − P (Y ≤ 1)
= 1 − ((0.98274)10 + 10(0.01726)(0.98274)9 )
= 0.01223.
R1
87. (a) k 0 x2 (1 − x) dx = 1, which implies that k = 12.
R 1/2
(b) P (0 < X < 12 ) = 12 0 x2 (1 − x) dx = 16 5
.
(c) The number of observations lying in the interval (0, 12 ) is binomially distributed with
5
parameters n = 100 and p = 16 , so that
1
Mean number of observations in 0, = np = 31.25.
2
88. (a)
Z ∞
E(X) = λ xe−λx dx
0
Z ∞
= [−xe−λx ]∞
0 + e−λx dx (integrating by parts)
0
1 1
= 0 − [e−λx ]∞
0 = .
λ λ
B Worked Examples 137
(b)
Z ∞
E(X 2 ) = λ x2 e−λx dx
0
Z ∞
= [−x2 e−λx ]∞
0 +2 xe−λx dx
0
2
= 0 + 2 (using the first result)
λ
2
= 2
λ
2 1
= λ12 and σ = Var(X) = λ1 .
p
Therefore Var(X) = λ2
− λ2
(c) The mode is x = 0 since this value maximises f (x).
(d) The median m is given by
Z m
1 1
λe−λx dx = , which implies that m = log(2).
0 2 λ
Rx dy
(b) F (x) = 1
π −∞ 1+y 2 = π1 [tan−1 (y)]x−∞ = π1 (tan−1 (x) + π2 ).
(c) P (−1 ≤ X ≤ 1) = F (1) − F (−1) = π1 (tan−1 (1) + π2 ) − π1 (tan−1 (−1) + π2 ) = 12 .
B Worked Examples 138
Find
(a) unbiased estimates of µb and σb2 , the mean and variance of the weights of the boys;
(b) unbiased estimates of µg and σg2 , the mean and variance of the weights of the girls;
(c) an unbiased estimate of µb − µg .
Assuming that σb2 = σg2 = σ 2 , calculate an unbiased estimate of σ 2 using both sets of weights.
92. [Estimation] The time that a customer has to wait for service in a restaurant has the prob-
ability density function
(
3θ3
(x+θ)4
if x ≥ 0
f (x) =
0 otherwise,
94. [Confidence interval] At the end of a severe winter a certain insurance company found that
of 972 policy holders living in a large city who had insured their homes with the company,
357 had suffered more than £500-worth of snow and frost damage. Calculate an approximate
95% confidence interval for the proportion of all homeowners in the city who suffered more
than £500-worth of damage. State any assumptions that you make.
95. [Confidence interval] The heights of n randomly selected seven-year-old children were mea-
sured. The sample mean and standard deviation were found to be 121 cm and 5 cm re-
spectively. Assuming that height is normally distributed, calculate the following confidence
intervals for the mean height of seven-year-old children:
96. [Confidence interval] A random variable is known to be normally distributed, but its mean µ
and variance σ 2 are unknown. A 95% confidence interval for µ based on 9 observations was
found to be [22.4, 25.6]. Calculate unbiased estimates of µ and σ 2 .
97. [Confidence interval] The wavelength of radiation from a certain source is 1.372 microns. The
following 10 independent measurements of the wavelength were obtained using a measuring
device:
1.359, 1.368, 1.360, 1.374, 1.375, 1.372, 1.362, 1.372, 1.363, 1.371.
Assuming that the measurements are normally distributed, calculate 95% confidence limits
for the mean error in measurements obtained with this device and comment on your result.
98. [Confidence interval] In five independent attempts, a girl completed a Rubik’s cube in 135.4,
152.1, 146.7, 143.5 and 146.0 seconds. In five further attempts, made two weeks later, she
completed the cube in 133.1, 126.9, 129.0, 139.6 and 144.0 seconds. Find a 90% confidence
interval for the change in the mean time taken to complete the cube. State your assumptions.
99. [Confidence interval] In an experiment to study the effect of a certain concentration of insulin
on blood glucose levels in rats, each member of a random sample of 10 rats was treated with
insulin. The blood glucose level of each rat was measured both before and after treatment.
The results, in suitable units, were as follows.
Rat 1 2 3 4 5 6 7 8 9 10
Level before 2.30 2.01 1.92 1.89 2.15 1.93 2.32 1.98 2.21 1.78
Level after 1.98 1.85 2.10 1.78 1.93 1.93 1.85 1.67 1.72 1.90
Let µ1 and µ2 denote respectively the mean blood glucose levels of a randomly selected rat
before and after treatment with insulin. By considering the differences of the measurements
on each rat and assuming that they are normally distributed, find a 95% confidence interval
for µ1 − µ2 .
B Worked Examples 140
100. [Confidence interval] The heights (in metres) of 10 fifteen-year-old boys were as follows:
1.59, 1.67, 1.55, 1.63, 1.69, 1.58, 1.66, 1.62, 1.64, 1.61.
Assuming that heights are normally distributed, find a 99% confidence interval for the mean
height of fifteen-year-old boys.
If you were told that the true mean height of boys of this age was 1.67 m, what would you
conclude?
1
µ̂b = (77 + 67 + . . . + 81) = 67.3.
10
An unbiased estimate of σb2 is the sample variance of the weights of the boys,
1
µ̂g = (42 + 57 + . . . + 59) = 52.4,
10
σ̂g2 = ((422 + 572 + . . . + 592 ) − 10µ̂2g )/9 = 56.71̇.
1 2 1
(σ̂b + σ̂g2 ) = (52.67̇ + 56.71̇)
2 2
= 54.694̇.
B Worked Examples 141
where zγ is the 100γ percentile of the standard normal distribution. The width of the CI is
0.8zγ . The width of the quoted confidence interval is 1.316. Therefore, assuming that the
quoted interval is symmetric,
This implies that α = 0.1 and hence 100(1 − α) = 90, i.e. the confidence level is 90%.
94. Assuming that although the 972 homeowners are all insured within the same company they
constitute a random sample from the population of all homeowners in the city, the 95%
interval is given approximately by
" r r #
p̂(1 − p̂) p̂(1 − p̂)
p̂ − 1.96 , p̂ + 1.96 ,
n n
95. The 100(1 − α)% confidence interval is, in the usual notation,
s
x̄ ± critical value √ ,
n
where the critical value is the 100(1−α/2)th percentile of the t-distribution with n−1 degrees
of freedom. Here x̄ = 121 and s = 5.
(a) For the 95% CI, critical value = 1.753 (qt(0.95,df=15) in R) and the interval is
A 95% confidence interval for the mean error is obtained by subtracting the true wavelength
of 1.372 from each endpoint. This gives [−0.0087, −0.0001]. As this contains negative values
only, we conclude that the device tends to underestimate the true value.
98. Using x to refer to the early attempts and y to refer to the later ones, we find from the data
that X X
xi = 723.7, x2i = 104896.71,
X X
yi = 672.6, yi2 = 90684.38.
This gives
x̄ = 144.74, s2x = 37.093,
ȳ = 134.52, s2y = 51.557.
Confidence limits for the change in mean time are
s
4s2x + 4s2y 1 1
ȳ − x̄ ± critical value + ,
8 5 5
leading to the interval [−18.05, −2.39], as critical value = 1.860 (qt(0.95,df=8) in R).
As it contains only negative values, this suggests that there is a real decrease in the mean
time taken to complete the cube. We have assumed that the two samples are independent
random samples from normal distributions of equal variance.
99. Let d1 , d2 , . . . , d10 denote the differences in levels before and after treatment. Their values
are
0.32, 0.16, −0.18, 0.11, 0.22, 0.00, 0.47, 0.31, 0.49, −0.12.
Then i=1 di = 1.78 and 10
P10 P 2 ¯
i=1 di = 0.7924 so that d = 0.178, sd = 0.2299.
Note that the two samples are not independent. Thus the standard method of finding a
confidence interval for µ1 − µ2 , as used in Question 9 for example, would be inappropriate.
100. The mean of the heights is 1.624 and the standard deviation is 0.04326. A 99% confidence
interval for the mean height is therefore
0.04326 0.04326
1.624 − 3.250 × √ , 1.624 + 3.250 × √ ,
10 10
B Worked Examples 144
If we were told that the true mean height was 1.67 m then, discounting the possibility that
this information is false, we would conclude that our sample is not a random sample from
the population of all fifteen-year-old boys or that we have such a sample but an event with
probability 0.01 has occurred, namely that the 99% confidence interval does not contain the
true mean height.
Appendix C
Homework Sheets
Homework Sheet 1
1. Suppose we have the data: x1 = 1, x2 = 2, . . . , xn = n. Find the mean and the variance. For
variance use the divisor n instead of n − 1.
2. Suppose yi = axi + b where a and b P are real numbers. Show that: (i) ȳ ≡ n1 yi = ax̄ + b and
2
(ii) Var(y) = a Var(x) where x = n ni=1 xi and for variancePyou may either use the divisor
1
n n
n, i.e. Var(x) = n1 i=1 (xi − x)2 or n − 1, i.e. Var(x) = n−1
1 2
P
i=1 (xi − x) . The divisor does
not matter as the results hold regardless.
(x1 + x2 + . . . xn )2
(x21 + x22 + · · · + x2n ) ≥ .
n
You may start by assuming ni=1 (xi − x)2 ≥ 0. This is a version of the famous and very
P
important Cauchy-Schwartz inequality in Mathematics.
5. Read the computer failure data by issuing the command cfail <- scan("compfail.txt").
[Before issuing the command, make sure that you have set the correct working directory so
that R can find this data file.]
(a) Describe the data using the commands summary, var and table. Also obtain a histogram
(hist) plot and a boxplot (boxplot) of the data and comment.
(b) Use plot(cfail) to produce a plot of the failure numbers (y-axis) against the order in
which the data were collected (x-axis). You may modify the command to plot(cfail,
type="l") to perhaps get a better looking plot. The argument is lower case letter ’l’
not the number 1. Do you detect any trend?
(c) The object cfail is a vector of length 104. The first 52 values are for the weeks in
the first year and the last 52 values are for the second year. How can we find the two
145
C Homework Sheets 146
annual means and variances? For this we need to create an index corresponding to each
of the 104 values. The index should take the value 1 for the first 52 values and 2 for
the remaining. Use the rep (for replicate) command to generate this. Issue year <-
rep(c(1, 2), each=52). This will replicate 1, 52 times followed by 2, 52 times and
put the resulting vector in year. Type cbind(year, cfail) and see what happens.
Now you can calculate the year-wise means by issuing the command: tapply(X=cfail,
INDEX=year, FUN=mean). The function tapply applies the supplied FUN argument to
the X argument separately for each unique index supplied by the INDEX argument. That
is, the command returns the mean value for each group of observations corresponding
to distinct values of the INDEX variable year.
6. Fortune magazine publishes a list of the world’s billionaires each year. The 1992 list includes
225 individuals. Their wealth, age, and geographic location (Asia, Europe, Middle East,
United States, and Other) are reported. Variables are: wealth: Wealth of family or individual
in billions of dollars; age: Age in years (for families it is the maximum age of family members);
region: Region of the World (Asia, Europe, Middle East, United States and Other). The head
and tail values of the data set are given below.
wealth age region
37.0 50 M
24.0 88 U
.. .. ..
. . .
1 9 M
1 59 E
Read the data by issuing the following command.
bill <- read.table("billionaires.txt", head=T).
(a) Obtain the summary of the data set and comment on each of the three columns of data.
(b) Obtain the mean and variance of the wealth for each of the 5 regions of the world and
comment. You can use the tapply(X=bill$wealth, INDEX=bill$region, FUN=mean)
command to do this.
(c) Produce a set of boxplots to demonstrate the difference in distribution of wealth accord-
ing to different parts of the world. For example, you may need to issue the command:
boxplot(wealth ∼ region, data =bill)
(d) Produce a scatter plot of wealth against age and comment on the relationship between
them. Do you see any outlying observations? Do you think a linear relationship between
age and wealth will be sensible?
Try and ensure all your plots have informative titles and axis labelling – The function title
and the function arguments xlab and ylab, for example ylab="wealth in billions of US
dollars)" are helpful for this.
C Homework Sheets 147
Homework Sheet 2
1. I select two cards from a pack of 52 cards and observe the colour of each. Which of the
following is an appropriate sample space S for the possible outcomes?
2. A builder requires the services of both a plumber and an electrician. If there are 12 plumbers
and 9 electricians available in the area, the number of choices of the pair of contractors is
(a) 96
(b) 88
(c) 108
(d) none of the above
3. In a particular game, a fair die is tossed. If the number of spots showing is either 4 or 5 you
win £1, if the number of spots showing is 6 you win £4, and if the number of spots showing
is 1, 2 or 3 you win nothing. If it costs you £1 to play the game, the probability that you
win more than the cost of playing is
(a) 0
(b) 1/6
(c) 1/3
(d) 2/3
4. A rental car company has 10 Japanese and 15 European cars waiting to be serviced on a
particular Saturday morning. Because there are so few mechanics available, only 6 cars can
be serviced. Calculate
(a) the number of outcomes in the sample space (i.e. the number of possible selections of 6
cars).
(b) the number of outcomes in the event “3 of the selected cars are Japanese and the other
3 are European”.
(c) the probability (to 3 decimal places) that the event in (b) occurs, when the 6 cars are
chosen at random.
5. Shortly after being put into service, some buses manufactured by a certain company have
developed cracks on the underside of the main frame. Suppose a particular city has 25 of
these buses, and cracks have actually appeared in 8 of them.
(a) How many ways are there to select a sample of 5 buses from the 25 for a thorough
inspection?
(b) How many of these samples of 5 buses contain exactly 4 with visible cracks?
(c) If a sample of 5 buses is chosen at random, what is the probability that exactly 4 of the
5 will have visible cracks (to 3 dp)?
(d) If buses are selected as in part (c), what is the probability that at least 4 of those selected
will have visible cracks (to 3 dp)?
6. The probability that a man reads The Sun is 0.6, while the probability that he reads both The
Sun and The Guardian is 0.1 and the probability that he reads neither paper is 0.2. What is
the probability that he reads The Guardian? Draw a Venn diagram to illustrate your answer.
7. Write down all 27 ways in which three distinguishable balls A, B and C can be distributed
among three boxes. If each ball were to be placed into a box selected at random, what
probabilities would you assign to each way?
Show that, if the balls are made indistinguishable, the number of distinct ways reduces to 10.
If again each ball were to be placed into a box selected at random, what probabilities would
you assign to the 10 ways?
C Homework Sheets 149
Homework Sheet 3
1. Event A occurs with probability 0.8. The conditional probability that event B occurs given
that A occurs is 0.5. The probability that both A and B occur is
(a) 0.3,
(b) 0.4,
(c) 0.8,
(d) cannot be determined from the information given
2. An event A occurs with probability 0.5. An event B occurs with probability 0.6. The proba-
bility that both A and B occurs is 0.1. The conditional probability of A given B is
(a) 0.3,
(b) 0.2,
(c) 1/6,
(d) cannot be determined from the information given
You only need give the letters corresponding to the correct answers in this question.
3. In a multiple choice examination paper, n answers are given for each question (n = 4 in the
above two questions). Suppose a candidate knows the answer to a question with probability
p. If he does not know it he guesses, choosing one of the answers at random. Show that if
np
his answer is correct, the probability that he knew the answer is 1+(n−1)p .
4. Two events A and B are such that the probability of B occurring given that A has occurred
is 4 times the probability of A, and the probability that A occurs given that B has occurred
is 9 times the probability that B occurs. If the probability that at least one of the events
occurs is 7/48, find the probability of event A occurring.
6. A water supply system has three pumps A, B and C arranged as below, where the pumps
operate independently.
The system functions if either A and B operate or C operates, or all three operate. If A and
B have reliabilities 90% and C has reliability 95%, what is the reliability of the water supply
system?
7. (a) A twin-engine aircraft can fly with at least one engine working. If each engine has a
probability of failure during flight of q and the engines operate independently, what is
the probability of a successful flight?
(b) A four-engine aircraft can fly with at least two of its engines working. If all the engines
operate independently and each has a probability of failure of q, what is the probability
that a flight will be successful?
(c) For what range of values of q would you prefer to fly in a twin-engine aircraft rather
than a four-engine aircraft?
(d) Discuss briefly (in no more than two sentences) whether the assumption of independent
operation of the engines is reasonable.
C Homework Sheets 151
Homework Sheet 4
(a) seven people will die from this disease this year,
(b) 10 or more people will die from this disease this year,
(c) there will be no deaths from this disease this year?
5. The performance of a computer is studied over a period of time, during which it makes 1015
computations. If the probability that a particular computation is incorrect equals 10−14 ,
independent of other computations, write down an expression for the probability that fewer
than 10 errors occur and calculate an approximation to its value using R.
6. The random variable X has probability density function
kx2 (1 − x) if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.
(a) Find the value of k.
(b) Find the probability that X lies in the range (0, 21 ).
(c) In 100 independent observations of X, how many on average will fall in the range (0, 12 )?
for α > 0. Hint: In the integral first substitute y = βx and then use the definition of
the gamma function.
(b) Show that the expected value of X, E(X) = αβ .
α
(c) Show that the variance of X, Var(X) = β2
.
This is called the gamma distribution and it is plain to see that if α = 1, it is the exponential
distribution covered in Lectures. When α = n2 and β = 12 it is called the χ2 distribution with
n degrees of freedom, which has mean n and variance 2n.
C Homework Sheets 153
Homework Sheet 5
1. Show that if X ∼ Poisson(λ), Y ∼ Poisson(µ) and X and Y are independent random variables
then
X + Y ∼ Poisson(λ + µ).
2. If X has the distribution N (4, 16), find
(a) the number a such that P (|X − 4| ≤ a) = 0.95,
(b) the median,
(c) the 90th percentile,
(d) the interquartile range.
3. The fractional parts of 100 numbers are distributed uniformly between 0 and 1. The numbers
are first rounded to the nearest integer and then added. Using the CLT, find an approximation
for the probability that the error in the sum due to rounding lies between –0.5 and 0.5. Hint:
First argue that the individual errors are uniformly distributed and find the mean and variance
of the unifor distribution.
4. A man buys a new die and throws it 600 times. He does not yet know if the die is fair!
(a) Show that the probability that he obtains between 90 and 100 ’sixes’ if the die is fair is
0.3968 approximately.
(b) From the first part we see that P (90 ≤ X ≤ 100) ≈ 0.3968 where X denotes the number
of ‘sixes’ obtained from 600 throws. If the die is fair we can expect about 100 ‘sixes’
from 600 throws. Between what two limits symmetrically placed about 100 would the
number sixes obtained lie with probability 0.95 if the die is fair? That is, find N such
that P (100 − N ≤ X ≤ 100 + N ) = 0.95. You may use the continuity correction of 12 on
each of the limits inside the probability statement.
(c) What might he conclude if he obtained 120 sixes?
5. The random variables X, Y have the following joint probability table.
Y
1 2 3
1 1 1
1 16 8 16
1 1 1
X 2 8 8 8
1 1 1
3 16 4 16
6. Two random variables X, Y each take only the values 0 and 1, and their joint probability
table is as follows.
Y
0 1
0 a b
X
1 c d
Homework Sheet 6
1. Show that if X1 , X2 , . . . Xn is a random sample from a distribution having mean µ and variance
σ 2 , then
Xn
Y = ai Xi
i=1
Pn
is an unbiased estimator for µ provided i=1 ai = 1, and find an expression for V ar(Y ).
In the case n = 4, determine which of the following estimators are unbiased for µ:
2. In an experiment, 100 observations were taken from a normal distribution with variance
16. The experimenter quoted [1.545, 2.861] as the confidence interval for µ. What level of
confidence was used?
3. The heights of n randomly selected seven-year-old children were measured. The sample mean
and standard deviation were found to be 121 cm and 5 cm respectively. Assuming that height
is normally distributed, calculate the following confidence intervals for the mean height of
seven-year-old children:
(a) Suppose that the observed sample size is n = 30, of which 18 are 0s and 12 are 1s. Use
two different methods described in lectures to obtain a 95% confidence interval for p.
(b) Now suppose that the observed sample size is n = 30, of which 29 are 0s and 1 is 1.
Again, use two different methods to obtain a 95% confidence interval for p.
(c) Comment on your results in (a) and (b).
Rat 1 2 3 4 5 6 7 8 9 10
Level before 2.30 2.01 1.92 1.89 2.15 1.93 2.32 1.98 2.21 1.78
Level after 1.98 1.85 2.10 1.78 1.93 1.93 1.85 1.67 1.72 1.90
Let µ1 and µ2 denote respectively the mean blood glucose levels of a randomly selected rat
before and after treatment with insulin. By considering the differences of the measurements
on each rat and assuming that they are normally distributed, find a 95% confidence interval
for µ1 − µ2 .
C Homework Sheets 157
Homework Sheet 7
1. The random variable X has a normal distribution with standard deviation 3.5 but unknown
mean µ. The hypothesis H0 : µ = 3 is to be tested against H1 : µ = 4 by taking a random
sample of size 50 from the distribution of X and rejecting H0 if the sample mean exceeds 3.4.
Calculate the Type I and Type II error probabilities of this procedure.
2. The daily demand for a product has a Poisson distribution with mean λ, the demands on
different days being statistically independent. It is desired to test the hypotheses H0 : λ =
0.7, H1 : λ = 0.3. The null hypothesis is to be accepted if in 20 days the number of days with
no demand is less than 15. Calculate the Type I and Type II error probabilities.
3. A wholesale greengrocer decides to buy a field of winter cabbages if he can convince himself
that their mean weight exceeds 1.2 kg. Accordingly he cuts 12 cabbages at random and
weighs them with the following results (in kg):
1.26, 1.19, 1.17, 1.24, 1.23, 1.25, 1.20, 1.18, 1.23, 1.21, 1.19, 1.17.
Should the greengrocer buy the cabbages? Use a 10% level of significance.
4. A market gardener, wishing to compare the effectiveness of two fertilizers, used one fertilizer
throughout the growing season on half of his plants and used the second one on the other
half. The yields of the surviving plants are shown below, measured in kilograms.
Stating all assumptions made, test at a 10% significance level the hypothesis that the fertilizers
are equally effective against the alternative that they are not.
5. Eight young English county cricket batsmen were awarded scholarships which enabled them
to spend the winter in Australia playing club cricket. Their first-class batting averages in the
preceding and following seasons were as follows.
Batsman 1 2 3 4 5 6 7 8
Average before 29.43 21.21 31.23 36.27 22.28 30.06 27.60 43.19
Average after 31.26 24.95 29.74 33.43 28.50 30.35 29.16 47.24
Is there a significant improvement in their batting averages between seasons? Could any
change be attributed to the winter practice?
6. A sociologist wishes to estimate the proportion of wives in a city who are happy with their
marriage. To overcome the difficulty that a wife, if asked directly, may say that her marriage
is happy even when it is not, the following procedure is adopted. Each member of a random
sample of 500 married women is asked to toss a fair coin but not to disclose the result. If
the coin lands ‘heads’, the question to be answered is ‘Does your family own a car?’. If it
lands ‘tails’, the question to be answered is ‘Is your marriage a happy one?’. In either case the
response is to be either ‘Yes’ or ‘No’. The respondent knows that the sociologist has no means
of knowing which question has been answered. Suppose that of the 500 responses, 350 are
‘Yes’ and 150 are ‘No’. Assuming that every response is truthful and given that 75% of families
own a car, estimate the proportion of women who are happy with their marriage and obtain
an approximate 90% confidence interval for this proportion. Based on this confidence interval
would you reject the null hypothesis that 80% of women are happy with their marriage?
Prof Sujit Sahu studied statistics and math-
ematics at the University of Calcutta and the
Indian Statistical Institute and went on to ob-
tain his PhD at the University of Connecti-
cut (USA) in 1994. He joined the University
of Southampton in 1999. His research area
is Bayesian statistical modelling and computa-
tion.
Acknowledgements
These notes are taken largely from the MATH1024 notes developed over the years by many
Southampton colleagues, most recently by Prof Dankmar Böhning and Dr Vesna Perisic.
Miss Joanne Ellison helped enormously in typesetting and proofreading these notes.
Many worked examples were taken from a series of books written by F.D.J. Dunstan, A.B.J.
Nix, J.F. Reynolds and R.J. Rowlands, published by R.N.D. Publications.