STA1000S Finished Notes
STA1000S Finished Notes
Notes
Table of Contents
Week 1 .......................................................................................................................................... 6
Probability Versus Odds ......................................................................................................................... 6
Statistical Distributions .......................................................................................................................... 6
Excel Formulas ................................................................................................................................... 6
Fair Game ............................................................................................................................................... 6
Win Percentage ...................................................................................................................................... 6
House Advantage ................................................................................................................................... 7
Expected Gain/Loss ............................................................................................................................ 7
Week 2 .......................................................................................................................................... 8
Counting Rules ....................................................................................................................................... 8
Permutations .......................................................................................................................................... 8
Combinations ......................................................................................................................................... 8
Counting Rules ....................................................................................................................................... 8
Counting Rule 1: ................................................................................................................................. 8
Counting Rule 2: ................................................................................................................................. 8
Counting Rule 3: ................................................................................................................................. 8
Counting Rule 4: ................................................................................................................................. 8
Conditional Probability ........................................................................................................................... 9
Bayes’ Theorem .................................................................................................................................. 9
If A and B are Unrelated/Independent: .............................................................................................. 9
The Table Method for Bayes Theorem ................................................................................................. 10
Week 3 ........................................................................................................................................ 12
Qualitative data .................................................................................................................................... 12
Quantitative data: ................................................................................................................................ 12
Ordinal data: ........................................................................................................................................ 12
Excel ................................................................................................................................................. 13
Exploratory Data Analysis..................................................................................................................... 13
Visually Displaying Data ....................................................................................................................... 13
Skewness .............................................................................................................................................. 14
Five-Number Data Summaries ............................................................................................................. 14
o Median .................................................................................................................................... 14
o Lower-quartile: ........................................................................................................................ 15
o Upper quartile ......................................................................................................................... 15
Constructing Box and Whisker Plots: ............................................................................................... 15
......................................................................................................................................................... 15
Five-Number Summary in Excel ........................................................................................................... 15
Summary Statistics ............................................................................................................................... 16
Standard Deviation in Excel ............................................................................................................. 16
Formulas for Mean and Variance ......................................................................................................... 16
Measures of Location and Spread ........................................................................................................ 16
Location ........................................................................................................................................... 16
Spread .............................................................................................................................................. 17
For more notes, videos and explanations: ........................................................................................... 17
Week 4 ........................................................................................................................................ 18
Random Variable .................................................................................................................................. 18
Probability Mass Functions/Discrete Random Variables ..................................................................... 18
Probability Density Functions/Continuous Random Variables............................................................. 18
Expected Values of PDFs and PMFs ...................................................................................................... 19
Variance of Random Variable X ............................................................................................................ 19
Probability Mass Function (Discrete): .............................................................................................. 19
Probability Density Function (Continuous): ...................................................................................... 19
Coefficient of Variation ........................................................................................................................ 20
Expected Winnings/Loss ...................................................................................................................... 20
Combining Random Variables .............................................................................................................. 20
Week 5 ........................................................................................................................................ 21
Probability Distribution ........................................................................................................................ 21
Uniform Distribution ........................................................................................................................ 21
• Expected Value ........................................................................................................................ 21
• Variance .................................................................................................................................. 21
• Graph of Uniform Distribution................................................................................................. 22
Binomial Distribution ....................................................................................................................... 23
• Expected Value ........................................................................................................................ 23
• Variance .................................................................................................................................. 23
Week 6 ........................................................................................................................................ 24
Probability Distributions....................................................................................................................... 24
Poisson Distribution ............................................................................................................................. 24
• Expected Value ........................................................................................................................ 24
• Variance .................................................................................................................................. 24
• Graph....................................................................................................................................... 24
......................................................................................................................................................... 24
Exponential Distribution....................................................................................................................... 25
• Expected Value ........................................................................................................................ 25
• Variance .................................................................................................................................. 25
......................................................................................................................................................... 25
Central Limit Theorem...................................................................................................................... 26
The Normal Distribution ................................................................................................................... 26
Calculating Probability in Normal Distributions ................................................................................... 27
Things to Remember ........................................................................................................................... 28
Subtracting / Adding / Multiplying Normal Distributions ................................................................ 28
Lower/Upper Quartiles with Normal Distributions .......................................................................... 28
Example of Normal Distribution Question ........................................................................................... 29
Week 7 ........................................................................................................................................ 30
Sample v Population ........................................................................................................................ 30
Percentage Point Notation ................................................................................................................... 33
Confidence Intervals............................................................................................................................. 33
Point Estimate .................................................................................................................................. 33
Interval ............................................................................................................................................. 33
Confidence Interval Formula: ............................................................................................................... 33
Width of Confidence Interval: .......................................................................................................... 33
Determining Sample Size...................................................................................................................... 34
General Sample Size Formula ........................................................................................................... 34
Sample Size Formula When Trying to Achieve ‘L’ Accuracy ............................................................. 34
Some Common Z Values: ..................................................................................................................... 34
Some Things to Remember: ................................................................................................................. 34
Week 8 ........................................................................................................................................ 35
The Hypothesis Test ............................................................................................................................. 35
2-Sided Test .......................................................................................................................................... 37
Rejection Region in a 2-Sided Test ................................................................................................... 37
Which Level of Significance (a) to Use? ............................................................................................... 37
Some Things to Remember: ................................................................................................................. 38
Comparing 2 Sample Means ................................................................................................................ 39
Subtracting Distributions ................................................................................................................. 39
Finding the Z Value for Calculating the Test Statistic ....................................................................... 39
The Modified Approach........................................................................................................................ 39
The P Value Explained ...................................................................................................................... 41
• Example of P-Value Question .................................................................................................. 41
Week 9 ........................................................................................................................................ 42
Unknown Population Variances ........................................................................................................... 42
Finding the t-value: .......................................................................................................................... 42
Comparing to Z-Table ........................................................................................................................... 43
Confidence Interval Without Knowing Population Variance ................................................................ 43
Testing the Mean: ................................................................................................................................ 43
Two-Sided Test with Same Example ................................................................................................. 46
The Modified Approach........................................................................................................................ 46
P-Value with One-Sided.................................................................................................................... 46
P-Value with Two-Sided ................................................................................................................... 47
• P-Value Example ...................................................................................................................... 47
What are Degrees of Freedom? ........................................................................................................... 47
The Degree of Freedom “Rule”......................................................................................................... 47
Some Things to Remember .................................................................................................................. 48
Comparing Two Means with the T-Distribution (6 Step Approach) ..................................................... 48
Finding a T-Value when Dealing with Two Means ........................................................................... 48
Comparing Two Means with the T-Distribution (Modified Approach)................................................. 49
Finding the P-Value .......................................................................................................................... 49
Week 10 ...................................................................................................................................... 50
Comparing two Means in Paired Data Sets (6 Step Approach) ............................................................ 50
Comparing two Means in Paired Data Sets (Modified Approach)........................................................ 50
Calculate P-Value ............................................................................................................................. 50
P-Value and Rejecting or Accepting H0 ............................................................................................ 50
Excel and the T-Distribution ................................................................................................................. 51
- Right-Tailed Test ( > )............................................................................................................... 51
- Left-Tailed Test ( < ) ................................................................................................................. 51
- Two-Tailed Test ( < > ) ............................................................................................................. 51
Confidence Intervals Under the T-Test ................................................................................................ 51
Some Things to Remember .................................................................................................................. 52
Example of a Question ......................................................................................................................... 53
Goodness-of-Fit-Test: Whether Data Fits Various Distributions .......................................................... 54
Goodness-of-Fit-Test Under the 6-Step Approach ........................................................................... 54
Chi-squared Distribution ...................................................................................................................... 54
Getting Correct Degrees of Freedom for Chi-Squared ...................................................................... 55
What Does the Critical Value Mean? ............................................................................................... 55
Test Statistic Formula for Chi-Squared ............................................................................................. 56
Goodness-of-Fit Test Under the Modified Approach ........................................................................ 57
- Finding the P-Value ................................................................................................................. 57
A Note on Both the Modified and 6-Step Approach ............................................................................ 58
................................................................................................................. Error! Bookmark not defined.
Some Things to Remember .................................................................................................................. 59
Week 11 ...................................................................................................................................... 60
Testing for an Association Between Two Categorical Variables........................................................... 60
Testing for Association Using the 6-Step Approach ............................................................................. 60
Getting Correct Degrees of Freedom Under Two Variable Association Test .................................... 61
Degrees of Freedom for Tests of Association....................................................................................... 62
Testing for Association Using the Modified Approach ......................................................................... 64
Finding the P-Value .......................................................................................................................... 64
Excel and the Chi-Squared Test of Association ................................................................................. 64
- P-Value .................................................................................................................................... 64
- Critical Value ........................................................................................................................... 65
- Test Statistic ............................................................................................................................ 65
Testing for a Linear Relationship Between Two Variables ................................................................... 65
Is There a Linear Relationship Between X and Y? ............................................................................. 65
The Coefficient of Determination: R2 ................................................................................................... 66
Using X to Predict Y .............................................................................................................................. 66
Straight Line Formula with True Paramters ..................................................................................... 66
Straight Line Formula with Estimated Paramters ............................................................................ 66
Hypothesis Test About 𝛽 Using the 6-Step Approach .......................................................................... 68
Week 1
𝑛𝑜. 𝑜𝑓 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦 𝑤𝑎𝑦𝑠 𝑜𝑓 𝑔𝑒𝑡𝑡𝑖𝑛𝑔 𝑜𝑢𝑡𝑐𝑜𝑚𝑒
Pr 𝑂𝑢𝑡𝑐𝑜𝑚𝑒 =
𝑛𝑜. 𝑜𝑓 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
Probability reflects the number of ways of getting specific outcome relative to the
total number of ways of conducting the experiment.
Odds reflect the number of ways that give you the event of interest relative to the
number of ways that don’t give you the event of interest.
Statistical Distributions
Excel Formulas
= rand()
® Generates a random number between 0 and 1. By pressing F9, random
numbers are re-generated.
= countif (A1:A20;1)
® “A1:A20” refers to range of data.
® “1” refers to what you’re looking for.
By pressing F4, you lock the data set to specific numbers.
Fair Game
Nobody is expected to win and nobody is expected to lose in the long run.
(Example):
I have a stall where people come and bet on the numbers from die throws. Each
bet is R1. If Amy bets R1 on all 6 numbers, in order for it to be a fair game, she
should win R6 if she’s correct – what she pays = what she wins.
Win Percentage
House Advantage
Expected Gain/Loss
𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑒𝑡 = 𝑅2
1
Pr 𝑊𝑖𝑛𝑛𝑖𝑛𝑔 = ( )×2
12
𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑦𝑜𝑢𝑡 = 𝑅11 × 𝑅2
Therefore:
1
𝐸 𝐺 𝑜𝑟 𝐿 = 2 − (( )×22)
12
𝐸 𝐺 𝑜𝑟 𝐿 = 0.166
Permutations
When there are arrangements that can result in a number of various arrangements
(like 6 people all changing positions), to calculate the number of possible
outcomes, we have to assess it from the first position to the last.
o The first position has 6 options, then the second has 5 (because there
is now somebody sitting in position one) and so on. We then multiply
the numbers.
Combinations
Counting Rules
Counting Rule 1:
Arrangement of n objects without repetition:
= 𝑛!
Counting Rule 2:
Number of ways of ordering (order matters) n items chosen r at a time, without
repetition:
𝑛!
𝑛−𝑟 !
Counting Rule 3:
Number of ways of selecting (order doesn’t matter) r objects from a total of n
objects, without repetition:
𝑛!
𝑟! 𝑛 − 𝑟 !
Counting Rule 4:
Number of arrangements of n taken r at a time, with repetition:
= 𝑛S
Always ask:
Does order matter?
Is repetition allowed?
Conditional Probability
Bayes’ Theorem
Pr 𝐵 𝐴 . Pr (𝐴)
Pr 𝐴 𝐵 =
Pr (𝐵|𝐴). Pr 𝐴 + Pr 𝐵 𝐴 . Pr ( 𝐴 )
We know that:
• Pr 𝑍 = 0,02
• Pr 𝑃 𝑍 = 0,07
• Pr (𝑃|𝑍) = 0,01
1. Draw up a table
2. Fill in the blocks that you can.
3. Perform calculations. Remember that these all represent intersections!
Therefore, we’ll be multiplying.
o (Ex. For 𝑃):
§ We know that Pr (𝑃|𝑍) = 0,01
• And we know that Pr 𝑍 = 0,02
o Therefore, multiply the two numbers. (=0,0002)
o Then for P, we go 0,02 − 0,0002 = 0,0198
o Then we know that Pr 𝑃 𝑍 = 0,07
§ And we know that Pr 𝑍 = 0,02 and therefore Pr 𝑍 = 0,98
• Therefore, 0,07×0,98 = 0,0686
• Then, 0,98 − 0,0686 = 0,9114
𝑍 𝑍 Total
𝑃 0,0198 0,0686 0,0884
𝑃 0,0002 0,9114 0,9116
Total 0,02 0,98 1
de de gh
f f i
= ig
de
= 0.053
Week 3
Qualitative data: (Categorical/Nominal Data)
•No numbers
•More than two categories but without intrinsic order
Quantitative data: (Fully Numeric Data)
• Can be ranked
• Always has numbers
Ordinal data: falls between the two (semi-numerical);
o Ordered haphazardly – can be categorical
o Size between numbers do not necessarily have to be the same
§ Levels of satisfaction/education
Excel
= 𝐹𝑅𝐸𝑄𝑈𝐸𝑁𝐶𝑌(𝐽2: 𝐽974; 𝑀20: 𝑀29)
Bins array: bins = categories.
Data array: data from Therefore if you want marks from 0 to
which you want to 100% going like “0,10,20…”, these
retrieve your answers are your bins.
R2 tells us the amount of variation of ‘y’ which can be explained by the variation in
‘x’.
Exploratory Data Analysis
Source: ww.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
In numerical dataset of size ‘n’, sorted from smallest to largest, smallest number
has rank 1.
o X(r) denotes number with rank r.
o X(r+1/2) denotes number half-way between numbers with rank r and
rank r+1.
o Median: X(m) is number with rank (n+1)/2
§ Divides data into 2 equal halves
• If ‘n’ is even - sample median is the average of the
two middle observations.
• If ‘n’ is odd - sample median is the middlemost
observation.
• Is “robust” – not sensitive to outliers.
o Lower-quartile: X(l) number with rank l=([m]+1)/2
§ Where ‘m’ = rank of median. “[m]” means that if ‘m’ =
something and a half, we drop the half.
§ LQ (if it were representing a mark for a test) is the mark,
below which, the lowest 25% of students scored.
o Upper quartile: X(u) rank u=n-l+1
§ UQ (if it were representing a mark for a test), is the mark,
below which, 75% of the students’ marks lie.
Five-number summary:
𝑥 d , 𝑥ℓ , 𝑥• , 𝑥ˆ , 𝑥{
Box and whisker plots are useful when we want to compare two or more sets
of data; this is done by constructing the plots side-by-side. (Use same vertical
scale for all plots which are being compared)
Five-Number Summary in Excel
= 𝑄𝑈𝐴𝑅𝑇𝐼𝐿𝐸(𝑎𝑟𝑟𝑎𝑦; 𝑞𝑢𝑎𝑟𝑡)
For ‘quart’, put in 0-4 depending on which quartile you would like.
Because 5 number summaries can identify unusually small/large values.
These are referred to as ‘strays’ or ‘outliers’ if more extreme.
• Strays:
o For strays on the lower side1 = 𝑀𝑒𝑑𝑖𝑎𝑛 − 3 ×(𝑀𝑒𝑑𝑖𝑎𝑛 − 𝐿𝑄)
o For strays on the bigger side = 𝑀𝑒𝑑𝑖𝑎𝑛 + 3 × 𝑈𝑄 − 𝑀𝑒𝑑𝑖𝑎𝑛
• Outliers:
o For outliers on the lower side = 𝑀𝑒𝑑𝑖𝑎𝑛 − 6 ×(𝑀𝑒𝑑𝑖𝑎𝑛 − 𝐿𝑄)
o For outliers on the bigger side = 𝑀𝑒𝑑𝑖𝑎𝑛 + 6 ×(𝑈𝑄 − 𝑀𝑒𝑑𝑖𝑎𝑛)
Summary Statistics
Numerical rather than graphical.
o Statistic: any quantity calculated from the data values of a sample
Measure of location describes any statistic which purports to locate the middle of
the data set. Here are the two measures of location:
1. The sample median
2. The sample mean (most important measure of location)
• Most useful with symmetric distribution in datasets.
• Predominant measure of location.
1
Lower side: numbers smaller than the median.
Spread
Measure of spread gives insight into variability of the dataset. Three measures of
spread:
1. Range 𝑅 = 𝑥({) − 𝑥(d)
® Unreliable measure of spread; depends only on smallest and largest
values in the sample. Thus, it is the most sensitive to outliers.
o Non-robust
2. Interquartile Range 𝐼 = 𝑥(ˆ) − 𝑥(•)
® Length of the interval covering the central half of the dataset. Therefore
not sensitive to outliers.
o Robust
3. Sample variance is the most-used measure of spread;
® Easily-manipulated algebraically.
http://www.stat.berkeley.edu/~stark/SticiGui/Text/location.htm
Week 4
Random Variable
Pr 𝑋 = 𝑥 = 𝑃𝑟(𝑥)
Expected value of X.
Created through observing patterns.
o Weighted sum of all possible values of X
• Expected value of X also acts as the mean for probability density functions
and probability mass functions!
3C, 4C on p146
Probability Mass Function (Discrete):
A 𝐸 (𝑋 g ) = •–𝑥 g × 𝑝(𝑥 )—
𝑉𝑎𝑟(𝑋 ) = 𝐴 − (𝐵)g
™
g)
A 𝐸 (𝑋 = ˜ –𝑥 g × 𝑓 (𝑥 )— 𝑑𝑥
z
™
B 𝐸 (𝑋) = ˜ –𝑥 × 𝑓(𝑥 )— 𝑑𝑥
z
𝑉𝑎𝑟(𝑋 ) = 𝐴 − (𝐵)g
Coefficient of Variation
𝑉𝑎𝑟 𝑋
𝐶. 𝑉 =
𝐸 𝑋
Expected Winnings/Loss
(Example): Buy one $10 raffle ticket for a new car valued at $15,000. Two
thousand tickets are sold. What is the expected value of your gain?
Win Lose
Gain (x) 14990 10
Probability P(x) 1/2000 1999/2000
P(x).x 7.495 9.995
𝐸 𝐴 + 𝐵 = 𝐸 𝐴 + 𝐸(𝐵)
𝐸 𝐴 − 𝐵 = 𝐸 𝐴 − 𝐸(𝐵)
𝐸 𝑐𝐴 = 𝑐(𝐸 𝐴 )
𝑉𝑎𝑟 𝑐𝐴 = 𝑐 g (𝑉𝑎𝑟 𝐴 )
Week 5
Probability Distribution
1) Uniform
2) Binomial
Uniform Distribution
Where ‘a’ is the lower
bound and ‘b’ is the upper
bound.
• Expected Value
𝑏+𝑎
𝐸 𝑋 =
2
• Variance
g
𝑏−𝑎
𝑉𝑎𝑟 𝑋 =
12
• Graph of Uniform Distribution
Binomial Distribution
Pr 𝑆𝑢𝑐𝑐𝑒𝑠𝑠 = 𝑝
Pr 𝐹𝑎𝑖𝑙𝑢𝑟𝑒 = 1 − 𝑝
• Random variable which records number of successes in “n” trials with the
probability ‘p’ of success where ‘p’ remains constant throughout.
(Each trial is independent of the previous one; they don’t influence one
another) Probability associated with ‘X’ is:
“X is distributed
according to binomial
distribution with
parameters ‘n’ and ‘p’.
Pr 𝑋 ≥ 3 = 1 − Pr (𝑋 < 3)
or
Pr 𝐴 = 1 − Pr (𝐴)
∴ Pr 𝑋 ≥ 3 = 1 − Pr 𝑋 = 2 − Pr 𝑋 = 1 − Pr (𝑋 = 0)
• Expected Value
𝐸 𝑋 = 𝑛×𝑝
• Variance
𝑉𝑎𝑟 𝑋 = (𝑛×𝑝)×(1 − 𝑝)
Where:
‘n’ = number of repetitions
‘p’ = Pr(Success)
Week 6
Probability Distributions
Poisson Distribution
This distribution depends on events occurring randomly at an average rate
of occurrence.
• Events which occur at an average rate of occurrence occur according to a
‘Poisson Process’.
• Expected Value
𝐸 𝑋 =𝜆
• Variance
𝑉𝑎𝑟 𝑋 = 𝜆
• Graph
Exponential Distribution
- Ensure lambda uses same units as question being asked! Convert lambda
and then when answering question, use the converted lambda.
Central Limit Theorem
1 d „…¥ §
… ×
𝑓 𝑥 = ×𝑒 g ¦ 𝑓𝑜𝑟 − ∞ < 𝑥 < ∞
2𝜋𝜎 g
Random variable x has the
𝑋~𝑁(μ, 𝜎 g )
normal distribution with the
s = Sigma = Standard deviation
parameters µ and s2.
• Tells us where the graph is located
Even though it’s a probability density function, the probabilities cannot be found
through integration, however, there is a Z table.
Probability of lying within a given
number of standard deviations from
the mean is the same for all normal
distributions, regardless of their
parameters.
Hence, area A = area B A B
Example
When finding probability from the Z
table, you start going down the first
column and then look at the values in the
top row and then match them up.
Things to Remember
1 1
Pr 𝑍 < − = Pr 𝑍 >
3 3
Pr 𝑍 < −𝑎 = Pr 𝑍 > 𝑎
Subtracting:
Adding:
𝑎𝑋d ~𝑁(𝑎𝜇, 𝑎g 𝜎 g )
Lower Quartile
𝑋−𝜇
−0,67 =
𝜎g
Upper Quartile
𝑋−𝜇
0,67 =
𝜎g
Example of Normal Distribution Question
What is the probability of X lying between 4 and 14? (Pr (4 < 𝑋 < 14)) With
𝑋~𝑁(10, 2)
Because we have two ‘𝑋’ values essentially, we need to find the z values for both:
4 − 10 14 − 10
𝑧= 𝑧=
√2 √2
𝑧 = −4.24 𝑧 = 2.83
Lower Bound Upper Bound
-4.24 2.83
0.49998 0.49767
-4.24 2.83
𝑋 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
The probability distribution of a statistic is called a sampling distribution.
The distribution of the sum of ‘n’ normally distributed random variables is given by:
Sample mean
varies from
sample to sample
Where ‘mu’ is
whatever the true
population mean is.
And we know that the variance of the sum of the X’s is equal to:
Hence,
1
𝑉𝑎𝑟 𝑋 = ×(𝑛𝜎 g )
𝑛g
Videos to help:
http://www.statisticshowto.com/central-limit-theorem-examples/
Percentage Point Notation
𝑧 ¬.d = 10% of distribution lies to the right of the 𝑧 ¬.d value. Hence, 0.5 − 0.1 = 0.4
Therefore the corresponding 𝑧 ¬.d value can be found by looking in the z table for
the one which is as similar to 0.4 as possible
• = 0.3997
• Z-score which corresponds to this is 1.28.
o Therefore, 𝑧 ¬.d = 1.28
Confidence Intervals
Confidence intervals are the difference in reporting between ‘point estimates’ and
‘intervals’.
Point Estimate
• Just a number
o No information with regards to how uncertain the estimate is.
Interval
• Range of values
o Shows how much certainty is in the estimate.
A wide interval would show that we weren’t too sure what the true value is and
the value could be anything really. Opposite applies for a narrow interval.
𝜎g
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 =
𝑛
¯ 𝜎g ¯ 𝜎g
= 𝑋 − 𝑍𝑉𝑎𝑙𝑢𝑒 g × ; 𝑋 + 𝑍𝑉𝑎𝑙𝑢𝑒 g ×
𝑛 𝑛
¦§
• We want our estimate to lie within 𝑍×( ) units of the true mean.
{
• How confident do we want to be of our interval? (Z)
• How variable is the population? (𝜎)
𝜎g
𝐿 = 𝑍×
𝑛
g
𝑍×𝜎
𝑛=
𝐿
𝑍×𝜎 g
𝑛=
𝐿
Where ‘L’ is within how much of the true mean you need to be.
Null hypothesis [H0] will always say that the true parameter = some
hypothesized value.
o H0 generally assumes no effect.
𝐻¬ : 𝜇 = 𝜇¬ = 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟
Step 2:
If we suspected that the true mean was greater than a particular number,
we would say:
𝐻d : 𝜇 > 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟
= 2-sided test
Step 3:
Step 4:
If H0 is true then:
Now we take our observed sample mean and transform it using the Z
formula. What this does, is it shows us how many standard errors (above
or below) the expected mean does the sample mean lie.
𝑋−𝜇
𝑍= 𝜎
𝑛
We compare test statistic to the critical value from the rejection region..
Step 6:
Conclusion:
2-Sided Test
𝐻d : 𝜇 ≠ 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟
Type 1 Errors:
When we reject H0 erroneously.
o We can control this through a:
§ If a is small then it’s more difficult to reject H0 as rejection
region is small. So the chances of making a type 1 error when
there’s a small a are small.
Type 2 Errors:
When we accept H0 erroneously.
Do they come from the same population with the same underlying true mean?
Subtracting Distributions
Step 1:
30 − 31
∴ 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡 =
4 9
+
50 40
∴ 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡 ≈ −1.81
Look -1.81 up in Z table:
= 0.4649
∴ 0.5 − 0.4649 ×2
𝑃 − 𝑉𝑎𝑙 = 0.0702
Week 9
The T-Distribution
Unknown Population Variances
We have sample size “n” and we estimate our 𝜎 g using 𝑠 g , then the test
statistic has a t-distribution.
- Makes test statistic more variable.
o As sample size increases, distribution “peaks” and looks more
like a normal distribution; distribution determined by “n”.
“S” = sample
standard deviation
Comparing to Z-Table
Step 1:
Null hypothesis [H0] will always say that the true parameter = some
hypothesized value.
o H0 generally assumes no effect.
If we suspected that the true mean was greater than a particular number,
we would say:
𝐻d : 𝜇 > 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 (𝑖𝑛 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 3,5)
= 2-sided test
Step 3:
= a = significance level
If H0 is true then:
Now we take our observed sample mean and transform it using the T
formula. What this does, is it shows us how many standard errors (above
or below) the expected mean does the sample mean lie.
𝑋−𝜇
𝑇= 𝑠
𝑛
We compare test statistic to the critical value from the rejection region.
Conclusion.
If we didn’t want to choose a significance level/we weren’t given one, and we just
wanted to report the observed significance level/the p-value:
In order to do this we need to find the probability of our test statistic being
greater than 3,14; hence, look in the t-table for the largest value which
the test statistic exceeds; we do this because we are trying to see where the
test statistic would lie along the line.
- Our test statistic is 3,14 and in the table, 3,038 is the largest value which
the test statistic exceeds.
- This means that the probability of getting a test statistic bigger than 3,14
is less than 0,0025. Therefore the p-value is < 0,0025.
P-Value with Two-Sided
We would need to double the probability, hence in our example we would be
looking for a probability of observing a test statistic greater than 3,14 or less than
3,14.
- Probability/p-value would be < 0,005
o Because the p-value is small, it provides strong evidence
to reject H0.
• P-Value Example
For example, if we are told that 𝑋 = 25 and there are 𝑛 = 6 terms, and we are
told that the first 4 terms are 4, 10, 9 and 2, when we add them up we see they =
25. Hence, even though the last two terms are not given to us, we can conclude
that the last two terms must equal 11, and they can be: 6&5, 7&4, 10&1 etc.
But the degrees of freedom here is 5: the first 5 terms can be various numbers,
but the last term must be a number which, when added to the first 5
terms, sums to 11. Hence, there is 5 degrees of freedom.
For every parameter which we estimate before evaluating the current parameter
of interest, we lose one degree of freedom.
Some Things to Remember
Therefore, 𝑠dg 𝑎𝑛𝑑 𝑠gg can be viewed as estimates of the same true
variance, hence, we can combine the two sample variances to form 𝑠€ ,
a pooled estimate of the true population variance.
Weighted average
of the two sample
variances.
x
Comparing Two Means with the T-Distribution (Modified Approach)
o Go to the t-table, use the correct degrees of freedom and then find
the greatest number which the test statistic exceeds.
§ We then multiply that probability (seen in the top row) by 2 (if
two-sided test).
Week 10
Comparing Two Means in Paired Data Sets (6 Step Approach)
When two sets of data are dependent, we conduct a test using a single
sample of differences; dependent measures are known as repeated
measures.
All steps are the same up until step 5, just bear in mind that with the degrees of
freedom, “n” now refers to the number of pairs of data.
- Step 6: Conclude
If test statistic is more extreme than critical value, then we have enough
evidence to reject H0.
Since it is a modified test, we are just asking for the probability of getting a
test statistic as small as our test statistic or smaller.
Calculate P-Value
This is done by looking in the left-hand column for the correct degrees of
freedom, then you look for a critical value which the test statistic only just
exceeds. If this critical value was 3,601 for instance (at a 34 degrees of freedom
level), the p-value would be < or > 0,0005, depending on the question.
Use the “T.INV” formula. This returns the left-tailed t-distribution’s critical
value.
Use the “T.INV.2T” formula. This returns the two-tailed t-distribution’s critical
value.
¯ ¯
g
𝜎g g
𝜎g
= 𝑑 − 𝑇𝑉𝑎𝑙𝑢𝑒 {…d × ; 𝑑 + 𝑇𝑉𝑎𝑙𝑢𝑒 {…d ×
𝑛 𝑛
Some Things to Remember
Example of a Question
¯
g
𝜎g
𝑑 − 𝑇𝑉𝑎𝑙𝑢𝑒 {…d ×
𝑛
¸
And to find the 𝑇𝑉𝑎𝑙𝑢𝑒 {…d
§
, we do the following:
- Since it is a 99% confidence interval, we are going to look in the top row
for 1% (they have the same t-values), but since it is a confidence
interval, we will look for 1%/2 = 0,005 due to the fact that the
interval has an upper and lower bound, over which the 1% needs to be
spread.
- Therefore at a degree of freedom of 45 and a t-value of 2,960, we have
the following:
14,2
9,2 − 2,690×
46
= 3,568
Goodness-of-Fit-Test: Whether Data Fits Various Distributions
We can use the 6-step approach to test whether some data fits various
distributions:
Step 1:
Step 2:
Step 3:
Step 4:
Chi-squared Distribution
- Similar to t-distribution; has degrees of freedom which influence
shape of distribution.
- However, the 𝜒 g distribution is skewed to the right and is always
positive.
o Chi-squared distribution has its own table too.
Distribution of 𝜒 g
changing according
to degrees of
freedom.
(Example)
Step 5:
d
Because the expected value of a 1,2..6 on a dice is , we can say that “If
h
d
H0 is true, we can expect of the total number of tosses to produce each
h
outcome.” (ie. If total number of tosses was 60: 10 x 1’s, 10 x 2’s etc.)
𝐷g has approximately a 𝜒 g
distribution, provided that
all of the expected
frequencies exceed 5.
Measure of discrepancy between what you have observed and what you
would’ve expected under H0.
- If test statistic is large then it means you observed something very
different to what you expect. Hence, a large test statistic provides
good evidence to reject H0.
- If test statistic is small then it means you observed something very
similar to what you expect. Hence, a small test statistic does not
provide good enough evidence to reject H0.
Step 6:
Conclude
Our critical value (in the example) of 14,8 is bigger than 12,832
but it is also less than 15,086, the next critical value. Hence, we
have observed something which will occur with a probability
of less than 0,025, assuming H0 is true.
When you are comparing the observed to the expected, in order to find the
expected, you need to use the H0 hypothesized distribution:
Some Things to Remember
Step 1:
Step 2:
Step 3:
Step 4:
Our test statistic (D2) will follow a chi-squared distribution and will
compare observed and expected values.
𝐷g has approximately a 𝜒 g
distribution, provided that
all of the expected
frequencies exceed 5.
Step 5:
(Example) We are testing for an association between owning a die and the
outcome of a die.
(Example)
So now, we have the expected and observed values. Therefore our test
statistic (D2) can be calculated:
Step 6:
Conclude
We would say something like: ‘in the case of the test statistic being bigger
than the critical value (rejection region); at the 5% significance level, we
have enough evidence to reject H0 and we can conclude that the outcome
of ____ does not depend on ____.’
With a test statistic of 11,03 we look in our chi-squared distribution table and
can see under the 5 degrees of freedom level, that the largest critical value
which is still smaller than our test statistic is 9,236. This [9,236] correlates with a
probability of 0,1.
- Because our test statistic is bigger than 9,236 but smaller than
11,070, our p-value is smaller than 0,1.
- P-Value
This formula will give you the p-value immediately; the test statistic is
calculated behind the scenes. All you need to have are both the expected
and observed values/tables.
- Critical Value
- Test Statistic
o Is calculated from the p-value.
(Example)
R2 often expressed as
percentage: 74% of the
variation in y is explained by
x.
- Therefore:
o Higher the value of r/r2, the stronger the relationship between
x and y.
o Lower the value of r/r2, the weaker the relationship between x
and y.
Using X to Predict Y
They are statistics since they are computed from a sample. They also
follow a normal distribution.
The constants, ‘a’ and ‘b’ are chosen in order to minimize the sum of
squared residuals. (ie the sum of the squared differences between
the residuals.)
A residual [e] is defined as the difference between the actual y and the
predicted value of y (PV of y denoted as 𝑦.)
- 𝑒 =𝑦−𝑦
= 𝑒‚g
‚Åd
We do not need to know the formula for calculating regression coefficient; we rely
on excel.
𝐻¬ : 𝛽 = 0
Step 2:
𝐻d : 𝛽 < 0
𝐻d : 𝛽 > 0
𝐻d : 𝛽 ≠ 0
Step 3:
Step 4:
Step 5:
Step 6:
Conclude:
X = the ‘predictor’
(independent)
P-Value
Intercept = a value in ‘y = a + bx’
- 1E-277 = 1 × 10…gÍÍ
- Value you would expect to receive for your
- Calculated by:
final mark if your year mark was 0. o =
Year Mark = b value in ‘y = a + bx’ 𝑇. 𝐷𝐼𝑆𝑇. 2𝑇(𝑥; deg _𝑓𝑟𝑒𝑒𝑑𝑜𝑚)
- Intercept is 3,06 and slope is 0,82. o x = t-stat
o You can expect your final mark (Y) to be o deg_freedom = (number of
82% of your year mark, +/-- 3,06%. pairs – 2)
Residuals:
- Difference between observed
and expected for each
observed pair of data.
Hypothesis Test About 𝛽 Using the Modified Approach
Step 1:
𝐻¬ : 𝛽 = 0
Step 2:
𝐻d : 𝛽 ≠ 0
Step 3:
Step 4:
Step 5:
H0 said that there was no relationship
Conclude between x and y.
Regression and Correlation
Correlation:
- Correlation Coefficient
Predicting values for one variable, given particular values for another variable.
- Regression analysis is only effective if there is a relationship
between the dependent and independent variables.
(Example)
Y = death rate
X = number of gwaais smoked per
person
On Vula, read the slides in week 11 titled ‘Regression and Correlation Oct2015’.