Chapter 12
Chapter 12
[1] Scatter plot can be a helpful tool in determining Linear model 𝒚 = 𝜷𝟎 + 𝜷𝟏 × 𝒙 + 𝝐 We measure model’s fit by comparing the variance we can explain relative to the variance we cannot explain.
the strength of the relationship between 2 variables. The random error 𝜖 used in the model is due to the fact that other unspecified 𝑉𝑎𝑟(𝑦) = 𝑉𝑎𝑟(𝛽0 + 𝛽1 × 𝑥 + 𝜖) = 𝑉𝑎𝑟(𝛽1 𝑥) + 𝑉𝑎𝑟(𝜖)
variables also may affect 𝑌 and there may be measurement error in 𝑌.
Sample correlation coefficient (also called the Simple regression equation: 𝑬(𝒀|𝒙) = 𝜷𝟎 + 𝜷𝟏 × 𝒙 Estimated regression model 𝑦̂ = 𝑏0 + 𝑏1 𝑥
Pearson correlation coefficient) Evaluation metrics can be used for regression: Maximum error 𝐸∞ = max |𝑦̂𝑖 − 𝑦𝑖 | Ordinary least squares method (OLS) [Carl F. Gauss,
1≤𝑖≤𝑛
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) × (𝑦𝑖 − 𝑦̅) 1809]
𝑟= 1
Average error 𝐸1 = ∑𝑛𝑖=1 |𝑦̂𝑖 − 𝑦𝑖 |
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 √∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 𝑛
𝑬𝟐 is minimized when
If 𝑟 = 0 then there is seemly no linear correlation Root-mean square error ∑𝑛 𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )×∑𝑖=1(𝑦𝑖 −𝑦
̅)
𝑏1 = ∑𝑛 2
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
between two variables 𝑥 and 𝑦. 𝑖=1(𝑥𝑖 −𝑥̅ )
𝑛 𝑛
1 1 ∑𝑛 ∑𝑛
𝐸2 = √ ∑(𝑦𝑖 − 𝑦̂) 2 = √ ∑(𝑦 − 𝑏 − 𝑏 𝑥 )2 𝑖=1 𝑥𝑖 𝑖=1 𝑦𝑖
𝑖 𝑖 0 1 𝑖 where 𝑥̅ = ; 𝑦̅ =
𝑛 𝑛 𝑛 𝑛
𝑖=1 𝑖=1
[2] Test for significant correlation using Student’s t Source of variation in 𝑦 Total variation about the mean Variation explained by regression Unexplained or error Another evaluation metric for regression is called
Hypotheses: 𝐻0 : 𝜌 = 0; 𝐻1 : 𝜌 ≠ 0 variation Coefficient of determination
𝑛 𝑛
𝑛−2 In a regression, we seek to explain the + ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
≡ 𝑆𝑆𝑅 𝑆𝑆𝐸
Test statistic: 𝑡𝑐𝑎𝑙𝑐 = 𝑟√ ∑(𝑦𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑇 = ∑(𝑦̂𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑅 𝑅2 = =1−
1−𝑟 2
variation in the dependent variable 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐸
𝑖=1 𝑖=1
compare with 𝑡𝑐𝑟𝑖𝑡 using 𝑑. 𝑓. = 𝑛 − 2 This number represents the percent of variation explained.
around its mean.
[3] Critical value for the correlation coefficient Test for significance 𝑆𝑆𝐸 Note that in a simple
If the fit is good, SSE is relatively small compared to SST. A measure of overfit is standard error 𝑠𝑒 = √
𝑛−2
regression, the F-test
𝑡𝑐𝑟𝑖𝑡 Coefficient Hypotheses Test statistic Confidence interval for 𝒚 always yields the same
𝑟𝑐𝑟𝑖𝑡 = with 𝑑. 𝑓. = 𝑛 − 2
2
√𝑡𝑐𝑟𝑖𝑡 +𝑛−2 Slope 𝐻0 : 𝛽1 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑠𝑙𝑜𝑝𝑒−0 𝑏1 𝑠𝑒 𝑏1 ± 𝑡𝛼 × 𝑠𝑏1
𝑡𝑐𝑎𝑙𝑐 = = 𝑠𝑏1 = 2
p-value as a two-tailed t
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑙𝑜𝑝𝑒 𝑆𝑏1
Compare 𝑟 to 𝑟𝑐𝑟𝑖𝑡 . 𝐻1 : 𝛽1 ≠ 0 √∑𝑛1(𝑥𝑖 − 𝑥̅ )2
test for zero slope,
Intercept 𝐻0 : 𝛽0 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 0 𝑏0 𝑏0 ± 𝑡𝛼 × 𝑠𝑏0 which in turn always
𝑡𝑐𝑎𝑙𝑐 = = 1 (𝑥̅ )2 2
If r is not between the positive and negative critical 𝐻1 : 𝛽0 ≠ 0 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑆𝑏0 𝑠𝑏0 = 𝑠𝑒 √ + 𝑛 gives the same p-value
𝑛 ∑1 (𝑥𝑖 − 𝑥̅ )2
values, then the correlation coefficient is significant. If as a two-tailed test for
ANOVA
r is significant, then you may want to use a line for zero correlation. The
prediction. relationship between the
test statistics is
2
𝐹𝑐𝑎𝑙𝑐 = 𝑡𝑐𝑎𝑙𝑐
Caveat: In large samples, small correlations may be A few assumptions about the random error term 𝜀 are made when we use linear regression to fit a line to data [1: The errors are normally distributed. 2: The errors have constant variance. 3: The errors are
significant, even if the scatter plot shows little independent.] Because we cannot observe the error, we must rely on the residuals 𝑒1 = 𝑦1 − 𝑦
̂,
1 𝑒2 = 𝑦2 − 𝑦
̂,
2 … , 𝑒𝑛 = 𝑦𝑛 − 𝑦
̂𝑛 from the estimated regression for clues about possible violations of these assumptions.
evidence of linearity. Thus, a significant correlation While formal tests exist for identifying assumption violations, many analysts rely on simple visual tools to help them determine when an assumption has not been met and how serious the violation is. For more details,
may lack practical importance. please consult in the textbook.
Chapter 11: Analysis of Variance (ANOVA) Nguyen Thi Thu Van - November 25, 2023
One-way analysis of variance One-way ANOVA is a technique used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.
For example, to allocate resources and fixed costs correctly, hospital management needs to test whether a
patient’s length of a stay (LOS) depends on the diagnostic-related group (DRG) code. Consider the case of a
bone fracture. LOS is a numerical response variable (measured in hours). The hospital organizes the data by
using five diagnostic codes for type of fracture (facial, radius or ulna, hip or femur, other lower extremity, all
other). Type of fracture is a categorical variable.
Group means Overall mean Sum of squares Sum of squares between groups Sum of squares within groups
𝑛𝑗 𝑐 𝑛𝑗 𝑐 𝑛𝑗 𝑐 𝑐 𝑛𝑗 𝑐 𝑐 𝑛𝑗
∑𝑖=1 𝑦𝑖𝑗 1 1 ∑𝑖=1 𝑦𝑖𝑗 1 2
𝑦̅𝑗 = 2 2
𝑦̅ = ∑ ∑ 𝑦𝑖𝑗 = ∑ 𝑛𝑗 × = ∑ 𝑛𝑗 𝑦̅𝑗 ∑ ∑(𝑦𝑖𝑗 − 𝑦̅) = ∑ 𝑛𝑗 × (𝑦̅𝑗 − 𝑦̅) + ∑ ∑(𝑦𝑖𝑗 − 𝑦̅𝑗 )
𝑛𝑗 𝑛 𝑛 𝑛𝑗 𝑛
𝑗=1 𝑖=1 𝑗=1 𝑗=1 𝑗=1 𝑖=1 𝑗=1 𝑗=1 𝑖=1
Treatments 𝑻𝟏 𝑻𝟐 … 𝑻𝒄
𝒚𝟏𝟏 𝒚𝟏𝟐 … 𝒚𝟏𝒄
𝒚𝟐𝟏 𝒚𝟐𝟐 … 𝒚𝟐𝒄
…
Group size 𝒏𝟏 observations 𝒏𝟐 observations … 𝒏𝒄 observations
Mean ̅̅̅
𝒚𝟏 ̅̅̅
𝒚𝟐 … ̅̅̅
𝒚𝒄
If the treatment means do not differ ANOVA assumptions: F – test method [Ronald A. Fisher in the 1930s]
greatly from the grand mean, SSB will be • Observations on Y are independent. Step 1: State the hypotheses: 𝐻0 : 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑐 vs 𝐻1: Not all the means are equal
relatively small and SSE will be relatively • Populations being sampled are normal. Step 2: Specify the decision rule:
large (and conversely). The sums SSB and • Populations being sampled have equal variances. • Numerator: 𝑑𝑓1 = 𝑐 − 1;
SSE may be used to test the hypothesis In general, ANOVA’s are considered to be fairly robust against • Denominator: 𝑑𝑓2 = 𝑛 − 𝑐
that the treatment means differ from the violations of the equal variances assumption as long as each • Find the critical value 𝐹𝑐𝑟𝑖𝑡 = 𝐹. 𝐼𝑁𝑉. 𝑅𝑇(𝛼, 𝑑𝑓1 , 𝑑𝑓2)
grand mean. However, to adjust for group group has the same sample size. However, if the sample sizes 𝑀𝑆𝐵 𝑀𝑆𝐴
Step 3: Calculate the 𝐹 = ≡ use ANOVA table of interest and make the decision.
𝑀𝑆𝐸 𝑀𝑆𝐸
sizes, we first divide each sum of squares are not the same and this assumption is severely violated, you
Otherwise, find 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝐹. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐹, 𝑑𝑓1, 𝑑𝑓2 )
by its degrees of freedom. could instead run a Kruskal-Wallis Test, which is the non-
parametric version of the one-way ANOVA.
Post hoc tests with ANOVA: To determine exactly which group means are different, we can perform a Test for homogeneity of variances: ANOVA assumes that observations on the response variable are from normally distributed populations that have the
Tukey post hoc test [John W. Tukey in the 1930s]. Similar to a two-sample t-test except that it pools the same variance. However, only few populations meet these requirements perfectly and unless the sample is quite large, a test for normality is impractical.
variances for all 𝑐 samples 𝐻0 : 𝜇𝑗 = 𝜇𝑘 𝐻1: 𝜇𝑗 ≠ 𝜇𝑘 But we can easily test the assumption of homogeneous (equal) variances: Hartley’s test to check for unequal variances for 𝑐 groups: 𝐻0 : 𝜎12 = 𝜎22 = ⋯ =
𝑑. 𝑓. for numerator is 𝑐; for denominator is 𝑛 − 𝑐 ] approximated to the next lower integer if it is not an integer]
- We reject 𝐻0 if 𝑇𝑐𝑎𝑙𝑐 > 𝑇𝑐𝑟𝑖𝑡 We reject 𝐻0 if 𝐻𝑐𝑎𝑙𝑐 > 𝐻𝑐𝑟𝑖𝑡
Table T Table H
Too-way analysis of variance Two-way ANOVA is a technique used to determine the effect of two nominal predictor variables on a continuous outcome variable.
For example, a numerical response variable (paint viscosity) may vary both by temperature (Factor A) and by
paint supplier (Factor B). Three different temperature settings (A1, A2, A3) were tested on shipments from three
different suppliers (B1, B2, B3).
Chapter 10. Two-sample hypothesis tests Nguyen Thi Thu Van, November 26, 2022
Two-sample tests compare two sample estimates with each other, whereas one-sample tests compare a sample estimate with a non-sample benchmark or target (a claim or prior belief about a population parameter).
For example. A new bumper is installed on selected vehicles in a corporate fleet. During a 1-year test period, 12 vehicles with the new bumper were involved in accidents, incurring mean damage of $1,101 with a standard deviation of $696. During the same year, 9 vehicles
with the old bumpers were involved in accidents, incurring mean damage of $1,766 with a standard deviation of $838. Did the new bumper significantly reduce damage? Did it reduce significantly variation?
Basis of Two-Sample Tests. The logic of two-sample tests is based on the fact that two samples drawn from the same population may yield different
estimates of a parameter due to chance. For example, exhaust emission tests could yield different results for two vehicles of the same type. Only if the
two sample statistics differ by more than the amount attributable to chance can we conclude that the samples came from populations with different
parameter values, as illustrated the adjacent picture.
Two-sample tests are especially useful because they possess a built-in point of comparison. You can think of many situations where two groups are to
be compared: Before versus after; Old versus new; or Experimental versus control. Sometimes we don’t really care about the actual value of the
population parameter, but only whether the parameter is the same for both populations.
Comparing two means Comparing two proportions Comparing two variances
𝒙𝟏 𝒙𝟐
Variances known Variances unknown 𝒑𝟏 = , 𝒑𝟐 = Assuming the populations are normal, the test
𝒏𝟏 𝒏𝟐
𝝈𝟐𝟏 , 𝝈𝟐𝟐 Equal Unequal statistic follows the F distribution, named for
Ronald A. Fisher in the 1930s
𝑥1 − ̅̅̅
̅̅̅ 𝑥2 ̅̅̅
𝑥1 − ̅̅̅
𝑥2 ̅𝑥̅̅1̅−𝑥
̅̅̅2̅ 𝑝1 − 𝑝2 𝑠12
𝑧𝑐𝑎𝑙𝑐 = 𝑡𝑐𝑎𝑙𝑐 = If n1 , 𝑛2 ≥ 30, then 𝑧𝑐𝑎𝑙𝑐 = 𝑧𝑐𝑎𝑙𝑐 =
𝒔𝟐 𝒔𝟐 1 1 𝐹𝑐𝑎𝑙𝑐 =
√ 𝟏
+𝑛𝟐 √𝑝𝑐 (1 − 𝑝𝑐 ) ( + ) 𝑠22
𝝈𝟐 𝝈𝟐𝟐 𝒔𝟐𝒑 𝒔𝟐𝒑 𝑛1 2 𝑛1 𝑛2
√ 𝟏 + √ +
𝑛1 𝑛2 𝑛1 𝑛2 Critical values
̅𝑥̅̅1̅−𝑥
̅̅̅2̅
If n1 , 𝑛2 < 30, then 𝑡𝑐𝑎𝑙𝑐 = 1
𝒔𝟐 𝒔𝟐
√ 𝟏+ 𝟐 𝐹𝑅 = 𝐹𝑑𝑓1 ,𝑑𝑓2 ; 𝐹𝐿 =
𝑛1 𝑛2
𝐹𝑑𝑓2 ,𝑑𝑓1
𝑑. 𝑓. = 𝑛1 + 𝑛2 − 2 𝒔𝟐 𝟐 2 𝑑𝑓1 = 𝑛1 − 1
( 𝟏 𝒔𝟐 )
+
𝑛1 𝑛2
𝑑. 𝑓. = 𝟐 𝟐 𝑑𝑓2 = 𝑛2 − 1
𝒔𝟐 𝒔𝟐
( 𝟏) ( 𝟐)
𝑛1 𝑛2
𝑛1 −1
+𝑛
2−1
1 1 𝑠12 𝑠22 𝑝1 (1 − 𝑝1 ) 𝑝2 (1 − 𝑝2 )
(𝑥 𝑥2 ± 𝑡𝛼/2√𝑠𝑝2√
̅̅̅1 − ̅̅̅) + (𝑥 𝑥2 ± 𝑡𝛼/2√ +
̅̅̅1 − ̅̅̅) (𝑝1 − 𝑝2 ) ± 𝑧𝛼/2√ +
𝑛1 𝑛2 𝑛1 𝑛2 𝑛1 𝑛2
To find 𝑧𝑐𝑟𝑖𝑡 , look up in To find 𝑡𝑐𝑟𝑖𝑡 , look up in Table D or, Excel function. To find 𝐹𝑐𝑟𝑖𝑡 , we look up in Table F, or Excel
Table C. function.
𝛼
𝐹𝑅 ≡ 𝐹. 𝐼𝑁𝑉. 𝑅𝑇 ( , 𝑑𝑓1, 𝑑𝑓2)
2
1
𝐹𝐿 ≡ 𝛼
𝐹. 𝐼𝑁𝑉. 𝑅𝑇 ( , 𝑑𝑓2, 𝑑𝑓1)
2
Paired When sample data consist of n matched pairs, a different approach is required. If the same individuals are observed twice 𝑛
∑ 𝑑 ∑(𝑑 −𝑑) ̅ 2 𝑑̅−𝜇𝑑
𝑑̅ = 1 𝑖 ; 𝑠𝑑 = √ 𝑖 ;𝑡= 𝑠𝑑
𝑛 𝑛−1
𝒕 − test Student’s t distribution but under different circumstances, we have a paired comparison. For example, weekly sales of Snapple at 12 Walmart √𝑛
stores are compared before and after installing a new eye-catching display. Did the new display increase sales?
It is worth bearing in mind three questions (1) Are the populations skewed? Are there outliers? This question refers to the assumption of normal populations, upon which the tests are based.
when you are comparing two samples (2) Are the sample sizes large (𝒏 ≥ 𝟑𝟎)? The t test is robust to non-normality as long as the samples are not too small and the populations are not too skewed thanks to the Central Limit Theorem.
(3) Is the difference important as well as significant? A small difference in means or proportions could be statistically significant if the sample size is large because the standard error gets smaller as the sample size gets larger.
Chapter 9. One-sample hypothesis tests Nguyen Thi Thu Van - August 15, 2023
Data are used in business every day to support Hypothesis = A premise or claim that we want to test.
marketing claims, help managers make Steps in hypothesis testing. Step 1: State the hypotheses to be tested. 𝐻0 : Null hypothesis = a currently accepted value for a parameter which is constructed to be that of the status quo versus 𝐻1 : Alternative hypothesis =
decisions, and measure business improvement. research hypothesis which involves the claim to be tested. The two statements are hypotheses because the truth is unknown. Efforts will be made to reject the null hypothesis.
Has a ski resort decreased its average 𝑯𝟎 is true 𝑯𝟎 is false Key terms What is it? Symbol Definition Also called
response time to accidents? Reject 𝑯𝟎 Type I error Correct decision Type I error Reject a true hypothesis 𝛼 𝑃(𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 |𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒) False positive
Did the proportion of satisfied car repair Fail to reject 𝑯𝟎 Correct decision Type II error Type II error Fail to reject a false hypothesis 𝛽 𝑃(𝑓𝑎𝑖𝑙 𝑡𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 |𝐻0 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒) False negative
customers increase after providing more Power Correctly reject a false hypothesis 1−𝛽 𝑃(𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 |𝐻0 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒) Sensitivity
training for the employees?
Step 2: Specify what level of inconsistency with the data will lead to rejection of the hypothesis. This is called a decision rule [Jerzy Neyman in the 1930s]
Savvy businesspeople use data and many of the
To specify our decision rule, we will define an “extreme” outcome, called the critical value, is the boundary between the two regions (reject 𝐻0 and do not reject 𝐻0 ). The area under the sampling distribution curve that defines
statistical tools to answer these types of
an extreme outcome is called the rejection region.
questions. Apart from tools that you have
Step 3: Collect data and calculate necessary statistics to test the hypothesis and then make a decision. Should the hypothesis be rejected or not?
learned so far, hypothesis testing is one of the
Collect data and calculate a test statistic.
most widely used statistical tools.
Measures the difference between the sample statistic and the hypothesized parameter: A test statistic that falls in the shaded region will cause rejection 𝐻0 . Otherwise we fail to reject 𝐻0
Step 4: Take action based on the decision. This last step—taking action—requires experience and expertise on the part of the decision maker. Because resources are always scarce, we cannot do everything and therefore a
decision requires understanding not only statistical significance, but also the practical importance of the potential improvement: the magnitude of the effect and its implications for product durability, customer satisfaction,
Hypothesis testing is used in science and
budgets, cash flow, and staffing.
business to test assumptions and theories and
In addition, we can use p-value method [Ronald A. Fisher in the 1930s]: p-value that is less than α will cause rejection of H0 . In this case, we say the conclusion is statistically significance in terms of supporting the theory
guide managers when facing decisions.
being investigated not due to chance. The area of the nonrejection region (white area) is 1 − α.
Methods used in one-sample hypothesis tests One-sample hypothesis tests for mean One-sample hypothesis tests for proportion
If population variance is known If population variance is unknown Normal distribution
𝑛 ≥ 30 𝑛 < 30 𝒏𝝅𝟎 ≥ 𝟏𝟎 and 𝒏(𝟏 − 𝝅𝟎 ) ≥ 𝟏𝟎
𝑥̅ − 𝜇0 𝑥̅ −𝜇0 𝑥̅ −𝜇0 𝑥̅ − 𝜇0 𝑝 − 𝜋0 𝑝 − 𝜋0
𝑧𝑐𝑎𝑙𝑐 = 𝑧𝑐𝑎𝑙𝑐 = [𝑜𝑟 𝑎𝑙𝑠𝑜 𝑡𝑐𝑎𝑙𝑐 = ] 𝑡𝑐𝑎𝑙𝑐 = 𝑧𝑐𝑎𝑙𝑐 = =
𝜎 𝑠 𝑠 𝑠 𝜎𝑝
√𝑛 √𝑛 √ 0 (1 − 𝜋0 )
𝜋
√𝑛 √𝑛 𝑛
𝛼
Two-tailed 𝑯𝟎 : 𝝁 = 𝝁𝟎 Confident 𝑧𝛼/2 ≡ ±𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝛼/2) or look up in Table C 𝑡𝛼/2 ≡ ±𝑇. 𝐼𝑁𝑉. 2𝑇(𝛼, 𝑑𝑓) ≡ ±𝑇. 𝐼𝑁𝑉( , 𝑑𝑓) or look up in Table D with 𝑑𝑓 = 𝑛 − 1
2 𝜋0 (1 − 𝜋0 )
test 𝑯𝟏 : 𝝁 ≠ 𝝁𝟎 Interval 𝑝 ± 𝑧𝛼 √
2 𝑛
𝒑 −value p − value ≡ 2 × 𝑃(𝑍 > |𝑧𝑐𝑎𝑙𝑐 |) = 2 × [1 − 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(|𝑧𝑐𝑎𝑙𝑐 |,1)] =2 × 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(−|𝑧𝑐𝑎𝑙𝑐 |,1) 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 2 × 𝑇. 𝐷𝐼𝑆𝑇(−|𝑡|, 𝑑𝑓) p − value = 2 × [𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(−|𝑧𝑐𝑎𝑙𝑐 |,1)]
Left-tailed 𝑯𝟎 : 𝝁 ≥ 𝝁𝟎 CI 𝑧𝛼 ≡ 𝑧𝑐𝑟𝑖𝑡 = −|𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝛼)| 𝑡𝛼 ≡ −|𝑇. 𝐼𝑁𝑉(𝛼, 𝑑𝑓)| 𝑧𝛼 ≡ 𝑧𝑐𝑟𝑖𝑡 = −|𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝛼)|
test 𝑯𝟏 : 𝝁 < 𝝁𝟎 𝒑 −value p − value ≡ P(Z < −|zcalc |) = 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(−|𝑧𝑐𝑎𝑙𝑐 |,1) 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑇. 𝐷𝐼𝑆𝑇(−|𝑡𝑐𝑎𝑙𝑐 |, 𝑑𝑓) p-value = 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(−|𝑧𝑐𝑎𝑙𝑐 |,1)
Right 𝑯𝟎 : 𝝁 ≤ 𝝁𝟎 CI 𝑧𝛼 ≡ 𝑧𝑐𝑟𝑖𝑡 = | 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝛼)| 𝑡𝛼 ≡ |𝑇. 𝐼𝑁𝑉(𝛼, 𝑑𝑓)| 𝑧𝛼 ≡ 𝑧𝑐𝑟𝑖𝑡 = |𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝛼)|
tailed test 𝑯𝟏 : 𝝁 > 𝝁𝟎 𝒑 −value 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≡ 𝑃(𝑍 > |𝑧𝑐𝑎𝑙𝑐 | ) = 1 − 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(|𝑧𝑐𝑎𝑙𝑐 | |,1) 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 1 − 𝑇. 𝐷𝐼𝑆𝑇(|𝑡𝑐𝑎𝑙𝑐 |, 𝑑𝑓) p-value = 1 − 𝑁𝑂𝑅𝑀. 𝑆. 𝐷𝐼𝑆𝑇(|𝑧𝑐𝑎𝑙𝑐 |,1)
Cautions The confidence interval method described above requires that you specify your rejection criterion in terms of the test statistic before you take a sample. The p-value method is a more flexible approach that is often preferred
by statisticians over the critical value method. It requires that you express the strength of your evidence (i.e., your sample) against the null hypothesis in terms of a probability. The p-value is a direct measure of the likelihood
of the observed sample under 𝐻0 . The p-value answers the following question: If the null hypothesis is true, what is the probability that we would observe our particular sample mean (or one even farther away from 𝜇0 )?
However, be cautious in usage: p-value is not the probability of the null hypothesis being true! If we get wrong on it, p-value by themselves can cause misleading results.
Topic: Bias and Random Error Nguyễn Thị Thu Vân, September 24, 2023
Estimator 𝑋̅ is a function 𝑋̅ = 𝜃 used to estimate 𝜇. For In Statistics, there
Population are many sources
example:
About N = 230,000
∑𝑛=100
𝑖=1 𝑥𝑖 of error in
workers in various 1) 𝑥̅ = 𝜃(𝑥1 , … , 𝑥100 ) ≡
Sample 𝑥2 𝑛
𝑥1 collecting and
manufacturers 𝑥8 2) 𝑥̅ = 𝜃(𝑥1 , … , 𝑥100 ) ≡ ∑𝑛=100
𝑖=1 𝑤𝑖 𝑥𝑖
n = 100 𝑥3 sampling data.
in HCMC in 2022. workers 𝑥99 𝑥100 s.t. ∑𝑛=100
𝑖=1 𝑤𝑖 = 1 and 𝑤𝑖 ≥ 0
Error can be
What was their yearly 3) and so on.
described as
average savings 𝝁 = ? ̅
𝒙 The value of 𝜃 corresponding to this particular sample is
random or
called an estimate for 𝜇.
systematic.
Random error or sampling error is the difference between an estimate and Systematic error or bias is the difference between the expected value of the
the corresponding population parameter 𝑥̅ − 𝜇 estimator and the population parameter 𝐸(𝑋̅) − 𝜇.
Of course, sampling error is an inevitable risk in statistical sampling but a The simplest example occurs with a measuring device that is improperly
random error. The impact of random error/imprecision can be minimized with calibrated so that it consistently overestimates (or underestimates) the
large sample sizes. measurements by some units.
Precision and accuracy are two ways that scientists think about error:
Below is a diagram that will attempt to differentiate between imprecision and inaccuracy.
Fig 1. Accuracy and precision Fig 2. Accuracy and imprecision Fig 3. Inaccuracy and precision Fig 4. Inccuracy and imprecision
As you have evidenced, accuracy and precision are independent to each other. Clearly, error due to imprecision can be minimized with large sample sizes, but
bias is impossible to improve/adjust even for large sample size.
Properties of “good” Estimators
Unbiased estimators. The expected value is equal to the Efficient estimators. The more efficient estimator Consistent estimators. With more
population parameter. For example: If 𝑥1 , 𝑥2 , … , 𝑥𝑛 is a has lower sampling variability than a competing observations, we get closer to the
random sample from a normal population 𝑁(𝜇, 𝜎 2 = 1), estimator. Among all unbiased estimators, we population parameter.
1 prefer the minimum variance estimator.
then 𝑋̅ = 𝑛 ∑𝑛𝑖=1 𝑥𝑖2 is an unbiased estimator of 𝜇 2 + 1.
In conclusion, the bottom-line so far is that you must be able to distinguise between systematic error and random error. Furthermore, as you have seen, each
sampling is likely to come out with a different estimate. It is very useful to know a range in which the population parameter/true value will lie with a high
probability. This range will be called the confident interval.
Chapter 7. Continuous probability distribution Nguyen Thi Thu Van - July 30, 2023
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) is the area of the region bounded by the probability
𝑏
density function 𝑓 (𝑥) over [𝑎, 𝑏]. That is
𝑏
= ∫ 𝑓(𝑥)𝑑𝑥
𝑎
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓(𝑥)𝑑𝑥
𝑎
For a continuous random variable, the PDF is an equation that shows the height of the curve 𝑓 (𝑥) Expected value of a continuous variable Variance of a continuous variable Standard deviation
∞ ∞
at each possible value of X.
𝐸(𝑋) ≡ 𝜇 = ∫ 𝑥𝑓(𝑥)𝑑𝑥 𝜎 2 = ∫ (𝑥 − 𝜇)2 𝑓(𝑥)𝑑𝑥 = 𝐸(𝑋2 ) − [𝐸(𝑋)]2 𝜎 = √𝜎 2
−∞ −∞
in occurrence than data values far from the if 𝑛𝜋 ≥ 10; 𝑛(1 − 𝜋) ≥ 10. Consider the process of customers
mean. Continuity correction: use one half way cutoff points. arriving at a restaurant. If the count of
customer arrivals in a randomly selected
minute has a Poisson distribution, the
distribution of the time between two
customer arrivals will have an
exponential distribution.
Example Assume the weight of a randomly chosen Assume that the number of calories in a McDonald’s Egg McMuffin is a normally distributed In a certain store, there is a On average, 28 patients per Between 2p.m. to 4 p.m., patient
American passenger car is a uniformly random variable with a mean of 290 calories and a standard deviation of 14 calories. What is the .03 probability that the hour arrive in the Foxboro insurance inquiries arrive at Blue Choice
distributed random variable ranging from 2,500 probability that a particular serving contains fewer than 300 calories? scanned price in the bar code 24-Hour Walk-in Clinic on insurance at a mean rate of 2.2 calls per
pounds to 4,500 pounds. What is the probability 𝑋 = the number of calories of a randomly chosen McMuffin. So 𝑋 can take on fractional values scanner will not match the Friday between 6 p.m. and minute. What is the probability of
that a vehicle will weigh less than 3,000 pounds? around the mean. advertised price. The cashier midnight. What is the waiting less than 30 seconds for the next
𝑋 = the weight of a randomly chosen car. So 𝑋 scans 800 items. What is the approximate probability of call? So let 𝑋 be the waiting time for the
can take on fractional values on [2,500, 4,500] probability of at least 20 more than 35 arrivals? next call. Then 𝑋 = [0, ∞)
mismatches?
𝑎 is lower limit and 𝑏 is upper limit 𝜇, 𝜎 𝜆: mean arrival rate per unit of time
or space
1 1 1 𝑥−𝜇 2 𝑧2
PDF 𝑓(𝑥) = over [𝑎, 𝑏] 1 𝑥−𝜇 𝑓(𝑥) = 𝜆𝑒 −𝜆𝑥 over [0, ∞)
𝑏−𝑎 𝑓(𝑥) = 𝑒 −2( 𝜎 ) 𝑓(𝑧) = 𝑒 −
2 ;𝑧=
𝜎
𝜎√2𝜋 √2𝜋
Expected value of a discrete random Variance of a discrete random variable Standard deviation A discrete probability distribution assigns a probability to each value of a discrete variable and can be described either by
variable 𝑁
𝜎 = √𝜎 2 probability density/mass function (PDF/PMF) that shows the probability of each X–value
𝑁 𝑉𝑎𝑟(𝑋) ≡ 𝜎 2 = ∑(𝑥𝑖 − 𝜇)2 𝑃(𝑥𝑖 )
cumulative distribution function (CDF) that shows the cumulative sum of probabilities, adding from the smallest to the largest X–value.
𝑖=1
𝐸(𝑋) ≡ 𝜇 = ∑ 𝑥𝑖 𝑃(𝑥𝑖 )
𝑖=1 = 𝐸(𝑋 2 ) − [𝐸(𝑋)]2
Uniform distribution Binomial distribution Poisson distribution Hypergeometric distribution Geometric distribution
Descriptions Uniform distribution describes a random Binomial distribution describes the number of successes Poisson distribution [Siméon-Denis Poisson (1781– Hypergeometric distribution is similar Geometric distribution is related to the
variable with a finite number of equally in 𝑛 independent trials in which each trial is a Bernoulli 1840)] describes the number of events occurring in a to the binomial except that sampling is binomial. It describes the number of
likely consecutive integer values experiment [Jakob Bernoulli (1654–1705)] i.e., the fixed interval of time or space. without replacement. Bernoulli trials until the first success is
from 𝑎 to 𝑏. experiment has only two outcomes: success or fail. For the Poisson distribution to apply, the events must observed.
The success is merely meant to the event of interest. occur randomly and independently over a continuum of
time or space.
Example The daily 3 digits lottery has a uniform On average, 20 percent of the emergency room patients At an outpatient mental health clinic, appointment A statistics textbook chapter contains At Faber University, 15 percent of the
distribution with 1000 equally likely at Greenwood General Hospital lack health insurance. In cancellations occur at a mean rate of 1.5 per day on a 60 exercises, 6 of which are essay alumni (the historical percentage) make a
outcomes range from 000 through 999 a random sample of four patients, what is the probability typical Wednesday. What is the probability that no questions. A student is assigned 10 donation or pledge during the annual tele-
:𝑋 = {000, … , 999}. that two will be uninsured? cancellations will occur on a particular Wednesday? problems. What is the probability that fund. What is the probability that the first
𝑋 = the number of uninsured patients, 𝑋 = {0, 1, 2, 3, 4}. X = the number of cancellations on a particular none of the questions are essay? donation will come within the first 5 calls?
Wednesday, 𝑋 = {0,1,2, … } X = the number of essay questions the 𝑋 = the number of calls make until the first
student receives, 𝑋 = {0,1,2,3,4,5,6} success is achieved, 𝑋 = {1,2,3,4, … }
Parameters 𝑎: lower limit and 𝑏: upper limit 𝑛 : number of trials 𝜆: mean arrivals per unit of time or space 𝑁: population size; 𝜋: probability of success
𝜋: probability of success 𝑛 : sample size
𝑠 : number of successes in population
𝑥: number of successes in sample
PDF/PMF 1 𝑛! 𝜆𝑥 𝑒 −𝜆 𝑠 𝐶𝑥 × 𝑁−𝑠 𝐶𝑛−𝑥 𝑃(𝑋 = 𝑥) = 𝜋(1 − 𝜋)𝑥−1
𝑃(𝑋 = 𝑥) = 𝑃(𝑋 = 𝑥) = 𝜋 𝑥 (1 − 𝜋)𝑛−𝑥 𝑃(𝑋 = 𝑥) = 𝑃(𝑋 = 𝑥) =
𝑏−𝑎+1 𝑥! (𝑛 − 𝑥)! 𝑥! 𝑁 𝐶𝑛
𝐵𝐼𝑁𝑂𝑀. 𝐷𝐼𝑆𝑇(𝑥, 𝑛, 𝜋 ,0) or Table A 𝑃𝑂𝐼𝑆𝑆𝑂𝑁. 𝐷𝐼𝑆𝑇(𝑥, 𝜆, 0) or Table B 𝐻𝑌𝑃𝐺𝐸𝑂𝑀. 𝐷𝐼𝑆𝑇(𝑥, 𝑛, 𝑠, 𝑁, 0) 𝐸𝑋𝑃𝑂𝑁. 𝐷𝐼𝑆𝑇(𝑥, 𝜆 ,0)
CDF 𝑥−𝑎+1 𝑃(𝑋 ≤ 𝑥) = 1 − (1 − 𝜋)𝑥 , 𝑥 ≥ 1
𝑃(𝑋 ≤ 𝑥) = ,𝑎 ≤ 𝑥 ≤ 𝑏
𝑏−𝑎+1
𝐵𝐼𝑁𝑂𝑀. 𝐷𝐼𝑆𝑇(𝑥, 𝑛, 𝜋 ,1) 𝑃𝑂𝐼𝑆𝑆𝑂𝑁. 𝐷𝐼𝑆𝑇(𝑥, 𝜆, 1) 𝐻𝑌𝑃𝐺𝐸𝑂𝑀. 𝐷𝐼𝑆𝑇(𝑥, 𝑛, 𝑠, 𝑁, 1) 𝐸𝑋𝑃𝑂𝑁. 𝐷𝐼𝑆𝑇(𝑥, 𝜆 ,1)
𝑎+𝑏 𝑠 1
Mean 𝜇 = 𝑛𝜋 𝜇= 𝜆 𝜇 = 𝑛𝜋 where 𝜋 =
𝜇= 𝑁 𝜇=
2 𝜋
Standard 𝜎 = √𝑛𝜋(1 − 𝜋) 𝜎 = √𝜆 𝑁−𝑛
[(𝑏 − 𝑎) + 1]2 − 1 𝜎 = √𝑛𝜋(1 − 𝜋) × √ 1−𝜋
deviation 𝜎=√ 𝑁−1 𝜎=√ 2
12 𝜋
𝑠
Shape Symmetric with no mode Skewed right if 𝜋 < 0.5, skewed left if 𝜋 > 0.5 and Always right-skewed, but less so for larger 𝜆. Symmetric if = 0.5 Highly skewed
𝑁
symmetric if 𝜋 = 0.5
Characteristics Uniform model is useful as a benchmark The trials are independent of each other. There are An event of interest occurs randomly over time or The trials are not independent. There is at least one trial to obtain the
and also to generate random integers for only two outcomes for each trial: success or failure. space. The probability of success is not first success, but the number of trials is
sampling, or in simulation models. For The probability of success for each trial 𝜋 remains The average arrival rate 𝜆 remains constant. constant from trial to trial. not fixed.
example, lotteries are frequently studied constant. The arrivals are independent of each other.
to make sure they are truly random.
Chapter 5. Probability Nguyen Thi Thu Van - October 22, 2022
Descriptive statistics is used to summarize and A random experiment is an observational process whose results cannot be known in advance. For example, when a customer enters a Lexus dealership, will the customer buy a car or not? How much will the customer
describe data collected from, for example, scientific spend?
experiments, or business processes. The set of all possible outcomes denoted 𝑆 is the sample space for the experiment. An event is any subset of outcomes in the sample space. A simple event, or elementary event, is a single outcome. A compound event
consists of two or more simple events.
Inferential statistics use the theory of probability
- To generalize the information of a The probability of an event is a number that measures the relative likelihood that the event will occur. The probability of an event 𝐴, denoted 𝑃(𝐴), must lie within the interval from 0 to 1: 0 ≤ 𝑃(𝐴) ≤ 1
sample to a wider population where the A discrete sample space S consists of all the simple events, denoted 𝐸1 , 𝐸2 , … , 𝐸𝑛 : 𝑆 = {𝐸𝟏 , … , 𝐸𝑛 }. Then
sample comes from. 𝑃(𝑆) = 𝑃(𝐸1 ) + 𝑃(𝐸2 ) + ⋯ + 𝑃(𝐸𝑛 ) = 1, 0 ≤ 𝑃(𝐸𝑖 ) ≤ 1
- To test hypotheses. If the outcome of the experiment is a continuous measurement, the sample space cannot be listed, but it can be described by a rule. For example, the
- To make predictions about the future. sample space for the length of a randomly chosen cell phone call would be 𝑆 = {𝑋|𝑋 ≥ 0}: 𝑃(𝑆) = ∑𝑋∈𝑆 𝑃(𝑋) ≡ ∫ 𝑃(𝑋) = 1, 0 ≤ 𝑃(𝑋) ≤ 1.
Three distinct approaches of assigning probability:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡
An empirical approach: probability obtained is based on relative frequencies through observations or experiments (𝑎𝑛 𝑒𝑣𝑒𝑛𝑡) = . For example, there is a 2 percent chance of twins in a randomly chosen birth.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Law of Large Numbers is an important probability theorem, which says that as the number of trials increases, any empirical probability approaches its theoretical limit.
A classical/a priori approach: probability obtained is based on logic or theory, not on experience. But such calculations are rarely possible in business situations. For example, there is a 50 percent chance of heads on a coin flip.
A subjective approach: probability obtained is based on personal judgment or expert opinion. However, such a judgment is not random, because it is based on experience with similar events and knowledge of underlying causal process. Thus, subjective probabilities
have something in common with empirical probabilities. For example, there is an 80 percent chance that Vietnam will bid for the 2024 Winter Olympics.
Key terms Descriptions Formulas
Complement of an even 𝐴 is denoted by 𝐴′ is every outcome except the event. ′
Rules of 𝑃(𝐴) = 1 − 𝑃(𝐴 )
probability Union of two events 𝐴 and : 𝐴 ⊎ 𝐵 𝑖𝑠 all outcomes in either or both. General law of addition: 𝑃(𝐴 ⊎ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
Intersection of two events 𝐴 and 𝐵: 𝐴 ∩ 𝐵 is only those events in both. The probability of intersection of 2 events is called the joint probability. General law of multiplication: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴|𝐵) × 𝑃(𝐵)
A and B are mutually exclusive (or disjoint) if their intersection is the empty set, i.e., one event precludes the other from occurring. 𝑃(𝐴 ∩ 𝐵) = 0 and 𝑃(𝐴 ⊎ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Events are collectively exhaustive if their union is the entire sample space S (i.e., all the events that can possibly occur). Two mutually exclusive, collectively exhaustive events are binary (or dichotomous) events. For example, a car repair is either covered
by the warranty or not covered by the warranty. Remember that there can be more than 2 mutually exclusive, collectively exhaustive events. For example, a Walmart customer can pay by credit card, debit card, check, or cash.
Odds is the ratio of an event’s probability to the probability of its complement. Odds for event A:
𝑃(𝐴)
Odds against event A:
1−𝑃(𝐴)
1−𝑃(𝐴) 𝑃(𝐴)
𝑎
If odds against event A is quoted as 𝑏 to 𝑎 then 𝑃(𝐴) =
𝑎+𝑏
Conditional The probability of event A given that event B has occurred is a conditional probability, denoted 𝑃(𝐴 | 𝐵) which is read “the probability of A 𝑃(𝐴|𝐵) =
𝑃(𝐴∩𝐵)
𝑃(𝐵)
probability given B.”
Event A is independent of event B if and only if 𝑃(𝐴) = 𝑃(𝐴|𝐵)
Two events are independent when knowing that one event has occurred does not affect the probability that the other event will occur. In If 𝐴, 𝐵 are two independent events then 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) × 𝑃(𝐵)
contrast, they will be called dependent. In general, if 𝐴1 , … , 𝐴𝑛 are independent then 𝑃(𝐴1 ∩ … ∩ 𝐴𝑛 ) = 𝑃(𝐴1 ) × … × 𝑃(𝐴𝑛 ). This law can be
applied to system reliability. For example, suppose that a website has two independent file servers. If each
has 99 percent reliability, what is the total reliability? The principle of redundancy: when individual
components have a low reliability, high reliability can be still achieved with massive redundancy.
Contingency A contingency table is a cross-tabulation of frequencies into 𝑟 rows and 𝑐 columns and is called an 𝑟 × 𝑐 table. The intersection of each row and
table column is a cell that shows a frequency. A contingency table is like a frequency distribution for a single variable, except it has two variables
(rows and columns). Contingency tables often are used to report the results of a survey. Marginal probability of an even is a relative frequency
that is found by dividing a row or column total by the total sample size. Each cell of the 𝑟 × 𝑐 table is used to calculate a joint probability
representing the intersection of two events. Conditional probabilities may be found by restricting ourselves to a single row or column.
Tree Events and probabilities can be displayed in the form of a tree diagram or decision tree to help visualize all possible outcomes without using complicated formulas. A probability tree has two main parts: branches and ends (or also called leaves). The junction
diagrams points between branches and leaves are called nodes. The probability of each branch is generally written on branches, while the outcomes are written on the ends. The tree diagram is a common business planning activity.
Bayes’ Bayes’ theorem [Thomas Bayes (1702–1761)] provides a method of revising probabilities to reflect new information. The prior (unconditional) 𝑃(𝐴|𝐵)𝑃(𝐵) 𝑃(𝐴|𝐵)𝑃(𝐵)
𝑃(𝐵|𝐴) = =
theorem probability of an event B is revised after event A has occurred to yield a posterior (conditional) probability. This theorem is a simple 𝑃(𝐴 ∩ 𝐵) + 𝑃(𝐴 ∩ 𝐵′) 𝑃(𝐴|𝐵)𝑃(𝐵) + 𝑃(𝐴|𝐵′)𝑃(𝐵′)
mathematical formula used for calculating conditional probabilities. where 𝐵, 𝐵’ are mutually exclusive, collectively exhaustive.
Counting Permutation is number of arrangements of sampled items drawn from a population when order is important. 𝑛!
𝑛 𝑃𝑟 =
rules (𝑛 − 𝑟)!
Combination is number of arrangements of sampled items drawn from a population when order does not matter. 𝑛!
𝑛 𝐶𝑟 =
(𝑛 − 𝑟)! 𝑟!
Chapter 3 and 4: Descriptive statistics Nguyen Thi Thu Van - June 6, 2023
Apart of type of data and sampling methods, the characteristics that can be asking about the data:
Center: Where are the data values concentrated? What seem to be typical or middle data values?
Central tendency is a statistical measure to determine a single score that defines the center of a distribution that is most typical or most
representative of the entire group.
Variability: How much dispersion is there in the data? How spread out are the data values? Are there unusual values?
Variability provides a quantitative measure of the differences between scores in a distribution and describes the degree to which the scores are
spread out or clustered together. The smaller the standard deviation, the more closely the data cluster about the mean.
Shape: Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? Technically, the shape of a distribution is defined by
an equation that prescribes the exact relationship between each X and Y value on the graph. However, we will rely on a few less-precise terms that
serve to describe the shape of most distributions. Nearly all distributions can be classified as being either symmetrical or skewed.
Correlation: Is there an association between two variables?
Descriptive statistics refers to the collection, organization, presentation, and summary of data either using charts and graphs [kinds of visual descriptions] or using a numerical summary [kind of numerical descriptions].
Visual descriptions Numerical descriptions
The type of graph we use to display our data is dependent on the type of data we have. Some charts are better suited We can assess dataset in a general way from a dot plot or histogram. However, to describe datasets more precisely and convincingly, we need numerical
for numerical data (e.g., stem and leaf displays, dot plots, histograms), while others are better for displaying statistics.
categorical data (e.g., bar chart, pie charts, Pareto charts).
Stem and leaf displays Dot plots Histograms Effective excel charts Center Variability Shape Correlation/Covariance
A stem-and-leaf A dot plot is a graphical display Histograms are used Line charts – Column 6 measures of center: 5 measures of variability: Instead of relying on histograms, Instead of relying on scatter plots, we can use
display is basically a of 𝑛 individual values of to describe visually charts – Bar charts – Pie Mean Range we can use statistical measures a statistical measure to be called correlation
frequency tally, except numerical data on horizontal datasets with any charts – Scatter plots – Median Sample variance like skewness and kurtosis to coefficient to assess the degree of linear
that we use digits axis. The basic steps in making a size. The basic steps Pareto chart - Table – Pivot Mode Sample standard deviation gain more precise inferences correlation between two variables (−1 ≤ 𝑟 <
instead of tally marks, dot plot: (1) sort the data in to construct a tables. Midrange Coefficient of variation about the shape of the population 0 negative correlation, 0 < 𝑟 ≤ 1 positive
where each data value ascending order. (2) make a histogram: Geometric mean Mean absolute deviation being sampled. correlation, 𝑟 = 0 no correlation).
is split into a leaf (the scale that covers the data range. (1) sort the data in Pareto chart is commonly Trimmed mean Skewness: skewed left (skewness
last digit) and a stem (3) mark axis demarcations and ascending order. used in quality management The empirical relationship < 0), skewed right (skewness > 0) The covariance measures the degree to
(the other digits). label them. (4) plot each data (2) choose the to display the frequency of between mean, median, and and symmetric (skewness = 0) which two variables move together (if 𝜎𝑋𝑌 >
value as a dot above the scale at number of bins [The defects or errors of different mode: Kurtosis (Ku): mesokurtic 0, 𝑋 𝑎𝑛𝑑 𝑌 move in the same direction,
Stem-and-leaf display its approximate location. If more number of bins of a types. Most quality 𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 (𝐾𝑢 = 0), platykurtic (𝐾𝑢 < 0), 𝜎𝑋𝑌 < 0 in opposite directions, 𝜎𝑋𝑌 = 0
can reveal central than one data value lies at dataset is of your problems can usually be
= 3 (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛) leptokurtic (𝐾𝑢 > 0). unrelated).
tendency (say, at stem approximately the same X-axis judgement. If there is traced to only a few sources
z-score or standardized score is used to (1) identify and describe the exact location of each score, how far away from the mean the score lies in a
that has the biggest location, the dots are piled up no other specific or causes. Sorting the
distribution by using the mean and the standard deviation, which are the important measures for describing an entire distribution of data. For example,
frequency), dispersion vertically. requirement, we can categories in descending
assume that you received a score of 76 on a statistics exam. How did you do if 𝜇 = 70? Obviously, your score is 6 points above the mean, but you still do
(say, the data range) as A stacked dot plot can be used use Sturges’ rule.]. order helps managers focus
not know exactly where it is located. You may have one of highest scores in class or may not, which means 6 points may be a relatively big distance or a
well as the shape of to compare two or more groups (3) set the bin limits, on the vital few causes of
slightly small distance and you may be slightly above the average. At this stage you need to know the standard deviation to identify relative distance from
distribution. However, of data. (4) put the data problems rather than the
the score to the mean. (2) standardize an entire distribution. For example, there are many tests for measuring IQ and the tests are usually standardized so
its disadvantage is that values in the trivial many.
that they have a mean of 100 and a standard deviation of 15. This helps us understand and compare IQ scores even though they come from different tests.
a stem-and-leaf graph Dot plot can reveal the central appropriate bin.
Empirical rule/ Chebyshev’s theorem: If you have a mean and standard deviation, you might need to know the proportion of values that lie within, say,
works well for small tendency (say, where the data (5) create the table. Scatter plot displaying
plus and minus two standard deviations of the mean. If your data follow the normal distribution, that’s easy using the Empirical Rule! However, if you
samples of integer data values tend to cluster and where (6) sketch a bar chart pairs of observations on xy-
don’t know the distribution of your dataset or know that it doesn’t follow the normal distribution, let use Chebyshev’s Theorem.
with a limited range is the midpoint), dispersion, and whose Y-axis shows plane show relationships
Percentiles: In addition, to indicate the percentage of data values that fall below a particular value, we can use percentiles. A dataset could be divided the
but becomes awkward shape of distribution when the the number of data between variables (say,
data into 100 groups (percentiles), 10 groups (deciles), 5 groups (quintiles), or 4 groups (quartiles). Percentiles within percentile rank can be determined
when you have decimal data is large enough. However, values (or a income and age).
directly from the frequency distribution table. For the intermediate values not reported in the table can be found by a procedure called interpolation.
data (e.g., $60.39) or the disadvantage is that dot percentage) within
Grouping data plays a significant role when we have to deal with large data. Data formed by arranging individual observations of a variable into groups,
multidigit data (e.g., plots don’t reveal very much each bin and whose Pivot table is a powerful
so that a frequency distribution table of these groups provides a convenient way of summarizing or analyzing the data is termed as grouped data.
$3,857). In such information about the data set’s X-axis ticks show the tool to calculate,
cases, it is necessary to shape when the sample is small, end points of each summarize, and analyze Boxplot is a useful tool in exploratory data analysis for graphically depicting groups of numerical data through their quartiles to detect outliers (say, data
round the data to make and they become awkward when bin. data. values that differ greatly from the majority.)
the display work. the sample is large.
Chapter 2: Data collection Nguyen Thi Thu Van - September 19, 2022
Data terminology
An observation is a single member of a collection of items that we want to study, such as a person, firm, or region. A variable is a
characteristic of the subject or individual, such as name, age, income. A dataset consists of all the values of the observations we have
chosen to observe. Data usually are entered into a spreadsheet or database as an 𝑛 × 𝑚 matrix: univariate datasets (one variable);
bivariate datasets (two variables); and multivariate datasets (more than two variables). The questions that can be explored and the
data analytical techniques that can be used will depend upon the data type and the number of variables.
Data type
Categorical / Qualitative data Numerical / Quantitative data
Categorical data have values that are described by words rather than numbers (e.g., gender, eye color, hair color). Because categorical Numerical data arise from counting, measuring something, or some kind of mathematical operation. Numerical data can be broken down
variables have non numerical values, on occasion the values of categorical variable may be represented using numbers, this is called into two types: discrete (i.e. variables with countable number of values like the number of credits, the number of passengers in a flight
coding. But coding a category as a number does not make the data numerical and the number does not typically imply a rank. …) and continuous (i.e. variables with values within an interval like height, weight, time, income …).
Time-series data is a sequence of data points collected over time intervals, allowing us to track changes over time. Time-series data can
track changes over milliseconds, days, or even years. For the time series data, we are interested in the trends, or the pattern over time.
Cross-sectional data is collected from many units (people, companies, countries, etc.) in a single time period For the cross-sectional
data, we are interested in variation among observations or in relationships.
Sampling methods
Two main categories of sampling methods: random sampling (e.g., simple random sample, systematic sample, stratified sample, A census is an examination
cluster sample) and non-random sampling (e.g., judgement sample, convenience sample, focus group). Sampling without of all items of the population,
replacement means that once an item has been selected to be included in the sample, it cannot be considered for the sample again. while a sample involves
Sampling with replacement means that the same random number could show up more than once. Sampling with replacement does not looking only at some items
lead to bias in our sample results. Note that when the population is finite and the sample size is close to the population size, we should selected from the population.
not use sampling without replacement. When the sample’s less than 5% of the population, the population is effectively infinite.
Survey
Most survey research follows the same basic steps: Step 1: State the goals of the research. Step 5: Design a data collection instrument (questionnaire).
Step 2: Develop the budget (time, money, staff …). Step 6: Pretest the survey instrument and revise as needed.
Step 3: Create a research design (target population, frame, sample size) Step 7: Administer the survey (follow up if needed). Step 8: Code the data and analyze it.
Step 4: Choose a survey type and method of administration. Note that these steps may overlap in time.
Chapter 1 – Statistics Nguyen Thi Thu Van – September 19, 2022
Managers need reliable and timely information in order to analyze market trends and adjust to changing market conditions: company’s internal operations (e.g., sales, production, inventory
levels, warranty claims) and competitive positions (e.g., market share, customer satisfaction, repeat sales). Companies increasingly are using business analytics to support decision making,
to recognize anomalies that require tactical action, or to gain strategic insight to align business processes with business objectives. Statistics and statistical analysis permit data-based
decision making and reduce managers’ need to rely on guesswork. We can’t agree more that businesses that combine managerial judgment with statistical analysis are more successful.
What’s statistics? Why study statistics? Statistics in Business Statistical challenges Critical thinking
Statistics helps convert unstructured Knowing statistics will make you a Some of the ways statistics are Common challenges facing Some common logical pitfalls which
raw data that have been collected better consumer of other people’s data used in business: the business professionals abound in both the data process and the
into useful information and data analyses. You should know enough to using statistics: reasoning process: conclusions from
mining. It encompasses all the handle everyday data problems, to feel - Auditing small samples, conclusions from
technologies for collecting, storing, confident that others cannot deceive - Marketing - Imperfect data and nonrandom samples, conclusions from
accessing, and analyzing data on the you with spurious arguments, and to - Health care practical constraints rare events, poor survey methods.
company’s operations to make better know when you’ve reached the limits - Quality improvement - Business ethics Therefore, statistics is an essential part
business decisions. Some experts of your expertise. Here are some - Purchasing - Upholding ethical of critical thinking. It allows us to test an
prefer to call statistics data science, reasons for anyone to study statistics: - Medicine standards idea against empirical evidence. We use
a trilogy of tasks involving: - Communication - Operation management - Using consultants statistical tools to compare our prior
- Computer skills - Product warranty ideas with empirical data (data
- Data modeling - Information management collected through observations and
- Analysis - Technical literacy experiments). If the data do not support
- Decision making - Process improvement our theory, we can reject or revise our
theory.
There are two primary kinds of statistics:
• Descriptive statistics refers to the collection, organization, presentation, and summary of data (either using charts and graphs or using a numerical summary).
• Inferential statistics refers to generalizing from a sample to a population, estimating unknown population parameters, drawing conclusions, and making decisions.