Statistic Lecture2023
Statistic Lecture2023
Table of Contents
Ch1: Fundamental Elements of Statistics
1.1 What are Statistics?
1.2 Why Study Statistics?
1.3 Who Uses Statistics?
1.4 Origin and Growth of Statistics
1.5 Four stages of statistical process
1.6 Functions of Statistics
1.7 Types of statistics
1.8 Types of Variables
1.9 Collecting Data and Obtaining Data
Ch2: Presentation of statistical data
2.1 some statistical terminology
2.2 Presentation of ungrouped data
2.3 Presentation of grouped data
Ch6: Probability
6.1 Introduction to Probability
6.2 Laws of Probability
6.3 Empirical Probability
The Addition Rules for Probability
The Multiplication Rules and Conditional
Probability
Conditional Probability
Chapter One
Fundamental Elements of Statistics
What are Statistics?
Statistics is the science of collecting, organizing, presenting, analyzing,
and interpreting numerical data to assist in making more effective decisions.
Functions of Statistics
There are many functions of statistics. Let us consider the following five
important functions:
1) Condensation:
2) Comparison:
3) Forecasting:
4) Estimation:
5) Tests of Hypothesis:
3) Qualitative or quantitative.
4) A sample a sample is a portion, or part, of the population of interest.
A measurement is a number or attribute computed for each member of a population or of a
sample. The measurements of sample elements are collectively called the sample data.
Types of statistics
There are two main branches of statistics: descriptive and inferential. The Descriptive
statistics is used to say something about a set of information that has been collected only.
The Inferential statistics is used to make predictions or comparisons about larger group (a
population) using information gathered about a small part of that population. Thus, inferential
statistics involves generalizing beyond the data, something that descriptive statistics does not do.
1) Descriptive statistics: methods of organizing, summarizing, and presenting data in
an informative way.
2) EXAMPLE 1: The United States government reports the population of the United States
was 179,323,000 in 1960; 203,302,000 in 1970; 226,542,000 in 1980; 248,709,000 in 1990,
and 265,000,000 in 2000.
Types of Variables
Quantitative data are measurements that are recorded on a naturally occurring
numerical scale. Or are numerical measurements that arise from a natural numerical
scale. Quantitative data are further classified as either discrete or continuous.
Discrete data are numeric data that have finite number of possible value.
A classic example of discrete data is a finite subset of the counting number. (1, 2, 3,
4, 5, 6, 7, 8) perhaps corresponding to (Strongly disagree…… Strongly Agree).
Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.41421…
The real numbers are continuous with no gaps or interruptions. Physically measurable
quantities of length, volume, time, mass.
Qualitative data are measurements that cannot be measured on a natural numerical
scale; they can only be classified into one of a group of categories or are
measurements for which there is no natural numerical scale, but which consist of
attributes, labels, or other nonnumeric characteristics.
Qualitative data are nonnumeric.
Data Analysis is a process of gathering, modeling, and transforming data with the
goal of highlighting useful information, suggesting conclusions, and supporting
decision making. The data analysis has multiple facets and approaches, encompassing
diverse techniques under variety of names, in difference business, science, and social
science domain.
Samples
A representative sample exhibits characteristics typical of those possessed by
the population of interest.
A random sample of n experimental units is a sample selected from the
population in such a way that every different sample of size n has an equal
chance of selection
x ( for a population ) 𝒙̄ =
∑𝒙
(𝒇𝒐𝒓 𝒂 𝒔𝒂𝒎𝒑𝒍𝒆)
N 𝒏
10 + 4 + 7 + 5 + 7 + 8 + 9 𝟓𝟎
𝒙̄ = = = 7.14
𝟕 𝟕
The two middle values are 4 and 5. The median is the average of these two values, or 4.5.
19 + 13 + 15 + 25 + 18 = 90 = 18
5 5
When the mean is known and you must find a missing value, some simple rules of algebra must
be applied.
Example: Ali has received the following grades this term: 75, 87, 90, 88, and 79. If he wishes to
earn an 85 average, what must he score on his final test?
Set up the problem like this: 75 + 87 + 90 + 88 + 79 + s = 85
6
To solve:
1. Add the known values.
419 + s = 85
6
2. Next, we want to try to isolate the unknown (s) on one side of the equation. To do this we
must use inverse operations to eliminate the numbers on the side of the equation with the
unknown (this means we do the opposite of what is being done).
Start with the 6. Since we are dividing the expression 419 + s by the 6, we must now multiply it
by 6.
NOTE: Whatever you do to one side of the equation, you must do to the other side of the
equation as well. Therefore, I will multiply the 85 by 6 too.
Notation:
∑𝒙
𝑿̄ = Is the mean of a set of sample values.
𝒏
∑(𝒇𝒙)
𝑿̄ = Mean from a frequency
∑𝒇
Here,
a = assumed mean
= 25 + (-10/ 110)
= 25 -( 1/11)
= (275-1)/11
15-25 2
25-35 4
35-45 4
45-55 7
55-65 11
65-75 6
Find the mean percentage of female employees by the assumed mean method.
Solution:
Percentage of Number of Class mark (xi) di = xi – a fxidi
female employees departments
(CI) (fi)
5-15 1 10 -30 -30
15-25 2 20 -20 -40
25-35 4 30 -10 -40
35-45 4 40 = a 0 0
45-55 7 50 10 70
55-65 11 60 20 220
65-75 6 70 30 180
x F d = x-A fd
5 4 -10 -40
10 5 -5 -25
15 7 0 0
20 4 5 20
25 3 10 30
30 2 15 30
Solution: Now we have to use the formula given above to find the arithmetic
mean. Take the assumed mean A = 80
X F d = x- fd
65 6 A -90
70 11 -15 -110
75 3 -10 -15
80 5 -5 0
85 4 0 20
90 7 5 70
95 10 10 150
100 4 15 80
Total N = 50 20 ∑fd = 115
Example 3: The following data give the number of boys of a particular age
in a class of 40 students. Calculate the mean age of the students
Age (in years) Number of students
13 3
14 8
15 9
16 11
17 6
18 3
Solution: Now we have to use the formula given above to find the
arithmetic mean.Take the assumed mean A = 16
X F d = x-A Fd
13 3 -3 -9
14 8 -2 -16
15 9 -1 -9
16 11 0 0
17 6 1 6
18 3 2 6
Solution: Now we have to use the formula given above to find the arithmetic mean.
Take the assumed mean A = 45
Class Frequency(F) d = x– Fd
(x) A
15 12 -30 -360
25 20 -20 -400
35 15 -10 -150
45 14 0 0
55 16 10 160
75 7 30 210
85 8 40 320
= 45 + (0/103) = 45
Median – the number in the middle when the data is arranged in ascending or descending order.
The median is a measure of central tendency more resistant to the effects of extreme values. The
median is the value that occupies the middle position of data when data are put in rank order by
magnitude.
Let n be the number of cases in your data.
If n is odd, the median is the middle number of the data values sorted by magnitude. It occupies
th
n +1
the position.
2
If n is even, the median is the average of the middle two numbers of the data sorted by magnitude.
th th
It is the average of the numbers in the n and n + 2 positions.
2 2
Σf= 68
N Sum of frequencies
c.f. Cumulative frequency of the class just preceding the median class
N
cf M
Formula Median LM 2 xi
FM
N 68
2 = 2 =34, C= 125- 145, L=125, Cf =22, F=20,
I=20
68
−22 34−22 12
= 125 + ( 2 20 ) 𝑥20 = 125 + ( 20
) 𝑥20 = 125 + ( ) 𝑥20
20
= 125 + 12 Median = 137
Answer
Marks (x) No. of students (f) C.F.
0-10 15 15
10-20 20 35
20-30 25 60
30-40 24 84
40-50 12 96
50-60 31 127
60-70 71 198
70-80 52 250
N = 250
𝑵 𝟐𝟓𝟎
As, N=250⇒ = =125
𝟐 𝟐
As 127 are just greater than 125, therefore median class is 50−60.
N
Cf
Median L 2
Principle of Statistics
f
xh
Collected
by: Eng Ali Sidow Osman Page 24
Course Name: Principle of Statistics
𝟐𝟓𝟎
−𝟗𝟔 𝟏𝟐𝟓−𝟗𝟔
𝟐
∴ Median = 𝟓𝟎 + ( ) 𝒙𝟏𝟎 = 𝟓𝟎 + (
Median ) 𝒙𝟏𝟎
𝟑𝟏 𝟑𝟏
Solution: ⠀⠀⠀
⋆ TABLE: ⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀
The mode is the most commonly observed value in a set of data. For the normal distribution, the mode is
also the same value as the mean and median. In many cases, the modal value will differ from the average
value in the data.
In statistics, the mode is the value which is repeatedly occurring in a given set. We can also say that the
value or number in a data set, which has a high frequency or appears more frequently, is called
mode or modal value. It is one of the three measures of central tendency, apart from mean and median.
For example, mode of the set {3, 7, 8, 8, 9}, is 8. Therefore, for a finite number of observations, we can
easily find the mode. A set of values may have one mode or more than one mode or no mode at all
Number of students 7 10 13 8 2
Solution:
The maximum class frequency is 12 and the class interval corresponding to this
frequency is 20 – 30. Thus, the modal class is 20 – 30.
Lower limit of the modal class (l) = 20
Size of the class interval (h) = 10
Frequency of the modal class (f1) = 12
Frequency of the class preceding the modal class (f0) = 5
Frequency of the class succeeding the modal class (f2) = 8
Substituting these values in the formula we get
𝒇𝟏−𝒇𝟎 𝟏𝟐−𝟓 𝟕
Mode= 𝒍 + ( ) 𝒙𝒉 = 𝟐𝟎 + ( ) 𝒙𝟏𝟎 = 𝟐𝟎 + ( ) 𝒙𝟏𝟎 =
𝟐𝒇𝟏−𝒇𝟎−𝒇𝟐 𝟐𝒙𝟏𝟐−𝟓−𝟖 𝟐𝟒−𝟓−𝟖
𝟕
𝟐𝟎 + ( ) 𝒙𝟏𝟎 = 𝟐𝟎 + (𝟎. 𝟔𝟑𝟔𝟑)𝒙𝟏𝟎 = 𝟐𝟎 + 𝟔. 𝟑𝟔𝟑 = 𝟐𝟔. 𝟑𝟔𝟑
𝟏𝟏
The following data gives the information on the observed lifetimes (in hour) of 225 electrical
components.
lifetimes (in 0-20 20-40 40-60 60-80 80-100 100-120
hour)
Frequency (x) 10 35 52 61 38 29
Determine the model lifetimes (in hour) components.
In statistics, the range is the spread of your data from the lowest to the highest value in the
distribution. It is a commonly used measure of variability.
Along with measures of central tendency, measures of variability give you descriptive statistics for
summarizing your data set.
The range is calculated by subtracting the lowest value from the highest value. While a large range
means high variability, a small range means low variability in a distribution.
R = range
H = highest value
L = lowest value
The range is the easiest measure of variability to calculate. To find the range, follow these steps:
Age 37 19 31 29 21 26 33 36
First, order the values from low to high to identify the lowest value (L) and the highest value (H).
Age 19 21 26 29 31 33 36 37
R=H–L R = 37 – 19 = 18
Using the same calculation, we get a very different result this time:
Here,
Σ represents the addition of values
Example: Find mean, variance and standard deviation for the following data
below: 7, 15,12,17,20,14,9.
∑𝒙 𝟕+𝟏𝟓+𝟏𝟐+𝟏𝟕+𝟐𝟎+𝟏𝟒+𝟗 𝟗𝟒
𝒙̄ = (𝒇𝒐𝒓 𝒂 𝒔𝒂𝒎𝒑𝒍𝒆) 𝒙̄ = 𝒙̄ = 𝒙̄ = 𝟏𝟑. 𝟒
𝒏 𝟕 𝟕
𝒙̄
(𝟕 − 𝟏𝟑. 𝟒) + (𝟏𝟓 − 𝟏𝟑. 𝟒) + (𝟏𝟐 − 𝟏𝟑. 𝟒) + (𝟏𝟕 − 𝟏𝟑. 𝟒) + (𝟐𝟎 − 𝟏𝟑. 𝟒) + (𝟏𝟒 − 𝟏𝟑. 𝟒) + (𝟗
=
𝟕
𝟒𝟎.𝟗𝟔+𝟐.𝟓𝟔+𝟏.𝟗𝟔+𝟏𝟐.𝟗𝟔+𝟒𝟑.𝟓𝟔+𝟎,𝟑𝟔+𝟏𝟗.𝟑𝟔 𝟏𝟐𝟏.𝟕𝟐
𝒙̄ = 𝒙̄ = 𝒙̄ = 𝟏𝟕. 𝟒
𝟕 𝟕
SD ( X X ) 2
n
234.04
𝑆𝐷 = √ = 6.8
5
Mean and standard deviation of ungrouped data Recovery times from shoulder injuries.
Time in weeks(x) Frequency FX 2
x .f
1 5 5 5
2 8 16 32
3 12 36 108
4 19 76 304
5 7 35 175
6 4 24 144
7 3 21 147
8 2 16 128
Σf= 60 Σfx=229 Σfx2=1043
Solution
No of orders x) Frequency Midpiont(x) FX 2
x .f
10-12 4 11 44 484
13-15 12 14 168 2352
16-18 20 17 340 5780
19-21 14 20 280 5600
N =50 Σfx=832 Σfx2=14216
s s 2 (X X ) 2
s s 2 (m X ) 2
s
5443.73
N 1 n 1 37 1
s
5443.73 12.3
36
∑ 𝒇(x-xbarr) 𝟐𝟖𝟕.𝟑𝟔
Step two: Mean Deviation = = =5.7472
𝒏 𝟓𝟎
∑ 𝒇(x-xbarr )^2 2560.32
Variance == 𝒏
=
𝟓𝟎
= 51.2064
Regression Analysis
Regression analysis includes several variations, such as linear, multiple linear, and
nonlinear. The most common models are simple linear and multiple linear. Nonlinear
Principle of Statistics Collected by: Eng Ali Sidow Osman Page 40
Course Name: Principle of Statistics
regression analysis is commonly used for more complicated data sets in which the
dependent and independent variables show a nonlinear relationship.
Regression analysis offers numerous applications in various disciplines,
including finance.
Y = a + bX + ϵ
Where:
Multiple linear regression analysis is essentially similar to the simple linear model,
with the exception that multiple independent variables are used in the model. The
mathematical representation of multiple linear regressions is:
Where:
1) Y – Dependent variable
2) X1, X2, X3 – Independent (explanatory) variables
3) a – Intercept
4) b, c, d – Slopes
5) ϵ – Residual (error)
Multiple linear regressions follow the same conditions as the simple linear model.
However, since there are several independent variables in multiple linear analyses,
there is another mandatory condition for the model:
Non-co linearity: Independent variables should show a minimum of correlation with
each other. If the independent variables are highly correlated with each other, it will
Principle of Statistics Collected by: Eng Ali Sidow Osman Page 42
Course Name: Principle of Statistics
be difficult to assess the true relationships between the dependent and independent
variables.
Solution:
Y = 0.929X–3.716+11
= 0.929X+7.284
Example2: Calculate the two regression equations of X on Y and Y on X from the data
given below, taking deviations from an actual means of X and Y.
Solution:
= –5+44.25
= 39.25 (when the price is Somali Shilling. 20, the likely demand is 39.25)
Example3: Obtain regression equation of Y on X and estimate Y when X=55 from the
following
Solution:
Y–51.57 = 0.942(X–48.29)
Y = 0.942X–45.49+51.57=0.942 x–45.49+51.57
Y = 0.942X+6.08
Y= 0.942(55) +6.08=57.89
Solution:
X Y XY X2 Y2
2 4
1 3
3 4
2 3
4 6
Example 2: The table below shows the time in hours spent studying (x) of 6 grade 11
students and their scores on a test (y) solve for Pearson’s product Correlation
Coefficients.
X 1 2 3 4 5 6
Y 5 10 15 15 25 35
Solution
X Y XY X2 Y2
1 5
2 10
3 15
4 15
5 24
6 35
Σ=21 Σ =104 Σ= Σ= Σ=
Use the following correlation coefficient formula.
Correlation coefficients are used to measure how strong a relationship is between two variables.
There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s
correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear
regression. If you’re starting out in statistics, you’ll probably learn about Pearson’s R first. In fact,
when anyone refers to the correlation coefficient, they are usually talking about Pearson’s.
Correlation coefficient formulas are used to find how strong a relationship is between data. The
formulas return a value between -1 and 1, where:
Meaning
1) A correlation coefficient of 1 means that for every positive increase in one variable, there is
a positive increase of a fixed proportion in the other. For example, shoe sizes go up in
(almost) perfect correlation with foot length.
2) A correlation coefficient of -1 means that for every positive increase in one variable, there is
a negative decrease of a fixed proportion in the other. For example, the amount of gas in a
tank decreases in (almost) perfect correlation with speed.
3) Zero means that for every increase, there isn’t a positive or negative increase. The two just
aren’t related.
The absolute value of the correlation coefficient gives us the relationship strength. The larger the
number, the stronger the relationship for example, |-.75| = .75, which has a stronger relationship
than .65.
Two other formulas are commonly used: the sample correlation coefficient and the
population correlation coefficient.
Sx and sy are the sample standard deviations, and sxy is the sample covariance.
Population correlation coefficient
Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
So,
Consider equation (ii), regression equation of X on Y
3y - 2x = 10
2x = 3y - 10
So,
r = 0.866
Example5: Find the means of X and Y variables and the coefficient of correlation
between them from the following two regression equations:
4X–5Y+33 = 0
20X–9Y–107 = 0
Solution:
To get mean values we must solve the given lines.
4X – 5Y = -33 … (1)
20X – 9Y = 107 … (2)
1× 5 ⇒ 20X – 25Y = -165
20X – 9Y = 107
Subtracting (1) and (2), -16Y = -272
Y = 272/16 = 17 i.e., Y¯ = 17
Using Y = 17 in (1)
We get,
4X – 85 = -33
4X = 52
X = 13 i.e., X¯X¯ = 13
Mean values are X¯ = 13, Y¯ = 17,
Let regression line of Y on X be
4X – 5Y + 33 = 0
5Y = 4X + 33
Y = (4X + 33) Y = 1/5(4x + 33)
Y = 4/5X+33/5 Y = 0.8X + 6.6
∴ byx = 0.8
Let regression line of X on Y be
20X – 9Y – 107 = 0
20X = 9Y + 107 X = 1/20 (9Y + 107)
Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement
expenditure of Sh. Somali. 10 crores.
Solution:
When advertisement expenditure is 10 crores i.e., Y=10 then sales X=6(10) +4=64
which implies sales is 64.
Example7
There are two series of index numbers P for price index and S for stock of the
commodity. The mean and standard deviation of P are 100 and 8 and of S are 103 and
Example8
For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2
=55, ∑Y2 =135, ∑XY=83 Find the equation of the lines of regression and estimate the
value of X on the first line when Y=12 and value of Y on the second line if X=8.
Solution:
Example9
Solution:
Solving the two regression equations we get mean values of X and Y
Example10
Solution:
3X–2Y = 5
3X = 2Y+5
Coefficient of correlation
Since the two regression coefficients are positive then the correlation coefficient is also
positive and it is given by
Exercise1
Find (a) The two regression equations, (b) The coefficient of correlation between
marks in Economics and statistics, (c) The mostly likely marks in Statistics when the
marks in Economics are 30.
2. The heights (in cm.) of a group of fathers and sons are given below
Find the lines of regression and estimate the height of son when the height of the father
is 164 cm.
4. Obtain the two regression lines from the following data N=20, ∑X=80, ∑Y=40,
∑X2=1680, ∑Y2=320 and ∑XY=480
5. Given the following data, what will be the possible yield when the rainfall is 29₹₹?
6. The following data relate to advertisement expenditure (in lakh of rupees) and their
corresponding sales (in cores of rupees)
If the Correlation coefficient between X and Y is 0.66, then find (i) the two regression
coefficients, (ii) the most likely value of Y when X=10
8. Find the equation of the regression line of Y on X, if the observations ( Xi, Yi) are the
following (1,4) (2,8) (3,2) ( 4,12) ( 5, 10) ( 6, 14) ( 7, 16) ( 8, 6) (9, 18)
Write down the regression equation and estimate the expenditure on Food and
Entertainment, if the expenditure on accommodation is Rs. 200.
10. For 5 observations of pairs of (X, Y) of variables X and Y the following results are
obtained. ∑X=15, ∑Y=25, ∑X2=55, ∑Y2=135, ∑XY=83. Find the equation of the lines
of regression and estimate the values of X and Y if Y=8; X=12.
12. The equations of two lines of regression obtained in a correlation analysis are the
following 2X=8–3Y and 2Y=5–X. Obtain the value of the regression coefficients and
correlation coefficient.
Solution
𝒇
𝒑(𝟎) = =
𝒏
𝟐𝟐 𝟓 𝟐𝟕
𝒑(𝑨𝒐𝒓𝑩) = + =
𝟓𝟎 𝟓𝟎 𝟓𝟎
Solution
𝟓𝟔
𝒑(𝟓) =
𝟏𝟐𝟕
𝒑(𝒍𝒆𝒔𝒕𝒉𝒆𝒏𝟔𝒅𝒂𝒚𝒔)
𝟏𝟓 𝟑𝟐 𝟓𝟔
= + +
𝟏𝟐𝟕 𝟏𝟐𝟕 𝟏𝟐𝟕
𝟏𝟎𝟑
=
𝟏𝟐𝟕
Principle of Statistics Collected by: Eng Ali Sidow Osman Page 78
Course Name: Principle of Statistics
𝒑(𝒂𝒕𝒎𝒐𝒔𝒕𝟒𝒅𝒂𝒚𝒔)
𝟏𝟓 𝟑𝟐 𝟒𝟕
= + =
𝟏𝟐𝟕 𝟏𝟐𝟕 𝟏𝟐𝟕
𝒑(𝒂𝒕𝒍𝒆𝒔𝒕𝟓𝒅𝒂𝒚𝒔)
𝟓𝟔 𝟏𝟗 𝟓
= + +
𝟏𝟐𝟕 𝟏𝟐𝟕 𝟏𝟐𝟕
𝟖𝟎
=
𝟏𝟐𝟕
The Addition Rules for Probability
Two events are mutually exclusive events if
they cannot occur at the same time (i.e., they
have no outcomes in common).
Example:
A single card is drawn at random
from an ordinary deck of cards. Find
the probability
that it is either an ace or a black
card.
Example
In a hospital unit there are 8 nurses
and 5 physicians; 7 nurses and 3
physicians are females.
If a staff person is selected, find the
probability that the subject is a
nurse or a male.
Sample mean
Population mean
Standard z value
Original x value
PROBABILITY FORMULAS
Probability of an event A
P(not A) = 1 - P(A)
Permutation rule
Combination rule
CONFIDENCE INTERVALS
Confidence interval for a mean (large samples)
SAMPLE SIZE
Coefficient of determination
r2