Chapter 1 & 2
Chapter 1 & 2
Introduction
The word “statistics” could be singular or plural. The definition given in the second place above
might be taken as the singular form of “statistics”.
Statistics, in its singular sense is a subject area or field of study. It is defined as science, which deals
with the collection, processing, analysis, interpretation and presentation of numerical facts.
The subjects of statistics, as it seems, is not a new discipline but it is as old as the human society
itself. The sphere of its utility, however, was very much restricted.
The word “statistics” is derived from the Latin for “state” indicating the historical importance of
governmental data gathering, which related to demographic information (military recruitment and tax
collecting). Thus, the scope of statistics in the ancient times was primarily limited to the collection of
demographic, property and wealth data of a country by governments for framing military and fiscal
policies.
Nowadays, statistics is used almost in every field of study, such as natural science, social science
engineering, medicine, agriculture, e t c.
Classification: Statistics is broadly divided into two categories based on how the collected data are
used.
1. Descriptive Statistics
deals with describing data without attempting to infer anything that goes beyond the given set of
data,
consists of collection, organization, summarization and presentation of data.
Example1: The mean blood pressure of a group of patients and the success rate of a surgical
procedure can be considered as descriptive statistics.
2. Inferential Statistics
Statistical inference is drawing conclusions about an entire population based on
data in a sample drawn from that population. From both frequentist and Bayesian
perspectives, there are three main goals of inference: estimation, hypothesis testing,
and prediction. Estimation and hypothesis testing deal with drawing conclusions
about unknown and unobservable population parameters.
Prediction is estimating the values of potentially observable but currently unobserved quantities. For
example, we might want to predict the number of “yesses” in
a future survey of 50 UI students. Prediction in statistical inference isn’t restricted to
predicting future observations, however. It may refer to estimating values that have
already occurred but were not measured. For example, we may want to use values
of acid rain deposition measured from rain gauges at specific sites to predict acid
rain deposition at other locations that have no rain gauges.
deals with making inferences and/or conclusions about a population based on data obtained from a
limited sample of observations,
consists of performing hypothesis testing, determining relationships among variables and making
predictions.
Example2: The mean blood pressure of all Americans and the expected success rate of a
surgical procedure in patients who have not yet undergone the operation.
b) Sample: Is a part of a population taken so that some generation about the population can be
made. A sample should be a representative of the population. Example: If you want to study
the mean age of primary school teachers in sodo town, all primary school teachers in sodo
town constitute the population as mentioned above, but if you study only some of the
teachers, the selected ones constitute your sample.
We have defined statistics, in singular sense, as a science that deals with collection,
organization (classification), presentation, analysis, and interpretation of numerical facts. So we
consider the following stages of statistical investigation:
Data Collection: This is a stage where we gather information for our purpose.
Data Organization: It is a stage where we edit our data. A large mass of figures that are
collected from surveys frequently need organization. The collected data involve irrelevant
figures, incorrect facts, omission and mistakes.
Data Presentation: The organized data can now be presented in the form of tables, charts
diagrams and graphs. At this stage, large data are presented in a very summarized and
condensed manner.
Data Analysis: This is the stage where we critically study the data. The purpose of data
analysis is to dig out information useful for decision making.
Data Interpretation: This is the stage where draw valid conclusions from the results
obtained through data analysis. If the data that have been analyzed are not properly interpreted,
the whole purpose of the investigation may be defected and misleading conclusion may be
drawn.
Uses of statistics
The science of statistics is very essential for research and decision making processes in all aspects of
human life. The following are some of the areas for which statistical analysis is required:
To represent the facts in the form of numerical data.
To summarize a mass of data into a few presentable understandable and precise
figures.
To Predict or forecast future trend.
To help select a course of action among a number of alternatives.
To help in formulating policies.
a) It does not study qualitative characteristics directly Examples: Beauty, honesty, poverty, and
standard of living.
b) It does not study a single individual but deals with aggregate of facts. Example: The
population size of a country for some given year does not help us for comparative studies.
c) Statistical results are true only on the average. Examples: The probability of getting a head in
tossing a coin is 1|2. The germination percentage of a given variety of seed is 80%
d) It is sensitive for misuse: Examples: The number of car accidents committed in a city in a
particular year by women drivers is 10 while that committed by men drivers is 40. Hence
women drivers are safe drivers.
a) Quantitative variables: are variables that can be quantified or can have numerical values.
Examples: height, area, income, temperature e t c.
b) Qualitative variables: are variables that can not be quantified directly. Examples: colour ,
beauty, sex, location qualitative variables are also called categorical variables. And hence we
have two types of data; quantitative & qualitative data.
Interval scale data convey better information than nominal and ordinal scale data.
There is a constant interval size between any adjacent units on the measurement scale.
There exists a zero point on the measurement scale and that there is a physical significance to
this zero point.
One is different, larger /taller/ better/ less by a certain amount of difference and so much
times than the other.
(+, -, *, / are possible on this scale)
This measurement scale provides better information than interval scale of measurement
1.6 Sources of data and methods of data collection
Any aggregate of numbers cannot be called statistical data. We say an aggregate of numbers is
statistical data when they are
Comparable
Meaningful and
Collected for a well defined objective
Raw data: are collected data, which have not been organized numerically.
Examples: 25, 10, 32, 18, 6, 93, 4.
An array: is an arrangement of raw numerical data in ascending or descending order of magnitude.
It enables us to know the rang of the data set easy and it also gives us some idea about the
general characteristics of the distribution.
Any scientific investigation requires data related to the study. The required data can be obtained from
either a primary source or a secondary source.
Primary source: Is a source of data that supplies first hand information for the use of the immediate
purpose.
Primary data: are data originally collected for the immediate purpose.
- Primary data are more expensive than secondary data.
Secondary source: are individuals or agencies, which supply data originally collected for other
purposes by them or others.
- Usually they are published or unpublished materials, records, reports, e t c.
Secondary data: data collected from a secondary source.
I. Observation or measurement
In this method, data can be obtained through direct observation or measurement .
- It requires training of persons who measure in order to insure the use of standard procedure
- Provides accurate information but it is expensive and inconvenient
II. Interviews and Questionnaires
Questionnaire: - are written documents which instruct the readers or listeners to answer the questions
written on it.
There are three ways of collecting information under this method
a) Face to face interviews ( Questionnaires in charge of interviewers )
b) Telephone interviews
c) Mailed questionnaires ( Self administered questionnaires returned by mail )
III. The use of documentary sources
It is extracting of information from existing sources (e.g. Hospital records)
Exercise
1. How does statistics help for your profession?
2. Differentiate descriptive and inferential statistics.
3. Mention some limitations of statistics (discuss by examples).
4. Explain the difference between the following statistical terms by giving example?
. Qualitative and quantitative variables
. Nominal and ordinal
. Parameter and statistic
. Secondary and primary data
5. Explain various methods of collecting primary and secondary data.
6. What is a questionnaire?
Classification: - is the process of arranging items/data into classes or categories according to their
similarities and/or differences.
Classification eliminates inconsistency and also brings out the points of similarity and/or dissimilarity
of collected items/data.
Classification is necessary because it would not be possible to draw inferences and conclusions if we
have a large set of collected [raw] data.
A frequency distribution is a table that presents data according to some criteria with the
corresponding number of items falling in each class (i.e. with the corresponding frequencies.)
Example: A frequency distribution presenting the number of males and females in a class
Sex Frequency
Male 57
Female 39
Generally, there are two basic types of frequency distributions: Ungrouped and Grouped frequency
distributions.
Ungrouped frequency distribution is a table of all potential raw scored values that could possibly
occur in the data along with their corresponding frequencies. Ungrouped frequency distribution is
often constructed for small set of data or a discrete variable.
Example: The following data are the ages in years of 20 women who attend health education last year:
30, 41, 39, 41, 32, 29, 35, 31, 30, 36, 33, 36, 32, 42, 30, 35, 37, 32, 30, and 41.
Construct a frequency distribution for these data.
STEP 1. Find the range of the data:
Range Maximumobservation Minimumobservation
STEP 2. Construct a table, tally the data and complete the frequency column. The frequency
distribution becomes as follows.
When the range of the data is large, the data must be grouped into classes. Grouped frequency
distribution is a frequency distribution when several numbers of data are grouped into one class.
– Class width (W): the difference between the upper and lower boundaries of any class or the lower
limits of two consecutive classes, or the upper limits of two consecutive classes.
N.B. Class width is not equal to the difference between UCL and LCL of the same class.
– Class mark (M): the mid point of a class interval.
UCBi LCBi
i.e. M
2
– Unit of measurement (U): the smallest difference between any two values of the variable being
measured.
– Cumulative frequency (Cf) less than type: the total frequency of all values (observations) less than
or equal to the upper class boundary for the given class.
– Cumulative frequency (Cf) more than type: The total frequency of all values (observations)
greater than or equal to the lower class boundary for the given class.
A tabular arrangement of class intervals together with their corresponding cumulative frequency
(either less than or more than type; as defined above) is called a cumulative frequency distribution.
– Relative frequency: the frequency a class divided by the total frequency (i.e. sum of all
frequencies) and, if multiplied by 100, gives the percent of values falling in that class.
Frequencyof that class
Re lative frequencyof a class
Total frequency
Note:
The relative frequency shows what fractional part or proportion of the total frequency belongs
to the corresponding class.
The sum of all the relative frequencies in the frequency distribution is always 1.
– Relative cumulative frequency (less than type/ more than type): total of the relative frequencies
above/ below a class inclusively. Or the cumulative frequency (less than type/more than type)
divided by the total frequency. This gives the percent of values which are less than/more than the
upper/lower class boundary.
subsequent class boundaries add class width to both lower and upper class boundaries
7. Tally the data and find the frequencies by counting the no of observations belonging to the
specific class
8. Calculate cumulative frequencies (optional). Finally the resultant frequency distribution
looks like:
Table 2.2
Weight Interval Class Tally Frequency Cumulative Cumulative
Frequency (less Frequency
(lb),class limits boundaries
than type) (more than
type)
10 – 19 9.5 – 19.5 //// 5 5 57
20 – 29 19.5 – 29.5 //// //// //// //// 19 24 52
30 – 39 29.5 – 39.5 //// //// 10 34 33
40 – 49 39.5 – 49.5 //// //// /// 13 47 23
50 – 59 49.5 – 59.5 //// 4 51 10
60 – 69 59.5 – 69.5 //// 4 55 6
70 – 79 69.5 – 79.5 // 2 57 2
The data that is presented by a frequency distribution can also be displayed diagrammatically
or graphically.
Diagrams and graphs:
are techniques for presenting data in visual displays using geometric figures;
are visual aids which give a bird’s eye view about a given set of numerical data;
have greater attraction than mere figures (numbers);
facilitate comparison of data;
are easily understandable by anyone who does have no statistical background
Usually diagrams are appropriate for presenting discrete data, whereas graphs are appropriate
for presenting continuous types of data.
There are three common diagrammatic presentations of data: bar-diagram/charts, pie-chart
and pictograms, as well as three common graphic presentations of data: histogram, frequency
polygon, and cumulative frequency polygon (O-give).
Bar-diagram is a series of equally spaced bars having equal width and the height of each
bar representing the magnitude or frequency of observations in each group.
Bar-diagrams are usually used to represent one way or simple frequency distribution.
Bar-diagrams can be drawn either horizontally or vertically. Usually horizontal bar-
diagrams are used for qualitatively classified data whereas vertical bar-diagrams are used
for quantitatively classified data.
Example: The mean serum cholesterol level of males and females
2.4.2. Pie-charts
A pie-chart is a circle that is divided into sections or wedges according to the percentages of
frequencies in each category of the distribution. The angle of the sector of a class is obtained
by multiplying the ratio of the frequency of the class to the total frequency by 3600.
frequencyof the class
i.e. sec tor angleof a class 3600
total frequency
frequencyof the class
percentageofaclass 100%
total frequency
Note that pie-charts are usually used for depicting nominal level data.
Example: 57 medical doctors graduated from black line generalized hospital; they are
distributed to seven selected hospitals say A, B, C, D, E, F & G.
Table 2.3
Selected No. of doctors Angle Percentage
hospitals assigned (degree) (%)
A 5 31.8 8.8
B 19 119.9 33.3
C 10 63.0 17.5
D 13 82.2 22.8
E 4 25.2 7.0
F 4 25.2 7.0
G 2 12.7 3.5
G Series1,
Series1,
8, 0,9, 0, Series1,
Series1, Series1,
0% 0% 10,
11, 0,
0, 0%
0% 12, 0, 0%
F A
B
D
2.4.3. Pictograms
In pictograms, we represent the data by means of some picture symbols. Here we decide a
suitable picture to represent a definite number of units in which the variable is measured.
Example: Draw a pictorial diagram to present the following data (number of patients in a
certain hospital for four years.)
Year 1992 1993 1994 1995
No. of 2000 3000 5000 7000
patients
A histogram is another way of data presentation which is more suitable for frequency
distributions with continuous classes. In drawing a histogram, we put the class boundaries of
each class on the horizontal axis and its respective frequency on the vertical axis.
Example: Draw a histogram for the above grouped FD (weight of children).
Table 2.4
Cumulative Cumulative
Class Class Mark Frequency Frequency (less Frequency
Boundaries than type) (more than
type)
9.5 – 19.5 14.5 5 5 57
19.5 – 29.5 24.5 19 24 52
29.5 – 39.5 34.5 10 34 33
39.5 – 49.5 44.5 13 47 23
49.5 – 59.5 54.5 4 51 10
59.5 – 69.5 64.5 4 55 6
69.5 – 79.5 74.5 2 57 2
A frequency polygon is a line graph drawn by taking the frequencies of the classes along the
vertical axis and their respective class marks along the horizontal axis. Then join the cross
points by a free hand curve.
Example: Present the data in the previous example (weight of children) using a frequency
polygon.
Cumulative frequency polygon can be traced on less than or more than cumulative frequency
basis. Place the class boundaries along the horizontal axis and the corresponding cumulative
frequencies (either less than or more than cumulative frequencies) along the vertical axis.
Then join the cross points by a free hand curve.
Example: the data in the previous example can be presented using either a less than or a
more than cumulative frequency polygon as given below (i) and (ii) respectively.
(i) Less than type cumulative frequency polygon
Figure 2.4: Less than cumulative frequency polygon of weight of children