Statistical Methods Ecourse (ICAR)
Statistical Methods Ecourse (ICAR)
1.1. Introduction
The basic foundation of statistics is the probability theory which aims to systematize the laws of
chance to discover the regularities in the pattern in which the events depending on chance repeat
themselves. Probability had its beginning with the games of chance, such as the tossing of a coin,
rolling of a die; drawing a card, etc., in the 17th century. It was only in the 19th century Gregory
Mendel while studying the genetic laws in peas showed that it can be applied to biological
investigations also. Since then it is being applied very successfully to various problems in
biology.
Sub-events
Let A and B be two events such that event A occurs whenever events B occurs. Then, events B is sub-
event of event A.
While throwing a die, let A= {2, 4, 6} and B= {2}. Here B is a subset (sub-event) of event A, and is denoted
by B A.
Union of events
Union of two or more events is the event of occurrence of at least one of these events. Thus union of
two events A and B is the event of occurrence of at least one of them. The union of A and B is denoted
by A B or A+B or A or B
Example 1
while tossing two coins simultaneously, let A= {HH} and B = {TT} be two events.
Then their union is A B = {HH, TT}
Here A is the event of occurrence of two heads and B is the events of occurrence of two tails their union
A B is the event of occurrence of two heads or two tails.
Example 2
while throwing a die, let A ={2,4,6} B={3,6} and C={4,5,6} be three events.
Then, their union is A B C= {2, 3, 4, 5}
Intersection of events
Intersection of two or more events is the event of simultaneous occurrence of all these events. Thus
intersection of two events A and B is the events of occurrence of both of them. The intersection of A and
B is denoted by A B or AB or A and B.
Factorials
For any given positive integer n, the product of all the whole numbers from n down through 1 is
called n factorial and is written as n!
For example,
5! = 5x4x3x2x1=120
8! = 8x7x6x5x4x3x2x1 = 40320
And in general
n! = n(n-1)(n-2) .............2x1
By definition, 0! = 1
Further,
n! = n(n-1)!
= n(n-1)(n-2)! And so on
Factorials are useful in finding the number of ways objects can be arranged in a line. For
example, suppose that there are 3 containers of culture media, each of which is inoculated with a
different organism. These culture media can be placed in a line on a platform in 3! = 6 ways. If
the 3 media are designated as a, b and c, then 6 arrangements are abc, bca, cab, cba, bac, acb.
Combinations
Defined
Permutations
Defined
Note 1
Probability of an event is a non-negative number which lies between 0 and 1.
Symbolically 0 ≤ p ≤ 1.
Note 2
If the event E can happen in 'm' ways out of total of ‘n’ ways, then the number of ways in which the
event A will not happen is n - m. Hence, the probability that an event E will not happen (denoted by q) is
given by,
So that, p+q=1
i.e., the sum of the probabilities of occurrence and non-occurrence of an event is equal to 1,
Example 1
What is the probability of getting head when an unbiased coin is tossed ?
Answer:
Total number of equally likely outcomes = 2 = { H, T}
No. of favorable outcome =1 = {H }
Therefore, the probability of getting head is, P (H) = ½.
Example 2
What is the probability of getting an even number when an unbiased die is thrown?
Answer
Total Number of equally likely outcomes = 6
Number of favorable outcomes
P (Even number) =3/6=0.5
Example 3
In a pond containing 100 fishes 20 are marked. If one fish is subsequently caught what is the probability
of it being (i) marked (ii) unmarked?
Answer
Total number of fishes =100
Number of marked fishes =20
(i) Hence the numbers of favorable cases for marked fish are 20.
Example 4
In a composite fish culture experiment, fingerlings of 6 species of fish namely, rohu, catla, mrigal,
common carp, silver carp and grass carp, were stocked in the ratio of 1 : 1 : 1 : 2.5 : 3 : 1.5 respectively. A
fingerling is subsequently drawn, what is the probability that it is of catla?
Answer
Fingerlings of rohu, catla, mrigal, common carp, silver carp and grass carp are stocked in the ratio of 1 : 1
: 1 : 2.5 : 3 : 1.5 respectively. Thus out of 10 fingerlings we have 1 fingerling of catla. Hence, the
probability that the fingerling drawn is of catla, = 1/10 = 0.10
It is to be noted that as the number of trials (frequency) increases the estimate of probability of an
event stabilizes around a particular value.
Example 5
The frequency distribution of lengths of 1000 randomly selected fishes of a particular species is given
below. What is the probability that a fish chosen at random will have length between 35-45 cm?
Answer
Example 6
One thousand fertilized eggs of a major carp were kept under observation to find out the number
of individuals reaching different stages in the life history. Observed data are given below:
Answer
Out of 1000 fertilized eggs, only 200 reached the fingerlings stage. Therefore, the
probability of fertilized I egg reaching the fingerling stage is 200/1000 = 0.2
Out of 700 hatchlings only 210 reached the fry stage, therefore the probability of
hatchling reaching the fry stage is 210/700 = 0.30
Out of 210 fry 196 reached the adult stage, therefore, the probability of the fry reaching
the adult stage = 196/210 = 0.928
1.7.Theorem on probability
After understanding the basic definition of probability, it is required to know laws of probability to
compute probability when more than one trail is conducted or when probability of different types of
events are required. There are two important laws of probability. They are
I. Addition Theorem
Let A and B be two events with respective probabilities P(A) and P(B). The probability of occurrence of at
least one of these two events denoted by
P(A+B) is given by P(A+B) = P(A) + P(B) - P(AB) where P (AB) is the probability of simultaneous occurrence
of A and B or P(A B) = P(A) + P(B) – P(A B)
Example 7
In a certain district 25% of the fish farmers practice composite fish culture of rohu catla and mrigal, 15%
fish farmers follow monoculture of rohu only and 10% farmers follow composite fish culture as well as
monoculture of rohu in their farm. Find the probability that a randomly selected fish farmer follows at
least one of the practices.
Answer
Let events A and B be,
A : The farmer follows composite fish culture
B : The farmer follows monoculture of rohu
Then, P (A) = 0.25, P (B) = 0.15, and P (A B) = 0.10
The probability that the farmer follows at least one of the practices is denoted by P (A+B) and is given
by,
P(A B) = P(A) + P(B) - P(A B) 0.25 = 0.15 - 0.10 0.30
Example 8
A pond contains 150 fishes of rohu, 225 fishes of cattle and 125 fishes of mrigal. Find the probability that
a fish randomly selected is rohu or a catla.
Answer
Let events A and B be,
A: Selected fish is rohu
B: Selected fish is catla
Events A and B are mutually exclusive as a fish selected cannot be both rohu and catla. Hence,
P(A B) = P(A) + P(B)
Independent events
The events A and B are said to be independent, if the occurrence of one does not depend on the
occurrence or non-occurrence of the other.
For Example
When a Coin is tossed two times, the result of the second throw does not depend on the result of
the first throw.
Conditional probability
Let A and B are two events in the sample space, then P (A/B) denotes the probabilities of
the happening of event A, when B has already occurred.
Given 2 events A & B of sample space such that P(A/B) denotes the probability of
happening of event A such that event B is already occurred.
Example 9
In a pond containing 100 fishes, 35 are marked. If two are caught one after another and without
replacement, what is the probability that both the fishes caught are marked?
Answer
Let A denote the event of catching marked fish in the first draw and B denote the event of
catching marked fish in the 2nd draw, then
P(A) =35/100
The probability of drawing marked fish in the 2nd draw, given that the first fish caught was
marked is,
P(B/A) =34/99
Hence, P (both the fish caught are marked)= P(AB)= P(A). P(B/A) = (35/100)(34/99)=1190
/9900=0.12
Example 10
A pond contains 200 fishes of which 40 are marked. A second pond contains 300 fishes of which
50 are marked. One fish is drawn from each of the ponds. What is the probability that the fishes
drawn are both marked?
Answer
Let A denote the 2nd event of catching marked fish from 1st pond, B denote the event of
catching marked fish from 2nd pond.
Hence,
Example 11
An urn contains 7 white and 8 black pomfrets. A second urn contains 5 white 9 black pompfrets. One
pompfrets is taken out at random from the first urn and put into the second urn without noticing its
colour. A fish is then drawn at random from the second urn. What is the probability that it is a white
pompfrets?
Answer
Two cases arise here
Case (i) Pompfrets taken first urns is white
Let A denote the event of drawing white pomfrets from first and let B denote the event of drawing
white pompfrets from second urn.
Here, P(A) =7/15, P(B/A)=6/15
Hence, P(AB) =P(A).P(B/A) = (7/15) (6/15)=42/225=0.1867
1.8. Exercises
a) A die is rolled; find the probability that the number obtained is greater than 4.
b) Two coins are tossed, find the probability that one head only is obtained.
c) Two dice are rolled; find the probability that the sum is equal to 5.
d) A card is drawn at random from a deck of cards. Find the probability of getting the King of heart.
e) A Fish is drawn at random from an Aquarian consisting of 6 Gold & 4 black Mollies co Find the
probability of getting Gold Fish.
2.1. Introduction
In real life situation we always infer about a population on the basis of a sample study. For a
given frequency distribution of a variable in the sample under study, we can get relative
frequencies which are probabilities of occurrence of different values of random variables.
Probability distribution is analogous to a relative frequency distribution with probabilities
replacing relative frequencies. Thus, probability distributions can be regarded as theoretical or
limiting forms of relative frequency distributions, when the number of observations made is very
large. Hence, probability distributions can be considered as distributions, of populations, whereas
relative frequency distributions are distributions of samples drawn from these populations.
Frequency distributions which arise in sample can be approximated by well known theoretical
probability distributions which serve as useful tools in making inferences and decisions under
conditions of uncertainty on the basis of limited data or theoretical considerations.
Thus random variable is a function that assigns a real number to each outcome in a sample space of a
random experiment.
As discrete random variable can take only a finite number of values or a countable infinite
number of values, it is possible to list all the values with the corresponding probabilities. The
probability distribution of a discrete random variable is called probability mass function.
In the case of continuous random variable, it is no longer meaningful to list all the values with
the corresponding probabilities and hence the probability of a random variable falling in a given
interval is listed. A histogram can be drawn taking probability on Y axis with large number of
small intervals of random variable on X axis. A smooth curve passing through upper sides of the
rectangles of the histogram can be drawn. In many cases it is possible to determine a function
f(x) that approximates the curve. This function is called a probability density function. Here also
the two basic conditions should be satisfied (i) f(x) ≥ 0 and (ii) the total area under the curve f(x)
and the x-axis is equal to 1.
Example1
Find the probability distribution of an outcome in throwing of a dice experiment.
Answer
Let X denote the outcome of the experiment. Then the probability distribution of X is given by,
A random variable X is said to follow binomial distribution if its probability mass function is
given by (1). If p and n are known, this distribution can be completely determined. Hence, p and
n are called parameters of the binomial distribution. In this distribution, it is assumed that each
trial results in one of two possible mutually exclusive outcomes namely ‘success’ or ‘failure’.
Further, p is assumed to be constant from observation to observation and outcomes of
observations are independent.
2.4.1. Examples
Example 2
Find the probability of getting only one catla in a sample of 10 fishes drawn one by one, if the probability
of a catla being drawn in any draw is 0.2.
Answer
In a sample of size n, the probability getting x catla is given by,
For x = 0, 1, 2....................10
In this example p is the probability of a catla being drawn in any draw which is given to be 0.2,
i.e., p = 0.2
q = 1-p = 1-0.2 = 0.8
Sample size n = 10, x = 1, i.e., getting one catla
Probability of getting one catla is given by,
Example 3
What is the probability of finding 2 males in a sample of 5 fishes drawn one by one? (Assume sex ratio is
1:1).
Answer
The probability of finding male = 0.5 = p (say).
The probability of not finding male (i.e. finding female) 0.5 = q (say). Further n=5 and x=2.
Hence, the required probability using binomial distribution is given by,
Where e is the base of natural logarithm having a value of 2.7183, m is the mean of the
distribution. If m is known this distribution can be completely determined and is called
parameter of the distribution. An important characteristic of the Poisson distribution is that its
variance is equal to the mean of the distribution. The Poisson distribution is positively skewed.
However, as m (= np, when n is large) increases, it will tend to normal distribution. In Poisson
distribution, it is assumed that rare events occur randomly and independently. Some examples of
Poisson variable are number of ships arriving in a harbour per hour, number of animals per
square of plankton species, the number of machines breaking down daily in a fish processing
plant. As variance equals mean in the case of Poisson distribution, the ratio of former to the latter
(i.e. s2/m) can be used to determine whether the variable under study is randomly distributed or
over-dispersed. Theoretically if this ratio is greater than 1, the population is over dispersed and
Poisson distribution will not be suitable to describe this population.
2.6.1. Examples
Example 5
In the study of a certain fish species, a large number of samples were taken from a pond, and the
number of fish in each sample was counted. The average number of fish in sample was found to
be 2. Assuming that the number of fish follows a Poisson distribution find the probability that (i)
there are exactly 3 fishes, (ii) there are more than 4 fishes.
Answer
In this example m = 2
Example 6
The data given below refer to the number of animals per square of a particular species of
plankton counted in a plankton counting cell. Compute the Poisson probabilities and the
expected frequencies.
Answer
To compute Poisson probabilities, arithmetic mean “m” of the distribution is required.
In the given example,
The Poisson probabilities for different values of x are computed using the Poisson distribution,
Where µ and are respectively the mean and standard deviation of the distribution. π and e are
constant whose values are equal to 3.1416 and 2.7183 respectively. The graph of f(x) is a famous
“bell-shaped” curve.
Normal distribution can be completely identified if mean (µ) and standard deviation ( ) are
known. The distribution will vary depending upon the values of m and S which are given in Fig
(a) and Fig (b). It is a continuous distribution and can theoretically assume any value from -
to + . However, for all practical purposes the values lie in the range of plus or minus three
standard deviations from the mean.
Fig. a: Distributions with the same standard deviation but different means
Fig. b: Distribution with the same mean but different standard deviations
The central position of the curve will be described by the mean and the spread of the
curve by the standard deviation.
The coefficient of skewness is zero and the coefficient of kurtosis is 3
Mean plus or minus one standard deviation (m± ) includes 68.00 percent (68.27% to be
more precise) of the total frequency or total area of the curve.
The area under Normal curve is always are,
Fig. a: Area between m ±1
(ii) Mean plus or minus 1.96 standard deviation (m ±1.96) includes 95% of the total frequency.
Example 7
Weight of a particular species of fish was found to be distributed normally with the mean 400
grams and standard deviation 50 grams. Find the standard normal variate of fishes with weights
(i) 300, (ii) 450 and (iii) 430.
(ii) In Spiegel (1981) area of the normal curve is tabulated from 0 to any positive value of Z.
Note:
Before referring to these tables, it is therefore necessary to know the manner in which areas are
presented.
In the present manual, area tables as presented in Spiegel (1981) are referred to.
Example 8
The mean length of a one-year-old brood of catla is 30 cm and standard deviation 2 cm. A fish is
caught at random, find the probability that its length is,
(i) (a) Between 30 and 32cm
(b) Between 28 and 33 cm
(ii) Suppose it was decided to transfer all those having length greater than 31cm, what percent of
fish is required to be transferred? Assume lengths are normally distributed.
Answer
(i) (a) Compute the corresponding standard normal variate Z, for X1=30 and X2 =32
They are,
The probability that the length of the fish caught is between 30 and 32 cm in terms of Z will be,
P (0≤Z≤1) = Area between (Z=0 and Z=1) =0.3413
The area is obtained by referring to the area table of normal distribution.
(b) Length between 28 and 33 cm i.e. X1= 28, X2= 33, corresponding standard normal variates
are,
The probability that the length of the fish caught is between 28 and 33 cm in terms of Z will be
Example 10
Five fish of a particular species which is thought to be near extinction in a certain region
have been caught, tagged and released to mix into the population. After they had
completely mixed with the rest of the population, a random sample of 10 of these fishes
is selected. If there are 25 fish of this species in the region, what is the probability that in
the second sample ;
(i) There are 2 tagged fish
(ii) At the most 2 tagged fish
Answer
Example 11
In order to estimate number of fish in a pond 200 fish were caught and tagged and
released into the pond. After the tagged fish thoroughly mixed with rest of the population,
a random sample of 50 fish was selected and 12 were found to be tagged. Estimate the
number of fish in the pond.
Answer
Unit 3 - Estimation
3.1. Introduction
Statistical inference is that branch of statistics which deals with the theory and techniques of
making decisions regarding the statistical nature of the population using samples drawn from the
population.
Statistical inference has two branches.
They are:
Standard Error is used to decide the efficiency and consistency of the statistic as an
estimator.
In interval estimation, Standard Error is to write down the confidence intervals.
In testing of hypothesis, standard error of the test statistic is used to standardize the
distribution of the test statistics.
3.5.Unbiased Estimator
An estimator is said to be unbiased if the average of all values taken by the estimator is equal to
the population parameter
For example: Sample mean is an unbiased estimator population mean because average of all
means of samples of same size taken from a population is equal to the population mean.
Similarly, Sample proportion is the unbiased estimator of population proportion and Sample
variance is the unbiased estimator of population variance.
Note :
Sample variance is given by,
3.6.Interval Estimators
3.6.1.Case 1 - (1- α ) % Interval estimates for population mean when σ is known is given by.
Note:
If σ is unknown then, it is to be replaced by sample standard deviation ,‘s’.
Note:
Larger the sample size( n), lower is the error(E).
The sample size can be determined for the known confidence level by using the
formula :
3.6.1.1.Examples
Example1
From a random sample of 25 fishes taken from a pond, the average length is found to be
5.2cm.Assuming the length follows a normal distribution with an unknown mean and a standard
deviation of 0.5 cm.
1.Calculate the 95% Confidence Interval for the mean length of fishes in the pond.
For a confidence level of 95%, the critical value is zα/2 = 1.96.
2.Indicate the sample size needed to estimate the average length with an error of ± 0.5 cm and a
confidence level of 95%.
Example 2
A sample of the sixteen fishes has been taken from a pond gave the following length
measurements in mm.
95, 108, 97, 112, 99, 106, 105, 100, 99, 98, 104, 110, 107, 111, 103, 110.
Assuming that the length fishes follow a normal distribution with variance of 25 mm2 unknown
mean:
1. What is the distribution of the sample mean?
Example 2
In a sample of 400 selected at random, a sample mean of 50 was obtained. Determine the
confidence interval with a confidence level of 97% for the average population.
Solution:
Example 3
With the same confidence level, what minimum sample size should it have so that the
interval width has a maximum length of 1?
Solution
4.1. Introduction
Many situations call for verification of statements on the basis of available information.
For example, a researcher may be interested in verifying the statement 'Fish species A
grows faster than fish species B' using the average growth of fish of species A group
faster than fish B based on say information collected on 40 farms. Verifying a statement
concern in a population by examine a sample from that population is called testing of
hypothesis.
Estimation and testing of hypothesis are not as different as they appear. For example,
confidence intervals may be used to arrive at the same conclusions that are reached by
using the testing of hypothesis procedure.
4.2. Terminology
Before carrying out any statistical test, it is required to understand following terminology
involved in testing of hypothesis.
Statistical hypothesis
Statistical hypothesis is a statement about the population under study. It is usually a statement
about one or more parameters of the population. Such statement may or may not be true.
Examples of hypothesis are:
Null hypothesis
The hypothesis to be tested is commonly designated as ‘Null hypothesis” and is denoted usually
by Ho.
For example:
To decide whether one procedure of fish processing is better than the other in terms of
shelf life.
Then null hypothesis can be formulated as below:
Ho: There is no difference in the shelf life of two procedures.
Alternative hypothesis
Any admissible hypothesis that differs from a null Hypothesis is called an alternative hypothesis
and is denoted by H1.
For example:
In an experiment to compare the efficiency of 4 feeds.
Then hypothesis are:
Test statistic
It is a function of sample values. It extracts the information about the population parameter
contained in the sample. The observed value of the test statistic serves as a guide in rejecting or
not rejecting the null hypothesis.
For example,
In testing of null hypothesis that value of the population mean is µo i.e. Ho : µ=µ0 the
statistic used is,
Rejection region
After the test statistic to be used is selected, the set of possible values of a statistic are
divided into two mutually exclusive regions viz : rejection region (critical region) and
acceptance region (Region of non rejection). If the observed value of a test statistic falls
in the rejection region, Ho is rejected. If it falls in the acceptance region, it is not rejected.
It may be noted that if the observed value falls in the acceptance region, it does not prove
the hypothesis; it simply fails to disprove it.
For a fixed sample, size it is difficult to minimize both α and β, as an attempt to decrease one
may lead to an increase in the other. It is customary to fix α at a predetermined level and
choose a test procedure that minimizes β i.e., α is prefixed in a test and β is minimized. Thus, we
run the risk of rejecting a true H0 but reduce β, the acceptance of false H0 to minimum. Test
criterion is developed on these principles.
4.4.Level of significance
In testing a given hypothesis, the maximum probability with which we would be willing to risk
type I error is called the level of significance of the tests, denoted by α.
In other words, it is a way of quantifying the amount of risk one wants to take in rejecting a true
hypothesis.
Usually 5% or 1% levels of significance are chosen. These levels, however, depend on the gravity
of the risk which costs in decision making.
4.5.Degrees of freedom
The number of independent observations available from the data for estimation of a particular
parameter or a quantity is called the ‘degrees of freedom’.
It can be calculated by deducting from the number of observations, the number of constants that
are calculated from the data. For instance, the estimate of population variance based on a sample
of ‘n’ observations is given by,
In this case the constant (parameter), population mean, is estimated by the sample mean
. Hence, deduct 1 from the total number of observations ‘n’ to get the degrees of
freedom, i.e., the degrees of freedom of s2 is (n-1).
In the case of (r x c) contingency table, degrees of freedom is equal to (r-1) (c-1) where r
is the number of rows and c is equal number of columns. For example for a contingency
table with 3 rows and 4 columns is equal to (3-1)(4-1) i.e. 6
standard deviation
Standard Error: It is the Standard deviation of the sampling distribution of sample
means.
Central limit theorem: If random samples of n measurements are repeatedly drawn from
a population with a finite mean µ and standard deviation σ, then the relative frequency
histogram drawn for the (repeated) sample means will tend to be distributed normally.
Approximation becomes more valid as n increases.
If a sample of size n is drawn from a normal population with mean µ and standard
deviation σ, then sample mean is also distributed as normal with mean µ and
standard deviation This proposition holds good even if the population from which the
sample is drawn is not normal provided the sample size is large (from central limit
In the tests involving normal distribution, the set of values of Z outside the range - 1.96
to 1.96 constitutes the region of rejection or critical region (Fig. 1).
In the above discussion 5% level of significance was used. As mentioned earlier any
level of significance can be used. If 1% level of significance is used, the region of
rejection will be outside the range — 2.58 to 2.58 (Fig.2).
In such cases the critical region will be to one side of the distribution as shown in Fig.3. Tests
applied to such situations are called ‘one tailed’ tests. It is to be noted that the critical value of Z
at 5% and 1% level of significance for one tailed test are 1 .645 and 2.33, whereas, these values
are 1.96 and 2.58 for two tailed tests. Discussions in the following sections will be restricted to
two-tailed tests, but the same procedure will hold good for one tailed tests also.
Test for mean of single sample
Let x1, x2 ..................xn be the values of a variable X, in a large random sample of size n from a
population with mean m and variance σ2. On the basis of this sample, the hypothesis regarding
the value µ is tested. Testing of hypothesis consists of the following steps.
5.3.Test for equality of two population means
5.4.Testing single population proportion
6.1.Introduction
When the size of the sample is small, the distributions of various statistics are far from normality
and hence tests of hypothesis based on normal variate cannot be applied. In such cases tests of
hypothesis based or exact sampling distribution of ‘t’ and ‘F’ are applied. When applying these
tests it is assumed that the population from which the sample is drawn is normal.
The t - distribution which is popularly known as student’s t distribution is a sampling distribution
derived from the parent normal distribution. This distribution is symmetrical about the mean but
is slightly flatter than then normal distribution. Unlike the normal distribution it will be different
for different size of the sample ‘n’ or the degree of freedom (n-1). When the size of the sample is
very small < 30), the t - distribution markedly differs from normal distribution, but as n increases
t - distribution resembles more and more a normal distribution (fig.1). The t distribution has
mean zero and variance n / (n-2) for n>2. The variable t ranges theoretically from - ∞ to + ∞. The
values of ‘t’ have been tabulated for different degrees of ‘freedom at different levels of
significance (Fisher and Yates, 1963).
Example 1: A sample of 25 fingerlings drawn from a rearing tank showed a mean length of
75.8 mm and standard deviation of 10 mm. Is the data consistent with the claimed mean size
of 80 mm?
Example 2 : Weight was recorded separately for male and female of one year old fish
of species A. The mean weights of males and females are:
Answer : Let µ1 and µ2 denote the population means of male and female fishes
respectively.
Example 3 : The following table gives the obtained by 9 students in two tests, one held
beginning of a year and the other at the year after intensive coaching. Do the data in the
students have benefited by coaching?
Student 1 2 3 4 5 6 7 8 9
Test 1 55 60 65 75 49 25 35 18 61
Test 2 63 70 70 81 54 29 32 21 70
Answer:
(i) Hypotheses
Example 4: Following data refer to catch (in tons) per haul of one hour duration in a trawl
survey off a certain coast
1.2, 2.5, 1.0, 4.0, 3.0, 2.8, 0.6, 3.4, 2.5, 2.0
Compute mean catch per hour and also 95% confidence limits for catch per hour for the coast
(population) under survey.
Answer:
95% confidence limits are given by:
To calculate these confidence limits the following, computations are to be made:
Haul 1 2 3 4 5 6 7 8 9 10 Total
No.
Catch/ 1.2 2.5 1.0 4.0 3.0 2.8 0.6 3.4 2 2.5 23.0
Hour(x)
x2 1.44 6.25 1.0 16.0 9.0 7.84 0.36 11.56 4.0 6.25 63.7
Exercise for practice:
I. Samples of eleven and fifteen animals were fed on different diets A and B respectively.
The gain in weight for the individual animals for the same period was as follows:
II. Length measurements of sampled mackerel made on the same day in two landing
centres are given bellow:
Landing centre A: 24 21 20 19 17 18 15 13 20
Landing centre B: 21 22 14 18 16 19 20 22 21
Does the data support that mean length of mackerel in both the landing centres are same.
The shape of distribution depends on the degrees of freedom which is also its mean
(Fig.1). When n is small, the distribution is markedly different from normal distribution
but as n increases the shape of the curve becomes more and more symmetrical and for
n > 30, it can be approximated by a normal distribution. The values of have been
tabulated for different degrees of freedom at different levels of probability. (Fisher and
Yates, 1963). The is always greater than or equal to zero i.e. ≥ 0.
Most data on biological investigations can be classified either as quantitative or
qualitative (attribute) data. The statistical procedures discussed so far apply
mostly to quantitative data. There are many instances in fisheries research,
wherein attribute data describe the phenomenon under investigations more
adequately than quantitative data. The chi-square test based on distribution is
commonly used for analysis of attribute data.
Where n is the total number of observations and k is the number of classes. The in (1) has k-1
degrees of freedom. In this test the expected frequency of each class should be more than 5. If
any such frequency is small adjacent classes may be grouped, so that the expected frequency is
more than 5.
If the calculated value of is greater than the table value of with (k-1) df, at specified level of
significance the null hypothesis of specified ratio is rejected.
Example : A sample of 500 fish observed for determining the sex ratio, indicated that
230 were male and 270 female. Do the observed data fit the expected ratio of 1: 1?
(i) Hypothesis
Ho :The observed data fit he ratio 1:1
H1 :The observed data does not fit the ratio 1:1.
(ii) Test statistic
On the basis of this hypothesis of 1:1 ratio, 250 fish are expected in male and female
classes, is calculated as follows:
Frequency
Sex Oi2 Oi2/Ei
Observed (Oi) Expected (Ei)
Male 230 250 52,900 211.60
Female 270 250 72,900 291.60
Total 500 500 503.20
Example : Test whether the data on number of animals per square of a particular species of
plankton given in example 5 of chapter 6 follows Poisson distribution.
Answer :
(i) Hypotheses
H0 :Number of animals per square of a particular species of plankton follows Poisson
distribution
H1 :Number of animals per square of particular species of plankton does not follow Poison
distribution
(ii) Test Statistics
Expected frequencies using Poisson probability distribution have already been computed in
example 5 of chapter 6. Hence based on observed and expected frequencies can be computed
as outlined below:
x Oi Ei Oi2/ Ei
0 30 33.29 27.04
1 42 36.62 48.17
2 18 20.14 16.09
3 8 7.38 8.67
4 2 2.03 1.97
Total 100 101.94
B1 a b a+b
B2 c d c+d
This correction is suitable when the expected frequency of classes is less than 5, but
estimation with correction can do no harm even when the frequencies are large. Hence
it is always better to use the correction as a matter of routine.
.
Ai Oi1 Oi2 ... Oij ... Oic Ai
.
.
Ar Or1 Or2 ... Orj ... Orc Ar
Total (B1) (B2) .. Bj .. BC N
As the table consists of ‘r’ rows and ‘c’ columns, there will be (r x c) observed
frequencies, one in each cell. Corresponding to each observed frequency, there is
expected frequency, computed based on certain hypothesis. Under the null hypothesis
of no relationship or of independence between the attributes, expected frequency of
each cell is computed by multiplying totals of the row and column to which the cell
belongs divided by the total number of observations. For instance, the expected
frequency of the cell in 1st row and 2nd column is obtained by multiplying the 1st row
total (A1) with the 2nd column total (B2) and then dividing by the total number of
observations, ‘n’. After calculating the expected frequencies for each cell, is
computed using the formula,
Example: In a fish tagging experiment, the length frequency of tagged fishes and
recoveries were as under. Test whether the length distributions can be accepted as
same?
Number of people
Sex
Eating fish Not Eating fish
Male 370 150
Female 230 250
Total 600 400
II. A survey of 162 children having one parent with blood group M and another with
blood group N revealed that 28.4% children have blood group M 42% have blood group
MN and the remaining have the blood group N.Test the validity of the genetic law that
the proportion of M: MN: N is 1:2:1
Unit 8 - F-Distribution
8.1.Introduction
In the t-tests on paired and unpaired samples we had to assume that the two samples came from
the same Normal Distribution. We were testing that the means were the same but still had to
assume that standard deviation were the same. We can test whether this assumption is correct by
using F- test is due to Snedecor.
If the standard deviations are the same then we would expect sample standard deviations to be
similar i.e. s1=s2 or more precisely,
This can be written as σ12/ σ22 =1
That is we can test for the closeness of this ratio to 1
The Variance ratio or F-Test formalises these ideas:
The ratio of variances (s12/ s22) of independent samples taken from a normal population follows a
distribution called F-distribution with two degrees of freedom one for numerator and the other
for denominator.
8.2. F-distribution
If s12 is the variance of a sample of size n2 and s22 is the variance of another
independent sample of n1 taken from the same population then ratio of variances of
these samples i.e. s12/ s22 follows a distribution called F-distribution with (n1-1) and (n2-
1) degrees of freedom.
Note:
F has two sets of degrees of freedom-one for numerator and another for
denominator.
F is a positively skewed distribution.
Shape of F distribution depends upon the two degrees of freedom.
Degrees of freedom are the parameters of F distribution.
= 4.52/3.82
F(cal) = 1.40
F(tab) at (n1-1, n2-1)= (34,24)df at 5%=1.79 (approximately)
= 40/25
F(cal) = 1.60
F(tab) at (n1-1, n2-1)= (39,24)df at 5% =1.79 (approximately)
If on the same individual, data on two variables say X and Y are listed, it is called a bivariate
population. In this bivariate population, for every value of X, there is a corresponding value of Y.
By treating these variables X and Y separately, measures of central tendency, dispersion etc., can
be worked out. In addition to these measures it may be of interest to study the strength of
relationship existing between the variables and the nature of their relationship. The study of the
former aspect is referred to as ‘correlation’ and the latter as regression analysis.
In the above expression, X and Y denote the measurements on variables X and Y and n is the
number of pairs of observations i.e. the sample size.
The variables are said to be positively correlated if ‘r’ is positive and negatively correlated if ‘r’
is negative. Positive correlation indicates that two variables are moving in the same direction,
i.e., as one increases the other increases or if one decreases the other decreases. Negative
correlation indicates that the two variables are moving in opposite direction i.e., as one increases
the other decreases
Examples
Length and weight of juveniles of fish, Income and expenditure are some of the examples
for positive correlation
Rate of infection and yield, demand and supply are the some of the examples for negative
correlation
Growth & demand for fish, size of the shoe and number of intelligent boys / girls are
some of the example for no correlation.
When r = +1 there exists a strict linear relationship and the correlation between the variables is
said to be perfectly positive.
When r = -1 the relationship is linear and correlation between the variables is perfectly negative.
The correlation coefficient equal to one (either positive or negative) indicates perfect correlation
between the variables. Perfect correlation rarely occurs in biological data though values as high
as 0.99 have been obtained in some cases. The closer the value of the coefficient to one, the
greater is the intensity or the degree of association between the variables. Values of r near zero
may arise when there is no relationship or when there is a real relationship but it is not linear.
9.3.1.Example 1
The total length and standard lengths of 15 fishes of a particular species were measured. Work
out the coefficient of correlation for the data given below:
Answer:
9.4.1 Example 2
The correlation between length and weight ‘or a particular fish species is observed to be
0.7 from a sample of 18 specimens.ls it significant?
Answer: Let denote the population correlation coefficient between length and weight
of fish.
(i) Hypothesis
H0 : = 0 ; H1 : ≠ 0
(ii) Test statistic
=3.92
Table values of t are t16 (5%) = 2.12, t16 (1%) = 2.92
(iii) Statistical decision :
Since the calculated value of t is greater than the table value oft at 5% and 1% level of
significance, reject H0. Hence, the correlation coefficient is significant.
Note: It is however, not necessary to carry out the‘t’ test described above for testing the
significance of the correlation coefficient as readymade table of critical values of r for
different degrees of freedom at 5% and 1% levels of significance is available (Fisher
and Yates 1963). Compare the calculated value of r with the critical value of r from the
table. If the calculated value of r is higher than the critical value, then correlation is
significant.
9.5 Introduction to Simple Linear Regression
If two variables are found to be highly correlated then a more useful approach would be to study
the nature of their relationship. Regression analysis achieves this by formulating statistical
models which can best describe these relationships. These models enable prediction of the
value of one variable, called the dependent variable from the known values of the other
variable(s). It differs from correlation in that regression estimates the nature of relationship
where as the correlation coefficient estimates the degree or intensity of relationship. Further, it is
necessary to designate one of the variables as dependent and the other as independent in the
case of regression analysis which is not necessary in correlation analysis.
Simple linear regression deals with the study of near relationships involving two variables,
whereas, the relationships among more than two variables are studied of multiple regression
techniques.
Fitting linear relationship of the form (1) is equivalent to estimating the constants a and b for the
observed data. The best method that is used of estimation of ‘a’ and ‘b’ is the method of ‘least squares’.
In a popular way it only means that a line is found to which the total of squares of all distances from
different points is minimum i.e. sum of e2 is minimum. In other words the values of a and b which
minimize,
Estimated values of these constants are substituted in the equation Y = a+bX to get the regression
equation. From this equation the value of Y can be estimated for a given value of X.
9.6.1 Special names of the parameters ‘a’ and ‘b’ and expressions for their variances
There are special names for the parameters ‘a’ and ‘b’. The parameter ‘a’ is called the Y
intercept. It is the value Y assumes when X = 0. The parameter ‘b’ is called the regression
coefficient and gives the slope of the regression line, i.e., it shows how steep the line is. The
regression coefficient indicates the rate of change in the dependent variable(Y) per unit change
in the independent variable(X).
The variance of the estimates b and a are respectively given by:
where sy and sx, are standard deviations of y and x respectively.
9.6.2 Variance about the regression line (Deviation from regression)
The assumption behind the standard linear regression is that each Yi is normally distributed with
mean value a+b Xi, and with a constant variance σ2 which is not dependent on the value of Xi.
The formula for estimate of this variance is given by
This forms the basis for an estimate of error in fitting the line. However, convenient formula to
work out this variance is given by,
4. Test statistic is t =
In the above expression Y = In W and X =In L, and denote arithmetic means of Y and X values
respectively. The B value given and estimate of b, whereas conventionally ‘a’ is estimates as exp (A). This
method however gives biased estimate of ‘a’. To compensate for the bias the ‘a’ value obtained is
multiplies by the correction factor exp (S2/2), where S2 is an estimate of variance of deviations from
regression. Hence, corrected a = exp (A+S2/2).
Note: If logarithm to the base 10 (common logarithm) are used then, a = antilog (A+S2/2)
9.9 Applications of length-weight relationship of fishes
As length of fish can be measured more easily and accurately than weight in landing centers as
well as on the board of the vessels in the sea, weight would readily be estimated from the
predetermined length-weight relationship.
In order to compare weight and length in a particular sample or individual condition factors are
employed. Fulton’s condition factor (K) is calculated as,
Where W and L are the observed total weight and length of a fish. It is the value of ‘a’ in length-
weight relationship, W = aLb, when b=3. If the fish is heavier, at a given length the larger is the
factor K, implying better is the condition of fish. K greater than 1, indicates general well being of
the fish is good. Fulton’s condition factor is suitable for comparing differences related to sex,
season or place of capture. Even when b differed from 3, Fulton’s condition factor may be used,
if fish are approximately of the same length. If the length range is large, the following formula is
used:
Alternatively, the condition factor is computed as the ratio observed weight to estimated weight.
K =
Where is the estimated weight obtained from length weight relationship, W = aLb
9.10 Fitting of length-weight relationship
Examples:Total length (cm) and weight (g) recorded on a sample of 12 fish are given
below:
(i) Fit length-weight relationship of the type W = aLb, where W is weight and L is length
of fish
(ii) Test whether b differed significantly differs from 3.
Solution:
= 1.2898 = 1.5310
Then = 1.5310-4.8917= -3.3607
S2 =
=
= 0.0011
a= Antilog (A+S2/2)
= Antilog [-3.3607+ (0.0011) /2]
= Antilog ( -3.35015) = 0.0004
The length-weight relationship is therefore given by
W = 0.0004 L 3.7944
(ii)To test whether the sample regression coefficient (b=3.7944) comes from a
population with the regression coefficient β=3, the following null hypothesis is set up.
Hypothesis
Ho: β=3; H1:β≠3
The test statistic used is,
=0.0122
Hence Sb = = 0.1105
Hence
(iii) Statistical decision
The table values of t at 5% and 1% level of significance are 2.228 and 3.169
respectively. As t calculated is 7.2 which are more than the table values of t at both 5%
and 1% level of significance, the null hypothesis is rejected.
10.1.Introduction
India has a long coast line of about 8118 km and there are about 1400 landing centres
scattered along the coast. The sampling design developed and practiced by the Central
Marine Fisheries Research Institute (CMFRI) Kochi provides the estimate for total
marine fish landings India. The sampling design adopted for this purpose is ‘stratified
multistage random sampling’ with the stratification being done over space and time.
Each maritime state is divided into several zones on the basis of geographical
consideration and fishing practices.
Realizing the need for developing a uniform and standard methodology for estimation of inland fisheries
resources and catch, a pilot investigation was launched as early as in 1955-56 by ICAR in two districts of
erstwhile Hyderabad state. Later on, the Government transferred the work from ICAR to the Directorate
of Nation Sample Survey (NSSO) in April, 1956. In September, 1958 the Directorate of the National
Sample Survey took up the survey work in Orissa to evolve suitable sampling techniques for estimation
of fish production.
By the end of 1958, basic information on various inland fisheries resources and their relative importance
fish practices etc., were available which later formed the basis of the pilot survey undertaken by the
NSSO in 1962-63, in 3 districts of Orissa viz. Cuttach, Sambalpur and Mayurbhanj. Indian statistical
Institute, Calcutta made an attempt for evolving sampling methodology for inland fisheries during 1960-
61. Field problems came to light during the study and no estimation was attempted.
The Central Inland Fisheries Research Institute (CIFRI), Barrackpore made an attempt the area and catch
from ponds in the district of Hoogly. West Bengal during 1962-63, which was not successful due to some
administrative difficulties. The NSSO conducted a survey in 1973-75 covering 3 districts one each in
West Bengal, Tamil Nadu and Andhra Pradesh for estimating the catch from impounded water as well as
riverine resources by enquiry method. The estimates worked out were not satisfactory particularly form
riverine resources.
Indian Agricultural Statistics Research Institute (IASRI). New Delhi and CIFRI, Barrackpore carried out a
pilot survey during 1978-81 in one district of West Bengal. The data were collected both by enquiry and
by physical observations. The study covered only ponds in the district of 24 Paragans in West Bengal.
The catch estimate of other important resources viz., estuaries, rivers, brackish water impoundments
beels etc. could not be carried out due to limited manpower.
As scientifically designed method for collection and estimation of inland fisheries statistics did not
emerge in spite of all these attempts, a centrally sponsored scheme was launched in 1984 in 8 states to
evolve standardization methodology for collection of inland fisheries statistics in the country and its
implementation was entrusted to CIFRI, Barrackpore. The resources assessment survey work and catch
assessment survey work have been completed in 158 and 56 districts respectively till 1998-99 under this
scheme. The scheme has enabled preparations of a uniform and sound data collection methodology on
the basis of sample surveys conducted in various states. Estimation procedures have also been
formulated for different ecological environments in inland fisheries. The CIFRI Barrackpore has brought
out a bulletin number 58 (revised) in 1991 on methodology for collection and estimation of inland
fisheries statistics in India, to provide guidelines on collection of data and estimation procedures with
associated degree of reliability at national level. Inspite of all these efforts, standardization of
methodologies for estimation of catch from diverse inland aquatic resources and establishing
mechanism for regular collection and dissemination of data by the states and union territories are yet to
take place.
The Central Inland Fisheries Research Institute (CIFRI), Barrackpore made an attempt the area and catch
from ponds in the district of Hoogly. West Bengal during 1962-63, which was not successful due to some
administrative difficulties. The NSSO conducted a survey in 1973-75 covering 3 districts one each in
West Bengal, Tamil Nadu and Andhra Pradesh for estimating the catch from impounded water as well as
riverine resources by enquiry method. The estimates worked out were not satisfactory particularly form
riverine resources.
Indian Agricultural Statistics Research Institute (IASRI). New Delhi and CIFRI, Barrackpore carried out a
pilot survey during 1978-81 in one district of West Bengal. The data were collected both by enquiry and
by physical observations. The study covered only ponds in the district of 24 Paragans in West Bengal.
The catch estimate of other important resources viz., estuaries, rivers, brackish water impoundments
beels etc. could not be carried out due to limited manpower.
As scientifically designed method for collection and estimation of inland fisheries statistics did not
emerge in spite of all these attempts, a centrally sponsored scheme was launched in 1984 in 8 states to
evolve standardization methodology for collection of inland fisheries statistics in the country and its
implementation was entrusted to CIFRI, Barrackpore. The resources assessment survey work and catch
assessment survey work have been completed in 158 and 56 districts respectively till 1998-99 under this
scheme. The scheme has enabled preparations of a uniform and sound data collection methodology on
the basis of sample surveys conducted in various states. Estimation procedures have also been
formulated for different ecological environments in inland fisheries. The CIFRI Barrackpore has brought
out a bulletin number 58 (revised) in 1991 on methodology for collection and estimation of inland
fisheries statistics in India, to provide guidelines on collection of data and estimation procedures with
associated degree of reliability at national level. Inspite of all these efforts, standardization of
methodologies for estimation of catch from diverse inland aquatic resources and establishing
mechanism for regular collection and dissemination of data by the states and union territories are yet to
take place.