Mca4020 SLM Unit 08
Mca4020 SLM Unit 08
8.1 Introduction
So far in our previous chapters we studied various daily problems related to
one variables. But there are more number of problems involving two or more
variables. If two quantities varies in such a way that movements in one
variable effects the movement of other variables, then we say that the two
variables are correlated. For example the variables like height and weight,
rainfall and yield, price and demand, income and expenditure, production
and employment etc. Regression measures the average relationship
between any two or more closely related variables.
In this unit, we will discuss about the techniques such as correlation and
regression, used for investigating the relationship between two or more
variables.
Objectives:
At the end of this unit the student should be able to:
calculate the coefficient for partial and multiple correlation
8.2 Correlation
Correlation is a statistical tool used to study the relationship between two or
more variables. Two variables are said to be correlated if the change in one
variable there will change in other variable. On the other hand if the change
in one variable does not bring any change in other variable then we say that
the two variables are not correlated to each other.
According to Simpson and Kafka. Correlation analysis deals with the
association between two or more variables
In the words of A.M. Tuttle, Correlation is an analysis of the covariation
between two or more variables
8.2.1 Types of correlation
There are four types of correlation
1. Simple, Partial and Multiple correlation
2. Positive and negative correlation
3. Perfect and Imperfect correlation
4. Linear and non-linear correlation.
1. Simple, Partial and Multiple correlation:
Simple correlation is the relationship between any two variables. Partial
correlation is the study of relationship between any two out of three or more
variables ignoring the effect of other variables. For example, let us suppose
that we have three variables X1 = marks of maths, X2 = marks of Science
and X3 = marks of English. So if we study the relationship between X 1 and
X2 ignoring the effect of other variable i.e., X3, then it is partial correlation.
Multiple correlation is the study of simultaneous relationship between one or
group of other variables. For example, if we study X1, X2, X3 simultaneously
then correlation between X1 and (X2, X3) is multiple correlation. Multiple
correlation is not commonly used.
2. Positive and Negative correlation: Two variables are said to be
positively correlated when the both the variables under study move in the
same direction, i.e., if one variable increase the other variable should also
increase and if one variable decreases the other variable should also
decrease. Variables are said to be negatively correlated if increase in one
variable leads to decrease in other variable and vice versa. That is the
variables move in opposite direction. For positive correlation, the graph will
be an upward curve whereas in case of negative correlation the graph will
be downward curve.
3. Perfect and Imperfect correlation
When both the variables changes at a constant rate irrespective of the
change in direction then it is called perfect correlation. When the variables
changes at different ratio then it is called imperfect correlation. The values of
perfect correlation is 1 or -1 and the values of imperfect correlation lies in
between -1 and 1.
4. Linear and Non-linear correlation
Linear correlation is a correlation when the graph of the correlated data is a
straight line. That is the variables are perfectly correlated. The linear
correlation can be either positive or negative when the graph of straight line
is either upward or downward in direction. On the other hand the non-linear
or curvi-linear correlation is a correlation when the graph of the variables
gives a curve of any direction. Like perfect correlation, non-linear correlation
can be either be positive or negative in nature depending upon the upward
and downward direction of the curve
No Correlation
(8.2)
(8.3)
Note:
1. The value of correlation coefficient can not exceed unity numerically. It
always lies in between -1 and +1. That is . If r = +1, the
correlation is perfect and positive and if r = -1, correlation is perfect and
negative.
2. It is not affected by change of origin or change of scale
3. It is a relative measure. It does not have any unit attached to it
Example: Calculate the coefficient of correlation by Karl Pearsons method
based on following values
T1 75 60 45 30 15
T2 150 175 200 225 250
Solution:
2 2
T1 X= T1/15 X T2 Y= T2/25 Y XY
75 5 25 150 6 36 30
60 4 16 175 7 49 28
45 3 9 200 8 64 24
30 2 4 225 9 81 18
15 1 1 250 10 100 10
15 55 40
Solution: Correct
Corrected
Corrected
Corrected
Corrected
Cov(X,Y) =
Solution:
Sale (X- ) 2 Expenses (Y- ) 2
Firm x y xy
X x Y y
1 50 -8 64 11 -3 9 24
2 50 -8 64 13 -1 1 8
3 55 -3 9 14 0 0 0
4 60 2 4 16 2 4 4
5 65 7 49 16 2 4 14
6 65 7 49 15 1 1 7
7 65 7 4 15 1 1 7
8 60 2 4 14 0 0 0
9 60 2 4 13 -1 1 -2
10 55 -8 64 13 -1 1 8
N =10
Applying the formula for r and substituting the respective values from the
table we get r as:
( x x)( y y )
r
N x y
( x x )( y y )
Given Cov(x, y) = - 17.5
N
x = 49 = 7 y = 9 = 3
17.5
r -0.833
73
SAQ 1: Ten observation in Weight (x) and Height (y) of a particular age
group gave the following data.
x = 56 y = 138 x2 = 1357 y2 = 2136 xy = 836 .Find r
When we do not know the shape of population distribution and when the
data is of qualitative type, Spearmans Ranks correlation coefficient is used
to measure the relationship.
Spearmans Rank correlation coefficient is defined as:
6 D2
1
N3 N
Type i: Ranks are assigned: When ranks are already assigned, take the
difference between the ranks of the variables and denote it by D. Then the
rank correlation is computed using the formula
6 D2
1
N3 N
Example: In a singing competition, two judges assigned the ranks for seven
candidates. Find Spearmans rank correlation coefficient.
Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3
Solution:
2
Competitor R1 (Judge 1) R2 (Judge 2) D = R1 R 2 D
1 5 6 -1 1
2 6 4 2 4
3 4 5 -1 1
4 3 1 2 4
5 2 2 0 0
6 7 7 0 0
7 1 3 -2 4
Total 14
6 D2
1
N3 N
6 14 6 14
=1 1 0.75
7(7 1)
2
7 48
Example: Find the rank difference coefficient of correlation for the data
displayed in table
Student Score on Score on Rank of Rank on
Test I Test II Test I Test II
X Y R1 R2
A 16 8 2 5
B 14 14 3 3
C 18 12 1 4
D 10 16 4 2
E 2 20 5 1
Solution:
Difference
Score on Score on Rank of Rank on Difference
Student between
Test I Test II Test I Test II squared
Ranks
2
X Y R1 R2 D D
A 16 8 2 5 -3 9
B 14 14 3 3 0 0
C 18 12 1 4 -3 9
D 10 16 4 2 2 4
E 2 20 5 1 4 16
D = 38
2
N=5
6 D2 6(38)
=1 1 3 1 1.9 0.9
N N
3
5 5
Example: The following table represents the sales statistics of six sales
representatives in two different localities. Find whether there is a
relationship between buying habits of the people in the localities.
Representative 1 2 3 4 5 6
Locality I 2 5 3 1 4 6
Locality II 4 5 3 1 2 6
Solution:
Sales in Sales in 2
Representative D = R1-R2 D
Locality I, R1 locality II, R2
1 2 4 -2 4
2 5 5 0 0
3 3 3 0 0
4 1 1 0 0
5 4 2 2 4
6 6 6 0 0
Total 8
68 8
=1 1 0.7714
6 (6 1)
2
35
Type ii: Ranks are not assigned: When ranks are not given, we have to
assign the ranks to the variables either in ascending order or descending
order. The ranks can be assigned by taking either the highest value as 1 or
the lowest value as 1.Then use the same formula to compute the rank
correlation.
Example: Calculate the rank correlation for the following data of marks of 2
tests given to candidates for a clerical job.
Preliminary test: 92 89 87 86 83 77 71 63 53 50
Final test : 86 83 91 77 68 85 52 82 37 57
Solution:
Preliminary 2 2
R1 Final Test (Y) R2 (R1-R2) = D
test (X)
92 10 86 9 1
89 9 83 7 4
87 8 91 10 4
86 7 77 5 4
83 6 68 4 4
77 5 85 8 9
71 4 52 2 4
63 3 82 6 9
53 2 37 1 1
50 1 57 3 4
N = 10
6 D2
1
N3 N
6 44 6 44
=1 1 0.733
10(10 1)
2
990
ranked equal at 6th place, then they are given the rank . Thus if two
or more individuals are to be ranked equal then the ranks assigned to these
individuals is the average of the ranks.
When equal ranks are assigned to some entries then there will be an
adjustment of to the value , where m stands for number of
items whose ranks are common. If there are more than one such group of
Sikkim Manipal University Page No.: 239
Probability and Statistics Unit 8
items with common rank, this value is added as many times as the number
of such groups. The formula can thus be written as:
= 1 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m 2 ) ...
2 3 3
N N
3
Example: Find rank correlation coefficient for the data given in table.
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28
=1
6 24 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2)
10(100 1)
624 0.5 0.5 0.5 0.5
=1
10 99
146
=1 0.8525
10 99
where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd constant
Sikkim Manipal University Page No.: 241
Probability and Statistics Unit 8
Similarly,
r13 r12 . r23 r23 r12 . r13
r13.2 and r23.1
1 r12 1 r32 1 r12 1 r31
2 2 2 2
(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by
r13 r12.r32 0.5 0.8 0.4 0.18
r13.2 0.33
1 r12 . 1 r32
2 2
1 0.82 1 0.42 0.55
(iii) The correlation between variables 2 and 3 keeping the 1st constant is
given by
r23 r21.r31 0.4 0.8 0.5
r23.1 0
1 r21 . 1 r31
2 2
1 0.8 2 1 0.5 2
SAQ6. Given r12 = 0.7, r13 = 0.61 and r23 = 0.4, calculate all partial
correlations. Calculate r12.3 , r13.2
The coefficient of multiple correlations for R1.23, R2.13 and R3.12 can be
expressed as:
R1.23 = r
12
2
r13 2 2 r12 r13 r23 1 r
23
2
R2.13 = r 2
12
r 2 2 r12 r13 r23
23
1 r
2
13
R3.12 = r 2
13
r 2 2 r12 r13 r23
23
1 r
2
12
R1.23 = r
2
12
2
r13 2r 12 r 13 r 23 1 r
2
23
= 0.986
8.5 Regression
Regression is defined as, the measure of the average relationship between
two or more variables in terms of the original units of the data
The line of regression is the line which gives the best estimate to the value
one variable for any specific value of the other variable. The line of
regression is the line of best fit and is obtained by the principle of least
squares.
For a set of paired observations there exist two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of y on x. It is used to
estimate y values for given x values. The line drawn in such a way that the
sum of horizontal deviation is zero and sum of their squares is minimum, is
called regression line of x on y. It is used to estimate x values for given y
values. The smaller the angle between these lines, the higher is the
correlation between the variables. The regression lines always intersect at
( x, y ).
x x bxy y y
where,
N xy ( x) ( y ) ( X X ) (Y Y ) cov xy
bxy r x
N y ( y )
2 2
(Y Y ) 2
y y2
and
N xy ( x) ( y) ( X X ) (Y Y ) y cov xy
b yx r
N x 2 ( x) 2 ( X X ) 2 x x2
where byx and bxy are called regression coefficients and r is the correlation
coefficient.
byx .bxy r 2
byx .bxy r
Note that , byx = , bxy = ,
Properties:
1. The product of regression coefficients is always less than 1, that is,
b yx .b xy 1
2. Regression coefficient is independent of the change of origin but not of
scale.
3. It is an absolute measure
Regression Coefficient
Correlation Coefficient
The correlation coefficients, The regression coefficients,
rxy = ryx byx bxy
r lies between -1 and 1. byx can be greater than one, in which
case bxy must be less than one such
that byx.bxy<1
It has no units attached to it. It has units attached to it.
It is not based on cause and effect It is based on cause and effect
relationship. relationship.
It indirectly helps in estimation. It is meant for estimation.
225 190
X= = 22.5 Y= = 19
10 10
Regression equation of Y on X is given by:
Y Y b y x ( X X )
19 0.521 22.5
0.521 7.2775
Regression Equation of X and Y is:
N dxdy ( dx) ( dy ) 10 43 (5) (0) 43
bxy = 1.392
N dy 2 ( dy ) 2 10 24 (5) 2 24
22.5 1.792 19
1.792 11.548
r = 0.521x1.792 = 0.966
Hence, the correlation coefficient r is 0.966.
Solution:
y
Y Y r (X X )
x
3.5
Y 67 (0.8) ( X 65)
2.5
67 1.12 65
1.12 5.8
2.5
X 65 (0.8) Y 67
3.5
65 0.57 67
0.57 26.72
Hence, the two regression equations are:
1.12 5.8
0.57 26.72
Example: The table shows the results that were worked out from scores in
Statistics and Mathematics in a certain examination.
Scores in Statistics Scores in Mathematics
(X) (Y)
Mean 40 48
Standard Deviation 10 15
Therefore,
when y = 30; x =35.518 using equation (3) and
when x =50, y = 54.3 by using equation (4)
Example: For the data shown in table, obtain the two regression equations.
Estimate Y for X = 15 and estimate X for Y = 20
X 12 4 20 8 16
Y 18 22 10 16 14
Solution: The table displays the values required for obtaining the
regression equations.
X = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X
( X X )( Y Y ) 104
b yx 0.65
( X X ) 2 160
and
( X X ) (Y Y ) 104
bxy 1.3
(Y Y ) 2 80
( X X ) bxy (Y Y )
12 1.3 16
32.8 1.3
When, Y = 20, X =6.8.
Regression equation Y on X is given by:
( Y Y) b( X X)
16 0.65 12
23.8 0.65
When X = 15, Y =14.05
SAQ 7: From the following data given below compute the two regression
coefficients and formulate the two regression equation: = 510,
= 7140, = 4150, = 54900, = 740200, N = 102. Also
determine the value of Y when X = 7.
8.7 Summary
In this unit we studied the concept of correlation with the help of Karl
Pearsons correlation and Spearmans rank correlation coefficient method
with suitable number of examples. Partial and multiple correlation are also
studied. The concept of Regression analysis and multiple regression are
introduced with the help of examples.
Using rank correlation method find which judge has the nearest
approach to common likings in music.
4. Calculate the Karl Pearsons Coefficient between age and playing habits
of the following students:
Age : 15 16 17 18 19 20
No. of students : 250 200 150 120 100 80
Regular players : 200 150 90 48 30 12
5. Given r12 = 0.6, r13 = 0.5 and r23 = 0.2, calculate r12.3
6. For a large group of students x1 = score in economics, x2 = score in
maths, x3 = score in Stats., r12 = 0.69, r13 = 0.45, r23 = 0.58. Determine
the multiple correlation R3.12.
7. From the following data obtain the equation of the two lines of
regression for the following data
X Y X Y
43 29 45 27
44 31 42 29
46 19 38 41
40 18 40 30
44 19 42 26
42 27 57 10
8.9 Answers
3. The table below shows the sums required for calculation of Karl
Pearsons correlation coefficient.
Index of
2 No. of 2
Year Production x X X x yYY y xy
unemployed
X
1985 100 -4 16 15 0 0 0
1986 102 -2 4 12 -3 9 +6
1987 104 0 0 13 -2 4 0
1988 107 +3 9 11 -4 16 - 12
1989 105 +1 1 12 -3 9 -3
1990 112 +8 64 12 -3 9 - 24
1991 103 -1 1 19 +4 16 -4
1992 99 -5 25 26 + 11 121 - 55
X = 832 x = 0 x = 120 Y = 120 y = 0 y = 184 xy = -
2 2
92
X = 104 Y = 15
xy 92
r - 0.619
( x 2 ) ( y 2 ) 120 184
6 D2 6 14 6 136
1 =1 1 0.8
N N
3
7(7 1)
2
16 255
5.
Marks in Rank Rank
Applica Marks in Stats. 1 2 2 2
Accountancy assigned assigned (R -R ) = D
nts (Y)
(X) R1 R2
A 15 2 40 6 16
B 20 3.5 30 4 0.25
C 28 5 50 7 4
D 12 1 30 4 9
E 40 6 20 2 16
F 60 7 10 1 36
G 20 3.5 30 4 0.25
H 80 8 60 8 0
N=8
=1
6 81.5 1 / 12(2 3 2) 1 / 12(33 3)
8(64 1)
681.5 0.5 2
=1
8 63
6 84
=1 0
504
6. Solution: (i) The correlation between variables 1 and 2 keeping the 3rd
constant is given by
r12 r13 .r23 0.7 0.61 0.4 0.456
r12.3 0.629
1 r13 . 1 r23
2 2
1 0.612 1 0.4 2 0.794 0.916
(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by
Where = =
b) Y on X :
8.
X Y dx=X-3 dy=Y-2 dxdy
1 6 -2 4 4 16 -8
5 1 -1 -1 4 1 -2
3 0 -2 -2 0 4 0
2 0 -2 -2 1 4 2
1 1 -1 -1 4 1 2
1 2 0 0 4 0 0
7 1 -1 -1 16 1 -4
3 5 3 3 0 9 0
N=8
N dxdy ( dx) ( dy ) 8(10) (1 10) 80 5
bxy =
N dy ( dy )
2 2
8(36) (0) 2
288 18
0.28
Where =
b) Y on X :
Coefficient of correlation
Terminal Questions
1. Pair of judge A and C has the nearest approach to common likings in
music.
2. 0.915, there is high degree of positive correlation in ranks assigned by
the two managers.
3. R = 0.429
4. R = -0.991
5. 0.589
6. 0.584
7. bxy = -0.44, byx = -1.22, Regression eq of X on Y : X = 54.80 0.44Y
Regression eq of Y on X : Y = 78.67 1.22Y