Correlation and Regression Analysis
Correlation and Regression Analysis
Introduction
In the previous Chapter we have studied the characteristics of only one variable;
example, marks, weights, heights, rainfalls, prices, ages, sales, etc. This type of
analysis is called univariate analysis.
Sometimes we may be interested to find if there is any relationship between the
two variables under study
For example, the price of the commodity and its sale, height of a father and
height of his son, price and demand, yield and rainfall, height and weight and so
on.
Thus the association of any two variables is known as correlation.
Positive and negative correlation depends upon the direction of change of the
variables.
• Positive Correlation
• If two variables tend to move together in the same direction that is, an increase
in the value of one variable is accompanied by an increase in the value of the
other variable; or a decrease in the value of one variable is accompanied by a
decrease in the value of the other variable, then the correlation is called
positive or direct correlation.
Example
(i) The heights and weights of individuals
(i) If the plotted points show an upward trend, the correlation will be positive.
(ii) If the plotted points show a downward trend, the correlation will be
negative.
(iii) If the plotted points show no trend the variables are said to be uncorrelated.
Karl Pearson’s Correlation Coefficient
• Karl Pearson, a great biometrician and statistician, suggested a
mathematical method for measuring the magnitude of linear relationship
between two variables say X and Y.
• Karl Pearson’s method is the most widely used method in practice and is
known as Pearsonian Coefficient of Correlation.
• It is denoted by the symbol ‘r’ and defined as
Hence, the formula to compute Karl Pearson Correlation coefficient is
Interpretation of Correlation coefficient:
Coefficient of correlation lies between –1 and +1. Symbolically, –1≤ r ≤ + 1
When r=0, then there is no relationship between the variables, that is the
variables are uncorrelated.
This method is to be applied only when the deviations of items are taken from
actual means.
(ii) When actual values are taken (without deviation)
Example 1
Calculate Karl Pearson’s coefficient of correlation from the following data:
Solution:
Example 2
Calculate coefficient of correlation from the following data
Solution:
In both the series items are in small number. Therefore, correlation coefficient can
also be calculated without taking deviations from actual means or assumed mean.
(iii) When deviations are taken from an Assumed mean
When actual means are in fractions, say the actual means of X and Y series are 20.167
and 29.23, the calculation of correlation by the method discussed above would involve
too many calculations and would take a lot of time. In such cases we make use of the
assumed mean method for finding out correlation. When deviations are taken from an
assumed mean the following formula is applicable:
NOTE
While applying assumed mean method, any value can be taken as the assumed mean
and the answer will be the same. However, the nearer the assumed mean to the actual
mean, the lesser will be the calculations.
Example
Find out the coefficient of correlation in the following case and interpret.
Solution:
Let us consider Height of father (in inches) is represented as X and Height of
son (in inches) is represented as Y
Example
Calculate the correlation coefficient from the following data
Solution:
Example
From the following data calculate the correlation coefficient dxy =120, dx2 =90, dy2 =640
Solution:
Rank correlation
Spearman’s Rank Correlation Coefficient
• In 1904, Charles Edward Spearman, a British psychologist found out the
method of ascertaining the coefficient of correlation by ranks. This
method is based on rank. This measure is useful in dealing with
qualitative characteristics, such as intelligence, beauty, morality,
character, etc. It cannot be measured quantitatively, as in the case of
Pearson’s coefficient of correlation.
Example :The following are the ranks obtained by 10 students in Statistics and
Mathematics. Find the rank correlation coefficient.
Solution:
Let RX is considered for the ranks of Statistics and RY is considered for the ranks of
mathematics.
Example :Ten competitors in a beauty contest are ranked by three judges in the
following order
Use the method of rank correlation coefficient to determine which pair of judges
has the nearest approach to common taste in beauty?
Solution:
Let RX,RY,RZ denote the ranks by First judge, Second judge and third judge
respectively
Since the rank correlation coefficient between Second and Third judges i.e., ρYZ
is positive and high among the three coefficients. So, Second judge and Third
judge have the nearest approach for common taste in beauty
When ranks are not given
Example: Calculate rank correlation coefficient of the following data
Solution:
Let X is considered for Subject1 and Y is considered for Subject2
Example:
Find the Rank correlation coefficient and interpret the value
X : 43 44 46 40 44 45 46
Y : 29 31 29 32 48 29 32
Solution:
X Y R1 R2 d=R1-R2 d2
43 29 7 7 0 0
44 31 5.55 0.5 0.25
46 29 2.57 -4.5 20.25
40 32 8 3.54.5 20.25
44 48 5.51 4.5 20.25
45 29 4 7 -3 9
46 32 2.53.5-1 1
52 46 1 2 -1 1
=72
•C.F.
= Where m = number of times each value repeats
C.F. = correction factor.
ρ = 1-
= 1- = - 0.198
There is negative correlation
Problems:
1.Find the rand correlation coefficient and interpret the answer
Aptitude Index(X) : 60 62 60 70 65 58 50
Productivity Index(Y) : 68 60 62 80 60 60 62
3. random sample of recent repair jobs was selected and estimated cost and actual
cost were recorded.Calculate the value of spearman’s correlation coefficient.
Regression Analysis
Introduction:
Definition
• Regression analysis is a mathematical measure of the average relationship between
two or more variables in terms of the original units of the data.
• Regression helps us to estimate the value of one variable, provided the value of the
other variable is given. The statistical method which helps us to estimate the
unknown value of one variable from the known value of the related variable is called
Regression.
• Dependent and independent variables
• In regression analysis there are two types of variables. The variable whose
value is to be predicted is called dependent variable and the variable
which is used for prediction is called independent variable.
• Regression Equations
• Regression equations are algebraic expressions of the regression lines.
Since there are two regression lines, there are two regression equations.
• The regression equation of X on Y is used to describe the variation in the
values of X for given changes in Y and the regression equation of Y on X is
used to describe the variation in the values of Y for given changes in X.
• Regression equations of (i) X on Y (ii) Y on X and their coefficients in
different cases are described as follows.
When the actual values are taken
• When we deal with actual values of X and Y variables the two regression
equations and their respective coefficients are written as follows
1.Regression Equation of Y on X;
coefficients r =
(ii) If one of the regression coefficients is greater than unity, the other must be
less than unity.
Solution:
Regression coefficient of X on Y
Regression equation of X on Y
Regression coefficient of Y on X
Regression equation of Y on X
Exercises
1. The following table gives the aptitude test scores and the productivity indices of 7 workers selected at
random.
Aptitude Index(X) : 60 62 65 70 72 48 50
Productivity Index(Y) : 68 60 62 80 85 40 54
a) What are dependent and independent variables?
b) Fit regression of Y on X
c) Estimate the average productivity of a worker whose test score is 82
2.The following data gives the work experience of machine operators in a factory and the number of units of
product ion turned out per day
Machine Operator : 1 2 3 4 5 6 7 8
Work experience (years): 6 8 9 7 5 2 1 3
Units of Production : 50 60 62 54 47 25 20 41
a)Calculate the regression lines and estimate the probable units of production of a m chine
operator with an experience of 12 years
b) Estimate the probable years of experience of a machine operator whose daily production is 85
3.Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.
4.Obtain two regression lines from the following data and interpret the value
Advertising expenditure : 10 12 13 23 27 30
(Rs. In lakhs)
Sales turnover : 40 42 40 45 48 50
(Rs. In crores)
Problem: The following table shows the sales and advertisement expenditure of a firm
Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement
expenditure of Rs. 10 crores.
Solution:
• Let
Where the regression is used?
• Finance: capital asset pricing model(CAPM), Non-
performing assets, probability of default, chance of
bankruptcy, credit risk.
• Marketing: Sales, market share, customer satisfaction,
customer churn, customer retention,
customer life time value.
• Operations: Inventory, productivity, efficiency.
• HR: Job satisfaction, attrition.
• The main purpose of regression is to predict the value of
dependent variable given the value(s) of independent
variable(s).
Potential uses of regression analysis in business include the
following:
1. How do wages of employees depend on years of
experience, years of education, and gender?
2. How does the current price of a stock depend on its own
past values, as well as the current and past vales of a
market index?
3. How does a company’s current sales level depend on its
current and past advertising levels, the advertising levels
of its competitors, the company’s own past sales levels,
and the general level of the market?
4. How does the total cost of producing a batch of items
depend on the total quantity of items that have been
produced?
5. How does the selling price of a house depend on such
factors as the appraised value of the house, the square
footage of the house, the number of bedrooms in the
house, and perhaps others?
6. How does the work exhaustion(WE)among the working
women depends on perceived workload, work-family
conflict, turnover intention, and the job autonomy in the
organization?
Note:
• Regression model establishes the existence of an
association between two variables, but not causation.
•
• In general form the linear regression for population data can be written as
Y = β0+ β1X + ε
Where ε is an error or disturbance
• β0,β1 are regression coefficients . The estimates are b0 and b1
b1 =
b0 = - β 1
The Linear Regression Model
Y Yi β 0 β1X i ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi for this Xi value
Intercept = β0
Xi X
Estimated Linear Regression Equation
The simple linear regression equation provides an estimate
of the population regression line
, are the estimators of and
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept
Value of X for
•• b
0 and b1 are obtained by minimizing the sum of the
squared differences(error) between Y and
• Y = β0+ β1X + ε
Minimize SSE • Validating the model
• Is x really related to y?
Is β1 statistically
significant? Using a model