100% found this document useful (1 vote)
360 views

Correlation and Regression Analysis

This document discusses correlation and regression analysis. It defines correlation as the statistical analysis that measures the degree to which two variables fluctuate in relation to each other. Correlation can be positive, negative, or zero. Positive correlation means the variables increase together, while negative means they change in opposite directions. Regression analysis allows estimating or predicting the value of one variable based on the other. The key methods discussed are scatter plots, Pearson's correlation coefficient, and Spearman's rank correlation coefficient.

Uploaded by

manasaavvaru
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
360 views

Correlation and Regression Analysis

This document discusses correlation and regression analysis. It defines correlation as the statistical analysis that measures the degree to which two variables fluctuate in relation to each other. Correlation can be positive, negative, or zero. Positive correlation means the variables increase together, while negative means they change in opposite directions. Regression analysis allows estimating or predicting the value of one variable based on the other. The key methods discussed are scatter plots, Pearson's correlation coefficient, and Spearman's rank correlation coefficient.

Uploaded by

manasaavvaru
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Correlation Analysis

Introduction
In the previous Chapter we have studied the characteristics of only one variable;
example, marks, weights, heights, rainfalls, prices, ages, sales, etc. This type of
analysis is called univariate analysis.
Sometimes we may be interested to find if there is any relationship between the
two variables under study
For example, the price of the commodity and its sale, height of a father and
height of his son, price and demand, yield and rainfall, height and weight and so
on.
Thus the association of any two variables is known as correlation.

Correlation is the statistical analysis which measures and analyses


the degree or extent to which two variables fluctuate with reference
to each other.
1. Meaning of Correlation
The term correlation refers to the degree of relationship between two or more
variables. If a change in one variable effects a change in the other variable, the
variables are said to be correlated.
2. Types of correlation
Correlation is classified into many types, but the important are:
(i) Positive
(ii) Negative

Positive and negative correlation depends upon the direction of change of the
variables.
• Positive Correlation
• If two variables tend to move together in the same direction that is, an increase
in the value of one variable is accompanied by an increase in the value of the
other variable; or a decrease in the value of one variable is accompanied by a
decrease in the value of the other variable, then the correlation is called
positive or direct correlation.

Example
(i) The heights and weights of individuals

(ii) Price and Supply

(iii) Rainfall and Yield of crops

(iv) The income and expenditure


• Negative Correlation
• If two variables tend to move together in opposite direction so that an increase
or decrease in the values of one variable is accompanied by a decrease or
increase in the value of the other variable, then the correlation is called negative
or inverse correlation.
Example
(i) Price and demand
(ii) Repayment period and EMI
(iii) Yield of crops and price
(iv) speed of vehicle and time taken
• No Correlation
• Two variables are said to be uncorrelated if the change in the value of one
variable has no connection with the change in the value of the other variable.
• For example
• We should expect zero correlation (no correlation) between weight of a person
and the colour of his hair or the height of a person and the colour of his hair.
• Simple correlation
The correlation between two variables is called simple correlation.
The correlation in the case of more than two variables is called
multiple correlation.
The following are the mathematical methods of correlation
coefficient
• (i) Scatter diagram
• (ii) Karl Pearson’s Coefficient of Correlation
Scatter Diagram
Let (X1 , Y1),(X2, Y2) … (Xn , Yn) be the n pairs of observation of the variables X and Y. If we
plot the values of X along x - axis and the corresponding values of Y along y-axis, the
diagram so obtained is called a scatter diagram. It gives us an idea of relationship between
X and Y. The type of scatter diagram under a simple linear correlation is given below.

(i) If the plotted points show an upward trend, the correlation will be positive.
(ii) If the plotted points show a downward trend, the correlation will be
negative.
(iii) If the plotted points show no trend the variables are said to be uncorrelated.
Karl Pearson’s Correlation Coefficient
• Karl Pearson, a great biometrician and statistician, suggested a
mathematical method for measuring the magnitude of linear relationship
between two variables say X and Y.
• Karl Pearson’s method is the most widely used method in practice and is
known as Pearsonian Coefficient of Correlation.
• It is denoted by the symbol ‘r’ and defined as
Hence, the formula to compute Karl Pearson Correlation coefficient is
Interpretation of Correlation coefficient:
Coefficient of correlation lies between –1 and +1. Symbolically, –1≤ r ≤ + 1

When r =+1 , then there is perfect positive correlation between the


variables.

When r=–1 , then there is perfect negative correlation between the


variables.

When r=0, then there is no relationship between the variables, that is the
variables are uncorrelated.

Thus, the coefficient of correlation describes the magnitude and direction of


correlation.
Methods of computing Correlation Coefficient
(i) When deviations are taken from Mean
Of all the several mathematical methods of measuring correlation, the Karl
Pearson’s method, popularly known as Pearsonian coefficient of correlation, is
most widely used in practice.

This method is to be applied only when the deviations of items are taken from
actual means.
(ii) When actual values are taken (without deviation)
Example 1
Calculate Karl Pearson’s coefficient of correlation from the following data:

Solution:
Example 2
Calculate coefficient of correlation from the following data
Solution:
In both the series items are in small number. Therefore, correlation coefficient can
also be calculated without taking deviations from actual means or assumed mean.
(iii) When deviations are taken from an Assumed mean
When actual means are in fractions, say the actual means of X and Y series are 20.167
and 29.23, the calculation of correlation by the method discussed above would involve
too many calculations and would take a lot of time. In such cases we make use of the
assumed mean method for finding out correlation. When deviations are taken from an
assumed mean the following formula is applicable:

NOTE
While applying assumed mean method, any value can be taken as the assumed mean
and the answer will be the same. However, the nearer the assumed mean to the actual
mean, the lesser will be the calculations.
Example
Find out the coefficient of correlation in the following case and interpret.

Solution:
Let us consider Height of father (in inches) is represented as X and Height of
son (in inches) is represented as Y
Example
Calculate the correlation coefficient from the following data

Solution:

Example
From the following data calculate the correlation coefficient dxy =120, dx2 =90, dy2 =640
Solution:
Rank correlation
Spearman’s Rank Correlation Coefficient
• In 1904, Charles Edward Spearman, a British psychologist found out the
method of ascertaining the coefficient of correlation by ranks. This
method is based on rank. This measure is useful in dealing with
qualitative characteristics, such as intelligence, beauty, morality,
character, etc. It cannot be measured quantitatively, as in the case of
Pearson’s coefficient of correlation.

• Rank correlation is applicable only to individual observations. The


result we get from this method is only an approximate one, because
under ranking method original value are not taken into account.
The formula for Spearman’s rank correlation which is denoted by ρ (pronounced as rho) is

where d = The difference of two ranks = R X - RY and

N = Number of paired observations.

Rank coefficient of correlation value lies between –1 and +1. Symbolically,


–1≤ρ≤+1
When we come across spearman’s rank correlation, we may find two types of
problem

(i) When ranks are given

(ii) When ranks are not given

Example :The following are the ranks obtained by 10 students in Statistics and
Mathematics. Find the rank correlation coefficient.
Solution:
Let RX is considered for the ranks of Statistics and RY is considered for the ranks of
mathematics.
Example :Ten competitors in a beauty contest are ranked by three judges in the
following order

Use the method of rank correlation coefficient to determine which pair of judges
has the nearest approach to common taste in beauty?

Solution:

Let RX,RY,RZ denote the ranks by First judge, Second judge and third judge
respectively
Since the rank correlation coefficient between Second and Third judges i.e., ρYZ
is positive and high among the three coefficients. So, Second judge and Third
judge have the nearest approach for common taste in beauty
When ranks are not given
Example: Calculate rank correlation coefficient of the following data

Solution:
Let X is considered for Subject1 and Y is considered for Subject2
Example:
Find the Rank correlation coefficient and interpret the value
X : 43 44 46 40 44 45 46
Y : 29 31 29 32 48 29 32
Solution:
X Y R1 R2 d=R1-R2 d2
43 29 7 7 0 0
44 31 5.55 0.5 0.25
46 29 2.57 -4.5 20.25
40 32 8 3.54.5 20.25
44 48 5.51 4.5 20.25
45 29 4 7 -3 9
46 32 2.53.5-1 1
52 46 1 2 -1 1
=72
•C.F.
  = Where m = number of times each value repeats
C.F. = correction factor.

In X, m =2 (46 repeated two times)


m =2 (44 repeated 2 times)
In Y, m =2 (32 repeated 2 times)
m =3 (29 repeated 3 times)
C.F. = 2(22-1)/12 + 2(22-1)/12 +2(22-1)/12 +3(32-1)/12
= 3.5

ρ = 1-

= 1- = - 0.198
There is negative correlation
Problems:
1.Find the rand correlation coefficient and interpret the answer
Aptitude Index(X) : 60 62 60 70 65 58 50
Productivity Index(Y) : 68 60 62 80 60 60 62

3.  random sample of recent repair jobs was selected and estimated cost and actual
cost were recorded.Calculate the value of spearman’s correlation coefficient.
Regression Analysis
Introduction:

• So far we have studied correlation analysis which measures the


direction and strength of the relationship between two variables.
• Here we can estimate or predict the value of one variable from
the given value of the other variable.
• For instance, price and supply are correlated. We can find out the
expected amount of supply for a given price or the required price
level for attaining the given amount of supply.
• The term “ regression” literally means “ Stepping back towards the average”. It was
first used by British biometrician Sir Francis Galton (1822 -1911), in connection with
the inheritance of stature.
• Galton found that the offsprings of abnormally tall or short parents tend to “regress”
or “step back” to the average population height.
• But the term “regression” as now used in Statistics is only a convenient term without
having any reference to biometry.

Definition
• Regression analysis is a mathematical measure of the average relationship between
two or more variables in terms of the original units of the data.

• Regression helps us to estimate the value of one variable, provided the value of the
other variable is given. The statistical method which helps us to estimate the
unknown value of one variable from the known value of the related variable is called
Regression.
• Dependent and independent variables
• In regression analysis there are two types of variables. The variable whose
value is to be predicted is called dependent variable and the variable
which is used for prediction is called independent variable.
• Regression Equations
• Regression equations are algebraic expressions of the regression lines.
Since there are two regression lines, there are two regression equations.
• The regression equation of X on Y is used to describe the variation in the
values of X for given changes in Y and the regression equation of Y on X is
used to describe the variation in the values of Y for given changes in X.
• Regression equations of (i) X on Y (ii) Y on X and their coefficients in
different cases are described as follows.
When the actual values are taken
• When we deal with actual values of X and Y variables the two regression
equations and their respective coefficients are written as follows
1.Regression Equation of Y on X;

is known as the regression coefficient of Y on X, and r is the correlation


coefficient between X and Y, σx and σy are standard deviations of X and Y
respectively.
2.Regression Equation of X on Y;

is known as the regression coefficient of X on Y, and r is the correlation


coefficient between X and Y, σx and σy are standard deviations of X and Y
respectively.
Properties of Regression Coefficients
(i) Correlation Coefficient is the geometric mean between the regression

coefficients r =

(ii) If one of the regression coefficients is greater than unity, the other must be
less than unity.

(iii) Both the regression coefficients are of same sign.


Example
Calculate the regression coefficient and obtain the lines of regression for the
following data

Solution:
Regression coefficient of X on Y

Regression equation of X on Y
Regression coefficient of Y on X

Regression equation of Y on X
Exercises
1. The following table gives the aptitude test scores and the productivity indices of 7 workers selected at
random.
Aptitude Index(X) : 60 62 65 70 72 48 50
Productivity Index(Y) : 68 60 62 80 85 40 54
a) What are dependent and independent variables?
b) Fit regression of Y on X
c) Estimate the average productivity of a worker whose test score is 82

2.The following data gives the work experience of machine operators in a factory and the number of units of
product ion turned out per day
Machine Operator : 1 2 3 4 5 6 7 8
Work experience (years): 6 8 9 7 5 2 1 3
Units of Production : 50 60 62 54 47 25 20 41

a)Calculate the regression lines and estimate the probable units of production of a m chine
operator with an experience of 12 years
b) Estimate the probable years of experience of a machine operator whose daily production is 85
3.Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.

• Estimate the likely demand when the price is Rs.20.

4.Obtain two regression lines from the following data and interpret the value
Advertising expenditure : 10 12 13 23 27 30
(Rs. In lakhs)
Sales turnover : 40 42 40 45 48 50
(Rs. In crores)
Problem: The following table shows the sales and advertisement expenditure of a firm

Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement
expenditure of Rs. 10 crores.
Solution:

When advertisement expenditure is 10 crores i.e.,


Y=10 then sales X=6(10)+4=64 which implies sales is 64.
Example:There are two series of index numbers P for price index and S for stock of the
commodity. The mean and standard deviation of P are 100 and 8 and of S are 103 and 4
respectively. The correlation coefficient between the two series is 0.4. With these data obtain
the regression lines of P on S and S on P.
Solution:
Let us consider X for price P and Y for stock S. Then the mean and SD for P is considered as X-
Bar = 100 and σx=8. respectively and the mean and SD of S is considered as Y-Bar =103 and
σy=4. The correlation coefficient between the series is r(X,Y)=0.4
Let the regression line X on Y be
Example:For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2 =55,
∑Y2 =135, ∑XY=83 Find the equation of the lines of regression and estimate the value of X on the
first line when Y=12 and value of Y on the second line if X=8.
Solution:
Example
In a laboratory experiment on correlation research study the equation of the two
regression lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X
and Y. Also work out the values of the regression coefficient and correlation
between the two variables X and Y.
Solution:
Solving the two regression equations we get mean values of X and Y
• Let

• Let
Where the regression is used?
• Finance: capital asset pricing model(CAPM), Non-
performing assets, probability of default, chance of
bankruptcy, credit risk.
• Marketing: Sales, market share, customer satisfaction,
customer churn, customer retention,
customer life time value.
• Operations: Inventory, productivity, efficiency.
• HR: Job satisfaction, attrition.
• The main purpose of regression is to predict the value of
dependent variable given the value(s) of independent
variable(s).
Potential uses of regression analysis in business include the
following:
1. How do wages of employees depend on years of
experience, years of education, and gender?
2. How does the current price of a stock depend on its own
past values, as well as the current and past vales of a
market index?
3. How does a company’s current sales level depend on its
current and past advertising levels, the advertising levels
of its competitors, the company’s own past sales levels,
and the general level of the market?
4. How does the total cost of producing a batch of items
depend on the total quantity of items that have been
produced?
5. How does the selling price of a house depend on such
factors as the appraised value of the house, the square
footage of the house, the number of bedrooms in the
house, and perhaps others?
6. How does the work exhaustion(WE)among the working
women depends on perceived workload, work-family
conflict, turnover intention, and the job autonomy in the
organization?

7. What percentage of loans are likely to result in a loss?

8. How to identify the most profitable customer?


• Regression analysis is classified as two types
1 Simple regression:
A simple regression analysis includes a single explanatory
variable.
2 Multiple regression:
a multiple regression can include any number of
explanatory variables.

Note:
• Regression model establishes the existence of an
association between two variables, but not causation.
• 
• In general form the linear regression for population data can be written as
Y = β0+ β1X + ε
Where ε is an error or disturbance
• β0,β1 are regression coefficients . The estimates are b0 and b1

b1 =

b0 = - β 1
The Linear Regression Model

Y Yi  β 0  β1X i  ε i
Observed Value
of Y for Xi

εi Slope = β1
Predicted Value Random Error
of Y for Xi for this Xi value

Intercept = β0

Xi X
Estimated Linear Regression Equation
 The simple linear regression equation provides an estimate
of the population regression line
, are the estimators of and
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept

Value of X for

Ŷi  b 0  b1X i observation i


The Least Squares Method

•• b
  0 and b1 are obtained by minimizing the sum of the
squared differences(error) between Y and

min  (Yi Ŷi )  min  (Yi  (b 0  b1X i ))


2 2
Interpretation of the
Intercept and the Slope
• b0 is the estimated mean value of Y when the
value of X is zero

• b1 is the slope of the line, the change in Y when


X increases by one unit.
Simple Linear Regression
• Variable x and y has Linear
relationship • Fitting a model

• Y = β0+ β1X + ε
Minimize SSE • Validating the model

• Is x really related to y?
Is β1 statistically
significant? Using a model

• Predict y for a given x.

• Assumption of the world

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy