0% found this document useful (0 votes)
11 views

Lesson_10_Relationship_Between_Variables

Uploaded by

Anupam Swain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lesson_10_Relationship_Between_Variables

Uploaded by

Anupam Swain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Statistics Essentials for Data Science

Relation Between Variables


Learning Objectives

By the end of this lesson, you will be able to:

Discuss the concepts of correlation and causation

Examine the types of correlation coefficients, such as Karl Pearson’s


and Spearman’s Rank Correlation

Explain the coefficient of determination


Business Scenario

ABC, an organization that stores a large amount of data, aims to analyze the data
and extract meaningful insights by determining the relationship between
variables.

To accomplish this goal, the organization must learn how to determine


correlation and causation and analyze various types of correlation coefficients,
such as Karl Pearson's and Spearman's rank correlation.
Discussion: Relationship Between Variables

Duration: 15 minutes
• What does correlation mean?
• How do you determine the relationship between variables?
Correlation
Correlation

Correlation is a statistical measure that quantifies the extent to which two variables are linearly related.

A scatter diagram helps visually illustrate the relationship between variables, providing a clear
understanding of their interdependence.
Relationship Between Variables with Scatter Diagram

The nature of the scatter plot provides insights into the relationship between variables.

The absence of a band in the The presence of a pattern in the shape of


scatter plot suggests a lack of a band within the scatter plot indicates
relationship between the variables. the existence of a relationship between
the variables.
Relationship Between Variables with Scatter Diagram

A scatter diagram serves as a powerful visual aid for understanding correlation.


It can provide the following insights:

Upward and downward sloping

Linear and curvilinear relationships

Quantifying relationships
Upward-Sloping

The bands in Fig (a) and (b) are upward-sloping.

50 50

40 40

30 30

20 20

10 10

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Fig (a)
Fig (b)

This indicates that as one variable increases, the other variable also increases.
Downward-Sloping

The bands in Fig (c) and (d) are downward-sloping.

50 50

40 40

30 30

20 20

10 10

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Fig (c) Fig (d)

This indicates that as one variable increases, the other variable decreases.
Width of Bands

The width of the bands in Fig (e) and (f) is narrow.

50 50

40 40

30 30

20 20

10 10

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Fig (e) Fig (f)

A narrower band indicates a stronger relationship between the variables.​


Width of Bands

The width in Fig (g) and (h) are relatively broader.

50 50

40 40

30 30

20 20

10 10

0 10 20 30 40 50 60 0 10 20 30 40 50 60

Fig (g) Fig (h)

A broader band indicates greater variability in the variables.


Linear and Curvilinear Relationships

When the band is almost a line or a rectangle, the relationship is linear.

If it is a curve, the relationship is curvilinear.


Quantifying Relationships

A quantitative measure is used to quantify the degree of the relationship.


Discussion: Relationship Between Variables

Duration: 15 minutes
• What does correlation mean?
Answer: Correlation refers to the statistical measure of the strength and
direction of the association between two variables. It indicates how closely
the variables are related to each other.

• How do you determine the relationship between variables?


Answer: A scatter diagram serves as a powerful visual aid for
understanding correlation. The absence of a band indicates a lack of
relationship between the variables. The presence of a band, on the other
hand, is indicative of a relationship.
Discussion
Discussion: Correlation and Covariance

Duration: 15 minutes
• What are the types of correlation?

• What does covariance mean?


Types of Correlation Coefficients
Types of Correlation Coefficients

There are two types of correlation coefficients:

Karl Pearson’s Coefficient Spearman’s Rank


of Correlation Correlation Coefficient
Karl Pearson’s Coefficient of Correlation

It is a linear correlation coefficient that falls in the value range of -1 to +1.

X xj …………….. n 𝑥ҧ
In a data set comprising n pairs of observations,
the xj values correspond to one characteristic (X),
while the yj values correspond to another
characteristic (Y).
Y yj …………….. n 𝑦ത
Karl Pearson’s Coefficient of Correlation

Let 𝑥ҧ and 𝑦ത respectively be the means of the X and Y. Then:

Population covariance of X and Y Sample covariance of X and Y

{∑ (xj - 𝑥)ҧ * (yj - 𝑦)}


ത {∑ (xj - 𝑥)ҧ * (yj - 𝑦)}

Cov(X,Y) = Cov(X,Y) =
n n-1

The summation extends from 1 to n. The summation extends from 1 to n – 1.


Correlation Coefficient

Below is the Karl Pearson’s Correlation Coefficient:

cov 𝑋,𝑌
𝑟=
𝑠𝑥∗ 𝑠𝑦

sx = standard deviation of X

sy = standard deviation of Y
Correlation Coefficient

The correlation coefficient, often represented by the symbol r, is a statistical measure that calculates
the strength and direction of the linear relationship between two variables.

It is important to be cautious and guard against false correlations.


Algebraic Formula of Correlation

The correlation simplifies as:

r=
(n*(∑xj*yj)) - ((∑xj) * (∑yj))

√ ( n*∑x j
2 - (∑xj)2 ) ( n*∑yj2 - (∑yj)2 )

Where,
n = total number of paired data points of x and y
Σ = sum of the values
Σx = sum of all x-values
Σy = sum of all y-values
Σxy = sum of the product of paired x and y values
Σx² and Σy² = sums of the squares of all x-values and y-values
Discussion: Correlation and Covariance

Duration: 15 minutes
• What are the types of correlation?
Answer: The two types of correlation coefficients are Karl Pearson's
Coefficient of Correlation and Spearman's Rank Correlation Coefficient.

• What does covariance mean?


Answer: Covariance is a measure of interdependence between two
variables.
Karl Pearson’s Correlation Coefficient: Use Cases
Practical Uses of Karl Pearson’s Coefficient of Correlation

Example 1: Selection of job applicants

The test scores are used to identify candidates for specific posts in an organization.
Practical Uses of Karl Pearson’s Coefficient of Correlation

Assume that individuals' test scores are highly correlated with their
scores during later performance appraisals

50
Performance appraisal

40
scores →

30
Well-designed tests prove to be highly
20 effective in personnel selection.
10

0 10 20 30 40 50 60
Test scores →

If there is a lack of correlation or a low correlation, it indicates the need to redesign the test.
Practical Uses of Karl Pearson’s Coefficient of Correlation

Example 2: The sales in units for certain products are correlated with the
demand for spare parts for these products.

Correlated

Sales in units Demand for spares


Practical Uses of Karl Pearson’s Coefficient of Correlation

Knowledge of correlation is useful:

Manufacturing Banking Research

To make plans for spare To find the correlation To find the correlation
parts production in the between income and credit between cigarette smoking
future card delinquency rate and longevity
Properties of Pearson’s Correlation Coefficient

The measure is dimensionless.

For instance, the temperature remains constant irrespective of the chosen unit of
measurement, demonstrating its inherent value consistency.
Properties of Pearson’s Correlation Coefficient

When the correlation coefficient (r) assumes any of the extreme values, it indicates a
perfect linear relationship between the two variables.

-1 ≤ r ≤ 1

Perfect linear relationship


Spearman’s Rank Correlation Coefficient
Spearman’s Rank Correlation Coefficient

It is a nonparametric measure of rank correlation that evaluates the


correlation between the rankings of two variables.

It helps determine how well the relationship


between the variables can be described using
a monotonic function.
Spearman’s Rank Correlation Coefficient

To determine the correlation coefficient, one must examine a data set of two
variables representing student scores:

Examiners

X Y

Student 1
0 0

Student 2 4 2

Student 3 7 3

Student 4 10 10

The scores of four students in a test are based on the independent assessments
conducted by examiners X and Y.
Spearman’s Rank Correlation Coefficient

Such scores could arise from variations in grading strictness.

Examiner X Examiner Y
Spearman’s Rank Correlation Coefficient

For example, examiner Y penalizes students more severely for wrong answers.​

Examiners

X Y
For instance, examiner Y applies stricter
Student 1 0 0
penalties for incorrect answers. As a
result, the correlation between the two
Student 2 4 2
examiners is less than 1, with a
Student 3 7 3 calculated value of 0.9.

Student 4 10 10

This indicates a high level of agreement between the examiners regarding the relative
performance of the candidates.
Spearman’s Rank Correlation Coefficient

To address scenarios like the previous example, Spearman introduced a measure known as the rank
correlation coefficient.
Spearman’s Rank Correlation Coefficient

Example 1: Assign ranks to students incorporating the hierarchy in the scores independently
for both examiners.

Graded by X Rank Graded by Y Rank

Student 1 0 1 Student 1 0 1

Student 2 4 2 Student 2 2 2

Student 3 7 3 Student 3 3 3

Student 4 10 4 Student 4 10 4
Spearman’s Rank Correlation Coefficient

Use the ascending or descending orders consistently in both cases

X Rank Y Rank

Student 1 0 4 Student 1 0 4

Student 2 4 3 Student 2 2 3

Student 3 7 2 Student 3 3 2

Student 4 10 1 Student 4 10 1

The correlation coefficient of the two sets of ranks is referred to as the rank correlation coefficient.
This value can be calculated using the same formula.
Formula for Rank Correlation Coefficient

The rank coefficient (r) can be calculated using the following formula:

(6*∑dj2)
r= 1-
[n*(n2-1)]

n = number of candidates or pairs of observations

dj = difference in ranks jth candidate


Calculating Rank Correlation Coefficient

Example 2: Calculating the difference in ranks with coefficient as 1

(6*∑dj2)
1= 1-
[n*(n2-1)]

(6*∑dj2)
=0 ∑dj2 = 0
[n*(n2-1)]

Ranks are identical.


Calculating Rank Correlation Coefficient

When the rankings are diametrically opposite, the rank correlation is -1.

Rank 1 Rank 2

1 4

2 3 Rank correlation = -1

3 2

4 1
Discussion
Discussion: Spurious Correlation

Duration: 15 minutes
• What does spurious correlation mean?

• How is the coefficient of determination utilized?


Causation
Cause and Effect

In a relationship between two variables, one variable serves as the


cause, and the other variable as the effect.

• The cause or independent variable is denoted by X.


• The effect or dependent variable is denoted by Y.
Cause and Effect Variables

Example 1: Cutting speed is a cause, and its impact on tool life is the effect.

Correlation Coefficient

Cause Effect
Correlation and Causation

Correlation may not always indicate a cause-effect relationship between variables.


Correlation and Causation

Example 2: The total number of students enrolled in schools across different cities may display a
correlation when observed over multiple years.

City 1 City 2

However, this does not necessarily imply a cause-effect relationship.


Spurious Correlation

An increase in the number of students in City 1 and City 2 typically stems


from a rise in population in both the cities.

Spurious correlation

Increasing students Increasing population

Such correlations, which may seem related but are not directly causal, are known as spurious correlations.
Interpretation of Correlation

Interpreting correlations requires careful consideration.

The action wherein one variable directly impacts another, creating an effect, is known as causation.
Regression

If one needs to study a cause-effect relationship, a predictive tool known as a regression


equation is often utilized.

In this context, only linear relationships are considered.


Regression

A data set is composed of n pairs of observations on two variables.

Xi Yi

Independent variable Dependent variable

To predict Y for any given value of X, use a linear equation known as the regression of Y on X
Example of Regression

Consider the given data set and the scatter diagram, as shown:

X 2 3 4 5 6 7 8 9 10

Y 31 53 53 75 83 94 92 100 124

140
120
100 The red points displayed in the
80 graphic representation
correspond to the data plots
60
derived from the table.
40
20
0
0 5 10 15
Example of Regression

Then, the computed values are:

Mean (𝑥)ҧ = Σ𝑋
𝑛
𝑥ҧ = 6 𝑦ത = 78.3

Standard deviation, (𝜎) =


Standard
ഥ )2 Sx = 2.58 Sy = 26.97
Σ(𝑋 −𝑥
(𝑛 −1)
Deviation

ҧ
(𝑋 −𝑥)(𝑌 −𝑦)
Correlation = ϵ r = 0.97
𝜎𝑥𝜎𝑦
Prediction Value

To determine the prediction value, consider the following:

Est (Yj) The predicted value of Y for a given value Xj.

Est (Yj) = a+(b * Xj) Regression of Y on X

Yj - Est (Yj) = ej Error in estimation

ej is the difference between the actual value and the


Yj = a +(b*Xj)+ ej
predicted value.
Regression Line

A regression line, also known as a line of best fit, is a straight line that is used to visualize and quantify
the correlation between two variables in statistical analysis.

The blue line shown below is a regression line.

140

120

100

80

60

40

20

0
0 5 10 15
Prediction Value

The prediction value is calculated as:

Est (Y) – 𝑦ത = r* (sy/sx) * (X - 𝑥)ҧ

r Correlation

The ratio of the standard


sy/sx
deviation of Y to X

X - 𝑥ҧ The difference between X and 𝑥ҧ

Thus, the predicted value of Y can be obtained for any value of X.


Calculating Prediction Value

Consider the same data set:

X 2 3 4 5 6 7 8 9 10

Y 31 53 53 75 83 94 92 100 124

Recalling the calculations previously performed:

𝑥ҧ = 6 𝑦ത = 78.3 Sx = 2.58 Sy = 26.97 r = 0.97


Calculating Prediction Value

The required regression equation is as shown:

Est (Y) - 78.3 = (0.97 * 26.97/2.58) * (X - 6) = 10.17 * (X - 6)

X=7 Est (Y) - 78.3 =10.14

Predicted value Y = 88.44


Coefficient of Determination
Coefficient of Determination

It is a statistical test that determines how well changes in one variable can
account for variations in another.​

After obtaining this equation, an The coefficient of determination


index for estimating its quality fulfills this role as a quality
becomes necessary. estimation index.

Source: Investopedia
Coefficient of Determination: Example

Example: Effect of cutting speed (X) on tool life (Y).

Cutting speed X Tool life Y

Values of tool life exhibit variation.


Coefficient of Determination: Example

An error term highlights the


difference between the observed
Y – Est (Y)
Y - 𝑦ത is the deviation or variation value and the predicted value,
which is the value of the
deviation.

Y - 𝑦ത = Y – Est (Y) + Est (Y) - 𝑦ത This deviation represents the


Est (Y) - 𝑦ത discrepancy of the estimated
value from the average.

As cutting speeds change, predicted values will differ from the average tool life.
Quantifying Quality

When the quality of prediction is good, the error terms are small.

Y - 𝑦ത z Est (Y) - 𝑦ത

Hence, the above two values will tend to be close to each other.
.
Quantifying Quality

The following measure, R², is utilized to quantify the quality of the


regression equation.

• It quantifies the degree to which variations in


R2 = { ∑(Est (yj) - 𝑦ത( ∑/{2(yj – 𝑦ത2(
the dependent variable are explained by the
regression equation.
• If R² is low, the explanatory power of the
R2 is the coefficient of determination. regression equation might be low.
Variations in Prediction Value

Y may likely be influenced by other factors, in addition to X.

For instance, the cutting speed could be


impacted by factors such as improper
handling.

Total variation is calculated as follows:


Total variation = Explained variation + Unexplained variation
Coefficient of Determination and Variations

When variables are highly correlated and the regression equation delivers a high-quality prediction, the
unexplained variation tends to be low.

Quality of prediction is measured by:


50

40
Explained variation
Coefficient of determination =
30 Total variation
Y→

20

10

0 10 20 30 40 50 60
X→
Coefficient of Determination and Variations

The value of the coefficient of determination is non-negative and does not exceed unity.

When r = + or - 1:

100% of the variation is explained, that is, the regression equation


accounts for all changes in the dependent variable.​

When r = + or – 0.9:

81% of the variations are explained by the regression equation, the


remaining 19% is unexplained variation.​
TSS and RSS

Difference between RSS and TSS:

Total sum of squares (TSS) Residualsum


Residual sumofofsquares
squares(RSS)
(RSS)

• Measures variation in the observed • Measures variation in the error


data between the observed data and
• TSS = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 modeled values
• RSS = (y1 – ypred) 2 + (y2 – ypred)2 +…+(yn
– ypred)2
Coefficient of Determination: Example

The table below presents the calculated coefficient of determination for the given data:

X Y Est (Y) Y-𝑦


ത Est (Y) - 𝑦
ത TSS RSS
2 1 37.6665 -47.3333 -40.6668 2240.444 1653.791
3 53 47.8332 -25.3333 -30.5001 641.7776 930.2579
4 53 57.9999 -25.3333 -20.3334 641.7776 413.4484
5 75 68.1666 -3.33333 -10.1667 11.11109 103.3624
6 83 78.3333 4.66667 -3.00e-05 21.77781 9.00E-10
7 94 88.5 15.66667 10.16667 245.4445 103.3612
8 92 98.6667 13.66667 20.33337 186.7779 413.4459
9 100 108.8334 21.66667 30.50007 469.4446 930.2543
10 124 119.0001 45.66667 40.66677 2085.445 1653.786
Coefficient of Determination: Example

From the dataset in the previous slide:

Sum of TSS = 6544

Sum of RSS = 6201.707

The coefficient of determination is calculated using the formula = 1 - (RSS/TSS).


Coefficient of Determination: Example

This indicates that 95% of the variations in Y are explained by the regression equation.

Coefficient of determination = 1- (6201.707/6544) = 1- 0.9476

= 0.0523
Discussion: Spurious Correlation

Duration: 15 minutes
• What does spurious correlation mean?
Answer: A spurious correlation is a term used in statistics to describe a
situation where two variables seem to be related to each other, but in
reality, there is no causal relationship between them.

• How is the coefficient of determination utilized?


Answer: The coefficient of determination is used as an index to estimate
the quality of the regression equation.
Key Takeaways

When two variables are related, the relationship can be studied


under a concept called correlation.

The correlation coefficient quantifies the extent to which the


variations in one variable influences the variations in the other.

Causation helps to understand the relationship between two


variables. One is the cause and the other is the effect.
Knowledge Check
Knowledge
Check
____________ is used to analyze the correlation between two variables.
1

A. Scatter plot

B. Bar graph

C. Pie chart

D. Bubble chart
Knowledge
Check
____________ is used to analyze the correlation between two variables.
1

A. Scatter plot

B. Bar graph

C. Pie chart

D. Bubble chart

The correct answer is A

A scatter plot is used to analyze the correlation between two variables.


Knowledge
Check
The formula to find Karl Pearson’s Correlation Coefficient is ____________.
2

A. r = Cov(X, Y)/ (sx* sy)2

B. r = Cov(X, Y)/ sx∗ sy

C. r = Cov(X, Y)/ sx2* sy2

D. r = Cov(X, Y)/ sx* sy


Knowledge
Check
The formula to find Karl Pearson’s Correlation Coefficient is ____________.
2

A. r = Cov(X, Y)/ (sx* sy)2

B. r = Cov(X, Y)/ sx∗ sy

C. r = Cov(X, Y)/ sx2* sy2

D. r = Cov(X, Y)/ sx* sy

The correct answer is D

The formula to find Karl Pearson’s Correlation Coefficient is r = Cov(X, Y)/ sx* sy.
Knowledge
Check In Spearman’s Rank Correlation Coefficient, if rankings are diametrically opposite, the
3 correlation between the rankings is ____________.

A. 0

B. -1

C. 0.5

D. +1
Knowledge
Check In Spearman’s Rank Correlation Coefficient, if rankings are diametrically opposite, the
3 correlation between the rankings is ____________.

A. 0

B. -1

C. 0.5

D. +1

The correct answer is B

In Spearman’s Rank Correlation Coefficient, if rankings are diametrically opposite, the correlation
between the rankings is -1.
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy