0% found this document useful (0 votes)
5 views

F_Regression

The document discusses Simple Linear Regression, including its model, methods, and significance testing. It highlights the relationship between dependent and independent variables, the use of regression in various fields, and the importance of the least squares method. Additionally, it covers testing for significance using t and F tests, along with the assumptions about the error term.

Uploaded by

praishh2827
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

F_Regression

The document discusses Simple Linear Regression, including its model, methods, and significance testing. It highlights the relationship between dependent and independent variables, the use of regression in various fields, and the importance of the least squares method. Additionally, it covers testing for significance using t and F tests, along with the assumptions about the error term.

Uploaded by

praishh2827
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Simple Linear Regression

• Simple Linear Regression Model


• Least Squares Method
• Coefficient of Determination
• Model Assumptions
• Testing for Significance
Regression model establishes existence of association between two
variables, but not causation.
BCCI Bans Girlfriends and Wives

Girlfriends and wives create such a “distraction”


that Indian batsmen can’t make runs, bowlers
fail to take wickets and fielders drop simple
catches.
Dependent and Independent are just terms
used!

Married Men Earn More Money!

Is marriage leading to more money or


More Money leading to Marriage?
Regression helps to validate hypothesis
Interesting Hypotheses
• Married people are more happier than singles!!!
• Vegetarians miss fewer flights.
• Black cars have more chance of involving in an accident than white cars
in moon light.
• Women use camera phone more than men.
• Left handed men earn more money!
• Smokers are better sales people.
• Those who whistle at workplace are more efficient.
Regression Nomenclature

Dependent Variable Independent Variable


Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable
Feature Outcome Variable
Regression Vs Correlation

• Regression is the study of, “existence of a relationship”, between


two variable. The main objective is to estimate the change in mean
value of independent variable.

• Correlation is the study of, “strength of relationship”, between two


variables.
Regression History

• Francis Galton was the first to apply regression.

• Claimed that height of children of tall parents


“regress towards mean of that generation”.

• Modern regression analysis is developed by R A


Fisher.
Francis
Galton

Ref: F Galton, “Regression towards mediocrity in hereditary


stature”, Nature, Vol. 15, 246-263, 1886
Where is it used?

ü Every functional area of management uses regression.


ü Finance: CAPM, Non-performing assets, probability of default, Chance of
bankruptcy, credit risk.
ü Marketing: Sales, market share, customer satisfaction, customer churn,
customer retention, customer life time value.
ü Operations: Inventory, productivity, efficiency.
ü HR – Job satisfaction, attrition.
Importance of Regression

ØIn 1980, Supreme court of USA recognized regression as a valid


method of identifying discrimination.

ØAmerican Food and Drug Administration (FDA) uses regression as an


approved tool for validating food and drug products.
Types of Regression

Regression
Models
One independent More than One
variable independent variable
Simple Multiple
Regression Regression

Linear Non-linear Linear Non-linear


Types of Regression

• Simple linear regression – refers to a regression model


between two variables.

Y   0  1 X 1  
• Multiple linear regression – refers to a regression model on
more than one independent variables.
Y   0  1 X 1   2 X 2  ...   k X k  

• Nonlinear regression.
1
Y  0   X 2 3  
1   2 X 1
Linear Regression

• Linear regression stands for a function that is linear


in regression coefficients.

• The following equation will be treated as linear as


far as regression is concerned.
Y  1  1 X1   2 X1 X 2  3 X 22

The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
Simple Linear Regression
• Managerial decisions often are based on the relationship between two or
more variables.
• Regression analysis can be used to develop an equation showing how the
variables are related.
• The variable being predicted is called the dependent variable and is denoted
by y.
• The variables being used to predict the value of the dependent variable are
called the independent variables and are denoted by x.
Simple Linear Regression
• Simple linear regression involves one independent variable and one
dependent variable.
• The relationship between the two variables is approximated by a
straight line.
• Regression analysis involving two or more independent variables is
called multiple regression.
Simple Linear Regression Model
• The equation that describes how y is related to x and an error term is called
the regression model.
• The simple linear regression model is:

y = 0 + 1x + 

where:
0 and 1 are called parameters of the model,
 is a random variable called the error term.
Simple Linear Regression Equation
• The simple linear regression equation is:
E(y) = 0 + 1x
• Graph of the regression equation is a straight line.
• 0 is the y intercept of the regression line.
• 1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.
Simple Linear Regression Equation
• Positive Linear Relationship

E(y)

Regression line

Intercept Slope 1
0 is positive

x
Simple Linear Regression Equation
• Negative Linear Relationship

E(y)

Intercept
0 Regression line

Slope 1
is negative

x
Simple Linear Regression Equation
• No Relationship

E(y)

Intercept Regression line


0
Slope 1
is 0

x
Estimated Simple Linear Regression Equation
• The estimated simple linear regression equation
� = �0 + �1 �
• The graph is called the estimated regression line.
• b0 is the y intercept of the line.
• b1 is the slope of the line.
• � is the estimated value of y for a given x value.
Estimation Process
Regression Model Sample Data:
y = 0 + 1x + x y
Regression Equation x1 y1
E(y) = 0 + 1x . .
Unknown Parameters . .
0, 1 xn yn

Estimated
b0 and b1 Regression Equation
provide estimates of � = �0 + �1 �
0 and 1 Sample Statistics
b0, b1
Least Squares Method
• Least Squares Criterion
min (�� − �� )2

where:
yi = observed value of the dependent variable
for the i th observation
�� = estimated value of the dependent variable
for the i th observation
Least Squares Method
• Slope for the Estimated Regression Equation
(�� − �)(�� − �)
�1 =
(�� − �)2

where:
xi = value of independent variable for i th observation
yi = value of dependent variable for i th observation
� = mean value for independent variable
� = mean value for dependent variable
Least Squares Method
• y-Intercept for the Estimated Regression Equation

�0 = � − �1 �
Simple Linear Regression
• Example: Reed Auto Sales
Reed Auto periodically has a special week-long sale. As part of
the advertising campaign Reed runs one or more television
commercials during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.
Simple Linear Regression
• Example: Reed Auto Sales

Number of Number of
TV Ads (x) Cars Sold (y)
1 14
3 24
2 18
1 17
3 27
Sx = 10 Sy = 100
�=2 � = 20
Estimated Regression Equation
• Slope for the Estimated Regression Equation
(�� − �)(�� − �) 20
�1 = 2 = =5
(�� − �) 4

• y-Intercept for the Estimated Regression Equation


�0 = � − �1 � = 20 − 5(2) = 10

• Estimated Regression Equation


� = 10 + 5�
Using Excel’s Chart Tools for
Scatter Diagram & Estimated Regression Equation
Reed Auto Sales Estimated Regression Line
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

(�� − �)2 = (�� − �)2 + (�� − �� )2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
• The coefficient of determination is:

r2 = SSR/SST

where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 87.72% of the variability


in the number of cars sold can be explained by the linear relationship
between the number of TV ads and the number of cars sold.
Using Excel to Compute the Coefficient of Determination
• Adding r 2 Value to Scatter Diagram
Reed Auto Sales Estimated Regression Line
30
25
20
Cars Sold

y = 5x + 10
15
R 2 = 0 .8 7 7 2
10
5
0
0 1 2 3 4
TV Ad s
Sample Correlation Coefficient

��� = (sign of �1 ) Coefficient of Determination


��� = (sign of �1 ) �2

where:
b1 = the slope of the estimated regression
equation � = �0 + �1 �
Sample Correlation Coefficient

��� = (sign of �1 ) �2

The sign of b1 in the equation � = 10 + 5x is "+".

��� =+ . 8772

rxy = +.9366
Assumptions About the Error Term 
1. The error  is a random variable with mean of zero.
2. The variance of  , denoted by  2, is the same for all values of the
independent variable.
3. The values of  are independent.
4. The error  is a normally distributed random variable.
Testing for Significance
• To test for a significant regression relationship, we must conduct a
hypothesis test to determine whether the value of 1 is zero.
• Two tests are commonly used:

t Test and F Test

• Both the t test and F test require an estimate of  2, the variance of  in the
regression model.
Testing for Significance
• An Estimate of  2
The mean square error (MSE) provides the estimate of  2, and the notation
s2 is also used.

s 2 = MSE = SSE/(n - 2)

where:

SSE= (�� − �� )2 = (�� − �0 − �1 �� )2


Testing for Significance
• An Estimate of 
• To estimate , we take the square root of s2.
• The resulting s is called the standard error of the estimate.

SSE
s = MSE =
�−2
Testing for Significance: t Test
• Hypotheses

H0: 1 = 0
Ha: 1 ≠ 0

• Test Statistic

�1 �
�= where ��1 =
��1 (�� − �)2
Testing for Significance: t Test
• Rejection Rule

Reject H0 if p-value < 


or t < -tor t > t

where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: t Test
1. Determine the hypotheses. H0: 1 = 0
Ha: 1 ≠ 0

2. Specify the level of significance.  = .05

�1
3. Select the test statistic. �=
��1

4. State the rejection rule. Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)
Testing for Significance: t Test
�1 5
5. Compute the value of the test statistic. �= = = 4.63
��1 1.08

6. Determine whether to reject H0. t = 4.541 provides an area of .01


in the upper tail. Hence, the p-
value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject
H0.
Confidence Interval for 1
• We can use a 95% confidence interval for 1 to test the hypotheses just used in
the t test.
• H0 is rejected if the hypothesized value of 1 is not included in the confidence
interval for 1.
Confidence Interval for 1
• The form of a confidence interval for 1 is:

�1 ± ��/2 ��1

where
b1 is the point estimator,
��/2 ��1 is the margin of error, and
ta/2 is the t value providing an area of
/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for 1
• Rejection Rule

Reject H0 if 0 is not included in the confidence interval for 1.

• 95% Confidence Interval for 1


�1 ± ��/2 ��1 = 5 +/- 3.182(1.08) = 5 +/- 3.44 or 1.56 to 8.44

• Conclusion
0 is not included in the confidence interval. Reject H0
Testing for Significance: F Test
• Hypotheses

H0: b1 = 0
Ha: b1 ≠ 0

• Test Statistic

F = MSR/MSE
Testing for Significance: F Test
• Rejection Rule

Reject H0 if
p-value < 
or F > F

where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Testing for Significance: F Test
1. Determine the hypotheses. H0: b1 = 0
Ha: b1 ≠ 0

2. Specify the level of significance.  = .05

3. Select the test statistic. F = MSR/MSE

4. State the rejection rule. Reject H0 if p-value < .05 or


F > 10.13 (with 1 d.f. in
numerator and 3 d.f. in denominator)
Testing for Significance: F Test
5. Compute the value of the test statistic.

F = MSR/MSE = 100/4.667 = 21.43

6. Determine whether to reject H0.

F = 17.44 provides an area of .025 in the upper tail. Thus, the p-value
corresponding to F = 21.43 is less than .025. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.
Some Cautions about the
Interpretation of Significance Tests
• Rejecting H0: 1 = 0 and concluding that the relationship between x and y
is significant does not enable us to conclude that a cause-and-effect
relationship is present between x and y.

• Just because we are able to reject H0: 1 = 0 and demonstrate statistical


significance does not enable us to conclude that there is a linear relationship
between x and y.
Interpretation of Simple Linear Regression Coefficients

• Interpretation of regression coefficients is important for


understanding the relationship between the response variable and
the explanatory variable and the impact of change in the values of
explanatory variables on the response variable.

• The interpretation will depend on the functional form of the


relationship between the response and the explanatory variables.

• Interpretation of 0 and 1 in Y = 0 + 1 X

When the functional form is Y = 0 + 1 X, the value of 0 = E(Y|X=0).


Y
1 =  X that is 1 is the change in the value of Y for the unit change in
the value of X. Where  Y is the partial derivative of Y with respect to
X. X
Validation of the Simple Linear Regression Model

It is important to validate the regression model to ensure its validity


and goodness of fit before it can be used for practical applications.
The following measures are used to validate the simple linear
regression models:

• Co-efficient of determination (R-square).


• Hypothesis test for the regression coefficient
• Analysis of Variance for overall model validity (relevant more for
multiple linear regression).
• Residual analysis to validate the regression model assumptions.
• Outlier analysis.

The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)

• The co-efficient of determination (or R-square or R 2 ) measures the percentage of


variation in Y explained by the model (0 + 1 X).
• The simple linear regression model can be broken into explained variation and
unexplained variation as shown in

Yi

  0  1 X i  i

Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model

In absence of the predictive model for Yi, the users will use the mean value of Y­i. Thus, the

total variation is measured as the difference between Yi and mean value of Yi (i.e.,Yi -Y ).
Description of total variation, explained variation and unexplained variation

Variation Type Measure Description



Total Variation (SST) ( Yi  Y ) Total variation is the difference between the
actual value and the mean value.
Variation explained by the ( )
 
Yi  Y Va r i at i o n ex p l a i n e d by t h e m o d e l i s t h e
model difference between the estimated value of Yi and


the mean value of Y
Variation not explained by ( )
Yi  Yi
Variation not explained by the model is the
model difference between the actual value and the
predicted value of Yi (error in prediction)
The relationship between the total variation, explained variation and the unexplained variation is
given as follows:
   
Yi  Y  Yi  Y  Yi  Yi
  
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
It can be proved mathematically that sum of squares of total variation is equal to sum of squares of
explained variation plus sum of squares of unexplained variation

  2 2 2
n  n     n   
  Y i  Y     Y i  Y     Y i  Y i 
i  1  i  1  i  1 
              
SST SSR SSE

where SST is the sum of squares of total variation, SSR is the sum of squares of variation explained
by the regression model and SSE is the sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by
2
  
 Yi  Y 
2 Explained variation SSR  

Coefficien t of determinat ion  R   
Total variation SST  2

 Yi  Y 
 
 

Since SSR = SST – SSE, the above Eq. can be written as


2
  
 Yi  Yi 
SSE  
R2  1  1  
SST  2
 
 Yi  Y 
 
 
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:

• The value of R2 lies between 0 and 1.


• Higher value of R 2 implies better fit, but one should be aware of
spurious regression.
• Mathematically, the square of correlation coefficient is equal to
coefficient of determination (i.e., r2 = R2).
• We do not put any minimum threshold for R 2 ; higher value of R 2
implies better fit. However, a minimum value of R 2 for a given
significance value  can be derived using the relationship between
the F-statistic and R2
Spurious Regression

Number of Facebook users and the number of


people who died of helium poisoning in UK

Year Number of Facebook users in Number of people who died of helium


millions (X) poisoning in UK (Y)

2004 1 2

2005 6 2

2006 12 2

2007 58 2

2008 145 11

2009 360 21

2010 608 31

2011 845 40

2012 1056 51
Hypothesis Test for Regression Co-efficient (t-Test)

• The regression co-efficient ( 1) captures the existence of a


linear relationship between the response variable and the
explanatory variable.
• If  1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
Ø The estimate of 1 using OLS is given by
n   n   n  n 
 (X i  X )(Y i  Y )  (X i  X )Y i X  (Y i  Y )  (X i  X )Y i
β1  i 1  i 1  i 1  i 1
n  n  n  n 
2 2 2 2
 (X i  X )  (X i  X )  (X i  X )  (X i  X )
i 1 i 1 i 1 i 1

Above eq. can be written as follows:


n
 K iYi 
β1  i 1 where K i  (X i  X )
n
2
 Ki
i 1

That is, the value of 1 is a function of Yi (Ki is a constant since


Xi is assumed to be non-stochastic)
The standard error of 1 is given by
 Se
S e (1 ) 

( X i  X )2

In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by

n  n
2 2
 (Y i  Y i )  i
Se  i 1  i 1
n 2 n 2

The denominator in above Eq. is (n  2) since  0 and  1 are estimated


from the sample in estimating Yi and thus two degrees of freedom are
lost. The standard error of  can be written as

1

n 
2
  (Y i  Y i ) n  2
Se i 1
S e (1)  
 
2
(X i  X ) ( X i  X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:

H0: 1 = 0
HA: 1  0
• The corresponding t-statistic is given as

  
 1  1  1 0  1
t   
  
Se ( 1) Se ( 1) Se ( 1)
Test for Overall Model: Analysis of Variance (F-test)

The null and alternative hypothesis for F-test is given by


H 0 : There is no statistically significant relationship
between Y and any of the explanatory variables (i.e., all
regression coefficients are zero).
HA: Not all regression coefficients are zero
• Alternatively:
H0: All regression coefficients are equal to zero
HA: Not all regression coefficients are equal to zero

• The F-statistic is given by


MSR MSR / 1
F 
MSE MSE / n  2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy