4 Regression Issues
4 Regression Issues
Normality of Residuals
Probability Plots
X Cum Freq Cum (E) Cum (O)
0 1 0.006738 0.005
1 3 0.040428 0.015
2 13 0.124652 0.065
3 42 0.265026 0.21
4 76 0.440493 0.38
5 123 0.615961 0.615
6 150 0.762183 0.75
7 165 0.866628 0.825
8 186 0.931906 0.93
9 196 0.968172 0.98
10 200 0.986305 1
>10 200 1 1
Probability Plots
1.2 1.2
1 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12
0
0 0.2 0.4 0.6 0.8 1 1.2
Heteroscedasticity
Other Tests
Breusch-Pagan (BP) Test
White’s Test of Heteroscedasticity
Other tests such as Glejser, Spearman’s rank correlation, and Goldfeld-Quandt tests of
heteroscedasticity
Breusch-Pagan (BP) Test
• Estimate the OLS regression, and obtain the squared residuals
• Regress the square residuals on the k regressors included in the model.
• Other regressors also can be used if they have some bearing on the error
variance.
• Set up H0 that the error variance is homoscedastic and hence all the slope
coefficients are simultaneously equal to zero.
• Use the F statistic from this regression to test H0.
• Reject H0 based on the F test
White’s Test
• Obtain the squared residuals from OLS
• Regress the squared residuals on the regressors, the squared terms of
these regressors, and the pair-wise cross-product term of each regressor.
• Multiply the R2 value of this regression with n
• Under the null hypothesis that there is homoscedasticity, this product
follows the Chi-square distribution with df = number of coefficients
estimated.
• The White test is more general and more flexible
Possible Reasons
• The presence of outliers in the data
• Incorrect functional form of the regression model
• Incorrect transformation of data
• Mixing observations with different measures of scale (such as mixing
observations from different groups with different variances)
Consequences
Example
When Collinearity Is Not Perfect, But High
The OLS estimators are still BLUE, but one or more regression coefficients
have large standard errors compared to the values of the coefficients, and
the t ratios small leading to the conclusion that the true values of these
coefficients are not different from zero.
the R2 value may be very high, but some regression coefficients are
statistically not significant,
The regression coefficients become very sensitive to small changes in the
data. Happens if the sample is relatively small.
Impact of Multicollinearity
• The variances of regression coefficient estimators are inflated.
• The magnitudes of regression coefficient estimates may be different.
• Adding and removing variables produce large changes in the coefficient
estimates.
• Removing a data point causes large change in the coefficient estimate.
• In some cases, the F ratio is significant, but none of the t ratios are.
Variance Inflation Factor
Consider the regression model:
𝜎2 𝜎2
var (¿ 𝑏2)= 2 2
= 2 𝑉𝐼𝐹 ¿
∑ 𝑥 2 𝑖 (1−
2 𝑟 23 ) ∑ 𝑥22 𝑖
𝜎 𝜎
var (¿𝑏3 )= 2 2
= 2 𝑉𝐼𝐹 ¿
∑ 𝑥3 𝑖 (1 −𝑟 23 ) ∑ 𝑥 3 𝑖
• where σ2 is the variance of the error term ui, and r23 is the coefficient of
correlation between X2 and X3.
Variance Inflation Factor
1
𝑉𝐼𝐹 = 2
1 − 𝑟 23
i 1
0 dL dU 4 -dU 4 -dL 4
Positive No Negative
autocorrelation autocorrelation autocorrelation
Inconclusive Inconclusive
Durbin-Watson Decision rules
Null hypothesis Decision If
𝜌1= 𝜌2=...=𝜌 𝑝 =0
Breusch-Godfrey (BG) Test
• The BG test involves the following steps:
• Regress et, the residuals from our main regression, on the regressors in the
model and the p autoregressive terms given in the equation on the previous
slide, and obtain R2 from this auxiliary regression.
• If the sample size is large, BG have shown that: (n – p) R2 ~ χ2p
• Test H0
• Use the F value obtained from the auxiliary regression.
• This F value has (p , n-k-p) degrees of freedom in the numerator and denominator,
respectively, where k represents the number of parameters in the auxiliary
regression (including the intercept term).
Remedial Measures
First-Difference Transformation
If autocorrelation is of AR(1) type, then
𝑢𝑡 − 𝜌 𝑢𝑡 −1= 𝑣 𝑡
Assume ρ=1 and run first-difference model (taking first difference of dependent
variable and all regressors)
Generalized Transformation
Estimate value of ρ through regression of residual on lagged residual and use
value to run transformed regression
Model Evaluation
Outliers
• Mahalanobis Distance: A measure of how much an observation’s value on
the independent variables differ from the average of all observations.
• Cook’s Distance A measure of how much the residuals of all cases would
change if a particular case were excluded from the calculation of the
regression coefficients.
• Leverage Values – Measures the influence of a point on the fit of the
regression. Leverage value ranges from 0 to (n -1)/n
• DFFIT and DFBETA values
Mahalanobis Distance
• Identify
influential observations
• It is the distance between a specific observation and the centroid of all observations
of the explanatory variables.
• Where is the Mahalanobis distance of point Xi and μi is the mean. S-1 is the co-
variance matrix
• should be less than the critical value with degrees of freedom equal to the number
of independent variables in the model
• A simple thumb rule can be to consider any observation with larger than 10 as
influential observation.
Mahalanobis Distance
• Mahalanobis distance is used to check whether the sample point is an outlier or not.
• If the Mahalanobis distance is greater than the critical value, then the sample would
be rejected as an outlier.
• As a rule of thumb, the maximum Mahalanobis distance should not exceed the
critical chi-squared value with degrees of freedom equal to number of predictors
and alpha =.001, or else outliers may be a problem in the data.
• The minimum, maximum, and mean Mahalanobis distances are displayed by SPSS
in the "Residuals Statistics" table when "Casewise diagnostics" is checked under
the Statistics button in the Regression dialog.
Cook’s Distance
• Measures the change in regression parameters
• Measures how much the predicted value of the dependent variable will change for
all observations in the sample, when a particular observation is excluded from the
sample
• Where Di is the Cook’s distance measure for ith observation, is the predicted value
of jth observation with ith observation included in the model and is the predicted
value of jth observation with ith observation excluded from the model. k is the
number of independent variables in the model.
• A value of more than 1 for Cook’s distance indicates influential observation. Some
times a value of (4/(n-k-1)) is also used as a cut off value
Leverage
• If a point has a large leverage, then the slope of the regression line follows
more closely the slope of the line between that point and the mean point.
• Points with small leverage may not have much influence on the regression
coefficients.
Leverage Value
• Hi = [Hii] = Xi(XTX)-1XiT