Multiple Regression SPECIALISTICA
Multiple Regression SPECIALISTICA
1 / 63
Outline
2 Introduction 2 The multiple linear regression model 2 Underlying assumptions 2 Parameter estimation and hypothesis testing 2 Residual diagnostics 2 Goodness of t and model selection 2 Examples in R
2 / 63
3 / 63
3 / 63
3 / 63
3 / 63
3 / 63
where the residual or error terms i , i = 1, . . . , n are assumed to be independent random variables having a normal distribution with mean zero and constant variance 2 . As a consequence, the distribution of the random response variable is also normal (y N (, 2 )) with expected value , given by: = E(y |x1 , x2 , . . . , xp ) = 0 + 1 x1 + . . . + p xp , and variance 2 .
4 / 63
The parameters k , k = 1, 2, . . . , p are known as regression coecients and give the amount of change in the response variable associated with a unit change in the corresponding explanatory variable, conditional on the other explanatory variables in the model remaining unchanged.
Note
The term linear in multiple linear regression refers to the regression parameters, not to the response or explanatory variables. Consequently models in which, for example, the logarithm of a response variable is modeled in terms of quadratic functions of some of the explanatory variables would be included in this class of models. An example of nonlinear model is y1 = 1 e 2 xi1 + 3 e 4 xi2 + i .
5 / 63
and
Each row in X (sometimes known as the design matrix) represents the values of the explanatory variables for one of the individuals in the sample, with the addition of unity to take into account of the parameter 0 .
6 / 63
8 / 63
Properties of LS estimators
Gauss-Markov theorem
In a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, the Best Linear Unbiased Estimators (BLUE) of the coecients are the Least-Squares (LS) estimators. More generally, the BLUE estimator of any linear combination of the coecients is its least-squares estimator. It is noteworthy that the errors are not assumed to be normally distributed, nor are they assumed to be independent (but only uncorrelated, a weaker condition), nor are they assumed to be identically distributed (but only homoscedastic, a weaker condition).
9 / 63
Parameters estimation
The Least-Squares (LS) procedure is used to estimate the parameters in the multiple regression model. Assuming that X X is nonsingular, hence it can be inverted, then the LS estimator of the parameter vector is = (X X)1 X y. This estimator has the following properties: E() = , and cov() = 2 (X X)1 . The diagonal elements of the matrix cov() give the variances of the j , whereas j , k . The square the o-diagonal elements give the covariances between pairs roots of the diagonal elements of the matrix are thus the standard errors of the j .
Statistical methods and applications (2009) Multiple Linear Regression Model 10 / 63
In details
Minimization of this function results in a set of p normal equations, which are solved to yield the parameter estimators. The minimum is then found by setting the gradient to zero:
0 = G () = 2X y + 2(X X) = = (X X)1 X y.
11 / 63
Variance table
The regression analysis can be assessed using the following analysis of variance (ANOVA) table.
Table: ANOVA table Source of Variation Regression Residual Total Sum of Squares (SS) P SSR = P n (i y )2 i=1 y SSE = Pn (yi yi )2 i=1 SST = n (yi y )2 i=1 Degrees of Freedom (df) p np1 n1 Mean Square MSR=SSR/p MSE=SSE/(n-p-1)
where yi is the predicted value of the response variable for the ith individual and y is the mean value of the response variable.
12 / 63
(yi yi )2 .
i=1
Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.
Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63
(yi yi )2 .
i=1
Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.
Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63
(yi yi )2 .
i=1
Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.
Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63
(yi yi )2 .
i=1
Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.
Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63
(yi yi )2 .
i=1
Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.
Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63
14 / 63
Two models are nested if both contain the same terms and one has at least one additional term. For example, the model (a) y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + is nested within model (b)
2 2 y = 0 + 1 x 1 + 2 x 2 + 3 x 1 x 2 + 4 x 1 + 5 x 2 +
Model (a) is the reduced model and model (b) is the full model. In order to decide whether the full model is better than the reduced one (i.e, does it contribute additional information about the association between y and the predictors?), we have to test the hypothesis H0 : 4 = 5 = 0 against the alternative that at least one additional term is = 0.
15 / 63
is used. Once chosen an appropriate level, if F F,1 ,2 , with 1 = q and 2 = n (k + q + 1), then H0 is rejected.
Statistical methods and applications (2009) Multiple Linear Regression Model 16 / 63
yi(k) yi
i=1
where yi(k) is the tted value of the ith observation when the kth observation is omitted from the model. The values of Dk determine the inuence of the kth observation on the estimated regression coecients (cause for concern). Values of Dk greater than one are suggestive that the corresponding observation has undue inuence on the estimated regression coecients.
18 / 63
Principle of parsimony
In science, parsimony is the preference for the least complex explanation for an observation. One should always choose the simplest explanation of a phenomenon, the one that requires the fewest leaps of logic (Burnham and Anderson, 2002). William of Occam suggested in the 14th century that one shave away all that is unnecessary, an aphorism known as the Occams razor. Albert Einstein is supposed to have said: Everything should be made as simple as possible, but no simpler. According to Box and Jenkins (1970), the principle of parsimony should lead to a model with the smallest number of parameters for adequate representation of the data. Statisticians view the principle of parsimony as a bias versus variance tradeo. Usually, it happens that bias (of parameters estimates) decreases and variance (of parameters estimates) increases as the dimension (number of parameters) of the model increases. All model selection methods are based on the principle of parsimony.
Statistical methods and applications (2009) Multiple Linear Regression Model 19 / 63
y )2 =1 y )2
yi )2 . y )2
It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)
20 / 63
y )2 =1 y )2
yi )2 . y )2
It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)
20 / 63
y )2 =1 y )2
yi )2 . y )2
It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)
20 / 63
y )2 =1 y )2
yi )2 . y )2
It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)
20 / 63
y )2 =1 y )2
yi )2 . y )2
It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)
20 / 63
= 1 (1 R 2 )
n1 . np1
SSR SSE =1 . SST SST SSE /(n p 1) , = 1 SST /(n 1) (1 R 2 )SST /(n p 1) n1 = 1 = 1 (1 R 2 ) . SST /n np1 =
21 / 63
Log-likelihood for normal distribution is given by = n(log(2) + log(SSE /n) + 1)/2. If the object under study is the multivariate linear regression model with p esplicative variables and n units then AIC = 2 + 2K = 2 + 2(p + 1), and BIC = 2 + K log(n) = 2 + (p + 1) log(n).
22 / 63
A common mistake when computing AIC is to take the estimate of 2 from the computer output, instead of computing the ML estimate above. Moreover K is the total number of estimated regression parameters, including the intercept and 2 .
23 / 63
Mallows Cp statistic
2 Mallows Cp statistic is dened as Cp = SSRp (n 2p), s2
where SSRp is the residual sum of squares from a regression model with a certain set of p 1 of the explanatory variables, plus an intercept, and s 2 is the estimate of 2 from the model that includes all explanatory variables under consideration. 2 Cp is an unbiased estimator of the mean squared prediction error. 2 If Cp is plotted against p, the subsets of the variables ensuring a parsimonious model are those lying close to the line Cp = p. 2 In this plot, the value p is (roughly) the contribution to Cp from the variance of the estimated parameters, whereas the remaining Cp p is (roughly) the contribution from the bias of the model. 2 Cp plot is a useful device for evaluating the Cp values of a range of models (Mallows, 1973, 1995; Burnman, 1996).
Statistical methods and applications (2009) Multiple Linear Regression Model 24 / 63
To summarize
Rules of thumb
There are a number of measures based on the entire estimated equation that can be used to compare two or more alternative specications and select the best one based on that specic criterion.
2 Radj higher is better.
AIC lower is better. BIC lower is better. Low values of Cp are those that indicate the best model to consider.
25 / 63
30 / 63
This data set concerns with air pollution in the United States. For 41 cities in the United States the following variables were recorded. SO2: Sulphur dioxide content of air in micrograms per cubic meter Temp: Average annual temperature in F Manuf: Number of manufacturing enterprises employing 20 or more workers Pop: Population size (1970 census) in thousands Wind: Average annual wind speed in miles per hour Precip: Average annual precipitation in inches Days: Average number of days with precipitation per year Air Pollution in the U.S. Cities. From Biometry, 2/E, R. R. Sokal and F. J. Rohlf. Copyright c 1969, 1981 by W. H. Freeman and Company.
33 / 63
34 / 63
#R^2 for model1 #Residual standard error: 14.64 on 34 degrees of freedom #Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112 #F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07 > model1$coefficients (Intercept) Neg.temp Manuf Pop Wind 111.72848064 1.26794109 0.06491817 -0.03927674 -3.18136579 Days -0.05205019
Precip 0.51235896
35 / 63
36 / 63
> anova(model1) > xtable(anova(model1)) Df 1 1 1 1 1 1 34 Sum Sq 4143.33 7230.76 2125.16 447.90 785.38 22.11 7283.27 Mean Sq 4143.33 7230.76 2125.16 447.90 785.38 22.11 214.21 F value 19.34 33.75 9.92 2.09 3.67 0.10 Pr(>F) 0.0001 0.0000 0.0034 0.1573 0.0640 0.7500
37 / 63
Precip 0.41947
38 / 63
#R^2 for the selected model by the stepwise procedure #Residual standard error: 14.45 on 35 degrees of freedom #Multiple R-squared: 0.6685, Adjusted R-squared: 0.6212 #F-statistic: 14.12 on 5 and 35 DF, p-value: 1.409e-07 > step(model1, direction="backward") > step(model1, direction="forward") > step(model1, direction="both")
Statistical methods and applications (2009) Multiple Linear Regression Model 39 / 63
1,2,3,4 0.592
1,2,3,5,6 0.588
40 / 63
7.5
23 2346 1236
7.0
123456
Cp 6.5
236 1235
5.5
6.0
5.0
12345 3 4 5 p 6 7
Note
The Cp plot shows that the minimum value for Cp index may be found in correspondence of the combination #12345, hence the selected variables to be included in a parsimonious model are Neg.temp, Manuf, Pop, Wind and Precip.
Statistical methods and applications (2009) Multiple Linear Regression Model 41 / 63
Sample Quantiles
-20
-10
10
20
30
40
-2
-1
0 Theoretical Quantiles
42 / 63
Histogram of model1$res
Frequency
-20
0 model1$res
20
40
42 / 63
rstand
-1 0
10
20 Index
30
40
44 / 63
-2
-1
0 Theoretical Quantiles
45 / 63
10
20 Index
30
40
45 / 63
A data frame containing the estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 males, ages 21 and 81 (Johnson, 1996). Response variable is y = 1/density , like in Burnham and Anderson (2002). 13 potential predictors are age, weight, height, and 10 body circumference measurements. Selection of the best model using AIC , Mallows Cp and adjusted R 2 criteria through stepwise regression.
46 / 63
density Density from underwater weighing (gm/cm3 ) age Age (years) weight Weight (lbs) height Height (inches) neck Neck circumference (cm) chest Chest circumference (cm) abdomen Abdomen circumference (cm) hip Hip circumference (cm) thigh Thigh circumference (cm) knee Knee circumference (cm) ankle Ankle circumference (cm) biceps Biceps (extended) circumference (cm) forearm Forearm circumference (cm) wrist Wrist circumference (cm)
Multiple Linear Regression Model 47 / 63
Full model
(Intercept) age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist > model11$r.squared [1] 0.7424321 > model11$adj.r.squared [1] 0.7283632
Statistical methods and applications (2009) Multiple Linear Regression Model 48 / 63
Estimate 0.8748 0.0001 -0.0002 -0.0002 -0.0009 -0.0001 0.0020 -0.0005 0.0005 0.0000 0.0006 0.0005 0.0009 -0.0036
Std. Error 0.0359 0.0001 0.0001 0.0002 0.0005 0.0002 0.0002 0.0003 0.0003 0.0005 0.0005 0.0004 0.0004 0.0011
t value 24.35 1.56 -1.91 -0.79 -1.97 -0.53 11.43 -1.54 1.71 0.01 1.24 1.41 2.22 -3.23
Pr(>|t|) 0.0000 0.1198 0.0580 0.4314 0.0504 0.5993 0.0000 0.1261 0.0892 0.9893 0.2144 0.1606 0.0276 0.0014
-2
-1
0 Theoretical Quantiles
49 / 63
-0.02
-0.01
0.00
0.01
0.02
residuals(model1)
49 / 63
rstand -3 0 -2 -1
50
100 Index
150
200
250
49 / 63
Estimate 0.8644 0.0001 -0.0002 -0.0010 0.0020 -0.0004 0.0007 0.0011 -0.0033
Std. Error 0.0244 0.0001 0.0001 0.0005 0.0001 0.0003 0.0003 0.0004 0.0011
t value 35.49 1.73 -2.56 -2.04 13.30 -1.51 2.56 2.76 -3.09
Pr(>|t|) 0.0000 0.0848 0.0111 0.0421 0.0000 0.1320 0.0109 0.0063 0.0023
#Residual standard error: 0.008903 on 243 degrees of freedom # Multiple R-squared: 0.7377, Adjusted R-squared: 0.7291 #F-statistic: 85.44 on 8 and 243 DF, p-value: < 2.2e-16
Statistical methods and applications (2009) Multiple Linear Regression Model 50 / 63
Cp
26101113 2461113 12461213 261213 26781213 26101213 269111213 1246789101213 124678910111213 1245678101213 246101213 124567810111213 268111213 268101213 12468910111213 12456810111213 2681213 1246789111213 12346810111213 124568111213 12467891213 1245678111213 2461213 12456781213123467810111213 1268111213 234678111213 123468111213 246710111213 234610111213 1234678101213 26111213 24678101213 1234678111213 2467111213 24681213 2469111213 2346111213 12681213 246810111213 2467810111213 12346781213 2610111213 126810111213 1268101213 246781213 2468111213 1246111213 124610111213 24678111213 12468101213 12467810111213 24610111213 124678101213 124681213 1246810111213 246111213 124678111213 12468111213 1246781213 6 8 p 10 12
1245678910111213 123468910111213 123456810111213 12456789111213 1234678910111213 12346789101213 1234567810111213 12345678101213 12346789111213 12345678111213
10
11
12
14
51 / 63
Using AIC criterion the best model is y~age+weight+neck+abdomen+hip+thigh+forearm+wrist AIC= -2370.71 as we can nd in Burnham and Anderson (2002), while using R 2 adjusted criterion the best model is a model with 10 covariates y~age+weight+neck+abdomen+hip+thigh+ankle+biceps +forearm+wrist Finally using Mallows Cp criterion is a model with 8 covariates y~age+weight+neck+abdomen+hip+thigh+forearm+wrist
52 / 63
In this data set, CO2 emissions (metric tons per capita) measured in 116 countries are related to other variables like:
1 2 3 4 5 6
Energy use (kg of oil equivalent per capita) Export of goods and services (% of GDP) Gross Domestic Product (GDP) growth (annual %) Population growth (annual %) Annual deforestation (% of change) Gross National Income (GNI), Atlas method (current US$)
53 / 63
Full model
54 / 63
Sample Quantiles
-15
-10
-5
-2
-1
0 Theoretical Quantiles
55 / 63
-15
-10
-5 mod1$res
55 / 63
rstand
-8 0
-6
-4
-2
20
40
60 Index
80
100
120
55 / 63
> extractAIC(mod1) [1] 7.0000 213.2029 > extractAIC(mod2) [1] 4.0000 207.6525
The same model is selected using R 2 adjusted and Mallows Cp criteria as we can nd in Ricci (2006).
Statistical methods and applications (2009) Multiple Linear Regression Model 56 / 63
1,2,3,6 0.798
1,2,3,5,6 0.797
1,2,4,5,6 0.797
1245 1235 1456 124 123 125 146 156 12346 12456 12356 136 1346 1356 123456
13 1 7 Cp 5 6
12 16 4
126 2 3 4 p 5 6 7
57 / 63
58 / 63
59 / 63
60 / 63
Bibliography
AUSTIN, P. C. and TU, J. V. (2004). Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. Journal of Clinical Epidemiology 57, 11381146. BROWN, P. J. (1994). Measurement, Regression and Calibration Oxford, Clarendon. BURNHAM, K. P. and ANDERSON, D. R. (2002). Model selection and multimodelinference: a practical information-theoretic approach. New York: Springer-Verlag. BURMAN, P. (1996). Model tting via testing. Statistica Sinica 6, 589-601. CHATTERJE, S and HADI, A. S. (2006). Regression analysis by example, 4th edition. Wiley & Sons: Hoboken, New Yersey. f DER, G. and EVERITT, B. S. (2006). Statistical Analysis of Medical Data Using SAS Chapman & Hall/CRC, Boca Raton, Florida. DIZNEY, H. and GROMAN, L. (1967). Predictive validity and dierential achievement in three MLA comparative foreign language tests. Educational and Psychological Measurement 27, 1127-1130.
Statistical methods and applications (2009) Multiple Linear Regression Model 61 / 63
Bibliography
EVERITT, B. S. (2005). An R and S-PLUS companion to multivariate analysis. Springer Verlag: London. FINOS, L., BROMBIN, C. and SALMASO, L. (2009) Adjusting stepwise p-values in generalized linear models. Accepted for publication in Communications in Statistics: Theory and Methods. FREEDMAN, L.S., PEE, D. and MIDTHUNE, D.N. (1992). The problem of underestimating the residual error variance in forward stepwise regression. The statistician 41, 405412. GABRIEL, K. R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Component Analysis. Biometrika, 58 453467. HARSHMAN, R. A. and LUNDY, M. E. (2006). A randomization method of obtaining valid p-values for model changes selected post hoc. Poster presented at the Seventy-rst Annual Meeting of the Psychometric Society, Montreal, Canada, June, 2006. Available at http://publish.uwo.ca/ harshman/imps2006.pdf. JOHNSON, R. W. (1996) Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education, 4, (e-journal) see http://www.amstat.org/publications/jse/toc.html
Statistical methods and applications (2009) Multiple Linear Regression Model 62 / 63
Bibliography
JOLLIFFE, I. T. (1982). A note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society, Series C 31, 300-303. JOLLIFFE, I. T. (1986) Principal component analysis. Springer: New York. MALLOWS, C.L. (1973) Some comments on Cp. Technometrics 15, 661-675. MALLOWS, C.L. (1975) More comments on Cp. Technometrics 37, 362-372. MORRISON, D. F. (1967) Multivariate statistical methods. McGraw-Hill: New York. RICCI, V. (2006) Principali tecniche di regressione con R. See cran.r-project.org/doc/contrib/Ricci-regression-it.pdf.
63 / 63