0% found this document useful (0 votes)

160 views

Multiple Regression SPECIALISTICA

The document discusses multiple linear regression models. It describes how multiple linear regression generalizes simple linear regression to model relationships between a dependent variable and multiple explanatory variables. It covers the underlying assumptions of the model, parameter estimation, and residual diagnostics for evaluating model fit and selection.

Uploaded by

alchemist_bg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views

Multiple Regression SPECIALISTICA

Uploaded by

alchemist_bg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Multiple Linear Regression Model

Statistical methods and applications (2009)

Multiple Linear Regression Model

1 / 63

Outline
2 Introduction 2 The multiple linear regression model 2 Underlying assumptions 2 Parameter estimation and hypothesis testing 2 Residual diagnostics 2 Goodness of t and model selection 2 Examples in R

Statistical methods and applications (2009)

Multiple Linear Regression Model

2 / 63

Multiple linear regression model

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

Multiple vs. multivariate regression

In multivariate regression more than one dependent variable is available.

Variables do not arise on an equal footing

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

Statistical methods and applications (2009)

Multiple Linear Regression Model

3 / 63

Multiple linear regression model

Multiple vs. multivariate regression

In multivariate regression more than one dependent variable is available.

Variables do not arise on an equal footing

Statistical methods and applications (2009)

Multiple Linear Regression Model

3 / 63

Multiple linear regression model

Multiple vs. multivariate regression

In multivariate regression more than one dependent variable is available.

Variables do not arise on an equal footing

Statistical methods and applications (2009)

Multiple Linear Regression Model

3 / 63

Multiple linear regression model

Multiple vs. multivariate regression

In multivariate regression more than one dependent variable is available.

Variables do not arise on an equal footing

Statistical methods and applications (2009)

Multiple Linear Regression Model

3 / 63

Multiple linear regression model

Multiple vs. multivariate regression

In multivariate regression more than one dependent variable is available.

Variables do not arise on an equal footing

Statistical methods and applications (2009)

Multiple Linear Regression Model

3 / 63

Details on the model

Let yi be the value of the response variable on the ith individual, and xi1 , xi2 , . . . , xip be the ith individuals values on p explanatory variables. The multiple linear regression model is then given by yi = 0 + 1 xi1 + . . . + p xip + i , i = 1, . . . , n,

where the residual or error terms i , i = 1, . . . , n are assumed to be independent random variables having a normal distribution with mean zero and constant variance 2 . As a consequence, the distribution of the random response variable is also normal (y N (, 2 )) with expected value , given by: = E(y |x1 , x2 , . . . , xp ) = 0 + 1 x1 + . . . + p xp , and variance 2 .

Statistical methods and applications (2009)

Multiple Linear Regression Model

4 / 63

About the term linear

The parameters k , k = 1, 2, . . . , p are known as regression coecients and give the amount of change in the response variable associated with a unit change in the corresponding explanatory variable, conditional on the other explanatory variables in the model remaining unchanged.

Note
The term linear in multiple linear regression refers to the regression parameters, not to the response or explanatory variables. Consequently models in which, for example, the logarithm of a response variable is modeled in terms of quadratic functions of some of the explanatory variables would be included in this class of models. An example of nonlinear model is y1 = 1 e 2 xi1 + 3 e 4 xi2 + i .

Statistical methods and applications (2009)

Multiple Linear Regression Model

5 / 63

Writing the model using matrices and vectors

The multiple regression model may be written using the matricial representation: y = X + , where y = [y1 , y2 , . . . , yn ], = [0 , 1 , . . . , p ], = [ 1 x11 x12 . . . x1p 1 x21 x22 . . . x2p X= . . . . . . . . . . . . . . . 1 xn1 xn2 . . . xnp
1 , 2 , . . . , n ],

and

Each row in X (sometimes known as the design matrix) represents the values of the explanatory variables for one of the individuals in the sample, with the addition of unity to take into account of the parameter 0 .

Statistical methods and applications (2009)

Multiple Linear Regression Model

6 / 63

Classical assumptions for regression analysis

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.
Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

Classical assumptions for regression analysis

What does multiple linear regression look like?

While simple linear regression allows to draw a straight line that best t the data (in the least squares sense) in the (x, y ) plane, multiple linear draws a plane that best ts the cloud of data point in the in a (p + 1)-dimensional space. Actually, with p predictors, the regression equation yi = 0 + 1 xi1 + 2 xi2 + . . . + p xip , i = 1, 2, . . . , n, denes a p-dimensional hyperplane in a (p + 1)-dimensional space, that minimizes the sum of the squares of the distances (measure parallel to the y axis) between the hyperplane and the data points.

Statistical methods and applications (2009)

Multiple Linear Regression Model

8 / 63

Properties of LS estimators

Gauss-Markov theorem
In a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, the Best Linear Unbiased Estimators (BLUE) of the coecients are the Least-Squares (LS) estimators. More generally, the BLUE estimator of any linear combination of the coecients is its least-squares estimator. It is noteworthy that the errors are not assumed to be normally distributed, nor are they assumed to be independent (but only uncorrelated, a weaker condition), nor are they assumed to be identically distributed (but only homoscedastic, a weaker condition).

Statistical methods and applications (2009)

Multiple Linear Regression Model

9 / 63

Parameters estimation
The Least-Squares (LS) procedure is used to estimate the parameters in the multiple regression model. Assuming that X X is nonsingular, hence it can be inverted, then the LS estimator of the parameter vector is = (X X)1 X y. This estimator has the following properties: E() = , and cov() = 2 (X X)1 . The diagonal elements of the matrix cov() give the variances of the j , whereas j , k . The square the o-diagonal elements give the covariances between pairs roots of the diagonal elements of the matrix are thus the standard errors of the j .
Statistical methods and applications (2009) Multiple Linear Regression Model 10 / 63

In details

How to obtain the LS estimator of the parameter vector

One method of estimation of population parameters is ordinary least squares. This method allows to nd the vector that minimizes the sum of squared residuals, i.e. the function G () given by G () = e e = (y X) (y X). It follows that
G () = y y + (X X) 2 X y.

Minimization of this function results in a set of p normal equations, which are solved to yield the parameter estimators. The minimum is then found by setting the gradient to zero:
0 = G () = 2X y + 2(X X) = = (X X)1 X y.

Statistical methods and applications (2009)

Multiple Linear Regression Model

11 / 63

Variance table

The regression analysis can be assessed using the following analysis of variance (ANOVA) table.
Table: ANOVA table Source of Variation Regression Residual Total Sum of Squares (SS) P SSR = P n (i y )2 i=1 y SSE = Pn (yi yi )2 i=1 SST = n (yi y )2 i=1 Degrees of Freedom (df) p np1 n1 Mean Square MSR=SSR/p MSE=SSE/(n-p-1)

where yi is the predicted value of the response variable for the ith individual and y is the mean value of the response variable.