0% found this document useful (0 votes)
4 views

CN Ols

Arizona States Univ. ECON 725 1

Uploaded by

Alexy Flemmings
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CN Ols

Arizona States Univ. ECON 725 1

Uploaded by

Alexy Flemmings
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

1.

LINEAR REGRESSION UNDER IDEAL CONDITIONS

[1] What is “Regression Model”?

Example:
• Suppose you are interested in the average relationship between income (y)
and education (x).
• For the people with 12 years of schooling (x =12), what is the average
income (E(y|x=12))?
• For the people with x years of schooling, what is the average income
(E(y|x))?
• Regression model:
y = E ( y | x) + ε ,
where ε is a disturbance (error) term with E (ε | x ) = 0 .
• Regression analysis is aimed to estimate E ( y | x ).

Linear Regressions under Ideal Conditions-1


Digression to Probability Theory

(1) Bivariate Distributions


• Consider two random variables (RV), X and Y with a joint probability
density function (pdf): f(x, y) = Pr(X=x, Y=y).

• Marginal (unconditional) pdf:


fx(x) = Σyf(x,y) = Pr(X = x) regardless of Y;
fy(y) = Σx f(x,y) = Pr(Y = y) regardless of X.

• Conditional pdf:
f(y|x) = Pr(Y = y, given X = x) = f(x,y)/fx(x).

• Stochastic independence:
• X and Y are stochastically independent iff f(x,y) = fx(x)fy(y), for all x,y.
• Under this condition, f(y|x) = f(x,y)/fx(x) = [fx(x)fy(y)]/fx(x) = fy(y).

Linear Regressions under Ideal Conditions-2


EX:
• Toss two coins, A and B.
• X = 1 if head from A; = 0 if tail from A.
Y = 1 if head from B; = 0 if tail from B.
f(x,y) = 1/4 for any x,y = 0, 1. (4 possible cases)

• Marginal pdf of x:
fx(0) = Pr(X=0) regardless of y = f(0,1) + f(0,0) = 1/4 + 1/4 = 1/2.
fx(1) = Pr(X=1) regardless of y = f(1,1) + f(1,0) = 1/4 + 1/4 = 1/2.

fx(x) = 1/2, for x = 0, 1.


Similarly, fy(y) = 1/2, for y = 0, 1.

• Conditional pdf:
f(y = 1| x = 1) = f(1,1)/fx(1) = (1/4)/(1/2) = 1/2;
f(y = 0| x = 1) = f(0,1)/fx(1) = 1/2.
→ f(y| x=1) = 1/2, for y = 0, 1.

• Find f(y|x=0) by yourself.

• Stochastic independence:
fx(x) = fy(y) = 1/2; fx(x)fy(y) = 1/4 = f(x,y), for any x and y.
Thus, x and y are stochastically independent.

Linear Regressions under Ideal Conditions-3


Expectation:
E[g(x,y)] = ΣxΣyg(x,y)f(x,y) [or ∫Ω
g ( x, y ) f ( x, y )dxdy ].

Means:
μx = E(x) = ΣxΣyxf(x,y) = Σxxfx(x).
μy = E(y) = ΣxΣyyf(x,y) = Σyyfy(y).

Variances:
σ x2 = E[( x − μ x )2 ] = Σ xΣ y ( x − μ x )2 f ( x, y ) = Σ x ( x − μ x ) 2 f x ( x )
.
= E ( x 2 ) − [ E ( x )]2 = Σ x x 2 f x ( x ) − μ x2

σ y2 = Σ x Σ y ( y − μ y )2 f ( x, y ) = Σ y ( y − μ y ) 2 f y ( y )
= E ( y 2 ) − [ E ( y )]2 = Σ y y 2 f y ( y ) − μ y2

Covariance:
σ xy = cov( x, y ) = E [( x − μ x )( y − μ y )] = Σ x Σ y ( x − μ x )( y − μ y ) f ( x, y )
= E ( xy ) − μ x μ y = Σ x Σ y xyf ( x, y ) − μ x μ y

Note: σxy > 0 → positively linearly related;


σxy < 0 → negatively linearly related;
σxy = 0 → no linear relation.

Linear Regressions under Ideal Conditions-4


EX: x, y = 1, 0, with f(x,y) = 1/4.
E(xy) = ΣxΣyxyf(x,y)
= 0×0×(1/4) + 0×1×(1/4)+ 1×0×(1/4) + 1×1×(1/4) = 1/4.

Correlation Coefficient:
The correlation coefficient between x and y is defined by:
σ xy
ρ xy = .
σ xσ y

Theorem:
-1 ≤ ρxy ≤ 1.

Note: ρxy → 1: highly positively linearly related;


ρxy → -1; highly negatively linearly related;
ρxy → 0: no linear relation.

Theorem:
If X and Y are stochastically independent, then, σxy = 0. But, not vice versa.

Linear Regressions under Ideal Conditions-5


Conditioning in a Bivariate Distribution:
• X,Y: RVs with f(x,y). (e.g., Y = income, X = education)
• Population of billions and billions: {(x(1),y(1)), .... (x(b),y(b))}.
• Average of y(j) = E(y).
• For the people earning a specific education level x, what is the average of y?

Conditional Mean and Variance:


• E ( y | x ) = E ( y | X = x ) = Σ y yf ( y | x ) .

• var( y | x ) = E[( y − E ( y | x )) 2 | x ] = Σ y ( y − E ( y | x ) ) f ( y | x ) .
2

Regression model:
• Let ε = y - E(y|x) (deviation from conditional mean).
• y = E(y|x) + y - E(y| x) = E(y|x) + ε (regression model).
• E(y|x) = explained part of y by x.
ε = unexplained part of y (called disturbance term).
E(ε|x) = 0 and var(ε| x) = var(y|x).

Note:
• E(y|x) may vary with x, i.e., E(y|x) is a function of x.
• Thus, we can define Ex[E(y|x)], where Ex(•) is the expectation over x =
Σx•fx(x) or ∫Ω•fx(x)dx.

Linear Regressions under Ideal Conditions-6


Theorem: (Law of Iterative Expectations)
E(y) [unconditional mean] = Ex[E(y|x)] .
Proof:
E(y) = ΣxΣyyf(x,y) = ΣxΣyyf(y|x)fx(x) = Σx[Σyyf(y|x)]fx(x).

Note:
For discrete RV, X with x = x1, ...,
E(y) = ΣxE(y|x)fx(x) = E(y|x=x1)fx(x1) + E(y|x=x2)fx(x2) + ... .

Implication:
If you know the conditional mean of y and the marginal distribution of x, you
can also find the unconditional mean of y, too.

EX 1: Suppose E(y|x) = 0, for all x. E(y) = Ex[E(y|x)] = Ex(0) = 0.


EX 2: E(y|x) = β1 + β2x. → E(y) = Ex(E(y|x)) = Ex(β1+ β2x) = β1+β2E(x).

Question: When can E(y|x) be linear? Answered later.

Definition:
We say that y is homoskedastic if var(y|x) is constant.

EX: y = E(y|x) + ε with var(ε|x) = σ2 for all x(constant).


→ var(y|x) = var[E(y|x)+ε|x] = var(ε|x) = σ2, for all x.
→ y is homoskedastic.

Linear Regressions under Ideal Conditions-7


Graphical Interpretation of Conditional Means and Variances
• Consider the following population:

E(y|x)=β1+β2x

x
x1 x2

• E(y|x=x1) measures the average value of y for the group of x = x1.


• var(y|x=x1) measures the dispersion of y given x = x1.
• If var(y|x=x1) = var(y|x=x2) = ..., we say that y is homoskedastic.
• Law of iterative expectation:
E(y) = ΣxE(y|x)fx(x) = E(y|x=x1)Pr(x=x1) + E(y|x=x2)Pr(x=x2) + ... .

Question: It is worth finding E(y|x)?

Linear Regressions under Ideal Conditions-8


Theorem: (Decomposition of Variance)
var(y) = varx[E(y|x)] + Ex[var(y|x)].

Note:
• varx[E(y|x)] ≤ var(y), since Ex[var(y|x)] ≥ 0.
• var(y) = E[(y-E(y))2]
= total variation of y.
varx[E(y|x)] = Ex[(E(y|x)-E(y))2]
= a part of variation in y due to variation in E(y|x)
= variation in y explained by E(y|x).

Coefficient of Determination:
R2 = varx[E(y|x)]/var(y).
→ Measure of worthiness of knowing E(y|x).
→ 0 ≤ R2 ≤ 1.
Note:
• R2 = variation in y explained by E(y|x)/total variation of y.
• Wish R2 close to 1.

Linear Regressions under Ideal Conditions-9


Summarizing Exercise:
• A population with X (income=$10,000) and Y (consumption=$10,000).
• Joint pdf:
Y\X 4 8
1 1/2 0
2 1/4 1/4

• Graph for this population:


y

Regression
line

4/3
1

x
4 8

• Marginal pdf:
Y\X 4 8 fy(y)
1 1/2 0 1/2
2 1/4 1/4 1/2
fx(x) 3/4 1/4

Linear Regressions under Ideal Conditions-10


• Means of X and Y:
• E(x) = µx = Σxxfx(x) = 4×fx(4) + 8×fx(8) = 4×(3/4) + 8×(1/4) = 5.
• E(y) = µy = Σyyfy(y) = 1.5.
• Variances of X and Y:
• var(x) = σx2 = Σx(x-µx)2fx(x)
= (4-5)2fx(4) + (8-5)2fx(8) = 1×(3/4) + 9×(1/4) = 3.
• var(y) = σy2 = 1/4.
• Covariance between X and Y:
• σxy = E[(x-µx)(y-µy)] = E(xy) - µxµy = ΣxΣyxyf(x,y) - µxµy
= 4×1×f(4,1)+4×2×f(4,2)+8×1×f(8,1)+8×2×f(8,2)-5×1.5 = 0.5.
σ xy 0 .5
• ρxy = = ≅ 0.58.
σ xσ y 3 1/ 4

• Conditional Probabilities
Y\X 4 8 fy(y)
1 1/2 0 1/2
2 1/4 1/4 1/2
fx(x) 3/4 1/4
• f(y|x):
Y\X 4 8
1 2/3 0
2 1/3 1

Linear Regressions under Ideal Conditions-11


• Conditional mean:
• E(y|x=4) = Σyyf(y|x=4) = 1×f(y=1|x=4) + 2×f(y=2|x=4)
= 1×(2/3) + 2×(1/3) = 4/3.
• E(y|x=8) = 2.
y

4/3
1

x
4 8

• Conditional variance of Y:
• var(y|x=4) = Σy[y-E(y|x=4)]2f(y|x=4) = 6/27.
• var(y|x=8) = 0.

• Law of iterative expectation:


• Ex[E(y|x)] = ΣxE(y|x)fx(x) = E(y|x=4)fx(4) + E(y|x=8)fx(8)
= (4/3)×(3/4) + 2×(1/4) = 1.5 = E(y)!!!

Linear Regressions under Ideal Conditions-12


(2) Bivariate Normal Distribution
Definition:

⎛ x⎞ ⎛ ⎛ μ x ⎞ ⎛ σ x2 ρ xyσ xσ y ⎞ ⎞
⎜ ⎟ ~ N ⎜ ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟ .
⎝ y⎠ ⎜ μ y ρ xyσ xσ y
⎝⎝ ⎠ ⎝ σ y ⎟⎠ ⎟⎠
2

1
f ( x, y ) =
2πσ xσ y 1 − ρ xy2

⎛ 1 ⎧ ( x − μ x )2 x − μ y − μ y ( y − μ y )2 ⎫ ⎞
× exp⎜ − − 2 ρ xy + ⎬ ⎟⎟ ,
⎜ 2(1 − ρ 2 ) ⎨ σ 2
x

σ σ σ
2

⎝ xy ⎩ x x y y ⎭⎠
where x, y ∈ ℜ.

Facts:
• f x ( x ) ~ N ( μ x , σ x2 ) and f y ( y ) ~ N ( μ y , σ y2 ) .

• E(y|x) = β1 + β2x and var(y|x) is constant (see Greene).


→ E(y|x) is linear in x and y is homoskedastic.
• If σxy = 0 (or ρxy = 0), x and y are stochastically independent.

Linear Regressions under Ideal Conditions-13


(3) Multivariate Distributions

Definition: (Mean vector and covariance matrix)


X1, ... , Xn : random variables.
Let x = [x1, .... , xn]′ (n×1 vector). Then,
⎛ E ( x1 ) ⎞ ⎡ var( x1 ) cov( x1 , x2 ) ... cov( x1 , xn ) ⎤
⎜ ⎟ ⎢cov( x , x )
⎜ E ( x2 ) ⎟ var( x2 ) ... cov( x2 , xn )⎥
E ( x) = ⎜ ⎟ ; Cov( x ) = ⎢ 2 1
⎥.
: ⎢ : : : ⎥
⎜⎜ ⎟⎟ ⎢cov( x , x ) cov( x , x ) ...
⎝ E ( xn ) ⎠ ⎣ n 1 n 1 var( xn ) ⎥⎦

→ Cov(x) is symmetric.
EX: If x is scalar, Cov(x) = E[(x-µ)2] = var(x).
EX: x = [x1,x2]′ ; E(x) = µ = [µ1, µ2]′
x - µ = [x1-µ1, x2-µ2]′
⎛ x − μ1 ⎞
→ (x-μ)(x-μ)′ =⎜ 1 ⎟( x1 − μ1 x2 − μ 2 )
⎝ 2
x − μ 2⎠

⎛ ( x1 − μ1 ) 2 ( x1 − μ1 )( x 2 − μ 2 ) ⎞
= ⎜⎜ ⎟⎟ .
(
⎝ 2 x − μ 2 )( x1 − μ 1 ) ( x 2 − μ 2 ) 2

→ E[(x-µ)(x-µ)′] = Cov(x).

Theorem: Cov(x) = E[(x-µ)(x-µ)′] = E(xx′) - µµ′.


Proof: See Greene.

Note: In Greene, Cov(x) is denoted by Var(x).

Linear Regressions under Ideal Conditions-14


Definition: Covariance Matrix between Two Random Vectors
X = ( X 1 , X 2 ,..., X n )′ and Y = (Y1 , Y2 ,..., Ym )′ are random vectors. Then,

⎛ cov( x1 , y1 ) cov( x1 , y2 ) ... cov( x1 , ym ) ⎞


⎜ cov( x , y ) cov( x , y ) cov( x2 , ym ) ⎟
Cov ( x, y ) = ⎜ 2 1 2 2 ⎟.
⎜ : : : ⎟
⎜ ⎟
⎝ cov( xn , y1 ) cov( xn , y2 ) ... cov( xn , ym ) ⎠

Definition: (Expectation of random matrix)


Suppose that Bij are RVs. Then,
⎡ B11 B12 ... B1q ⎤ ⎡ E ( B11 ) E ( B12 ) ... E ( B1q ) ⎤
⎢B B22 ... B2 q ⎥ ⎢ E ( B ) E ( B ) ... E ( B )⎥
B=⎢ ⎥ ⇒ E ( B) = ⎢ ⎥
21 21 22 2q

⎢ : : : ⎥ ⎢ : : : ⎥
⎢B Bp2 ... B pq ⎥⎦ ⎢ E ( B ) E ( B ) ... E ( B )⎥
⎣ p1 ⎣ p1 p2 pq ⎦

(4) Multivariate Normal distribution


Definition:
X = [X1, ... , Xn]′ is a normal vector, i.e., each of the xj's is normal.
Let E(x) = µ = [µ1, ... , µn]′ and Cov(x) = Σ = [Σij]n×n. Then,
x ~ N(µ,Σ).

Pdf of x:
f(x) = f(x1, ... , xn) = (2π)-n/2 Σ -1/2exp[-(1/2)(x-µ)′Σ-1(x-µ)] ,

where Σ = det(Σ).

Linear Regressions under Ideal Conditions-15


EX:
Let X be a single RV with N(µx,σx2). Then,
f(x) = (2π)-1/2(σx2)-1/2exp[-(1/2)(x-µx)(σx2)-1(x-µx)]

1 ⎡ ( x − μ x )2 ⎤
= exp ⎢ − ⎥.
2π σ x ⎣ 2σ x
2

EX:
Assume that all the Xi (i = 1, …, n) are iid with N ( μ x , σ x2 ) . Then,

(1) µ = E(x) = [µx, ... , µx]′ ;


(2) Σ = Cov(x) = diag (σ x2 , σ x2 ,..., σ x2 ) = σ x2 I n .

Using (1) and (2), we can show that f(x) = f(x1, ... , xn) = ∏i =1 f ( xi ) ,
n

1 ⎡ ( xi − μ x ) 2 ⎤
where f(xi) = exp ⎢ − ⎥.
2π σ x ⎣ 2σ x
2

Theorem: Conditional normal distribution


[y, x2, ... , xk]′ is a normal vector. Then,
E(y| x2,...,xk) = β1 + β2x2 + ... + βkxk = x*′; var(y|x*) = σ2.
where x* = (1, x2, ... , xk)′ and β = (β1, ... , βk)′ ]. That is, the regression of y on
x1, ... , xk is linear & homoskedastic.
Proof: See Greene.

Linear Regressions under Ideal Conditions-16


(5) Properties of the Covariance Matrix of a Random Vector
Definition:
Let X = [X1, ... , Xn]′ be a random vector and let c = [c1, ... , cn]′ be a n×1 vector
of fixed constants. Then,
c′x = x′c = c1x1 + ... + cnxn = Σjcjxj (scalar).

Theorem:
(1) E(c′x) = c′E(x);
(2) var(c′x) = c′Cov(x)c.
Proof:
(1) E(c′x) = E(Σjcjxj) = E(c1x1 + ... + cnxn)
= c1E(x1) + ... + cnE(xn) = ΣjcjE(xj) = c′E(x).
(2) var(c′x) = E[(c′x - E(c′x))2] = E[{c′x - c′E(x)}2]
= E[{c′(x-E(x))}2] = E[{c′(x-E(x))}{c′(x-E(x))}]
= E[{c′(x-E(x))}{(x-E(x))′c}]
= E[c′(x-E(x))(x-E(x))′c] = c′E[(x-E(x))(x-E(x))′]c = c′Cov(x)c.

Remark:
(2) implies that Cov(x) is always positive semidefinite.
→ c′Cov(x)c ≥ 0, for any nonzero vector c.
Proof:
For any nonzero vector c, c′Cov(x)c = var(c′x) ≥ 0.

Linear Regressions under Ideal Conditions-17


Remark:
• Cov(x) is symmetric and positive semidefinite (what does it mean?).
• Usually, Cov(x) is positive definite, that is, c′Cov(x)c > 0, for any nonzero
vector c.

Definition:
Let B = [bij]n x n be a symmetric matrix, and c = [c1, ... , cn]′. Then, a scalar c′Bc
is called a quadratic form of B.

Definition:
• If c′Bc > (<) 0 for any nonzero vector c, B is called positive (negative)
definite.
• If c′Bc ≥ (≤ ) 0 for any nonzero c, B is called positive (negative)
semidefinite.

Linear Regressions under Ideal Conditions-18


Theorem:
Let B be a symmetric and square matrix given by:
⎡ b11 b12 ... b1n ⎤
⎢b b22 ... b2 n ⎥
B = ⎢ 21 ⎥.
⎢ : : : ⎥
⎢b ... bnn ⎥⎦
⎣ n1 bn 2
Define the principal minors by:
b11 b12 b13
b b12
B1 = b11 ; B2 = 11 ; B3 = b21 b22 b23 ;... .
b21 b22
b31 b32 b33

B is positive definite iff B1 , B2 , ... , Bn are all positive. B is negative definite

iff B1 < 0, B2 > 0, B3 < 0, ... .

EX:
Show that B is positive definite:
⎡2 1⎤
⎢1 2⎥ .
⎣ ⎦
End of Digression

Linear Regressions under Ideal Conditions-19


[2] Classical Linear Regression (CLR) Model

Example:
• Wish to find important determinants of individuals’ earnings and estimate
the size of the effect of each determinant.
• Data: (WAGE2.WF1 or WAGE2.TXT)

# of observations (T): 935


1. wage monthly earnings
2. hours average weekly hours
3. IQ IQ score
4. KWW knowledge of world work score
5. educ years of education
6. exper years of work experience
7. tenure years with current employer
8. age age in years
9. married =1 if married
10. black =1 if black
11. south =1 if live in south
12. urban =1 if live in SMSA
13. sibs number of siblings
14. brthord birth order
15. meduc mother's education
16. feduc father's education
17. lwage natural log of wage

What variables would be important determinants of log(wage)?


From now on, we use both “log” and “ln” to refer to natural log.

Linear Regressions under Ideal Conditions-20


Mincerian Wage Equation:
• Set y (dependent variable) = log(wage).
• Set x• (vector of independent variables) = [1, educ, exper, exper2]′.
• xi = vector of independent variables (or explanatory variables, or
regressors).
• Use subscript “o” for “true value”.
• Assume E ( y | xi ) = β1,o + β 2,o educ + β 3,o exp er + β 4,o exp er 2

• y = E ( y | xi ) + ε = β1,o + β 2,o educ + β 3,o exp er + β 4,o exp er 2 + ε

• y = xi′βo + ε , where β o = ( β1,o , β 2,o , β 3,o , β 4,o )′

• Here,
• β2,o × 100 = %Δ in wage by one more year of education.
• (β3,o+2β4,oexper) × 100 = %Δ by one more year of exper.

• Issues:
• How to estimate βo’s?
• Estimated β’s would not be equal to the true values of β (βo). How close
would our estimates to the true values?

Linear Regressions under Ideal Conditions-21


Basic Assumptions for CLR
(I call these assumptions Strong Ideal Conditions (SIC).)

To understand SIC better; imagine a population of T-groups with the following


properties.
• For each group t = 1, 2,..., T, yt denotes the dependent variable and xt• =
(xt1,xt2,...,xtk)′ denotes the vector of regressors.
• The T-groups are assumed to be independent.
• Your sample consists of T observations, each of which comes from each
different group.

As you may find, the above assumptions are unrealistic. But under the
assumptions, more intuitive discussions about the statistical properties of OLS
can be made. The statistical properties of OLS discussed later still hold even
under more realistic assumptions.

Notation:
• E ( xt1 ) is the group population mean of x1 for group t, while E(x1) is the
population mean of x1 for the whole population.

Linear Regressions under Ideal Conditions-22


We now discuss each of SIC in detail:

(SIC.1) The conditional mean of yt (dependent variable) given xt• (vector of


explanatory variables) is linear:
yt = E ( yt | xt • ) + ε t = xt′• β o + ε t = β1,o xt1 + β 2,o xt 2 + ... + β k ,o xtk + ε t , (1)

where xt • = ( xt1 , xt 2 ,..., xtk )′ and β o = ( β1,o ,..., β k ,o )′ .

Comment:
• Usually, xt1 = 1 for all t. That is, β1 is an overall intercept term.
• E (ε t | xt i ) = 0 .
• E ( xt iε t ) = E xt i [ E ( xt iε t | xt i )] = E xt i [ xt i E (ε t | xt i )] = E xt i (0) = 0 .

(SIC.2) β o = ( β1,o ,..., β k ,o )′ is unique.

Comment:
• No other β* such that E ( yt | xt i ) = xt′i βo = xt′i β* for all t.
• The uniqueness assumption of βo is called “identification” condition.
• Rules out perfect multicollinearity (perfect linear relationship among the
regressors):
• Suppose β = ( β1 , β 2 , β 3 )′ and xt 3 = xt1 + xt 2 for all t.
• Set β1,* = β1,o + a; β 2,* = β 2,o + a; β 3,* = β 3,o − a for an arbitrary a ∈ℜ .
xt′i β* = xt1β1,* + xt 2 β 2,* + xt 3β 3,*
• = xt1β1,o + xt 2 β 2,o + xt 3β 3,o + a ( xt1 + xt 2 − xt 3 )
= xt′i β o
• (SIC.2) rules out this possibility.

Linear Regressions under Ideal Conditions-23


(SIC.3) The variables, yt, xt1, … , xtk, have finite moments up to fourth order.

Comment:
• E ( yt2 xt22 ) , E ( xt 3 xt34 ) , E ( xt43 ) , etc, exist.
• Rules out extreme outliers.
• We need this assumption for consistency and asymptotic normality of the
OLS estimator.
• SIC implies the Weak Ideal Conditions (WIC) that will be discussed
later.
• Violated if xt2 = t or xt 2 = xt −1,2 + vt 2 .

(SIC.4) A random sample {( y , x , x ,..., x )′}


t t1 t2 tk
t =1,...,T
is available and T ≥ k .

Comment:
• ( yt , xt1 , xt 2 ,..., xtk )′ are iid (independently and identically distributed):
• T groups which are iid with
⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞
E ⎜ ⎜ t ⎟ ⎟ = E ⎜ ⎜ ⎟ ⎟ and Cov ⎜ ⎜ t ⎟ ⎟ = Cov ⎜ ⎜ ⎟ ⎟ .
⎝ ⎝ xt i ⎠ ⎠ ⎝⎝ x ⎠⎠ ⎝ ⎝ xt i ⎠ ⎠ ⎝⎝ x ⎠⎠
• One observation is drawn from each of the T group.
• Could be appropriate for cross-section data.
• Violated if time series data are used. That is why we add “strong” for the
name of the conditions.
• If T < k , there are infinitely many β* such that xt′i βo = xt′i β* for all t. For
this case, the sample cannot identify β.
• Implies no autocorrelation: cov(ε t , ε s ) = 0 for all t ≠ s .

Linear Regressions under Ideal Conditions-24


(SIC.5) var(ε t | xt • ) = σ o2 , for all xt• (Homoskedasticity Assumption).
Comment:
• Often violated when cross-section data are used.
• Consider the two different populations:
o Population 1 (homoskedastic population):
ƒ homy = 1 + 2x2 + ε, where var(ε|x2) = 9.
o Population 2 (heteroskedastic population):
ƒ hety = 1 + 2x2 + ε, where var(ε|x2) = x22.
o x2 = 1, or 2, or 3, or 4, or 5, for both populations.

HETY vs. X2
HOMY vs. X2
20
20

16 16

12
12
HETY
HOMY

8
8
4

4
0

-4 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6

X2 X2

(SIC.6) The errors εt are normally distributed conditional on xt • .

(SIC.7) xt1 = 1, for all t = 1, ... , T.

Comment:
• Optional. Not critical.
• This condition implies that β1,o is an overall intercept term.
• Need this assumption for convenient interpretation of empirical R2.

Linear Regressions under Ideal Conditions-25


• Link between βo and covariances:
• Consider a simple regression model, yt = β1,o + β 2,o xt 2 + ε t = xt′i β o + ε t .

• Assume (SIC.1) – (SIC.4) and (SIC.7).


• E ( xt • yt ) = E ( xt • ( xt′• βo + ε t )) = E ( xt • xt′• ) β o

→ E ( x• y ) = E ( x• x•′ ) βo

→ βo = [ E ( x• x• )]−1 E ( x• y ) ,

where,
⎛⎛ 1 ⎞ ⎞ ⎛1 x2 ⎞ ⎛ 1 E ( x2 ) ⎞
E ( x• x• ) = E ⎜ ⎜ ⎟ (1 x2 ) ⎟ = E ⎜ = ;
⎝ ⎝ x2 ⎠ ⎠ ⎝ x2 x22 ⎟⎠ ⎜⎝ E ( x2 ) E ( x22 ) ⎟⎠

⎛⎛ 1 ⎞ ⎞ ⎛ y ⎞ ⎛ E( y) ⎞
E ( x• y ) = E ⎜ ⎜ ⎟ y ⎟ = E ⎜ ⎟ = ⎜ E( x y) ⎟ .
⎝⎝ 2 ⎠ ⎠
x ⎝ 2 ⎠ ⎝
x y 2 ⎠
cov( x2 , y )
→ β 2,o = ; β1,o = E ( y ) − β 2,o E ( x2 ) .
var( x2 )

Linear Regressions under Ideal Conditions-26


Theorem:
Let yt = xt′i β o ≡ β1,o + wt′i β w,o + ε t , where xt′i = (1, wt′i ) , β o = ( β1,o , β w′ ,o )′ ,

wt i = ( xt 2 ,..., xtk )′ and β w,o = ( β 2,o ,..., β k ,o )′ . Suppose that this model satisfies
(SIC.1)-(SIC4) and (SIC.7). Then,
β o = ( E ( xt • xt′• ) ) E ( xt • yt ) = ( E ( x• x•′ ) ) E ( x• y ) .
−1 −1

And,

β w,o = ( E ( w• w•′ ) − E ( w• ) E ( w•′ ) ) ( E ( w• y ) − E ( w• ) E ( y ) )


−1

;
= ( Cov ( w• ) ) Cov ( w• , y )
−1

Hint for proof:


−1
⎛ A11 A12 ⎞ ⎛ A11−1 0 ⎞ ⎛ A11−1 A12 ⎞
⎟ ( A22 − A21 A11 A12 ) ( A21 A11 − I )
−1 −1
⎜A ⎟ =⎜ ⎟+⎜
⎝ 21 A22 ⎠ ⎝ 0 0⎠ ⎝ −I ⎠ ,
⎛ 0 0 ⎞ ⎛ −I ⎞
=⎜ −1 ⎟
+ ⎜ −1 ⎟ ( A11 − A12 A22−1 A21 ) ( − I A12 A22−1 )
⎝ 0 A22 ⎠ ⎝ − A22 A21 ⎠
where 0’s here are zero matrices.

Linear Regressions under Ideal Conditions-27


Implications:
• The slopes, β 2,o , … , β k ,o , measure the correlations between regressors and

dependent variables.
• β 2,o ≠ 0 means non-zero correlation between yt and xt2. It does not mean

that xt2 causes yt. β 2,o ≠ 0 could mean that yt causes xt2.

• SIC do not talk about causality. SIC may hold even if yt determines xt • :
It can be the case that E ( edut | waget ) = β1,o + β 2,o waget .

• But, the regression model (1) is not meaningful if the x variables are not
causal variables. We would like to know by how much hourly wage rate
increases with one more of education. We would not be interested in how
many more years of education an individual could have obtained if his/her
current wage rate increased now by $1!

Linear Regressions under Ideal Conditions-28


[3] Ordinary Least Squares (OLS)

Definition:

For a given sample {( y , x ,..., x )′}


t t1 tk
t =1,...,T
without perfect multicollinearity

among regressors xt1 ,..., xtk , the OLS estimator βˆ = ( βˆ1 , βˆ2 ,..., βˆk )′
minimizes:
ST ( β ) ≡ Σt ( yt − xt1β1 − ... − xtk β k ) 2
,
= Σt ( yt − xt′• β ) = ( y − X β )′( y − X β )

where Σt = ΣTt =1 , and

⎛ y1 ⎞ ⎛ x1′• ⎞
⎜y ⎟ ⎜ x′ ⎟
y = ⎜ ⎟ ; X = ⎜ 2• ⎟ ; xt′• = ( xt1 , xt 2 ,..., xtk ) .
2

⎜ : ⎟ ⎜ : ⎟
⎜ ⎟ ⎜ ′ ⎟
⎝ T⎠
y ⎝ xT • ⎠

Comment on the assumption of no perfect multicollinearity.


• rank ( X ′X ) = rank ( X ) = k . So, X ′X = Σt xt • xt′• is invertible.

• If perfect multicollinearity exists, rank ( X ′X ) = rank ( X ) < k . So,


X ′X = Σt xt i xt′i is not invertible.
• If T < k , rank ( X ′X ) = rank ( X ) ≤ min(T , k ) < k . So, X ′X = Σt xt • xt′• is not
invertible. T < k is a case of perfect multicollinearity.

Linear Regressions under Ideal Conditions-29


EX: Simple Regression Model
• Wish to estimate yt = β1,o xt1 + β 2,o xt 2 + ε t :

ST ( β1 , β 2 ) = Σt ( yt − xt1β1 − xt 2 β 2 ) 2 .
• The first order condition for minimization:
∂ST / ∂β1 = Σt2(yt-xt1β1-xt2β2)(-xt1) = 0 → Σt(xt1yt - xt12β1-xt1xt2β2) = 0
∂ST / ∂β 2 = Σt2(yt-xt1β1-xt2β2)(-xt2) = 0 → Σt(xt2yt - xt1xt2β1-xt22β2) = 0
→ Σtxt1yt = (Σtxt12)β1 + (Σtxt1xt2)β2
Σtxt2yt = (Σtxt1xt2)β1 + (Σtxt22)β2
⎛ Σ t x t1 y t ⎞ ⎛ Σ t x t1 Σ t x t1 x t 2 ⎞⎛ βˆ1 ⎞
2
→ ⎜⎜ ⎟⎟ = ⎜⎜ ⎟⎜ ⎟ .
2 ⎟⎜ ˆ ⎟
Σ
⎝ t t 2 t ⎠ ⎝ Σ t x t 2 x t 1 Σ t x t 2 ⎠⎝ β 2 ⎠
x y

→ But, this equation is equivalent to X ′y = X ′X βˆ .

→ βˆ = (X′X)-1X′y.

Derivation of the OLS estimator for general cases:


• ST(β) = (y′ - β′X′)(y - Xβ) = y′y - β′X′y - y′Xβ + β′X′Xβ .
• Since y′Xβ is a scalar, y′Xβ = (y′Xβ)′ = β′X′y .
• Thus, ST(β) = y′y - 2β′X′y + β′X′Xβ .
⎛ ∂ST ( β ) ⎞
⎜ ∂β ⎟
⎜ 1
⎟ ⎛ 0⎞
⎜ ∂ST ( β ) ⎟ ⎜ ⎟
∂ST ( β ) ⎜ 0
• FOC for minimization of ST(β): ≡ ∂β 2 ⎟ = ⎜ ⎟ = 0k ×1 .
∂β ⎜ ⎟ ⎜:⎟
⎜ : ⎟ ⎜ ⎟
⎜ ∂S ( β ) ⎟ ⎝ 0 ⎠
⎜⎜ T ⎟⎟
⎝ ∂ β k ⎠

Linear Regressions under Ideal Conditions-30


But,
∂(β′X′y)/∂β = X′y;
∂(β′X′Xβ)/∂β = 2X′Xβ.
[In fact, for any k×1 vector d, ∂(β′d)/∂β = d; and, for any k×k symmetric
matrix A, ∂(β′Aβ)/∂β = 2Aβ.]
Thus, FOC implies

∂ST ( β )
= −2 X ′y + 2 X ′X β = 0k ×1
∂β

X ′y − X ′X β = 0k ×1 (2)

→ Solving (2), we have


βˆ = ( X ′X ) −1 X ′y .

SOC (second order condition) for minimization:

∂ 2 ST ( β ) ⎡ ∂ 2 ST ( β ) ⎤
=⎢ ⎥ = 2X′X,
∂β∂β ′ ∂ β ∂β
⎣ i j ⎦ k ×k
which is a positive definite matrix for any value of β. That is, the function
ST(β) is globally convex. This indicates that βˆ indeed minimizes ST(β).
[Here, we use the fact that ∂(β′Aβ)/∂β∂β′ = 2A for any symmetric matrix A.]

Linear Regressions under Ideal Conditions-31


Theorem: βˆ = ( X ′X ) −1 X ′y .

Definition:
• t'th residual: et = yt − xt′i βˆ (can be viewed as an estimate of εt).
• Vector of residuals: e = ( e1 ,..., eT )′ = y − X βˆ .

Theorem: X ′e = 0k ×1
Proof:
From the proof of the previous theorem,
X ′y − X ′X βˆ = 0k ×1 → X ′( y − X βˆ ) = 0k ×1 → X ′e = 0 .

Corollary:
If (SIC.7) holds ( xt1 = 1 for all t: β1 is the intercept), Σtet = 0.
Proof:
⎡ x11 x21 ... xT 1 ⎤ ⎡ e1 ⎤ ⎡ Σ t xt1et ⎤ ⎡0⎤
⎢x x22 ... xT 2 ⎥ ⎢ e2 ⎥ ⎢Σ t xt 2 et ⎥ ⎢0⎥
X ′e = ⎢ 12 ⎥⎢ ⎥ = ⎢ ⎥=⎢ ⎥ .
⎢ : : : ⎥⎢ : ⎥ ⎢ : ⎥ ⎢:⎥
⎢x ... xTk ⎥⎦ ⎢⎣ eT ⎥⎦ ⎢⎣ Σt xtk et ⎥⎦ ⎢⎣0⎥⎦ k ×1
⎣ 1k x2 k

→ Σtxt1et = 0 → Σtet = 0 (by SIC.7).

Linear Regressions under Ideal Conditions-32


Question:
Consider the following two models:
(A) yt = xt1β1 + xt2β2 + xt3β3 + εt;
(B) yt = xt1β1 + xt2β2 + εt.
Are the OLS estimates of β1 and β2 from (A) the same as those from (B)?

Digression to Matrix Algebra


Definition: Let A be a T×p matrix.
P(A) = A(A′A)-1A′ (T×T matrix called “projection matrix”);
M(A) = IT - P(A) = IT - A(A′A)-1A′ (T×T matrix called “residual maker).

Facts:
1) P(A) and M(A) are both symmetric and idempotent:
P(A)′ = P(A), M(A)′ = M(A), P(A)P(A) = P(A), M(A)M(A) = M(A).
2) P(A) and M(A) are psd (positive semi-definite).
3) P(A)M(A) = 0T×T (orthogonal).
4) P(A)A = [A(A′A)-1A′]A = A.
5) M(A)A = [IT-P(A)]A = A - P(A)A = A - A = 0T×T.
End of Digression

Theorem: e = M(X)y.
<Proof> e = y − X β = I T y − X ( X ′X ) −1 X ′y = [ I T − X ( X ′X ) −1 X ′] y = M ( X ) y .

Linear Regressions under Ideal Conditions-33


Frisch-Waugh Theorem:
Partition X into [XA,XB] and β = ( β A′ , β B′ )′ . Let β A be the OLS estimate of βA
B

from a regression of the model y = Xβ + ε = XAβA + XBβB + ε. Then,

β A = [ X A′M ( X B ) X A ]−1 X A′M ( X B ) y .

That is, β A is obtained by regressing M(XB)y on M(XB)XA.

Comment:
β A is different from the OLS estimate of βA from a regression of y on XA.

Theorem:
Consider the following models:
(A) yt = β1 + β2xt2 + β3xt3 + error
(B) yt = α1 + α2xt2 + error
(C) xt3 = δ1 + δ2xt2 + error

Then, α 2 = β 2 + δ 2 β 3 .

Linear Regressions under Ideal Conditions-34


Theorem:
Consider the following two models:
(A) yt = β1 + β2xt2 + ... + βkxtk + εt;
(B) yt − y = β 2 ( xt 2 − x2 ) + ... + β k ( xtk − xk ) + error .
Then, the OLS estimates of β2, ... , βk from the regression of (A) are the same as
the OLS estimates of β2, ..., βk from the regression of (B).
Proof :
Model (A) can be written as
y = X β = 1T β1 + X * β* + ε ,
where 1T is the T×1 vector of ones and β* = (β2,...,βk)′. Then,

( )
−1
β * = X *′ M (1T ) X * X *′ M (1T ) y .

Observe that:

M (1T ) y = ( y1 − y y2 − y ... yT − y )′ .
Now, complete the proof by yourself.

Linear Regressions under Ideal Conditions-35


[4] Goodness of Fit
Question: How well does your regression explain yt?

Example:
• A simple regression model: yt = β1 + β2xt2 + εt, with β1,o = β2,o = 1.
• For population A, σo2 = 1. For population B, σo2 = 10.
YA vs. X2 YB vs. X2
16 50

40
12
30

8 20
YA

YB 10
4
0

-10
0
-20

-4 -30
0 2 4 6 8 10 12 0 2 4 6 8 10 12

X2 X2

• Clearly, the regression line E ( yt | xt i ) explains Population A better.

• How can we measure the goodness of fit of E ( yt | xt i ) ?

Definition:
• "Fitted value" of yt: yˆ t = xt′i βˆ (an estimate of E ( yt | xt i ) ).

• Vector of fitted values: ŷ = X βˆ .

Linear Regressions under Ideal Conditions-36


Definition:
SSE = e′e = ( y − X βˆ )′( y − X βˆ ) = ( y − yˆ )′( y − yˆ ) = Σ t ( yt − yˆ t ) 2 .
(Unexplained sum of squares)
→ Measures unexplained variation of yt.
→ SSE/T is an estimate of Ex[var(y|x)].
SSR = Σt ( yˆ t − y )2 , where y = T −1Σt yt (Explained sum of squares).

→ Measures variation of yt explained by regression.


→ SSR/T is an estimate of varx[E(y|x)].
SST = Σt ( yt − y )2 (Total sum of squares)

→ SST/T measures total variation of yt.

Theorem: SSE = Σtet2 = y ′y − βˆ ′X ′y .


Proof:
SSE = ( y − Xβˆ ) ′( y − Xβˆ ) = y ′y − 2 βˆ ′X ′y + βˆ ′X ′Xβˆ

= y ' y − 2 βˆ ′X ′y + βˆ ′X ′X ( X ′X ) −1 X ′y = y ′y − βˆ ′X ′y .

Theorem:
2
SST = Σt ( yt − y )2 = Σ t yt − Ty 2 ,

SSR = βˆ ′X ′y − Ty 2 [if (SIC.7) holds].


Proof: For SSR, see Schmidt.

Linear Regressions under Ideal Conditions-37


Theorem:
Suppose that xt1 = 1, for all t (that is, (SIC.7) holds). Then, SST = SSE + SSR.
Proof: Obvious.

Implication:
Total variation of yt equals sum of explained and unexplained variations of yt.

Definition: [Measure of goodness of fit]


R2 = 1 - (SSE/SST) = (SST-SSE)/SST.

Theorem:
Suppose that xt1 = 1, for all t (SIC.7). Then, R2 = SSR/SST and 0 ≤ R2 ≤ 1.

Note:
1) If (SIC.7) holds, then, R2 = 1 - (SSE/SST) = SSR/SST.
2) If (SIC.7) does not hold, then, 1 - (SSE/SST) ≠ SSR/SST.
3) 1 - (SSE/SST) can never be greater than 1, but it could be negative.
SSR/SST can never be negative, but it could be greater than 1.

Linear Regressions under Ideal Conditions-38


Definition:
Ru2 (uncentered R2) = yˆ ′yˆ / y ′y = Σ t yˆ t2 / Σ t y t2 .

Note:
• Some people use Ru2, when the model has no intercept term.
• 0 ≤ Ru2 ≤ 1, since e′e + yˆ ′yˆ = y′y. [Why? Try it at home.]
→ This holds even if (SIC.7) does not hold.
• If y = 0, then, Ru2 = R2.

Definition:
An estimator of covariance between yt and ŷt (which be viewed as an estimate
of E ( yt | xt • ) ) is defined by:
1
e cov( yt , yˆ t ) = Σt ( yt − y )( yˆ t − y ) ,
T −1
where ~
y = T −1Σt yˆ t . Similarly, the estimators of var(yt) and var( ŷt ) are defined
by:
1 1
e var( yt ) = Σt ( yt − y ) 2 ; e var( yˆ t ) = Σt ( yˆ t − y ) 2 .
T −1 T −1
Then, the estimated correlation coefficient between yt and ŷt is defined by:
e cov( yt , yˆ t )
ρˆ = .
e var( yt ) e var( yˆ t )

Linear Regressions under Ideal Conditions-39


Note:
1) 0 ≤ ρ̂ 2 ≤ 1, whether (SIC. 7) holds or not.
2) If (SIC.7) holds, ~
y = y.
3) If (SIC.7) holds, 1-(SSE/SST) = SSR/SST = ρ̂ 2 .

Remark for the case where (SIC.7) holds:


1) If R2 = 1, yt and ŷt are perfectly correlated (perfect fit).
2) If R2 = 0, yt and ŷt have no correlation.

→ Regression may not be much useful.


3) Does a high R2 always mean that your regression is good?
[Answer]
No. If you use more regressors, then, you will get higher R2. In particular, if
k = T, R2 = 1.
4) R2 tends to exaggerate goodness of fit when T is small.

Definition: [Adjusted R2, Theil (1971)]


SSE /(T − k )
R 2 =1− .
SST /(T − 1)
Comment:
• R 2 < R 2 unless k > 1 and R2 < 1.
• R 2 could be negative.

Linear Regressions under Ideal Conditions-40


[Proof for the fact that R2 increases with k]
Theorem: Let A = [A1,A2]. Then,
M(A)Aj = 0; P(A)Aj = Aj, j = 1, 2; P(A) = P(A1) + P[M(A1)A2].

Theorem: ŷ = P(X)y and e = M(X)y.

Proof: Because ŷ = X βˆ = X(X′X)-1X′y = P(X)y. And e = y – P(X)y = M(X)y.

Lemma: SSE = y′M(X)y = y′y - y′P(X)y.


Proof: SSE = e′e = [ M ( X ) y ]′M ( X ) y = y ' M ( X )′M ( X ) y = y ′M ( X ) y .

Theorem:
When k increases, SSE never increases.
Proof:
Compare:
Model 1: y = Xβ + ε
Model 2: y = Xβ + Zγ + υ = Wξ + υ,
where W = [X,Z] and ξ = [β′,γ′]′.
SSE1 = SSE from M1 = y′M(X)y = y′y - y′P(X)y
SSE2 = SSE from M2 = y′M(W)y = y′y - y′P(W)y
= y′y - y′[P(X)+P{M(X)Z}]y
= y′y - y′P(X)y - y′P{M(X)Z}y
SSE1 - SSE2 = y′P{M(X)Z}y ≥ 0.

Linear Regressions under Ideal Conditions-41


[5] Statistical Properties of the OLS estimator

(1) Random Sample:


• A population (of billions and billions)

x(1) , ... , x(b)

Here, the x(j) are the members of the population.


• θ: An unknown parameter of interest (e.g., population mean or population
variance.)
o If we know the pdf of this population, we could easily compute θ. But
if you do not know the pdf?
• Need to estimate θ, using a random sample {x1, ... , xT} of size T from the
population.

Linear Regressions under Ideal Conditions-42


• What do we mean by “random sample”?
• A sample that represents the population well.
• Divide the population into T groups such that the groups are
stochastically independent and the pdf of each group is the same as the
pdf of the whole population. Then, draw one from each group: Then, the
x1, ... , xT should be iid (independently and identically distributed).
• “Random sample” means a sample obtained by this sampling strategy.
• An example of nonrandom sampling:
• Suppose you wish to estimate the % of supporters of the Republican
Party in the Phoenix metropolitan area.
• t is a zip-code area. Choose a person living in a street corner from
each t.
• If you do, your sample is not random. Because rich people are likely
to live in corner houses! Republicans are over-sampled!

• Let θˆ be an estimator of θ. What properties should θˆ have?

(2) Criteria for “good” estimators


1) Unbiasedness.
2) Small variance.
3) Distributed following a known form of pdf (e.g., normal, or χ2).

Linear Regressions under Ideal Conditions-43


Definition: (Unbiasedness)
If E (θˆ) = θo , then we say that θˆ is an unbiased estimator of θ.
Comment:
• Consider the set of all possible random samples of size T:
Estimate
Sample 1: {x1[1], x2[1], ... , xT[1]} → θˆ[1]
Sample 2: {x1[2], x2[2], ... , xT[2]} → θˆ[ 2 ]
Sample 3: {x1[3], x2[3], ... , xT[3]} → θˆ[ 3]
:
Sample b′: {x1[b′], x2[b′], ... , xT[b′]} → θˆ[ b′] .
• Consider the population of Sθ ≡{θˆ[1] , ... , θˆ[ b′] }.
• Unbiasedness of θˆ means that E(θˆ ) = population average of Sθ = θo.

Definition: (Relative Efficiency)


Let θˆ and θ be unbiased estimators of θ. If var(θˆ) < var(θ ) , we say that θˆ

is more efficient than θ .

Comment:
If θˆ is more efficient than θ , it means that the value of θˆ that I can obtain
from a particular sample would be generally closer to the true value of θ (θo)
than the value of θ .

Linear Regressions under Ideal Conditions-44


Example:
• A population is normally distributed with N(μ,σ2), where μo = 0 and σo2 = 9.
• {x1,x2, ... , xT} is a random sample (T = 100):
1
• Two possible unbiased estimators of μ: x = Σt xt and x = x1 .
T
⎛1 ⎞ 1 1
• E ( x ) = E ⎜ Σt xt ⎟ = Σ t E ( xt ) = Σ t μo = μo ; E ( x ) = E ( x1 ) = μo .
⎝T ⎠ T T
• Which estimator is more efficient?
2 2
⎛1 ⎞ ⎛1⎞ ⎛1⎞ σ o2
• var( x ) = var ⎜ Σt xt ⎟ = ⎜ ⎟ Σ t var( xt ) = ⎜ ⎟ Σtσ o =
2
;
⎝ T ⎠ ⎝ ⎠T ⎝ ⎠
T T

• var( x ) = var( x1 ) = σ o2 .

σ o2
• Thus, var( x ) = < σ o2 = var( x ) , if T > 1.
T

Gauss Exercise:
• From N(0,9), draw 1,000 random samples of size equal to T = 100.
• For each sample, compute x and x .
• Draw a histogram for each estimator.
• Gauss program name: mmonte.prg.

Linear Regressions under Ideal Conditions-45


/*
** Monte Carlo Program for sample mean
*/

seed = 1;
tt = 100; @ # of observations @
iter = 1000; @ # of sets of different data @

storem = zeros(iter,1) ;
stores = zeros(iter,1) ;

i = 1; do while i <= iter;

@ compute sample mean for each sample @

x = 3*rndns(tt,1,seed);
m = meanc(x);
storem[i,1] = m;
stores[i,1] = x[1,1];

i = i + 1; endo;

@ Reporting Monte Carlo results @

output file = mmonte.out reset;

format /rd 12,3;

"Monte Carlo results";


"-----------";
"Mean of x bar =" meanc(storem);
"mean of x rou =" meanc(stores);
library pgraph;
graphset;

v = seqa(-10, .2, 100);


{a1,a2,a3}=hist(storem,v);
@ {b1,b2,b3}=hist(stores,v); @

output off ;

Linear Regressions under Ideal Conditions-46


x

Linear Regressions under Ideal Conditions-47


Extension to the Cases with Multiple Parameters:
• θ = (θ1,θ2, ... , θp)′ is a unknown parameter vector.

Definition: (Unbiasedness)
θˆ is unbiased iff E (θˆ) = θo :
⎡ E (θˆ1 ) ⎤ ⎡ θ1,o ⎤
⎢ ⎥ ⎢
⎢ E (θ ˆ ) ⎥ θ 2,o ⎥
E (θˆ) = ⎢ 2
⎥ =⎢ ⎥ = θo .
: ⎢ : ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣ E (θˆp ) ⎥⎦ ⎣θ p ,o ⎦

Definition: (Relative Efficiency)


Suppose that θˆ and θ are unbiased estimators. Let c = (c1,c2, ... , cp)′ is a
nonzero vector. θˆ is said to be more efficient than θ , iff var(c ′θ ) ≥ var(c ′θˆ)
for any nonzero vector c.

Remark:
var(c ′θ ) ≥ var(c ′θˆ) .

↔ c ′Cov(θ )c − c ′Cov(θˆ)c ≥ 0 , for any nonzero c.

↔ c ′[Cov(θ ) − Cov(θˆ)]c ≥ 0 , for any nonzero c.

↔ Cov(θ ) − Cov(θˆ) is positive semi-definite.

Linear Regressions under Ideal Conditions-48


Comment:
• Let θ = (θ1,θ2)′ and c = (c1,c2)′.
• Suppose you wish to estimate c′θ = c1θ1 + c2θ2.
• If, for any nonzero c, var(c′θ ) = var(c1θ 1 +c2θ 2 ) ≥ var(c1θˆ1 +c2θˆ2 ) =

var(c′θˆ ), we say that θˆ is more efficient than θ .

Example:
⎡1 0 ⎤ ⎡1.5 1 ⎤
• Let θ = (θ1 , θ 2 )′ . Suppose Cov (θˆ) = ⎢ ⎥; Cov (θ ) = ⎢ 1 1.5⎥ .
⎣ 0 1 ⎦ ⎣ ⎦
• Note that:
var(θˆ1 ) = 1 < 1.5 = var(θ 1 ) ; var(θˆ2 ) = 1 < 1.5 = var(θ 2 ) .

• But,
⎡0.5 1 ⎤
Cov (θ ) − Cov (θˆ) = ⎢ ⎥ ≡ A → A1 = 0.5 > 0; A 2 = −0.75 < 0 .
⎣ 1 0 . 5⎦
• A is neither positive nor negative semi-definite.
• θˆ is not necessarily more efficient than θ .
• For example, suppose you wish to estimate θ1-θ2 = c′θ (where c = (1,-1)′):
• var(c′θˆ ) = c′Cov(θˆ )c = 2; var(c′θ ) = c′Cov(θ )c = 1.
• That is, for the given c = (1,-1)′, c′θ is more efficient than c′θˆ .
• This example is a case where relative efficiency of estimators depends on
c. For such cases, we can’t claim that one estimator is superior to others.

Linear Regressions under Ideal Conditions-49


Theorem:
If θˆ is more efficient than θ , var(θˆ j ) ≤ var(θ j ) , for all j = 1, ..., p. But not
vice versa.
Proof:
Choose c = (1,0,...,0)′. Then, you can show var(θˆ1 ) ≤ var(θ 1 ) . Now, choose c

= (0,1,0,....,0)′. Then, we can show var(θˆ2 ) ≤ var(θ 2 ) . Keep doing this until j
= p.

Linear Regressions under Ideal Conditions-50


(3) Population Projection
• Suppose you have data from all population members (say, t = 1, …., B).
1 B
• Assume that E ( x• x•′ ) = Σt =1 xt • xt′• is pd, where xt1 = 1 for all t.
B
• Let β p = ( β1, p ,..., β 2, p )′ be the OLS estimator obtained using all population:

Notice that β p is a population parameter vector. Denote

Pr oj ( yt | xt • ) = xt′• β p .

• Let e p ,t = yt − xt′i β p , where t = 1, … , B.

• Population projection model:


yt = Pr oj ( yt | xt i ) + ε t = xt′i β p + e p ,t .

• By definition, β p always exists. Notice that (SIC.1) assumes that the

conditional mean of yt is linear in xt i : E ( yt | xt i ) = xt′i βo . In contrast, the


population projection of yt is always linear.

Theorem:
E ( x j e p ) = 0 for all j = 1, … , k. That is, E ( xi e p ) = 0k ×1 .

Proof:
1 B
Recall X ′e = 0k ×1 → ΣTt=1 xt •et = 0k ×1 . That is, E ( x•e p ) = Σ t =1 xt •e p ,t = 0 .
B

Comment:
• E ( x•e p ) = 0 → E ( e p | x• ) = 0 , although the latter implies the former.

Linear Regressions under Ideal Conditions-51


Theorem:
β p = ( E ( x• x•′ ) ) E ( x• y ) .
−1

Proof:
−1
⎛1 ⎞ 1 B
β p = ( Σ x x′ )
−1
B
t =1 t • t • Σ x yt = ⎜ Σ tB=1 xt • xt′• ⎟
B
t =1 t • Σ t =1 xt• yt .
⎝ B ⎠ B

Comment:
• Intuitively, the OLS estimator is a consistent estimator of β p .

• Notice that under (SIC.1)-(SIC.4), β o = β p !

• Under (SIC.1)-(SIC.4), E ( yt | xt • ) = Pr oj ( yt | xt • ) = xt′• βo .


• Thus, under (SIC.1)-(SIC.4), the OLS estimator is a consistent estimator of
βo .

Linear Regressions under Ideal Conditions-52


(4) The Stochastic Properties of the OLS Estimator.

(SIC.8) The regressor xt1, … , xtk ( xt • ) are nonstochastic.

Comment:
• The whole population consists of T groups, and each group has fixed xt•.
We draw yt from each group. The value of yt would change over different
trials, but the value of xt• remains the same.
• Can be replaced by the assumption that E (ε t | x1• ,..., xT • ) = 0 for all t
(assumption of strictly exogenous regressors). This assumption holds as
long as (SIC.1) - (SIC.4) hold. If you do not use (SIC.8), the distributions
of βˆ and s 2 obtained below the conditional ones conditional on
x1i , x2i ,..., xT i .

Theorem:
Assume (SIC.1)-(SIC.6) and (SIC.8). Then,
• E ( βˆ ) = β o (unbiased)

• Cov ( βˆ ) = σ o2 ( X ′X ) −1

• E ( s 2 ) = σ o2 , where s 2 = SSE /(T − k ) = Σt et2 /(T − k ) = e′e /(T − k )


[even if the εt are not normal, that is, (SIC.6) does not hold]
• βˆ ~ N ( β o , σ o2 ( X ′X ) −1 ) .

• βˆ and SSE (so s2) are stochastically independent.


• SSE / σ o2 ~ σ 2 (T − k ) [if (SIC.6) holds.]

Linear Regressions under Ideal Conditions-53


Comment:
• As discussed later, we need to estimate Cov ( βˆ ) = σ o2 ( X ′X ) −1 .

• We can use s2 to estimate Cov ( βˆ ) .

Numerical Exercise:
• yt = β1 + β2xt2 + β3xt3 + εt, T = 5:
⎡ 0⎤ ⎡1 − 2 4⎤
⎢ 0⎥ ⎢1 − 1 1⎥
⎢ ⎥ ⎢ ⎥
y = ⎢1⎥; X = ⎢1 0 0⎥ .
⎢1 ⎥ ⎢1 1 1⎥
⎢ ⎥ ⎢ ⎥
⎢⎣3⎥⎦ ⎢⎣1 2 4⎥⎦

• Then,
⎡ 5 0 10 ⎤ ⎡5⎤
X ′X = ⎢ 0 10 0 ⎥ ; X ′y = ⎢ 7 ⎥ ; y ′y = 11 ; y = 1.
⎢ ⎥ ⎢ ⎥
⎢⎣10 0 34⎥⎦ ⎢⎣13⎥⎦

1) Compute βˆ :

⎡17 / 35 0 − 1 / 7⎤
(X′X)-1= ⎢ 0 1 / 10 0 ⎥.
⎢ ⎥
⎢⎣ − 1 / 7 0 1 / 14 ⎥⎦

⎡ βˆ1 ⎤ ⎡17 / 35 0 − 1 / 7⎤ ⎡ 5 ⎤ ⎡ 0.571⎤


⎢ ⎥
βˆ = ⎢ βˆ2 ⎥ = ⎢ 0 1 / 10 0 ⎥ ⎢ 7 ⎥ = ⎢ 0.7 ⎥ .
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ βˆ3 ⎥ ⎢⎣ − 1 / 7 0 1 / 14 ⎥⎦ ⎢⎣13⎥⎦ ⎢⎣0.214⎥⎦
⎣ ⎦

Linear Regressions under Ideal Conditions-54


2) Compute s2:
SSE = y′y - y′X βˆ = 0.46
→ s2 = SSE/(T-k) = 0.46/(5-3) = 0.23

3) Estimate Cov ( βˆ ) :

⎡17 / 35 0 − 1 / 7⎤ ⎡ 0.112 0 − 0.032⎤


s 2 ( X ′X ) −1 = 0.23⎢ 0 1 / 10 0 ⎥= ⎢ 0 0.023 0 ⎥.
⎢ ⎥ ⎢ ⎥
⎢⎣ − 1 / 7 0 1 / 14 ⎥⎦ ⎢⎣ − 0.032 0 0.016 ⎥⎦

4) Compute SSE, SSR and SST:


• SST = y′y - T y 2 = 11 - 5×(1)2 = 6;

⎛5⎞
⎜ ⎟
• SSE = y′y - βˆ ′X ' y = 11 - (0.571 0.7 0.214 )⎜ 7 ⎟ = 0.46
⎜13⎟
⎝ ⎠
• SSR = SST – SSE = 5.54.

5) Compute R2 and R 2 .
• R2 = SSR/SST =5.54/6 = 0.923
T −1 5 −1
• R2 = 1 - (1 − R 2 ) = 1 − (1 − 0.923) = 0.846.
T −k 5−3

Linear Regressions under Ideal Conditions-55


[Proofs of the General Results under SIC]
1) Some useful results:
a) Let ε = (ε1 ,..., ε T )′ . Then, the model yt = xt′• βo + ε t (t = 1, … , T) can be
written as y = X βo + ε . [Be careful that ε is a vector from now on!]
b) E (ε ) = 0T ×1 , because E (ε t ) = E xt• [ E (ε t | xt • )] = E xt• (0) = 0 for all t.

c) E (εε ′) = E (εε ′) − E (ε ) E (ε ′) = Cov(ε ) = σ o2 I T , because cov(ε t , ε s ) = 0 by

(SIC.4) and var(ε t ) = σ o2 by (SIC.5).


d) Under (SIC.8), E ( X ′ε ) = X ′E (ε ) = 0k ×1 .

2) Show that E ( βˆ ) = β o and Cov ( βˆ ) = σ o2 ( X ′X ) −1 .

Lemma D.1:
βˆ = β o + ( X ′X ) −1 X ′ε .
Proof:
y = X βo + ε .

βˆ = ( X ′X ) −1 X ′y = ( X ′X ) −1 ( X β o + ε ) = β o + ( X ′X ) −1 X ′ε .

Theorem: (Unbiasedness)
E ( βˆ ) = β o .
Proof:
E ( βˆ ) = E [ β o + ( X ′X ) −1 X ′ε ] = β o + ( X ′X ) −1 X ′E (ε ) = β o .

Linear Regressions under Ideal Conditions-56


Theorem:
Cov ( βˆ ) = σ o2 ( X ′X ) −1 .
Proof:
Cov( βˆ ) = Cov[ β o + ( X ′X ) −1 X ′ε ]

= Cov[( X ′X ) −1 X ′ε ] = ( X ′X ) −1 X ′Cov (ε )[( X ′X ) −1 X ′]′

= ( X ′X ) −1 X ′(σ o2 I T ) X ( X ′X ) −1 = σ o2 ( X ′X ) −1 X ′I T X ( X ′X ) −1

= σ o2 ( X ′X ) −1 X ′X ( X ′X ) −1 = σ o2 ( X ′X ) −1 .

3) Show E ( s 2 ) = σ o2 .

Lemma D.2:
SSE = e′e = y ′M ( X ) y = ε ′M ( X )ε .
Proof:
SSE = y′M(X)y = (Xβ+ε)′M(X)(Xβ + ε) = (β′X′+ε′)M(X)ε = ε′M(X)ε.

Theorem:
E ( SSE ) = (T − k )σ o2 .

Linear Regressions under Ideal Conditions-57


Digression to Matrix Algebra:
Definition: (trace of a matrix)
B = [bij]n x n → tr(B) = Σni=1bii = sum of diagonals.

Lemma D.3:
For Am×n and Bn×m, tr(AB) = tr(BA).
Lemma D.4:
If B is an idempotent n×n matrix,
rank(B) = tr(B).

[Comment]
• For Lemma D.4, many econometrics books assume B to be also symmetric.
But the matrix B does not have to be.
• An idempotent matrix does not have to be symmetric: For example,
⎛ 1/ 2 1 ⎞ ⎛1 a⎞
⎜ 1/ 4 1/ 2 ⎟ ; ⎜0 0⎟
⎝ ⎠ ⎝ ⎠
• Theorem DA.1:
The eigenvalues of an idempotent matrix, say B, are ones or zeros.
<Proof> λξ = Bξ = B 2ξ = Bλξ = λ 2ξ .

Linear Regressions under Ideal Conditions-58


• Theorem DA.2:
tr(B) = sum of the eigenvalues of B, where B is n×n.
<Proof> det(λ I − B ) = (λ − λ1 )...(λ − λn )

→ (b11 + b22 + ... + bnn )λ n −1 = (λ1 + ... + λn )λ n −1 .

• Theorem DA.3:
rank (B) = # of non-zero eigenvalues of B [See Greene.]

• Lemma D.4 is implied by Theorems DA.1-3.

Example:
Let A be T×k (T > k). Show that rank[IT-A(A′A)-1A′] = T - k.
[Solution]
rank[IT-A(A′A)-1A′]
= tr(IT - A(A′A)-1A′)
= tr(IT) - tr[A(A′A)-1A′] = T - tr[(A′A)-1A′A]
= T - tr(Ik) = T - k.
End of Digression.

Linear Regressions under Ideal Conditions-59


3) Show E ( s 2 ) = σ o2 :
E ( SSE ) = E (ε ′M ( X )ε ) = E[tr{ε ′M ( X )ε }] = E[tr{M ( X )εε ′}]
= tr[ M ( X ) E (εε ′)] = tr[ M ( X )σ o2 I T ] = σ o2tr[ M ( X )]
= σ o2tr[ I T − X ( X ′X ) −1 X ′] = σ o2 (T − k )

→ E ( s 2 ) = E ( SSE /(T − k )) = E ( SSE ) /(T − k ) = [σ o2 (T − k )]/(T − k ) = σ o2 .

4) Show the normality of βˆ .

Lemma D. 5:
Let zT×1 ~ N(μT×1, ΩT×T). Suppose that A is a k×T nonstochastic matrix. Then,
b + Az ~ N(b + Aμ, AΩA′).

Theorem: βˆ ~ N ( β o , σ o2 ( X ′X ) −1 )
Proof:
βˆ = β o + ( X ′X ) −1 X ′ε

→ βˆ ~ N(βo+(X′X)-1X′E(ε), (X′X)-1X′Cov(ε)X(X′X)-1)

= N ( β o , σ o2 ( X ′X ) −1 ) .

Linear Regressions under Ideal Conditions-60


5) Show that βˆ and SSE are stochastically independent.

Lemma D.6:
Let Q be a T×T (nonstochastic) symmetric and idempotent matrix. Suppose
ε ~ N (0T ×1 , σ o2 I T ) . Then,
ε ′Qε
~ χ 2 ( r) , r = tr(Q).
σo 2

Proof: See Schmidt.

Lemma D.7:
Suppose that Q is a T×T (nonstochastic) symmetric and idempotent and B is a
m×T nonstochastic matrix. If ε ~ N (0T ×1 , σ o2 I T ) , Bε and ε′Qε are
stochastically independent iff BQ = 0mxT.
Proof: See Schmidt.

Theorem:
(T − k ) s 2 SSE
= ~ χ 2 (T − k ).
σ 2
o σ 2
o

And, βˆ and s2 are stochastically independent.

Linear Regressions under Ideal Conditions-61


Proof:
1) Note that (T − k ) s 2 / σ o2 = SSE / σ o2 = ε ′M ( X )ε / σ o2 .
Since M(X) is idempotent and symmetric and tr(M(X)) = T-k, by Lemma D.7,
ε ′M ( X )ε / σ o2 ~ χ 2 (T − k ) .

2) Note that βˆ − β o = ( X ′X ) −1 X ′ε (by Lemma D.1); (T-k)s2 = SSE = ε′M(X)ε.


Note that (X′X)-1X′M(X) = 0kxT . Therefore, Lemma D.7 applies, i.e., SSE and
βˆ are stochastically independent. So are s2 and βˆ .

Theorem: var( s 2 ) = 2σ o4 /(T − k ) .


Proof:
Since (T − k ) s 2 / σ o2 ~ χ 2 (T − k ) , var[(T − k ) s 2 / σ o2 ] = 2(T − k ) (since

var(χ2(r)) = 2r), and [(T − k ) / σ o2 ]2 var( s 2 ) = 2(T − k ) implies

var( s 2 ) = 2σ o4 /(T − k ) .

Remark:
⎡ σ 2
( X ′
X ) −1
0k ×1 ⎤
⎛β ⎞ ⎛ βˆ ⎞ o
Let θ = ⎜ 2 ⎟ and θˆ = ⎜⎜ 2 ⎟⎟ . Then, Cov (θˆ) = ⎢ ⎥
2σ o4 ⎥ .
⎝σ ⎠ ⎝s ⎠ ⎢ 0
⎢⎣ 1×k
T − k ⎥⎦

Linear Regressions under Ideal Conditions-62


[6] Efficiency of βˆ and s2
Question:
Are the OLS estimators, βˆ and s2, the best estimators among the unbiased
estimators of β and σ2?

Theorem: (Gauss-Markov)
Under (SIC.1) – (SIC.5) (ε may not be normal) and (SIC.8), βˆ is the best
linear unbiased estimator (BLUE) of β.

Comment:
Suppose that β is an estimator which is linear in y; that is, there exists a T×k

matrix C such that β = C′y. Let us assume that E ( β ) = β o . Then, the above

theorem means that Cov( β ) - Cov( βˆ ) is psd, for any β .

Linear Regressions under Ideal Conditions-63


Proof of Gauss-Markov (A Sketch):
Let β be an unbiased estimator linear in y: That is, there exists a T×k matrix C

such that β = C′y. Let C′ = (X′X)-1X' + D′. Then,

E( E ( β ) = E [( X ′X ) −1 X ′y + D′y ] = E ( βˆ + D′y ) = β o + E ( D′y ) .

Since β is unbiased, it must be that:


E(D′y) = 0 → E[D′(Xβ+ε)] = 0 → D′Xβ + D′E(ε) = 0
→ D′Xβo = 0.
Since this result must hold whatever βo is, D′X = 0k×k. Then,
β = C′y = [(X′X)-1X′ + D′]y = [(X′X)-1X′ + D′](Xβo + ε)
= βo + [(X′X)-1X′ + D′]ε
After some algebra, you can show that (do this by yourself):
Cov( β ) = Cov( βˆ ) + σo2D′D [using the fact that D′X = 0].
Then, you can show:
Cov( β ) - Cov( βˆ ) = σo2D′D is psd (by the theorem below)

Digression to Matrix Theory


Theorem:
Suppose A is p×q nonzero matrix. Then, A′A is psd. If rank(A) = q, then, A′A
is pd.
End of Digression

Linear Regressions under Ideal Conditions-64


Theorem:
Under (SIC.1) – (SIC.6) (ε should be normal) and (SIC.8), βˆ and s2 are the
most efficient estimators of β and σ2. [(SIC.7) does not have to hold.]

Digression to Mathematical Statistics


(1) Cases in which θ (unknown parameter) is scalar.

Definition: (Likelihood function)


• Let {x1, ... , xT} be a sample from a population.
• It does not have to be a random sample.
• xt is a scalar.
• Let f(x1,x2, ... , xT,θo) be the joint density function of x1, ... , xT.
• The functional form of f is known, but not θo.
• Then, LT(θ) ≡ f(x1, ... , xT, θ) is called “likelihood function”.
• LT(θ) is a function of θ given x1, ... , xT.
• The functional form of f is known, but not θo.

Definition: (log-likelihood function)


lT(θ) = ln[f(x1, ... , xT,θ)].

Linear Regressions under Ideal Conditions-65


Example:
• {x1, ... , xT}: a random sample from a population distributed with f(x,θo).

∏ f ( xt , θ o ) .
T
• f(x1, ... , xT, θo) = t =1

∏ f ( xt ,θ ) .
T
→ LT (θ) = f(x1, ... , xT, θ) = t =1

→ ( )
lT(θ) = ln ∏t =1 f ( xt ,θ ) = Σt ln f ( xt ,θ ) .
T

Definition: (Maximum Likelihood Estimator (MLE))


MLE θˆMLE maximizes lT(θ) given data points x1, ... , xT.

Theorem: (Minimum Variance Unbiased Estimator)


If E(θˆMLE ) = θo, then θˆMLE is the MVUE. If E(θˆMLE ) ≠ θo, but if there exists a

function g(θˆMLE ) such that E[g(θˆMLE )] = θo, then, g(θˆMLE ) is the MVUE.

Example:
• {x1, ... , xT} is a random sample from a population following a Poisson
distribution [i.e., f(x,θ) = e-θθx/x! (suppressing subscript “o” from θ)].
• Note that E(x) = var(x) = θo for Poisson distribution.
• lT(θ) = Σtln[f(xt,θ)] = -θT + (ln(θ))Σtxt - Σtln(xt!)
1
• FOC of maximization: ∂ / ∂θ = −T + Σt xt = 0 .
T
θ
Σx
• Solving this, θˆMLE = t t = x .
T

Linear Regressions under Ideal Conditions-66


(2) Extension to the Cases with Multiple Parameters.
Definition:
• θ = [θ1,θ2, ... , θp]′.
• LT(θ) = f(x1, ..., xT,θ) = f(x1, ... , xT, θ1, ... , θp).
• lT(θ) = ln[f(x1, ... , xT,θ) = ln[f(x1, ... , xT, θ1, ... , θp)].
• xt could be a vector.
• If {x1, ... , xT} is a random sample from a population with f(x,θo),

( )
lT(θ) = ln ∏t =1 f ( xt ,θ ) = Σt ln f ( xt ,θ ) .
T

Definition: (MLE)
MLE θˆMLE maximizes lT(θ) given data (vector) points x1, ... , xT. That is, θˆMLE
solves
⎡ ∂ T (θ ) / ∂θ1 ⎤ ⎡0⎤
⎢ ⎥ ⎢ ⎥
∂ T (θ ) ⎢ ∂ T (θ ) / ∂θ 2 ⎥ ⎢0⎥
= = .
∂θ ⎢ : ⎥ ⎢:⎥
⎢∂ (θ ) / ∂θ ⎥ ⎢ ⎥
⎣ T p⎦ ⎣0⎦ p ×1

Theorem: (Minimum Variance Unbiased Estimator)


If E(θˆMLE ) = θo, then θˆMLE is the MVUE. If E(θˆMLE ) ≠ θo, but if there exists a

function g(θˆMLE ) such that E[g(θˆMLE )] = θo, then, g(θˆMLE ) is the MVUE.

Linear Regressions under Ideal Conditions-67


Comment:
Let θ be any unbiased estimator of θo. The above theorem implies that
[ Cov(θ ) − Cov(θˆMLE ) ] is psd.

Example:
• Let {x1, ... , xT} be a random sample from N ( μo , σ o2 ) .

• Since {x1, ... , xT} is a random sample, E ( xt ) = μo and var( xt ) = σ o2 .

• Let θ = (μ,v)′, where v = σ2.


1 ⎡ ( xt − μ )2 ⎤ ⎡ ( xt − μ ) 2 ⎤
• f ( xt , θ ) = exp ⎢ − ⎥ = ( 2π ) ( v ) exp ⎢ − 2v ⎥ .
−1 / 2 −1 / 2

2πv ⎣ 2 v ⎦ ⎣ ⎦
1 1 ( xt − μ )2
• ln[ f ( xt , θ )] = − ln(2π ) − ln( v ) − .
2 2 2v
T T Σt ( xt − μ )2
• T (θ ) = − ln(2π ) − ln(v ) − .
2 2 2v
• MLE solves FOC:
∂ T (θ ) 1 Σ (x − μ)
(1) = − Σ t 2( x t − μ )( −1) = t t = 0;
∂μ 2v v

∂ T (θ ) T Σt ( xt − μ )2
(2) =− + = 0.
∂v 2v 2v 2
• From (1):
Σt xt
(3) Σt ( xt − μ ) = 0 → Σtxt - Tμ = 0 → μ̂ MLE = = x.
T

Linear Regressions under Ideal Conditions-68


• Substituting (3) in to (2):
1
(4) -Tv + Σt(xt- μ̂ MLE )2 = 0 → vˆMLE = Σt ( xt − x )2 .
T
• Thus,

⎛ μˆ MLE ⎞ ⎛⎜ x ⎞

θˆ =⎜ ⎟= 1
Σ − 2 .
⎝ vˆMLE ⎠ ⎜⎝ T t t
( x x ) ⎟
MLE

• Note that:
⎛1 ⎞ 1 1
• E ( μˆ MLE ) = E ( x ) = E ⎜ Σt xt ⎟ = Σt E ( xt ) = Σt μo = μo .
⎝T ⎠ T T
T −1 2
σ o (by the fact that E ⎡⎢ ⎤
1
• E (vˆMLE ) = Σt ( xt − x ) 2 ⎥ = σ o2 )
T ⎣T −1 ⎦
T
→ Let g ( vˆMLE ) = vˆMLE .
T −1
⎡ 1 ⎤
→ Clearly, E [ g ( vˆMLE )] = E ⎢ Σt ( xt − x ) 2 ⎥ = σ o2 .
⎣T −1 ⎦
→ Thus, g (vˆMLE ) is MVUE of σ 2 .

Linear Regressions under Ideal Conditions-69


(3) Extension to Conditional density
Definition:
• Conditional density of yt: f ( yt ,θo | xt i ) , θ = [θ1,θ2, ... , θp]′.

• LT (θ ) = Π Tt =1 f ( yt ,θ | xt i ) .

• lT(θ) = LT (θ ) = ΣTt =1 ln( f ( yt | θ , xt )) .

Example:
• Assume that ( yt , xt′i ) iid and f ( yt , βo , vo | xt i ) ~ N ( xt′i βo , vo ) .

1 ⎛ 1 ⎞
• f ( yt , β , v | xt i ) = exp ⎜ − ( yt − xt′i β ) 2 ⎟ .
2π v ⎝ 2v ⎠
lT ( β , v ) = Σ t ln f ( yt , β , v | xt i )
T T 1
• =− ln(2π ) − ln v − Σ t ( yt − xt ′β ) 2 .
2 2 2v
T T 1
= − ln(2π ) − ln v − ( y − X β )′( y − X β )
2 2 2v

End of Digression

Linear Regressions under Ideal Conditions-70


Return to Efficiency of OLS estimator

Proof:
We already know that E ( βˆ ) = β o and E ( s 2 ) = σ o2 . Thus, it is sufficient to

show that βˆ and s2 are MLE or some functions of MLE. Under (SIC.1) –
(SIC.6) and (SIC.8),
ε ~ N(0T×1, voIT) → y ~ N(Xβo,voIT), where vo = σ o2 .
Therefore, we have the following likelihood function of y,
1 ⎡ 1 ⎤
LT(β,v) = exp ⎢ − ( y − X β )′( vI T ) −1 ( y − X β ) ⎥
(2π )T / 2 vI T ⎣ 2 ⎦

1 ⎡ 1 ⎤
= exp −
⎢⎣ 2 ( y − X β )′( vI ) −1
( y − X β ) ⎥⎦
(2π )T / 2 v T / 2
T

Then,
lT(β,v) = -(T/2)ln(2π) -(T/2)ln(v) - (y-Xβ)′(y-Xβ)/(2v)
= -(T/2)ln(2π) -(T/2)ln(v) - (1/2v)[y′y-2β′X′y+β′X′Xβ].
→ FOC: ∂lT(β,v)/∂β = -(1/2v)[-2X′y + 2X′Xβ] = 0k×1 (i)
∂lT(β,v)/∂v = -(T/2v) + (1/2v2)(y-Xβ)′(y-Xβ) = 0 (ii)
→ From (i), X′y - X′Xβ = 0k×1 → βˆ MLE = (X′X)-1X′y = βˆ .
→ From (ii), v̂MLE = SSE/T → s2 is a function of v̂MLE .
[s2 = [T/(T-k)] v̂MLE ]

Linear Regressions under Ideal Conditions-71


[7] Testing Linear Hypotheses
(1) Testing a single restriction on β:
• Ho: Rβo - r = 0, where R is 1×k and r is a scalar.

Example: yt = xt1β1 + xt2β2 + xt3β3 + εt.


• We would like to test Ho: β3,o = 0.
• Define R = [0 0 1] and r = 0.
• Then, Rβo - r = 0 → β3,o = 0.
• Ho: β2,o - β3,o = 0 (or β2,o = β3,o).
• Define R = [0 1 -1] and r = 0.
• Rβo - r = 0 → β2,o - β3,o = 0
• Ho: 2β2,o + 3β3,o = 3.
• R = [0 2 3] and r = 3.
• Rβ - r = 0 → Ho.

Theorem: (T-Statistics Theorem)


Assume that (SIC.1)-(SIC.6) and (SIC.8) hold. Under Ho: Rβo - r = 0,
Rβˆ − r
t= ~ t (T − k ) ,
sR

where sR = R[ s 2 ( X ′X ) −1 ]R′ .

Linear Regressions under Ideal Conditions-72


Corollary:
Let se( βˆ j ) = square root of the j’th diagonal of s2(X′X)-1. Then, under Ho: βj =

βj*,

βˆ j − β j *
t= ~ t (T − k ) .
ˆ
se( β j )
Proof:
Let R = [0 0 ... 1 ... 0]; that is, only the j′th entry of R equals 1. Let r = βj*. Then,

Rβˆ − r βˆ j − β j * βˆ j − β j *
βˆ j − β j *
t= = = = .
sR 2
′ −1
Rs ( X X ) R ′ ˆ
var( β j ) se ( ˆ
β )
j

Comment:
• T-Statistics Theorem implies the following:
• Imagine that you collect billions and billions (b) of different samples.
• For each sample, compute the t statistic for the same hypothesis Ho.
Denote the population of these t statistics by {t[1], t[2], ..., t[b]}.
• The above theorem indicates that the population of t-statistics is
distributed as t(T-k).

Linear Regressions under Ideal Conditions-73


How to reject or accept Ho
<Case 1> Ho: Rβo = r and Ha: Rβo ≠ r.
• For simplicity, consider a case with T-k = 25.
• Ho: βj,o = 0 and Ha: βj,o ≠ 0.

• If you choose α = 5% (significance level), the probability that your


t-statistic computed with a sample lies between –2.06 and 2.06 is 95%
(confidence level). Call 2.06 “critical value” (c).
• So, if the value of your t-statistic is outside of (-2.06, 2.06) [(-c, c)], you
could say, “My t-value is quite an unlikely number I can obtain, if Ho is
indeed correct”. In this sense, you reject Ho.
• If the value of your t-statistic is inside of (-2.06,2.06), you can say, “My
t-value is a possible number I can get if Ho is correct.” In this sense, you
accept (do not reject) Ho.

Linear Regressions under Ideal Conditions-74


• Another way to determine acceptance/rejection (P-value):
• Suppose you have t = 1.85 and T-k = 40
• Find the probability that a t-random variable is outside of (-1.85, 1.85).

• This probability is called p-value. This value is the minimum α value


with which you can reject Ho. Thus, your choice of α > p-value, reject
Ho. If your choice of α < p-value, do not reject Ho.

Linear Regressions under Ideal Conditions-75


<Case 2> Ho: Rβo = r and Ha: Rβo > r.
• T-k = 28, Ho: βj,o = 0 and Ha: βj,o > 0.

• Here, you strongly believe that βj,o cannot be negative. If so, you would
regard negative t-statistics as evidence for Ho. So, your
acceptance/rejection decision depends on how positively large the value of
your t-statistic is.
• Choose a critical value (c = 1.701) as in the above graph at 5% significance
level. Then, reject Ho in favor of Ha, if t > c (=1.701). Do not reject Ho, if t
< c.

Linear Regressions under Ideal Conditions-76


<Case 3> Ho: Rβo = r and Ha: Rβo < r.
• T-k = 18, Ho: βj,o = 0 and Ha: βj,o < 0.

• Here, you strongly believe that βj,o cannot be positive. If so, you would
regard a positive value of a t-statistic as evidence favoring Ho. So, your
acceptance/rejection decision depends on how negatively large the value of
your t-statistic is.
• Choose a critical value (-c = -1.734) as in the above graph at a given
significance level. Then, reject Ho in favor of Ha, if t < -c (= -1.734). Do not
reject Ho, if t > -c.

Linear Regressions under Ideal Conditions-77


Numerical Example:
• Use 95% of confidence level.
• y = β1 + β2x2t + β3x3t + εt.
⎡1.45 0 0 ⎤ ⎡1.2⎤
• s 2 ( X ′X ) −1 = ⎢ 0 72.57 − 101.60⎥ ; βˆ = ⎢ − 1⎥ ; T = 10.
⎢ ⎥ ⎢ ⎥
⎣⎢ 0 − 101.60 145.14 ⎦⎥ ⎣⎢ 2 ⎦⎥

• Ho: β2,o = β3,o against Ha: β2,o ≠ β3,o


→ Ho: β2,o - β3,o = 0.
→ Ho: 1•β2,o + (-1)• β3,o = 0.
→ R = (0,1,-1) and r = 0.
→ t = -0.14
→ df = 10 – 3 = 7 → c = 2.365
→ Since –2.365 (-c) < t < 2.365 (c), do not reject Ho.

• Ho: β2,o + β3,o = 1 ; Ha: β2,o + β3,o ≠ 1


→ t = 0, c = 2.365.

Linear Regressions under Ideal Conditions-78


[Proof of T-Statistics Theorem]
Digression to Probability Theory
1) Standard Normal Distribution: (z ~ N(0,1))
1 ⎛ z2 ⎞
• Pdf: φ(z) = exp⎜⎜ − ⎟⎟ , -∞ < z < ∞.
2π ⎝ 2⎠
2) χ2 (Chi-Square) Distribution
• Let z1, ... , zk be random variables iid with N(0,1).
• Then, y = Σ ik=1 z i ~ χ2(k).
2

• Here, y > 0, k = degrees of freedom.


• E(y) = k and var(y) = 2k.

3) Student t Distribution
• Let z ~ N(0,1) and y ~ χ2(k). Assume that z and y are stochastically
independent.
z
• Then, t = ~ t(k).
y/k

• E(t) = 0, k > 1; var(t) = k/(k-2), k > 2.


• As k → ∞, var(t) → 1. In fact, t → z.
• The pdf of t is similar to that of z, but t has ticker tails.
• f(t) is symmetric around t = 0.

Linear Regressions under Ideal Conditions-79


4) F Distribution
• Let y1 ~ χ2(k1) and y2 ~ χ2(k2) be stochastically independent.
y1 / k1
• Then, f = ~ f(k1,k2).
y2 / k 2

• f(1,k2) = [t(k2)]2.
• If f ~ f(k1,k2), k1f → χ2(k1) as k2 → ∞.

Gauss Exercise:
• z ~ N(0,1); t ~ t(4); y ~ χ2(2); f ~ f(2,10).
• Gauss program name: dismonte.prg
/*
** Monte Carlo Program for z, x-square, t and f distribution
*/

@ Data generation under Classical Linear Regression Assumptions @


new;
seed = 1;
iter = 10000; @ # of sets of different data points @

z = zeros(iter,1);
t = zeros(iter,1);
x = zeros(iter,1);
f = zeros(iter,1);

i = 1; do while i <= iter;


z[i,1] = rndns(1,1,seed);
t[i,1] = rndns(1,1,seed)./sqrt( sumc(rndns(4,1,seed)^2)/4 );
x[i,1] = sumc(rndns(2,1,seed)^2);
f[i,1] = ( sumc( rndns(2,1,seed)^2 )/2 )./ (sumc( rndns(10,1,seed)^2 )/10) ;
i = i + 1; endo ;

@ Histograms @

library pgraph;
graphset;
ytics(0,6,0.1,0) ;
v = seqa(-8,0.1,220);
@ {a1,a2,a3}=histp(z,v); @
@ {b1,b2,b3}=histp(t,v); @

library pgraph;
graphset;
ytics(0,10,0.1,0);
w = seqa(0, 0.1, 330);

Linear Regressions under Ideal Conditions-80


@ {c1,c2,c3} = histp(x,w); @
{d1,d2,d3} = histp(f,w);

z ~ N(0,1)

t ~ t(4)

Linear Regressions under Ideal Conditions-81


y ~ χ2(2)

f ~ f(2,10)
End of Digression

Linear Regressions under Ideal Conditions-82


Lemma T.1:
Under (SIC.1)-(SIC.6) and (SIC.8), βˆ and s2 are stochastically independent.
(See Schmidt.)
Lemma T.2:
Under (SIC.1)-(SIC.6) and (SIC.8),
R( βˆ − β )
~ t (T − k ) .
sR
Proof:

Define σR = σ o2 R( X ′X ) −1 R′ . Note that:

⎡ R( βˆ − β ) ⎤ ⎡ R ( βˆ − β ) ⎤
E⎢ ⎥ = 0 ; var ⎢ ⎥ = 1.
⎣ σ R ⎦ ⎣ σ R ⎦

[Why?] Furthermore, since βˆ is normal, so is R ( βˆ − β ) / σ R . That is,

R( βˆ − β )
q1 ≡ ~ N (0,1) .
σR
Note that

sR Rs 2 ( X ′X ) −1 R′ (T − k ) s 2
s2 χ 2 (T − k )
q2 ≡ = = = == .
σR Rσ o ( X ′X ) R′
2 −1 σ 2
o (T − k )σ 2
o T −k

Note that q1 and q2 are stochastically independent because βˆ and s2 are


stochastically independent by Lemma T.1. Therefore, we have:
R( βˆ − β ) q1 N (0,1)
= = ~ t (T − k ) .
sR q2 χ (T − K ) /(T − k )
2

Linear Regressions under Ideal Conditions-83


Proof of T-Statistics Theorem:
Under Ho,
Rβˆ − r Rβˆ − Rβ o R( βˆ − β o )
t= = = ~ t (T − k ) .
sR sR sR
Then, the result immediately follows from Lemma T.2.

(2) Testing several restrictions


Assume that R is m×k and r is m×1 vector, and Ho: Rβo = r.

Example:
• A model is given: yt = xt1β1,o + xt2β2,o + xt3β3,o + εt.
• Wish to test for Ho: β1,o = 0 and β2,o + β3,o = 1.
• Define:
⎡ 1 0 0⎤ ⎡ 0⎤
R=⎢ ⎥ ; r = ⎢ ⎥
⎣ 0 1 1⎦ ⎣1⎦
Then, Ho → Rβo = r.

Theorem: (F-Statistics Theorem)


Assume that all of SIC holds. Under Ho: Rβo = r,
( R βˆ − r )′[ Rs 2 ( X ′X ) −1 R′]−1 ( R βˆ − r )
F≡ ~ f ( m, T − k ) .
m

Linear Regressions under Ideal Conditions-84


Comment:
( Rβˆ − r )′[ Rs 2 ( X ′X ) −1 R′]−1 ( Rβˆ − r )
m
.
ˆ ′ −1
′ −1 ˆ
R( β − r ) [ R( X X ) R ] ( Rβ − r ) / m

=
SSE /(T − k )

Comment:
F-Statistics Theorem implies the following:
• Imagine that you collect billions and billions (b) of different samples.
• For each sample, compute the F statistic for the same hypothesis Ho. Denote
the population of these F statistics as {F[1], F[2], ... , F[b]}.
• The above theorem indicates that the population of the F-statistics is
distributed as f(m,T-k).

How to reject or accept Ho


• When you use the F-test, it is important to note that the hypothesis you
actually test is not Ho: Rβo = r. It is rather (with some exaggerations) the
hypothesis that:
Ho′: (Rβo-r)′[R(X′X)-1R′]-1(Rβo-r) = 0.
If so, your alternative hypothesis should be that
Ha′: (Rβo-r)′[R(X′X)-1R′]-1(Rβo-r) > 0,
because R(X′X)-1R′ is pd. So, the F-test is a one-tail by nature.

Linear Regressions under Ideal Conditions-85


• Suppose m = 3 and T-k = 60.

• If you choose α = 5% (significance level), the probability that your


F-statistic computed with a sample is greater than 2.76 (confidence level).
Call 2.76 “critical value” (c).
• So, if the value of your F-statistic is greater (smaller) than c, reject (do not
reject) Ho.

Linear Regressions under Ideal Conditions-86


An Alternative Representation of F-Statistic
Definition: (Restricted OLS)
~ ~
Restricted OLS estimators β and σ~ 2 are defined as follows: β minimizes
~
ST(β) = (y-Xβ)′(y-Xβ) subject to the restriction Rβ = r. Given β , σ~ 2 is

computed by ( y − X β )′( y − X β ) /(T − k + m ) .

Theorem:
β = βˆ − ( X ′X ) −1 R′[ R( X ′X ) −1 R′]( Rβˆ − r ) .
Proof: See Greene.

Theorem:
Under Ho: Rβo - r = 0,
E ( β ) = βo .

Cov ( β ) = Cov ( βˆ ) − σ o2 ( X ′X ) −1 R′[ R ( X ′X ) −1 R′]R( X ′X ) −1 .


Proof:
Show it by yourself. Use the fact that for any pd matrix A, BAB′ is a psd
matrix whatever nonzero conformable matrix B.

Theorem:
Assume that (SIC.1)-(SIC.6) and (SIC.8) hold (whether (SIC.7) holds or not).
~
If Ho is correct, then, β is more efficient than βˆ .
Proof: Show it by yourself.

Linear Regressions under Ideal Conditions-87


Theorem
Let SSE = ( y − X βˆ )′( y − X βˆ ) ; SSEr = ( y − X β )′( y − X β ) . Then,
( SSE r − SSE ) / m ( SSE r − SSE ) / m
F= = .
s2 SSE /(T − k )
Proof: See Greene.

Remark:
• Consider a model: yt = xt1β1 + xt2β2 + xt3β3 + xt4β4 + εt.
• Wish to test for Ho: β3,o = β4,o = 0.
~
• To find β , do OLS on:
(*) yt = xt1β1 + xt2β2 + εt .
~ ~
• Denote the OLS estimates by β1 and β 2 . Then, the restricted OLS
~ ~
estimate of β is given by = [ β1 , β 2 , 0, 0]′.
• Also, set SSE from (*) as SSEr.
• Test Ho: β2,o + β3,o = 1 and β4,o = 0.
• yt = xt1β1 + xt2β2 + xt3β3 + xt4β4 + εt.
→ yt = xt1β1 + xt2β2 + xt3(1-β2) + εt.
→ yt - xt3 = xt1β1 + (xt2-xt3)β2 + εt . (**)
~ ~ ~ ~ ~
• Do OLS on (**) and get β1 and β 2 . Set β 3 = 1 - β 2 and β 4 = 0. Set
SSEr = SSE from OLS on (**).

Linear Regressions under Ideal Conditions-88


Theorem
~ ~
Let β1 be the OLS estimator β1 for a model yt = β1 + εt. Then, β1 = y .
Proof: Do this by yourself.

Theorem: (Overall Significance F Test)


The model is given:
yt = xt1β1 + xt2β2 + ... + xtkβk + εt. (*)
Assume that this model satisfies all of SIC (including SIC.7). Consider Ho: β2,o
= ... = βk,o = 0. The F-statistic for this hypothesis is given by
T − k R2
F= ~ f(k-1,T-k),
k − 1 1 − R2
where R2 is from the original model (*).

Linear Regressions under Ideal Conditions-89


Example:
• Consider WAGE2.WF1
• Data: (WAGE2.WF1 or WAGE2.TXT – from Wooldridge’s website)
# of observations (T): 935
1. wage monthly earnings
2. hours average weekly hours
3. IQ IQ score
4. KWW knowledge of world work score
5. educ years of education
6. exper years of work experience
7. tenure years with current employer
8. age age in years
9. married =1 if married
10. black =1 if black
11. south =1 if live in south
12. urban =1 if live in SMSA
13. sibs number of siblings
14. brthord birth order
15. meduc mother's education
16. feduc father's education
17. lwage natural log of wage

• Estimate the Mincerian wage equation:


log(wage) = β1 + β2Educ + β3Exper + β4Exper2 + ε

Linear Regressions under Ideal Conditions-90


Estimation Results by Eviews:

Dependent Variable: LWAGE


Method: Least Squares
Sample: 1 935
Included observations: 935

Variable Coefficient Std. Error t-Statistic Prob.

C 5.517432 0.124819 44.20360 0.0000


EDUC 0.077987 0.006624 11.77291 0.0000
EXPER 0.016256 0.013540 1.200595 0.2302
EXPER^2 0.000152 0.000567 0.268133 0.7887

R-squared 0.130926 Mean dependent var 6.779004


Adjusted R-squared 0.128126 S.D. dependent var 0.421144
S.E. of regression 0.393240 Akaike info criterion 0.975474
Sum squared resid 143.9675 Schwarz criterion 0.996183
Log likelihood -452.0343 F-statistic 46.75188
Durbin-Watson stat 1.788764 Prob(F-statistic) 0.000000

• Ho: Education does not improve individuals’ productivity.


• Ha: Education matters, but its effect could be either positive or negative.
→ Ho: β2,o = 0 Vs. Ha: β2,o ≠ 0.

βˆ2 − 0
→ t= = 11.77291 ; c = 1.96 at 5% significance level.
se( βˆ2 )

→ Since t ∉ (-1.96, 1.96), reject Ho!


→ P-value for this t statistic = 0.0000; α = 0.05.

Linear Regressions under Ideal Conditions-91


• Ho: Education does not improve individuals’ productivity.
Ha: Education improves individuals’ productivity.
→ Ho: β2,o = 0 Vs. Ha: β2,o > 0.
βˆ2 − 0
→ t= = 11.77291; c = 1.645 at 5% significance level.
se( βˆ2 )
Since c < t, reject Ho in favor of Ha.

• Ho: Work experience does not improve individuals’ productivity.


→ Ho: β3,o = β4,o = 0.
Ha: Work experience matters.
→ Ha: β3,o ≠ 0 and/or β4,o ≠ 0.

Wald Test:
Equation: Untitled

Null C(3)=0
Hypothesis:
C(4)=0

F-statistic 17.94867 Probability 0.000000


Chi-square 35.89734 Probability 0.000000

→ F = 17.94867; c from f(2,931) = 2.6 (at α = 5%).


→ Reject Ho.
→ Or, p-val of F = 0.0000 < 0.05 = α. So, reject Ho.

Linear Regressions under Ideal Conditions-92


Example: (Cobb-Douglas production function)
• Setup: L = labor; K = capital; Q = output.
• The Cobb-Douglas production function is given:
Qt = ALt β2 Kt β3 eε t ,
where A is constant. Taking log for both sides, we have:
(*) log(Qt ) = β1 + β 2 log( Lt ) + β 3 log( Kt ) + ε t ,
where β1 = ln(A).
• Estimation: Do OLS on (*), and estimate β’s.
• Interpretation of β’s:
β2 = ∂ log(Qt ) / ∂ log( Lt ) = Elasticity of output with respect to labor.
β3 = ∂ log(Qt ) / ∂ log( Kt ) = Elasticity of output with respect to capital.
β2 + β3 = scale of economy (r)
[increasing returns to scale if r > 1]
• Using F- or t-test methods, we can test Ho: β2,o + β3,o = 1.
• A drawback of Cobb-Douglas
• When you use the Cobb-Douglas production function, you are assuming
that the elasticities are constant over different levels of L and L. In
reality, elasticities might change over different L and K.

Linear Regressions under Ideal Conditions-93


Example: (Translog Production Function)
• Setup:
⎧ β1 + β 2 log( Lt ) + β 3 log( K t ) ⎫
⎪ ⎪
log(Qt ) = ⎨ (log( Lt )) 2
(log( K t )) 2
⎬.
+
⎪⎩ 4β + β 5 + β 6 (log( L ))(log( K )) + ε t⎪
2 2
t t

• Testing Cobb-Douglas:
• Do a F-test for Ho: β 4,o = β 5,o = β 6,o = 0 .

• Estimating elasticities:
• Let log ( L) and log( K ) be chosen values of log(Lt) and log(Kt).
[You may choose sample means.]
• Observe that ηQL = ∂ log(Q ) / ∂ log( L) = β2 + β4log(L) + β6log(K).
• Thus, a natural estimate of ηQL is given:
ηˆQL = βˆ2 + βˆ4 log( L) + βˆ6 log( K ) = R βˆ ,

where R = (0,1,0, log( L) ,0, log( K ) ).

• var(ηˆQL ) = var( R βˆ ) = RCov ( βˆ ) R′ .

Thus, se(ηˆQL ) = RCov ( βˆ ) R′ .

Linear Regressions under Ideal Conditions-94


[Proofs of the theorems related with F-statistic]
Theorem:
Under Ho: Rβo = r,
( R βˆ − r )′[ R ( X ′X ) −1 R′]−1 ( R βˆ − r ) / m
F= ~ f ( m, T − k ) .
SSE /(T − k )
Proof:
Let g = ( R βˆ − r )′[ R ( X ′X ) −1 R′]−1 ( R βˆ − r ) / σ o2 ; and let h = SSE / σ o2 =

(T − k ) s 2 / σ o2 . Note that F = (g/m)/[h/(T-k)]. We already know that h ~


χ2(T-k). Therefore, we can complete the proof by showing that (i) g is χ2(m),
and that (ii) g and h are stochastically independent.
(i) Note that under Ho,
R βˆ − r = R βˆ − R β = R( βˆ − β ) = R( X ′X ) −1 X ′ε .
Therefore, we have:
ε ′X ( X ′X ) −1 R′[ R( X ′X ) −1 R′]−1 R( X ′X ) −1 X ′ε ε ′Qε
g= ≡ 2 .
σ o2 σo
We can see that Q is symmetric and idempotent with Rank(Q) = m. Since ε ~
N (0T ×1 , σ o2 I T ) , g ~ χ2(m). [See Schmidt.]

(ii) h = SSE/σ2 = ε′M(X)ε/σ2 ~ χ2(T-k). Note that M(X)Q = 0. Therefore, g


and h are stochastically independent. [See Schmidt.]

Linear Regressions under Ideal Conditions-95


Theorem:
Under Ho: Rβo - r = 0,
E ( β ) = βo ) ;

Cov ( β ) = Cov ( βˆ ) − σ o2 ( X ′X ) −1 R′[ R ( X ′X ) −1 R′]−1 R ( X ′X ) −1 .


Proof:
−1
(i) β = βˆ − ( X ′X ) −1 R ′ ⎡⎣ R( X ′X ) −1 R ′⎤⎦ ( R β − r )
−1
= βˆ − ( X ′X ) −1 R′ ⎡⎣ R( X ′X ) −1 R′⎤⎦ ( Rβ o + R( X ′X ) −1 X ′ε − r )
−1
= β o + ( X ′X ) −1 X ′ε − ( X ′X ) −1 R′ ⎡⎣ R( X ′X ) −1 R′⎤⎦ R( X ′X ) −1 X ′ε
−1
= β o + [( X ′X ) −1 X ′ − ( X ′X ) −1 R′ ⎡⎣ R( X ′X ) −1 R′⎤⎦ R( X ′X ) −1 X ′]ε

→ E ( β ) = βo .

(ii) Derive Cov ( β ) by yourself.

Theorem (Overall Significance Test)


The model is given:
(*) yt = xt1β1 + xt2β2 + ... + xtkβk + εt.
The null hypothesis is given by Ho: β2,o = ... = βk,o = 0. Assume that
(SIC.1)-(SIC.8) (including (SIC.7)) hold. Then, the F-statistic for Ho is given
by:
T − k R2
F= ~ f(k-1,T-k),
k −1 1− R 2

where R2 is from the above-unrestricted model (*).

Linear Regressions under Ideal Conditions-96


Proof:
The restricted model is given by: yt = β1 + εt. Since β1 = y ,

SSEr = ( y − X β )′( y − X β ) = Σ t ( yt − xt′• β ) 2

= Σt ( yt − β1 − xt 2 β 2 − ... − xtk β k )2

= Σt ( yt − β1 )2 = Σt ( yt − y )2 = SST.
Observe that:
( SSEr − SSEu ) /(k − 1) T − k SST − SSE
F = =
SSEu /(T − k ) k −1 SSE
.
T − k 1 − SSE / SST T − k R 2
= =
k − 1 SSE / SST k − 1 1 − R2

Linear Regressions under Ideal Conditions-97


[8] Tests of Structural Changes
(1) Motivation:
Relationships among economic variables may change over time or across
different genders (Ch. 7.4 in Greene)

Example 1:
Oil shocks during 70’s may have changed firms’ production functions
permanently.
Example 2:
Effects of schooling on wages may be different over different regions. [Why?
Perhaps because of different industries across different regions.]

• Data: (WAGE2.WF1 or WAGE2.TXT – from Wooldridge’s website)


# of observations (T): 935
1. wage monthly earnings
2. hours average weekly hours
3. IQ IQ score
4. KWW knowledge of world work score
5. educ years of education
6. exper years of work experience
7. tenure years with current employer
8. age age in years
9. married =1 if married
10. black =1 if black
11. south =1 if live in south
12. urban =1 if live in SMSA
13. sibs number of siblings
14. brthord birth order
15. meduc mother's education
16. feduc father's education
17. lwage natural log of wage

Linear Regressions under Ideal Conditions-98


• Mincerian wage equation for people living in South (A):
Dependent Variable: LWAGE
Sample(adjusted): 28 935 IF SOUTH = 1
Included observations: 319 after adjusting endpoints

Variable Coefficient Std. Error t-Statistic Prob.


C 4.860469 0.233695 20.79831 0.0000
EDUC 0.101053 0.012594 8.024086 0.0000
EXPER 0.053960 0.024386 2.212751 0.0276
EXPER^2 -0.001007 0.001009 -0.997829 0.3191

R-squared 0.179628 Mean dependent var 6.665056


Adjusted R-squared 0.171815 S.D. dependent var 0.450349
S.E. of regression 0.409838 Akaike info criterion 1.066352
Sum squared resid 52.90976 Schwarz criterion 1.113565
Log likelihood -166.0832 F-statistic 22.99070
Durbin-Watson stat 1.755004 Prob(F-statistic) 0.000000

• Mincerian wage equation for people living in Non-South (B):


Dependent Variable: LWAGE
Sample(adjusted): 1 910 IF SOUTH = 0
Included observations: 616 after adjusting endpoints

Variable Coefficient Std. Error t-Statistic Prob.


C 5.893468 0.143314 41.12270 0.0000
EDUC 0.063453 0.007563 8.389865 0.0000
EXPER -0.002798 0.015758 -0.177542 0.8591
EXPER^2 0.000744 0.000664 1.120953 0.2627

R-squared 0.103200 Mean dependent var 6.838013


Adjusted R-squared 0.098804 S.D. dependent var 0.392769
S.E. of regression 0.372861 Akaike info criterion 0.871250
Sum squared resid 85.08351 Schwarz criterion 0.899973
Log likelihood -264.3451 F-statistic 23.47553
Durbin-Watson stat 1.852473 Prob(F-statistic) 0.000000

Linear Regressions under Ideal Conditions-99


• Question:
• βA1,o = βB1,o, βA2,o = βB2,o, βA3,o = βB3,o and βA4,o = βB4,o?
• If so, we can pool all observations to estimate:
(C) lwaget = β1 + β2educt + β3expert + β4expert2 + εt, t = 1, ... , T.

Dependent Variable: LWAGE


Method: Least Squares
Date: 02/05/02 Time: 13:57
Sample: 1 935
Included observations: 935

Variable Coefficient Std. Error t-Statistic Prob.

C 5.517432 0.124819 44.20360 0.0000


EDUC 0.077987 0.006624 11.77291 0.0000
EXPER 0.016256 0.013540 1.200595 0.2302
EXPER^2 0.000152 0.000567 0.268133 0.7887

R-squared 0.130926 Mean dependent var 6.779004


Adjusted R-squared 0.128126 S.D. dependent var 0.421144
S.E. of regression 0.393240 Akaike info criterion 0.975474
Sum squared resid 143.9675 Schwarz criterion 0.996183
Log likelihood -452.0343 F-statistic 46.75188
Durbin-Watson stat 1.788764 Prob(F-statistic) 0.000000

• Question:
How can we test Ho: βA1,o = βB1,o, βA2,o = βB2,o, βA3,o = βB3,o and βA4,o = βB4,o?

Linear Regressions under Ideal Conditions-100


(2) General Framework
Model For Group A:
(A) yAt = βA1 + βA2xAt2 + ... + βAkxAtk + εAt , t = 1, ... , TA.
Model For Group B:
(B) yBt = βB1 + βB2xBt2 + ... + βBkxBtk + εBt, t = 1, ... , TB.
Under Ho: βAj,o = βBj,o for any j = 1, ... , k (k restrictions),
we can pool the data to estimate
(C) yt = β1 + β2xt2 + ... + βkxtk + εt, t = 1, ... , T ( = TA+TB). B

Assume that var(εAt) = var(εBt) = σ o2 .

(3) Chow-Test Procedure.


STEP 1: Do OLS on (C) and get SSEC.
STEP 2: Do OLS on (A) and (B); then get SSEA and SSEB.
STEP 3: Compute the Chow-Test statistic.
Under Ho,
( SSEC − SSE A − SSE B ) / k
FCHOW = ~ f ( k , TA + TB − 2k ) .
( SSE A + SSEB ) /(TA + TB − 2k )

Linear Regressions under Ideal Conditions-101


Example: Back to the Mincerian wage equation.
STEP 1: OLS results from all (SSEC = 143.9675; TA+TB = 935).
STEP 2: OLS results from South (SSEA = 52.90976; TA = 319).
OLS results from Non-South (SSEB = 85.08351; TB = 616).
STEP 3: Compute the Chow statistic:
( SSEC − SSE A − SSEB ) / k
FCHOW =
( SSE A + SSEB ) /(TA + TB − 2k )
(143.9675 − 85.08351 − 52.90976) / 4
=
(85.08351 + 52.90976) /(935 − 8)
= 10.033299
c from f(4,927) = 2.37 at 5% significance level. Since F > c, we
reject Ho. There is a structural difference between South and
Non-South.

Linear Regressions under Ideal Conditions-102


[Proof for Chow test]
• Assume εAt and εBt are iid N (0, σ o2 ) .

• Unrestricted Model: Merge Models (A) and (B):


Model A: yA = XAβA + εA
Model B: yB = XBβB + εBB

⎛ yA ⎞ ⎛ X A 0TA × k ⎞ ⎛ β A ⎞ ⎛ ε A ⎞
→ (*) ⎜ y ⎟ = ⎜0 + → y = X*β* + ε* .
⎝ B ⎠ ⎝ TB × k X B ⎟⎠ ⎜⎝ β B ⎟⎠ ⎜⎝ ε B ⎟⎠

(# of obs (T) = TA+TB; # of regressors = 2k)


⎛ βˆ A ⎞
ˆ ′ ′
→ OLS on (*): β* = ( X * X * ) X * y = ⎜ ⎟ .
−1
⎜ βˆ ⎟
⎝ B⎠
→ SSE from this regression = SSE* = SSEA + SSEB [Why?]. B

• Restricted model:
βA,o = βB,o (let us denote them by β): k restrictions.
→ Merge model (A) and (B) with the restriction (Model C):
⎛y ⎞ ⎛X ⎞ ⎛ε ⎞
(**) ⎜ A ⎟ = ⎜ A ⎟ β + ⎜ A ⎟ → y = Xβ + ε
⎝ yB ⎠ ⎝ X B ⎠ ⎝εB ⎠

→ OLS on this model (restricted OLS): βˆ = (X′X)-1X′y.


→ SSEr = SSEC.

Linear Regressions under Ideal Conditions-103


• F-test for βA,o = βB,o:
F = [(SSEr-SSEu)/k]/[SSEu/(T-2k)]
= [(SSEC-SSEA-SSEB)/k]/[(SSEA+SSEB)/(T-2k)].

• Chow test when var(εAt) ≠ var(εBt).


Under Ho: βA,o = βB,o,

WT (wald test) = ( βˆ A − βˆB )′[ s A2 ( X A′ X A ) −1 + sB 2 ( X B′ X B ) −1 ]−1 ( βˆ A − βˆB )

→ χ2(k).

• Alternative form of Chow test [Assuming var(εAt) = var(εBt).]


• Define a dummy variable:
dt = 1 if t ∈ A ; dt = 0 if t t ∈ B .

• Using all T observations, build up a model:


(*) yt = xt1β1 + ... + xtkβk + (dtxt1)βk+1 + ... + (dtxtk)β2k + εt.
• Note that
yt = xt1(β1+βk+1) + ... + xtk(βk+β2k) + εt, for t ∈ A ,
yt = xt1β1 + ... + xtkβk + εt, for t ∈ B .
• If no difference between A and B, βk+1 = ... = β2k = 0.
F test for Ho: βk+1,o = ... = β2k,o = 0 using OLS on (*) = Chow test!!!

Linear Regressions under Ideal Conditions-104


Example: Return to South V.S. Non-South
Dependent Variable: LWAGE
Method: Least Squares
Date: 02/05/02 Time: 16:01
Sample: 1 935
Included observations: 935

Variable Coefficient Std. Error t-Statistic Prob.

C 5.893468 0.148297 39.74107 0.0000


EDUC 0.063453 0.007826 8.107984 0.0000
EXPER -0.002798 0.016306 -0.171577 0.8638
EXPER^2 0.000744 0.000687 1.083291 0.2790
SOUTH -1.032999 0.265316 -3.893462 0.0001
SOUTH*EDUC 0.037600 0.014206 2.646802 0.0083
SOUTH*EXPER 0.056757 0.028159 2.015637 0.0441
SOUTH*EXPER^2 -0.001751 0.001172 -1.493727 0.1356

R-squared 0.166990 Mean dependent var 6.779004


Adjusted R-squared 0.160700 S.D. dependent var 0.421144
S.E. of regression 0.385824 Akaike info criterion 0.941648
Sum squared resid 137.9933 Schwarz criterion 0.983064
Log likelihood -432.2203 F-statistic 26.54749
Durbin-Watson stat 1.825679 Prob(F-statistic) 0.000000

Wald Test:
Equation: Untitled
Null Hypo.: C(5)=0
C(6)=0
C(7)=0
C(8)=0
F-statistic 10.03332 Probability 0.000000
Chi-square 40.13328 Probability 0.000000

Linear Regressions under Ideal Conditions-105


(4) What if TB < k?
• Can’t estimate β for Group B.
• Alternative test procedure (Chow predictive test):
STEP 1: Do OLS on (C) and get SSEC.
STEP 2: Do OLS on (A); then get SSEA.
STEP 3: Compute an alternative Chow-test statistic. Under Ho,
( SSEC − SSE A ) / TB
FACHOW = ~ f (TB , TA − k ) .
( SSE A ) /(TA − k )

• What is this?
yA = X Aβ + ε A for Group A;

y B = X B β + I TB γ + ε B for Group B,

where γ = γ 1 ,..., γ TB ′ .
( )
⎛ ⎞
⎛ y B ,1 ⎞ ⎜ xB ,1i′ 1 0 ... 0 ⎟ ⎛ β ⎞ ⎛ ε B ,1 ⎞
⎜ y ⎟ ⎜
x ′ 0 1 ... 0 ⎟ ⎜ γ 1 ⎟ ⎜ ε B ,2 ⎟
• ⎜
B ,2 ⎟=⎜ B ,2 i
⎟⎜ ⎟ + ⎜ ⎟.
⎜ : ⎟ ⎜ : : : : ⎟⎜ : ⎟ ⎜ : ⎟
⎜⎜ ⎟⎟ ⎜ ⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟
y γ ε
⎝ B ,TB ⎠ ⎜ xB ,T i′ 0 0 ... 1 ⎟ ⎝ TB ⎠ ⎝ B ,TB ⎠
⎝ B ⎠
⎛β ⎞
⎛ y A ⎞ ⎛ X TA×k 0TA×TB ⎞ ⎜ γ 1 ⎟ ⎛ ε A ⎞
• ⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎝ yB ⎠ ⎝⎜ X TB ×k ⎟
ITB ×TB ⎠ ⎜ : ⎟ ⎝ ε B ⎠
⎜⎜ ⎟⎟
⎝ γ TB ⎠
• SSEA = SSE from regression of the above model.
• FACHOW = F for Ho: γ 1 = ... = γ TB = 0 .

Linear Regressions under Ideal Conditions-106


[9] Forecasting

• Model: yt = β1xt1 + β2xt2 + ... + βkxtk + εt.


• Wish to predict y0 given x01, x02, ... , x0k.
• y0 = x0′ β + ε 0 , x0′ = ( x01 ,..., x0 k ) .

• ŷ0 = x0′ βˆ (point forecast of y0).

Theorem:
Under (SIC.1)-(SIC.6) and (SIC.8), ( y0 − yˆ 0 ) ~ N (0, σ o2 [1 + x0′ ( X ′X ) −1 x0 ]) .
Proof:
ŷ0 = x0′ βˆ = x0′ [ β o + ( X ′X ) −1 X ′ε ] = x0′ β o + x0′ ( X ′X ) −1 X ′ε .
y0 = x0′ βo + ε 0 .

→ y0 − yˆ 0 = ε 0 − x0′ ( X ′X ) −1 X ′ε .

→ Since ε0 and ε are normal, so is ( y0 − yˆ 0 ) .

→ E ( y0 − yˆ 0 ) = 0 .

→ var ( y0 − yˆ 0 ) = var(ε 0 − x0′ ( X ′X ) −1 X ′ε )

= var(ε 0 ) + var[ x0′ ( X ′X ) −1 X ′ε ]

= σ o2 + x0′ ( X ′X ) −1 X ′Cov(ε )[ x0′ ( X ′X ) −1 X ′]′

= σ o2 + σ o2 x0′ ( X ′X ) −1 x0 .

Linear Regressions under Ideal Conditions-107


Theorem:
y0 − yˆ 0
Under (SIC.1)-(SIC.6) and (SIC.8), ~ t (T − k ) .
s (1 + x0′ ( X ′X ) x0 )
2 −1

Implication:
Let c be a critical value for two-tail t-test given a significance level (say, 5%):
⎛ y0 − yˆ 0 ⎞
Pr ⎜ − c < < c ⎟ = 0.95 ,
⎝ se( y0 − yˆ 0 ) ⎠

where se( y0 − yˆ 0 ) = s 2 (1 + x0′ ( X ′X ) −1 x0 ) . This implies that:

Pr( yˆ 0 − c × se < y0 < yˆ 0 + c × se) = 0.95 .

Forecasting Procedure:
STEP 1: Let x0′ = ( x01 , x02 ,..., x0k ) .

STEP 2: Compute ŷ0 = x0′ βˆ .

STEP 3: Compute se( y0 − yˆ 0 ) = s 2 (1 + x0′ ( X ′X ) −1 x0 ) .

STEP 4: From given df = T-k and confidence level, find c.


STEP 5: Compute Pr( yˆ 0 − c × se < y0 < yˆ 0 + c × se) = 0.95 .

Linear Regressions under Ideal Conditions-108


Numerical Example:
⎛ 0.1 0 0⎞ ⎛ 1.2 ⎞
( X ′X ) −1 = ⎜ 0 5 −7 ⎟ ; βˆ = ⎜ −1 ⎟ ; T = 10; s 2 = 14.514 .
⎜ ⎟ ⎜ ⎟
⎜ 0 −7 10 ⎟ ⎜ 2 ⎟
⎝ ⎠ ⎝ ⎠
And x02 = 1 and x03 = 1.

STEP 1: Let x0′ = (1, x02 , x03 ) = (1,1,1) .

⎛ 1.2 ⎞
STEP 2: Compute yˆ 0 = x0′ βˆ = (1 1 1) ⎜ −1 ⎟ = 2.2 .
⎜ ⎟
⎜ 2 ⎟
⎝ ⎠

STEP 3: Compute se = s 2 (1 + x0′ ( X ′X ) −1 x0 ) = 14.514 × (1 + 1.1) = 5.52.

STEP 4: From given df = 10-3 = 7 and α = 5%, c = 2.365.


STEP 5: ŷ0 - c×se = 2.2 - 2.365×5.52 = -10.855.

ŷ0 +c×se = 2.2 + 2.365×5.52 = 15.255.


Pr(-10.855 < yo < 15.255) = 0.95.

Linear Regressions under Ideal Conditions-109


“Dynamic” and “Static” Forecasts in Eviews
• For the analysis of cross-section data, they are the same.
• For the analysis of time-series data, they could be different.
• When a regression model uses lagged dependent variables as regressors, it is
called a dynamic model.
• Consider a simple dynamic model yt = β1 + β2yt-1 + εt.
• “Dynamic” Forecast [Multiple Period Forecast]: Suppose you estimate
β’s using observations up to t = 100. Using the estimates, you would like
to forecast y101 and y102. For this case, if you use “dynamic forecast”,
Eviews will compute point forecasts of y101 and y102 by
ŷ101 = βˆ1 + βˆ2 y100 ; yˆ102 = βˆ1 + βˆ2 yˆ101 .

• “Static” Forecast [One Period Forecast]: If you choose “static forecast”,


Eviews will compute point forecasts of y101 and y102 by
ŷ101 = βˆ1 + βˆ2 y100 ; ŷ102 = βˆ1 + βˆ2 y101 .
Observe that “static forecast” use y101 instead of ŷ101 to forecast y102.

• If you have data points up to t = 100, and if you would like to forecast y at
t = 101 and t = 102, you’d better to use “dynamic forecast.”
• The formula of forecasting standard errors taught in the class can be used
for static forecasts. But the standard errors for dynamic forecasts are
much more complicated.

Linear Regressions under Ideal Conditions-110


[Exercise for Static Forecast]
• Use ECN2002.wf1 (data from 1959:1 to 1995:12).
• For the definitions of the variables, see ECN2002.XLS.
• Forecasting ldpi = log(DPI) using regression results from 1959:1 to
1995:12.

Dependent Variable: LDPI


Method: Least Squares
Date: 02/07/02 Time: 11:31
Sample(adjusted): 1959:07 1995:12
Included observations: 438 after adjusting endpoints

Variable Coefficient Std. Error t-Statistic Prob.

C 0.008851 0.003062 2.890236 0.0040


LDPI(-1) 0.802184 0.047680 16.82446 0.0000
LDPI(-2) 0.130495 0.061254 2.130386 0.0337
LDPI(-3) 0.086545 0.061535 1.406419 0.1603
LDPI(-4) 0.045344 0.061534 0.736894 0.4616
LDPI(-5) 0.078119 0.061248 1.275461 0.2028
LDPI(-6) -0.143010 0.047695 -2.998423 0.0029

R-squared 0.999933 Mean dependent var 7.280527


Adjusted R-squared 0.999932 S.D. dependent var 0.889422
S.E. of regression 0.007340 Akaike info criterion -6.97510
Sum squared resid 0.023220 Schwarz criterion -6.90986
Log likelihood 1534.547 F-statistic 1069361.
Durbin-Watson stat 2.014603 Prob(F-statistic) 0.000000

Linear Regressions under Ideal Conditions-111


9.0
Forecast: LDPIFS
Actual: LDPI
Forecast sample: 1996:01 2001:12
8.9 Included observations: 71

Root Mean Squared Error 0.005262


Mean Absolute Error 0.003230
8.8
Mean Abs. Percent Error 0.036666
Theil Inequality Coefficient 0.000300
Bias Proportion 0.106970
8.7 Variance Proportion 0.005447
Covariance Proportion 0.887582

8.6
1996 1997 1998 1999 2000 2001

LDPIFS

9.00

8.95

8.90

8.85

8.80

8.75

8.70

8.65

8.60
1996 1997 1998 1999 2000 2001

LDPI UPPERBS
LDPIFS LOW ERBS

Linear Regressions under Ideal Conditions-112


[Exercise for Dynamic Forecast]

9.2
Forecast: LDPIFD
Actual: LDPI
9.1
Forecast sample: 1996:01 2001:12
Included observations: 71
9.0
Root Mean Squared Error 0.057155
Mean Absolute Error 0.049899
8.9
Mean Abs. Percent Error 0.565546
Theil Inequality Coefficient 0.003247
8.8 Bias Proportion 0.762216
Variance Proportion 0.221227
8.7 Covariance Proportion 0.016557

8.6
1996 1997 1998 1999 2000 2001

LDPIFD

9.1

9.0

8.9

8.8

8.7

8.6
1996 1997 1998 1999 2000 2001

LDPI UPPERBD
LDPIFD LOW ERBD

Linear Regressions under Ideal Conditions-113


[10] Nonnormal ε and Stochastic Regressors

(1) Motivation
• If the regressors xt• are stochastic, all t and F tests are wrong (bad news).
• The t and F tests require the OLS estimator βˆ to be unbiased.

• Recall how we have shown the unbiasedness of βˆ under (SIC.8):

βˆ = β o + ( X ′X ) −1 X ′ε
?

→ E ( βˆ ) = β + E[( X ′X ) X ′ε ] = β + ( X ′X ) −1 X ′E (ε ) .
−1

• Unbiasedness of β does not require nonstochastic regressors. It only


requires:
E ( ε t | x1• ,..., xT • ) = 0 , for all t. (*)

Or E (ε | X ) = 0T ×1 .
Under this assumption,

( ) (
E ( β ) = E X E ( β | X ) = E X E ( β + ( X ′X ) −1 X ′ε | X ) )
= E X ( β + ( X ′X ) −1 X ′E (ε | X ) ) = E X ( β ) = β .

• But, for some cases, condition (*) does not hold. For example, xt• = yt-1. In

this case, E(εt-1|yt-1) ≠ 0. For this case, we can no longer say that β is an
unbiased estimator.
• An example for models with lagged dependent variables as regressors:
yt = β1xt1 + β2xt2 + β3yt-1 + εt . → β2/(1-β3) = long-run effect of xt2.

Linear Regressions under Ideal Conditions-114


• If the εt are not normally distributed, all t and F tests are wrong (bad news).
• Can we use them if T is large?
• Recall that the t and F statistics follow t and f distributions, respectively,
only if βˆ is normally distributed. But if the εt are not normally distributed,

βˆ is no longer normal.

Digression to Mathematical Statistics


Large-Sample Theories
1. Motivation:
• θˆT : An estimator from a sample of size T, {x1, ... , xT}.
I use subscript “T” to emphasize the fact that an estimator is a function of
sample size T.
• What would be the statistic properties of θˆT when T is infinitely large?

• What do we wish?
[We wish the distribution of θˆT would become more condensed around
θo as T increases.]

Linear Regressions under Ideal Conditions-115


2. Main Points:
Rough Definition of Consistency
• Suppose that the distribution of θˆT becomes more and more condensed

around θo as T increases. Then, we say that θˆT is a consistent estimator.


And we use the following notation:
plimT→∞θˆT = θo (or θˆT →p θo).

• The law of large numbers (LLN) says that a sample mean xT ( x from a

sample size equal to T) is a consistent estimator of μo. What does it mean?

• Gauss Exercise:
• A population with N(1,9).
• 1000 different random samples of T = 10 to compute x10 .

• 1000 different random samples of T = 100 to compute x100 .

• 1000 different random samples of T =5000 to compute x5000 .

Linear Regressions under Ideal Conditions-116


• conmonte.prg
/*
** Monte Carlo Program to Demonstrate Efficiency of Sample Mean
*/

@ Data generation from N(1,9) @

seed = 1;
tt1 = 10; @ # of observations @
tt2 = 100; @ # of observations @
tt3 = 1500; @ # of observations @
iter = 1000; @ # of sets of different data @

storx10 = zeros(iter,1) ;
storx100 = zeros(iter,1) ;
storx5000 = zeros(iter,1);

i = 1; do while i <= iter;

@ compute sample mean for each sample @

x10 = 1 + 3*rndns(tt1,1,seed);
x100 = 1 + 3*rndns(tt2,1,seed);
x5000 = 1 + 3*rndns(tt3,1,seed);
storx10[i,1] = meanc(x10);
storx100[i,1] = meanc(x100);
storx5000[i,1] = meanc(x5000);

i = i + 1; endo;

@ Reporting Monte Carlo results @

library pgraph;
graphset;

v = seqa(-2, .05, 120);


ytics(0,25,0.1,0);
@ {a1,a2,a3}=histp(storx10,v); @
@ {b1,b2,b3}=histp(storx100,v); @
{b1,b2,b3}=histp(storx5000,v);

Linear Regressions under Ideal Conditions-117


x10

x100

x5000

Linear Regressions under Ideal Conditions-118


• Relation between unbiasedness and consistency:
• Biased estimators could be consistent.
Example: Suppose that θT is unbiased and consistent.

Define θˆT = θT + 1/T.

Clearly, E(θˆT ) = θo + 1/T ≠ θo (biased).

But, plimT→∞θˆT = plimT→∞θT = θo (consistent).

• A unbiased estimator θˆT is consistent if var(θˆT ) → 0 as T → 4.

Example: Suppose that {x1, ..., xT} is a random sample from N ( μo , σ o2 ) .


E( xT ) = μo.

var( xT ) = σo2/T → 0 as T → ∞.
Thus, xT is a consistent estimator of μo.

Linear Regressions under Ideal Conditions-119


Law of Large Numbers (LLN)

Case of scalar random variables


• Komogorov's Strong LLN:
Suppose that {x1, ... , xT} is a random sample from a population with
finite µ and σ2. Then, plimT→∞ xT = μo.

• Generalized Weak LLN (GWLLN):


• {x1, ... , xT} is a sample (not necessarily a random sample)
• Define E(x1) = μ1,o, ... , E(xT) = μT,o.
• The variances of the xt (t = 1, …, T) are finite and may be different over
different t.
1
• Then, under suitable assumptions, plimT→∞ xT = limT→∞ Σt μo ,t .
T

Case of Vector Random Variables


• GWLLN
• xt: p×1 random vector.
• {x1, ... , xT} is a sample.
• Let E(x1) = μ1,o (p×1), ... , E(xT) = μT,o.
• Assume that Cov(xj) are well-defined and finite.
1
• Then, under suitable assumptions. plimT→∞ xT = limT→∞ Σ t μo , t .
T

Linear Regressions under Ideal Conditions-120


Central Limit Theorems (CLT) –Asymptotic Normality

Case of scalar random variables


• Motivation:
• Suppose that {x1, ... , xT} is a random sample from a population with
finite μ and σ2.
• We know xT → μo as T → ∞. But we can never have an infinitely large
sample!!!
• For finite T, xT is still a random variable. What statistical distribution
could approximate the true distribution of xT ?
• Lindberg-Levy CLT:
• Suppose that {x1, ... , xT} is a random sample from a population with
finite μ and σ2.

xT - μ o
• Then, T ( xT − μo ) → d N (0, σ o2 ) and T →d N(0,1).
σo
• Implication of CLT:
• T ( xT − μo ) ≈ N (0, σ o2 ) , if T is large.

• E ⎡⎣ T ( xT − μo ) ⎤⎦ = T [ E ( xT ) − μo ] ≈ 0 → E( E ( xT ) ≈ μo .

• var ⎡⎣ T ( xT − μo ) ⎤⎦ = T • var( xT − μo ) = T • var( xT ) ≈ σ 02

→ var( xT ) ≈ σ 02 / T .

• xT ≈ N ( μo , σ o2 / T ) , if T is large.

Linear Regressions under Ideal Conditions-121


Case of random vectors
• GCLT
• {y1, ... , yT}: a sequence of p×1 random vectors.
• For any t, E(yt) = 0p×1 and Cov(yt) is well defined and finite.
• Under some suitable conditions (acceptable for Econometrics I, II),
1 ⎛ 1 ⎞
Σt yt → d N ⎜ 0 p ×1 , limT →∞ Cov (Σt yt ) ⎟
T ⎝ T ⎠
• Note:
• Cov(yt) [var(yt) if yt is a scalar] could differ across different t.
• The yt could be correlated as long as limn→∞cov(yt,yt+n) = 0 (ergodic).
• If E ( yt | yt −1 , yt −2 ,...) = 0 (Martingale Difference Sequence), the yt’s
are linearly uncorrelated. Then,
1 ⎛ 1 ⎞
Σt yt → d N ⎜ 0 p ×1 , limT →∞ Σt Cov ( yt ) ⎟ .
T ⎝ T ⎠

End of Digression

Linear Regressions under Ideal Conditions-122


(2) Weak Ideal Conditions (WIC)
Consider the following linear regression model:
yt = xt′• β + ε t = xt1β1 + xt 2 β 2 + ... + xtk β k + ε t .

(WIC.1) The conditional mean of yt (dependent variable) given x1•, x2•, ... , xt•,
ε1, ... , and εt-1 is linear in xt•:
yt = E ( yt | x1• ,..., xt • , ε1 ,..., ε t −1 ) + ε t = xt′• βo + ε t .

Comment:
• Implies E (ε t | x1• , x2• ,..., xt • , ε1 , ε 2 ,..., ε t −1 ) = 0 .
• No autocorrelation in the εt: cov(ε t , ε s ) = 0 for all t ≠ s.
• Regressors are weakly exogenous and need not be strictly exogenous.
• E ( xs•ε t ) = 0k ×1 for all t ≥ s, but could be that E ( xs iε t ) ≠ 0 for some s > t.

(WIC.2) βo is unique.

(WIC.3) The series {xt•} are covariance-stationary and ergodic.

Comment:
• (WIC.2)-(WIC.3) implies that
p limT →∞ T −1 X ′X = p limT →∞ T −1Σt xt • xt′• ≡ Qo is finite and pd.

• Qo = limT →∞ T −1ΣE ( xt • xt′• ) [By GWLLN].


• Rules out perfect multicollinearity among regressors.

Linear Regressions under Ideal Conditions-123


(WIC.4) The data need not be a random sample.

(WIC.5) var(ε t | x1• , x2• ,..., xt • , ε1 ,..., ε t −1 ) = σ o2 for all t.


(No-Heteroskedasticity Assumption).

(WIC.6) The error terms εt are normally distributed conditionally on x1•, … , xt•,
ε1, … , εt-1.

(WIC.7) xt1 = 1, for all t = 1, ... , T.

Comment:
SIC → WIC.

Linear Regressions under Ideal Conditions-124


(3) Statistical Properties of the OLS estimator under WIC:
Theorem (Consistency/Asymptotic Normality Theorem):
Under (WIC.1)-(WIC.5),
p limT →∞ βˆ = β o (consistent).

p limT →∞ s 2 = σ o2 (consistent).

T ( βˆ − β o ) →d N ( 0k ×1 , σ o2Qo −1 ) .

Implication:
βˆ ≈ N ( β o , σ o 2 (TQo ) −1 ) → βˆ ≈ N ( β o , s 2 ( X ′X ) −1 ) ,

if T is reasonably large.

Implication:
1) t test for Ho: Rβo - r = 0 (R: 1×k, r: scalar) is valid if T is large.
Use z-table to find critical value.
2) For Ho: Rβo - r = 0 (R: m×k, r: m×1),
use WT = mF which is asymptotically χ2(m) distributed. [Why?]
−1
• WT = ( R β − r )′ ⎡ RCov ( β ) R ′⎤ ( R β − r )
⎣ ⎦
−1
= ( R βˆ − r )′ ⎡ Rs 2 ( X ′X ) −1 R ′⎤ ( R βˆ − r ) = mF.
⎣ ⎦

Theorem (Efficiency Theorem):


Under (WIC.1)-(WIC.6), the OLS estimators are efficient asymptotically.

Linear Regressions under Ideal Conditions-125


(4) Testing Nonlinear restrictions:
General form of hypotheses:
• Let w(θ) = [w1(θ),w2(θ), ... , wm(θ)]′, where wj(θ) = wj(θ1, θ2, ... , θp) = a
function of θ1, ... , θp.
• Ho: The true θ (θo) satisfies the m restrictions, w(θ) = 0m×1 (m ≤ p).

Examples:
1) θ: a scalar
Ho: θo = 2 → Ho: θo - 2 = 0 → Ho: w(θ) = 0, where w(θ) = θ - 2.
2) θ = (θ1, θ2, θ3)′.
Ho: θ1,o2 = θ2,o + 2 and θ3,o = θ1,o + θ2,o.
→ Ho: θ1,o2-θ2,o-2 = 0 and θ3,o-θ1,o-θ2,o = 0.
⎛ w1 (θ ) ⎞ ⎛ θ12 − θ 2 − 2 ⎞ ⎛ 0 ⎞
→ Ho: w(θ ) = ⎜ ⎟ = ⎜θ − θ − θ ⎟ = ⎜ 0 ⎟ .
⎝ 2
w (θ ) ⎠ ⎝ 3 1 2⎠ ⎝ ⎠

3) linear restrictions
θ = [θ1, θ2, θ3]′.
Ho: θ1,o = θ2,o + 2 and θ3,o = θ1,o + θ2,o
⎛ w (θ ) ⎞ ⎛ θ1,o − θ 2,o − 2 ⎞ ⎛ 0 ⎞
→ Ho: w(θo ) = ⎜ 1 o ⎟ = ⎜ ⎟ = ⎜ 0⎟ .
(θ ) θ
⎝ 2 o ⎠ ⎝ 3,o 1,o
w − θ − θ 2,o ⎠ ⎝ ⎠
⎛ θ1,o ⎞
⎛ 1 −1 0 ⎞ ⎜ ⎟ ⎛2⎞
→ Ho: w(θo ) = ⎜ ⎟ ⎜ 2,o ⎟ − ⎜ 0 ⎟ = Rθo − r .
θ
⎝ − 1 − 1 1 ⎠⎜ ⎟ ⎝ ⎠
⎝ θ 3,o ⎠

Linear Regressions under Ideal Conditions-126


Remark:
If all restrictions are linear in θ, Ho takes the following form:
Ho: Rθo - r = 0m×1,
where R and r are known m×p and m×1 matrices, respectively.

Definition:
⎛ ∂w1 (θ ) ∂w1 (θ ) ∂w1 (θ ) ⎞
⎜ ∂θ ...
∂θ 2 ∂θ p ⎟
⎜ 1

⎜ ∂w2 (θ ) ∂w2 (θ ) ∂w2 (θ ) ⎟
∂w(θ ) ⎜ ...
W (θ ) ≡ = ⎜ ∂θ1 ∂θ 2 ∂θ p ⎟ .
∂θ ′ ⎟
⎜ : : : ⎟
⎜ ⎟
⎜ ∂wm (θ ) ∂wm (θ )
...
∂wm (θ ) ⎟
⎜ ∂θ ∂θ 2 ∂θ p ⎟⎠ m× p
⎝ 1

Example: (Nonlinear restrictions)


Let θ = [θ1,θ2,θ3]′.
Ho: θ1,o2 - θ2,o = 0 and θ1,o - θ2,o - θ3,o2 = 0.
⎛ θ12 − θ 2 ⎞ ⎛ 2θ1 −1 0 ⎞
→ w(θ ) = ⎜ 2⎟
; (θ ) = ⎜ 1 .
−1 −2θ 3 ⎟⎠
W
θ −
⎝ 1 2 θ − θ 3 ⎠ ⎝

Linear Regressions under Ideal Conditions-127


Example: (Linear restrictions)
θ = [θ1,θ2,θ3]′
Ho: θ1,o = 0 and θ2,o + θ3,o = 1.
⎛ θ1 ⎞
⎛ θ1 ⎞ ⎛ 0 ⎞ ⎛ 0 ⎞ ⎛ 1 0 0⎞ ⎜ ⎟ ⎛ 0⎞ ⎛ 0⎞
→ w(θ ) = ⎜
θ + θ ⎟ − ⎜ 1 ⎟ = ⎜ 0 ⎟ → w(θ ) = ⎜ 0 1 1 ⎟ ⎜ θ 2 ⎟ − ⎜ 1 ⎟ = ⎜ 0 ⎟ ,
⎝ 2 3⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎜ ⎟ ⎝ ⎠ ⎝ ⎠
⎝ θ3 ⎠
which is of form w(θ ) = Rθ − r .

Theorem:
Under (WIC.1)-(WIC.5),

( )
T w( β ) − w( β o ) → d N ( 0m×1 ,W ( β o )σ 2Qo−1W ( β o )′) .

Proof:
Taylor’s expansion around βo:

w( β ) = w( β o ) + W ( β )( β − β o ) ,

where β is between β and βo . Since β is consistent, so is β . Thus,

( )
T w( β ) − w( β o ) ≈ W ( β o ) T ( β − β o )

→ d N ( 0m×1 ,W ( β o )σ 2Qo−1W ( β o )′) .

Implication:

( w ( β ) − w( β ) ) ≈ N ( 0
o m×1 )
,W ( β ) s 2 ( X ′X ) −1W ( β )′ .

Linear Regressions under Ideal Conditions-128


Theorem:
Under (WIC.1)-(WIC.5) and Ho: w(βo) = 0,
−1
WT = w( β )′ ⎡W ( β )Cov ( β )W ( β )′⎤ w( β ) ⇒ χ 2 ( m).
⎣ ⎦
Proof:
Under Ho: w(βo) = 0,

( )
w( β ) ≈ N 0m×1 ,W ( β )Cov ( β )W ( β )′ .

For a normal random vector hm×1 ~ N(0m×1,Ωm×m), h′Ω-1h ~ χ2(m). Thus, we


obtain the desired result.

Question: What does “Wald test” mean?


A test based on the unrestricted estimator only.

Linear Regressions under Ideal Conditions-129


(5) When the WIC are violated:
CASE 1: Simple dynamic model, yt = βyt-1 + εt.
• SIC is violated. But WIC hold, if the εt i.i.d. N (0, σ o2 ) and -1 < βo < 1.
• If βo = 1, WIC is also violated. For this case, the OLS is consistent, but not
normally distributed.
• For simplicity, set y0 = 0.
• yt = Σts =1ε s → var(yt) = E(yt2) = tσo2.
• plim (1/T)Σtxt•xt•′ = plim (1/T)Σtyt-12 = lim (1/T)ΣtE(yt-12) (by GWLLN)
= lim (1/T)Σt(t-1)σo2 = lim (1/T)[T(T-1)/2]σo2
= lim [(T-1)/2]σo2 → ∞ (WIC.3 violated.)

CASE 2: Deterministic trend model, yt = βt + εt.


1 T (T + 1)(2T + 1)
• plim (1/T)Σtxt•xt•′ = plim (1/T)Σtt2 = → ∞.
T 6
• WIC.3 is violated. But OLS estimator is consistent and asymptotically
normal.

CASE 3: Simultaneous Equations models.


• (a) ct = β1,o + β2,oyt + εt ; (b) ct + it = yt
• (a) → (b): yt = β1,o + β2,oyt + εt + it.
• yt = [β1,o/(1-β2,o)] + it[1/(1-β2,o)] + εt/(1-β2,o).
• yt is correlated with εt in (a).
• OLS is inconsistent.

Linear Regressions under Ideal Conditions-130


CASE 4: Measurement errors:
• yt = βoxt* + εt (true model).
• But we can observe xt = xt* + vt (vt: measurement error).
• If we use xt for xt*,
yt = xtβo + [εt-βovt] (model we estimate).
• xt and (εt-βovt) correlated.
• OLS is inconsistent.

• yt* = βoxt + εt (true model).


• But we can observe yt = yt* + vt.
• If we use yt for yt*,
yt = xtβo + [εt+vt] (model we estimate)
xt and (εt+vt) uncorrelated.
• OLS is consistent.

Linear Regressions under Ideal Conditions-131


[Proofs of Consistency and Asymptotic Normality Theorems]
(1) Show p lim βˆ = β o .

βˆ = βo + ( X ′X ) −1 X ′ε = β o + (T −1Σt xt • xt′i ) T −1Σt xt •ε t .

p limT →∞ T −1Σt xt • xt′• = Qo (by WIC.3)

p limT →∞ T −1Σt xt •ε t = limT →∞ Σt E ( xt •ε t ) [by GWLLN]


= lim T-1Σt0 [by WIC.1] = 0.
→ plim p limT →∞ βˆ = β o + (Qo ) −1 0 = β o .

(2) Show plim s2 = σo2.


plim s2 = plim SSE/T.
SSE/T = ε′M(X)ε/T = ε ′ε / T − ε ′ X ( X ′X ) −1 X ′ε / T

= T −1Σtε t2 − (T −1ε ′X )(T −1 X ′X ) −1 (T −1 X ′ε )

= T −1Σtε t2 − (T −1Σtε t xt′i )(T −1Σt xt • xt′• ) −1 (T −1Σt xt •ε t ) .

p limT →∞ T −1Σtε t2 = limT →∞ T −1Σt E (ε t2 ) = limT →∞ T −1Σtσ o2 = σ o2 .

p limT →∞ T −1Σt xt •ε t = 0 .

→ p limT →∞ s 2 = σ o2 − 0′(Qo ) −1 0 = σ o2 .

Linear Regressions under Ideal Conditions-132


(3) Show T ( βˆ − β ) →d N ( 0k ×1 , σ o2Qo −1 ) .

βˆ = βo + (T −1Σt xt • xt′• ) T −1Σt xt •ε t

→ ( βˆ − β ) = (T −1Σt xt • xt′• ) T −1Σt xt •ε t

⎛ 1 ⎞
→ T ( βˆ − β ) = [T −1Σt xt • xt′• ]−1 ⎜ Σ t xt •ε t ⎟ .
⎝ T ⎠
→ By GCLT with martingale difference,
1
Σt xt •ε t →d N(0, lim T-1ΣtCov(xt•εt))
T
Cov( xt •ε t ) = E ( xt •ε tε t xt′• ) = E (ε t2 xt • xt′• )

= E xt• [ E (ε t 2 xt • xt′• | xt • )] (by LIE)

= E xt• [ E (ε t 2 | xt • ) xt • xt′• ] = E xt (σ o2 xt • xt′• ) = σ o2 E ( xt • xt′• ) .

limT →∞ T −1Cov ( xt •ε t • ) = σ o2 limT →∞ T −1Σt E ( xt• xt′• ) = σ o2Qo .


1
→ Σt xt •ε t →d N (0k ×1 , σ o2Qo ) .
T
→ T ( βˆ − β o ) → d N ((Qo ) −1 0k ×1 ,(Qo ) −1σ o2Qo (Qo ) −1 ) = N (0k ×1 , σ o2 (Qo ) −1 )

Linear Regressions under Ideal Conditions-133

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy