CN Ols
CN Ols
Example:
• Suppose you are interested in the average relationship between income (y)
and education (x).
• For the people with 12 years of schooling (x =12), what is the average
income (E(y|x=12))?
• For the people with x years of schooling, what is the average income
(E(y|x))?
• Regression model:
y = E ( y | x) + ε ,
where ε is a disturbance (error) term with E (ε | x ) = 0 .
• Regression analysis is aimed to estimate E ( y | x ).
• Conditional pdf:
f(y|x) = Pr(Y = y, given X = x) = f(x,y)/fx(x).
• Stochastic independence:
• X and Y are stochastically independent iff f(x,y) = fx(x)fy(y), for all x,y.
• Under this condition, f(y|x) = f(x,y)/fx(x) = [fx(x)fy(y)]/fx(x) = fy(y).
• Marginal pdf of x:
fx(0) = Pr(X=0) regardless of y = f(0,1) + f(0,0) = 1/4 + 1/4 = 1/2.
fx(1) = Pr(X=1) regardless of y = f(1,1) + f(1,0) = 1/4 + 1/4 = 1/2.
• Conditional pdf:
f(y = 1| x = 1) = f(1,1)/fx(1) = (1/4)/(1/2) = 1/2;
f(y = 0| x = 1) = f(0,1)/fx(1) = 1/2.
→ f(y| x=1) = 1/2, for y = 0, 1.
• Stochastic independence:
fx(x) = fy(y) = 1/2; fx(x)fy(y) = 1/4 = f(x,y), for any x and y.
Thus, x and y are stochastically independent.
Means:
μx = E(x) = ΣxΣyxf(x,y) = Σxxfx(x).
μy = E(y) = ΣxΣyyf(x,y) = Σyyfy(y).
Variances:
σ x2 = E[( x − μ x )2 ] = Σ xΣ y ( x − μ x )2 f ( x, y ) = Σ x ( x − μ x ) 2 f x ( x )
.
= E ( x 2 ) − [ E ( x )]2 = Σ x x 2 f x ( x ) − μ x2
σ y2 = Σ x Σ y ( y − μ y )2 f ( x, y ) = Σ y ( y − μ y ) 2 f y ( y )
= E ( y 2 ) − [ E ( y )]2 = Σ y y 2 f y ( y ) − μ y2
Covariance:
σ xy = cov( x, y ) = E [( x − μ x )( y − μ y )] = Σ x Σ y ( x − μ x )( y − μ y ) f ( x, y )
= E ( xy ) − μ x μ y = Σ x Σ y xyf ( x, y ) − μ x μ y
Correlation Coefficient:
The correlation coefficient between x and y is defined by:
σ xy
ρ xy = .
σ xσ y
Theorem:
-1 ≤ ρxy ≤ 1.
Theorem:
If X and Y are stochastically independent, then, σxy = 0. But, not vice versa.
• var( y | x ) = E[( y − E ( y | x )) 2 | x ] = Σ y ( y − E ( y | x ) ) f ( y | x ) .
2
Regression model:
• Let ε = y - E(y|x) (deviation from conditional mean).
• y = E(y|x) + y - E(y| x) = E(y|x) + ε (regression model).
• E(y|x) = explained part of y by x.
ε = unexplained part of y (called disturbance term).
E(ε|x) = 0 and var(ε| x) = var(y|x).
Note:
• E(y|x) may vary with x, i.e., E(y|x) is a function of x.
• Thus, we can define Ex[E(y|x)], where Ex(•) is the expectation over x =
Σx•fx(x) or ∫Ω•fx(x)dx.
Note:
For discrete RV, X with x = x1, ...,
E(y) = ΣxE(y|x)fx(x) = E(y|x=x1)fx(x1) + E(y|x=x2)fx(x2) + ... .
Implication:
If you know the conditional mean of y and the marginal distribution of x, you
can also find the unconditional mean of y, too.
Definition:
We say that y is homoskedastic if var(y|x) is constant.
E(y|x)=β1+β2x
x
x1 x2
Note:
• varx[E(y|x)] ≤ var(y), since Ex[var(y|x)] ≥ 0.
• var(y) = E[(y-E(y))2]
= total variation of y.
varx[E(y|x)] = Ex[(E(y|x)-E(y))2]
= a part of variation in y due to variation in E(y|x)
= variation in y explained by E(y|x).
Coefficient of Determination:
R2 = varx[E(y|x)]/var(y).
→ Measure of worthiness of knowing E(y|x).
→ 0 ≤ R2 ≤ 1.
Note:
• R2 = variation in y explained by E(y|x)/total variation of y.
• Wish R2 close to 1.
Regression
line
4/3
1
x
4 8
• Marginal pdf:
Y\X 4 8 fy(y)
1 1/2 0 1/2
2 1/4 1/4 1/2
fx(x) 3/4 1/4
• Conditional Probabilities
Y\X 4 8 fy(y)
1 1/2 0 1/2
2 1/4 1/4 1/2
fx(x) 3/4 1/4
• f(y|x):
Y\X 4 8
1 2/3 0
2 1/3 1
4/3
1
x
4 8
• Conditional variance of Y:
• var(y|x=4) = Σy[y-E(y|x=4)]2f(y|x=4) = 6/27.
• var(y|x=8) = 0.
⎛ x⎞ ⎛ ⎛ μ x ⎞ ⎛ σ x2 ρ xyσ xσ y ⎞ ⎞
⎜ ⎟ ~ N ⎜ ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟ .
⎝ y⎠ ⎜ μ y ρ xyσ xσ y
⎝⎝ ⎠ ⎝ σ y ⎟⎠ ⎟⎠
2
1
f ( x, y ) =
2πσ xσ y 1 − ρ xy2
⎛ 1 ⎧ ( x − μ x )2 x − μ y − μ y ( y − μ y )2 ⎫ ⎞
× exp⎜ − − 2 ρ xy + ⎬ ⎟⎟ ,
⎜ 2(1 − ρ 2 ) ⎨ σ 2
x
σ σ σ
2
⎝ xy ⎩ x x y y ⎭⎠
where x, y ∈ ℜ.
Facts:
• f x ( x ) ~ N ( μ x , σ x2 ) and f y ( y ) ~ N ( μ y , σ y2 ) .
→ Cov(x) is symmetric.
EX: If x is scalar, Cov(x) = E[(x-µ)2] = var(x).
EX: x = [x1,x2]′ ; E(x) = µ = [µ1, µ2]′
x - µ = [x1-µ1, x2-µ2]′
⎛ x − μ1 ⎞
→ (x-μ)(x-μ)′ =⎜ 1 ⎟( x1 − μ1 x2 − μ 2 )
⎝ 2
x − μ 2⎠
⎛ ( x1 − μ1 ) 2 ( x1 − μ1 )( x 2 − μ 2 ) ⎞
= ⎜⎜ ⎟⎟ .
(
⎝ 2 x − μ 2 )( x1 − μ 1 ) ( x 2 − μ 2 ) 2
⎠
→ E[(x-µ)(x-µ)′] = Cov(x).
⎢ : : : ⎥ ⎢ : : : ⎥
⎢B Bp2 ... B pq ⎥⎦ ⎢ E ( B ) E ( B ) ... E ( B )⎥
⎣ p1 ⎣ p1 p2 pq ⎦
Pdf of x:
f(x) = f(x1, ... , xn) = (2π)-n/2 Σ -1/2exp[-(1/2)(x-µ)′Σ-1(x-µ)] ,
where Σ = det(Σ).
1 ⎡ ( x − μ x )2 ⎤
= exp ⎢ − ⎥.
2π σ x ⎣ 2σ x
2
⎦
EX:
Assume that all the Xi (i = 1, …, n) are iid with N ( μ x , σ x2 ) . Then,
Using (1) and (2), we can show that f(x) = f(x1, ... , xn) = ∏i =1 f ( xi ) ,
n
1 ⎡ ( xi − μ x ) 2 ⎤
where f(xi) = exp ⎢ − ⎥.
2π σ x ⎣ 2σ x
2
⎦
Theorem:
(1) E(c′x) = c′E(x);
(2) var(c′x) = c′Cov(x)c.
Proof:
(1) E(c′x) = E(Σjcjxj) = E(c1x1 + ... + cnxn)
= c1E(x1) + ... + cnE(xn) = ΣjcjE(xj) = c′E(x).
(2) var(c′x) = E[(c′x - E(c′x))2] = E[{c′x - c′E(x)}2]
= E[{c′(x-E(x))}2] = E[{c′(x-E(x))}{c′(x-E(x))}]
= E[{c′(x-E(x))}{(x-E(x))′c}]
= E[c′(x-E(x))(x-E(x))′c] = c′E[(x-E(x))(x-E(x))′]c = c′Cov(x)c.
Remark:
(2) implies that Cov(x) is always positive semidefinite.
→ c′Cov(x)c ≥ 0, for any nonzero vector c.
Proof:
For any nonzero vector c, c′Cov(x)c = var(c′x) ≥ 0.
Definition:
Let B = [bij]n x n be a symmetric matrix, and c = [c1, ... , cn]′. Then, a scalar c′Bc
is called a quadratic form of B.
Definition:
• If c′Bc > (<) 0 for any nonzero vector c, B is called positive (negative)
definite.
• If c′Bc ≥ (≤ ) 0 for any nonzero c, B is called positive (negative)
semidefinite.
EX:
Show that B is positive definite:
⎡2 1⎤
⎢1 2⎥ .
⎣ ⎦
End of Digression
Example:
• Wish to find important determinants of individuals’ earnings and estimate
the size of the effect of each determinant.
• Data: (WAGE2.WF1 or WAGE2.TXT)
• Here,
• β2,o × 100 = %Δ in wage by one more year of education.
• (β3,o+2β4,oexper) × 100 = %Δ by one more year of exper.
• Issues:
• How to estimate βo’s?
• Estimated β’s would not be equal to the true values of β (βo). How close
would our estimates to the true values?
As you may find, the above assumptions are unrealistic. But under the
assumptions, more intuitive discussions about the statistical properties of OLS
can be made. The statistical properties of OLS discussed later still hold even
under more realistic assumptions.
Notation:
• E ( xt1 ) is the group population mean of x1 for group t, while E(x1) is the
population mean of x1 for the whole population.
Comment:
• Usually, xt1 = 1 for all t. That is, β1 is an overall intercept term.
• E (ε t | xt i ) = 0 .
• E ( xt iε t ) = E xt i [ E ( xt iε t | xt i )] = E xt i [ xt i E (ε t | xt i )] = E xt i (0) = 0 .
Comment:
• No other β* such that E ( yt | xt i ) = xt′i βo = xt′i β* for all t.
• The uniqueness assumption of βo is called “identification” condition.
• Rules out perfect multicollinearity (perfect linear relationship among the
regressors):
• Suppose β = ( β1 , β 2 , β 3 )′ and xt 3 = xt1 + xt 2 for all t.
• Set β1,* = β1,o + a; β 2,* = β 2,o + a; β 3,* = β 3,o − a for an arbitrary a ∈ℜ .
xt′i β* = xt1β1,* + xt 2 β 2,* + xt 3β 3,*
• = xt1β1,o + xt 2 β 2,o + xt 3β 3,o + a ( xt1 + xt 2 − xt 3 )
= xt′i β o
• (SIC.2) rules out this possibility.
Comment:
• E ( yt2 xt22 ) , E ( xt 3 xt34 ) , E ( xt43 ) , etc, exist.
• Rules out extreme outliers.
• We need this assumption for consistency and asymptotic normality of the
OLS estimator.
• SIC implies the Weak Ideal Conditions (WIC) that will be discussed
later.
• Violated if xt2 = t or xt 2 = xt −1,2 + vt 2 .
Comment:
• ( yt , xt1 , xt 2 ,..., xtk )′ are iid (independently and identically distributed):
• T groups which are iid with
⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞ ⎛⎛ y ⎞⎞
E ⎜ ⎜ t ⎟ ⎟ = E ⎜ ⎜ ⎟ ⎟ and Cov ⎜ ⎜ t ⎟ ⎟ = Cov ⎜ ⎜ ⎟ ⎟ .
⎝ ⎝ xt i ⎠ ⎠ ⎝⎝ x ⎠⎠ ⎝ ⎝ xt i ⎠ ⎠ ⎝⎝ x ⎠⎠
• One observation is drawn from each of the T group.
• Could be appropriate for cross-section data.
• Violated if time series data are used. That is why we add “strong” for the
name of the conditions.
• If T < k , there are infinitely many β* such that xt′i βo = xt′i β* for all t. For
this case, the sample cannot identify β.
• Implies no autocorrelation: cov(ε t , ε s ) = 0 for all t ≠ s .
HETY vs. X2
HOMY vs. X2
20
20
16 16
12
12
HETY
HOMY
8
8
4
4
0
-4 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
X2 X2
Comment:
• Optional. Not critical.
• This condition implies that β1,o is an overall intercept term.
• Need this assumption for convenient interpretation of empirical R2.
→ E ( x• y ) = E ( x• x•′ ) βo
→ βo = [ E ( x• x• )]−1 E ( x• y ) ,
where,
⎛⎛ 1 ⎞ ⎞ ⎛1 x2 ⎞ ⎛ 1 E ( x2 ) ⎞
E ( x• x• ) = E ⎜ ⎜ ⎟ (1 x2 ) ⎟ = E ⎜ = ;
⎝ ⎝ x2 ⎠ ⎠ ⎝ x2 x22 ⎟⎠ ⎜⎝ E ( x2 ) E ( x22 ) ⎟⎠
⎛⎛ 1 ⎞ ⎞ ⎛ y ⎞ ⎛ E( y) ⎞
E ( x• y ) = E ⎜ ⎜ ⎟ y ⎟ = E ⎜ ⎟ = ⎜ E( x y) ⎟ .
⎝⎝ 2 ⎠ ⎠
x ⎝ 2 ⎠ ⎝
x y 2 ⎠
cov( x2 , y )
→ β 2,o = ; β1,o = E ( y ) − β 2,o E ( x2 ) .
var( x2 )
wt i = ( xt 2 ,..., xtk )′ and β w,o = ( β 2,o ,..., β k ,o )′ . Suppose that this model satisfies
(SIC.1)-(SIC4) and (SIC.7). Then,
β o = ( E ( xt • xt′• ) ) E ( xt • yt ) = ( E ( x• x•′ ) ) E ( x• y ) .
−1 −1
And,
;
= ( Cov ( w• ) ) Cov ( w• , y )
−1
dependent variables.
• β 2,o ≠ 0 means non-zero correlation between yt and xt2. It does not mean
that xt2 causes yt. β 2,o ≠ 0 could mean that yt causes xt2.
• SIC do not talk about causality. SIC may hold even if yt determines xt • :
It can be the case that E ( edut | waget ) = β1,o + β 2,o waget .
• But, the regression model (1) is not meaningful if the x variables are not
causal variables. We would like to know by how much hourly wage rate
increases with one more of education. We would not be interested in how
many more years of education an individual could have obtained if his/her
current wage rate increased now by $1!
Definition:
among regressors xt1 ,..., xtk , the OLS estimator βˆ = ( βˆ1 , βˆ2 ,..., βˆk )′
minimizes:
ST ( β ) ≡ Σt ( yt − xt1β1 − ... − xtk β k ) 2
,
= Σt ( yt − xt′• β ) = ( y − X β )′( y − X β )
⎛ y1 ⎞ ⎛ x1′• ⎞
⎜y ⎟ ⎜ x′ ⎟
y = ⎜ ⎟ ; X = ⎜ 2• ⎟ ; xt′• = ( xt1 , xt 2 ,..., xtk ) .
2
⎜ : ⎟ ⎜ : ⎟
⎜ ⎟ ⎜ ′ ⎟
⎝ T⎠
y ⎝ xT • ⎠
ST ( β1 , β 2 ) = Σt ( yt − xt1β1 − xt 2 β 2 ) 2 .
• The first order condition for minimization:
∂ST / ∂β1 = Σt2(yt-xt1β1-xt2β2)(-xt1) = 0 → Σt(xt1yt - xt12β1-xt1xt2β2) = 0
∂ST / ∂β 2 = Σt2(yt-xt1β1-xt2β2)(-xt2) = 0 → Σt(xt2yt - xt1xt2β1-xt22β2) = 0
→ Σtxt1yt = (Σtxt12)β1 + (Σtxt1xt2)β2
Σtxt2yt = (Σtxt1xt2)β1 + (Σtxt22)β2
⎛ Σ t x t1 y t ⎞ ⎛ Σ t x t1 Σ t x t1 x t 2 ⎞⎛ βˆ1 ⎞
2
→ ⎜⎜ ⎟⎟ = ⎜⎜ ⎟⎜ ⎟ .
2 ⎟⎜ ˆ ⎟
Σ
⎝ t t 2 t ⎠ ⎝ Σ t x t 2 x t 1 Σ t x t 2 ⎠⎝ β 2 ⎠
x y
→ βˆ = (X′X)-1X′y.
∂ST ( β )
= −2 X ′y + 2 X ′X β = 0k ×1
∂β
→
X ′y − X ′X β = 0k ×1 (2)
∂ 2 ST ( β ) ⎡ ∂ 2 ST ( β ) ⎤
=⎢ ⎥ = 2X′X,
∂β∂β ′ ∂ β ∂β
⎣ i j ⎦ k ×k
which is a positive definite matrix for any value of β. That is, the function
ST(β) is globally convex. This indicates that βˆ indeed minimizes ST(β).
[Here, we use the fact that ∂(β′Aβ)/∂β∂β′ = 2A for any symmetric matrix A.]
Definition:
• t'th residual: et = yt − xt′i βˆ (can be viewed as an estimate of εt).
• Vector of residuals: e = ( e1 ,..., eT )′ = y − X βˆ .
Theorem: X ′e = 0k ×1
Proof:
From the proof of the previous theorem,
X ′y − X ′X βˆ = 0k ×1 → X ′( y − X βˆ ) = 0k ×1 → X ′e = 0 .
Corollary:
If (SIC.7) holds ( xt1 = 1 for all t: β1 is the intercept), Σtet = 0.
Proof:
⎡ x11 x21 ... xT 1 ⎤ ⎡ e1 ⎤ ⎡ Σ t xt1et ⎤ ⎡0⎤
⎢x x22 ... xT 2 ⎥ ⎢ e2 ⎥ ⎢Σ t xt 2 et ⎥ ⎢0⎥
X ′e = ⎢ 12 ⎥⎢ ⎥ = ⎢ ⎥=⎢ ⎥ .
⎢ : : : ⎥⎢ : ⎥ ⎢ : ⎥ ⎢:⎥
⎢x ... xTk ⎥⎦ ⎢⎣ eT ⎥⎦ ⎢⎣ Σt xtk et ⎥⎦ ⎢⎣0⎥⎦ k ×1
⎣ 1k x2 k
Facts:
1) P(A) and M(A) are both symmetric and idempotent:
P(A)′ = P(A), M(A)′ = M(A), P(A)P(A) = P(A), M(A)M(A) = M(A).
2) P(A) and M(A) are psd (positive semi-definite).
3) P(A)M(A) = 0T×T (orthogonal).
4) P(A)A = [A(A′A)-1A′]A = A.
5) M(A)A = [IT-P(A)]A = A - P(A)A = A - A = 0T×T.
End of Digression
Theorem: e = M(X)y.
<Proof> e = y − X β = I T y − X ( X ′X ) −1 X ′y = [ I T − X ( X ′X ) −1 X ′] y = M ( X ) y .
Comment:
β A is different from the OLS estimate of βA from a regression of y on XA.
Theorem:
Consider the following models:
(A) yt = β1 + β2xt2 + β3xt3 + error
(B) yt = α1 + α2xt2 + error
(C) xt3 = δ1 + δ2xt2 + error
Then, α 2 = β 2 + δ 2 β 3 .
( )
−1
β * = X *′ M (1T ) X * X *′ M (1T ) y .
Observe that:
M (1T ) y = ( y1 − y y2 − y ... yT − y )′ .
Now, complete the proof by yourself.
Example:
• A simple regression model: yt = β1 + β2xt2 + εt, with β1,o = β2,o = 1.
• For population A, σo2 = 1. For population B, σo2 = 10.
YA vs. X2 YB vs. X2
16 50
40
12
30
8 20
YA
YB 10
4
0
-10
0
-20
-4 -30
0 2 4 6 8 10 12 0 2 4 6 8 10 12
X2 X2
Definition:
• "Fitted value" of yt: yˆ t = xt′i βˆ (an estimate of E ( yt | xt i ) ).
= y ' y − 2 βˆ ′X ′y + βˆ ′X ′X ( X ′X ) −1 X ′y = y ′y − βˆ ′X ′y .
Theorem:
2
SST = Σt ( yt − y )2 = Σ t yt − Ty 2 ,
Implication:
Total variation of yt equals sum of explained and unexplained variations of yt.
Theorem:
Suppose that xt1 = 1, for all t (SIC.7). Then, R2 = SSR/SST and 0 ≤ R2 ≤ 1.
Note:
1) If (SIC.7) holds, then, R2 = 1 - (SSE/SST) = SSR/SST.
2) If (SIC.7) does not hold, then, 1 - (SSE/SST) ≠ SSR/SST.
3) 1 - (SSE/SST) can never be greater than 1, but it could be negative.
SSR/SST can never be negative, but it could be greater than 1.
Note:
• Some people use Ru2, when the model has no intercept term.
• 0 ≤ Ru2 ≤ 1, since e′e + yˆ ′yˆ = y′y. [Why? Try it at home.]
→ This holds even if (SIC.7) does not hold.
• If y = 0, then, Ru2 = R2.
Definition:
An estimator of covariance between yt and ŷt (which be viewed as an estimate
of E ( yt | xt • ) ) is defined by:
1
e cov( yt , yˆ t ) = Σt ( yt − y )( yˆ t − y ) ,
T −1
where ~
y = T −1Σt yˆ t . Similarly, the estimators of var(yt) and var( ŷt ) are defined
by:
1 1
e var( yt ) = Σt ( yt − y ) 2 ; e var( yˆ t ) = Σt ( yˆ t − y ) 2 .
T −1 T −1
Then, the estimated correlation coefficient between yt and ŷt is defined by:
e cov( yt , yˆ t )
ρˆ = .
e var( yt ) e var( yˆ t )
Theorem:
When k increases, SSE never increases.
Proof:
Compare:
Model 1: y = Xβ + ε
Model 2: y = Xβ + Zγ + υ = Wξ + υ,
where W = [X,Z] and ξ = [β′,γ′]′.
SSE1 = SSE from M1 = y′M(X)y = y′y - y′P(X)y
SSE2 = SSE from M2 = y′M(W)y = y′y - y′P(W)y
= y′y - y′[P(X)+P{M(X)Z}]y
= y′y - y′P(X)y - y′P{M(X)Z}y
SSE1 - SSE2 = y′P{M(X)Z}y ≥ 0.
Comment:
If θˆ is more efficient than θ , it means that the value of θˆ that I can obtain
from a particular sample would be generally closer to the true value of θ (θo)
than the value of θ .
• var( x ) = var( x1 ) = σ o2 .
σ o2
• Thus, var( x ) = < σ o2 = var( x ) , if T > 1.
T
Gauss Exercise:
• From N(0,9), draw 1,000 random samples of size equal to T = 100.
• For each sample, compute x and x .
• Draw a histogram for each estimator.
• Gauss program name: mmonte.prg.
seed = 1;
tt = 100; @ # of observations @
iter = 1000; @ # of sets of different data @
storem = zeros(iter,1) ;
stores = zeros(iter,1) ;
x = 3*rndns(tt,1,seed);
m = meanc(x);
storem[i,1] = m;
stores[i,1] = x[1,1];
i = i + 1; endo;
output off ;
Definition: (Unbiasedness)
θˆ is unbiased iff E (θˆ) = θo :
⎡ E (θˆ1 ) ⎤ ⎡ θ1,o ⎤
⎢ ⎥ ⎢
⎢ E (θ ˆ ) ⎥ θ 2,o ⎥
E (θˆ) = ⎢ 2
⎥ =⎢ ⎥ = θo .
: ⎢ : ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣ E (θˆp ) ⎥⎦ ⎣θ p ,o ⎦
Remark:
var(c ′θ ) ≥ var(c ′θˆ) .
Example:
⎡1 0 ⎤ ⎡1.5 1 ⎤
• Let θ = (θ1 , θ 2 )′ . Suppose Cov (θˆ) = ⎢ ⎥; Cov (θ ) = ⎢ 1 1.5⎥ .
⎣ 0 1 ⎦ ⎣ ⎦
• Note that:
var(θˆ1 ) = 1 < 1.5 = var(θ 1 ) ; var(θˆ2 ) = 1 < 1.5 = var(θ 2 ) .
• But,
⎡0.5 1 ⎤
Cov (θ ) − Cov (θˆ) = ⎢ ⎥ ≡ A → A1 = 0.5 > 0; A 2 = −0.75 < 0 .
⎣ 1 0 . 5⎦
• A is neither positive nor negative semi-definite.
• θˆ is not necessarily more efficient than θ .
• For example, suppose you wish to estimate θ1-θ2 = c′θ (where c = (1,-1)′):
• var(c′θˆ ) = c′Cov(θˆ )c = 2; var(c′θ ) = c′Cov(θ )c = 1.
• That is, for the given c = (1,-1)′, c′θ is more efficient than c′θˆ .
• This example is a case where relative efficiency of estimators depends on
c. For such cases, we can’t claim that one estimator is superior to others.
= (0,1,0,....,0)′. Then, we can show var(θˆ2 ) ≤ var(θ 2 ) . Keep doing this until j
= p.
Pr oj ( yt | xt • ) = xt′• β p .
Theorem:
E ( x j e p ) = 0 for all j = 1, … , k. That is, E ( xi e p ) = 0k ×1 .
Proof:
1 B
Recall X ′e = 0k ×1 → ΣTt=1 xt •et = 0k ×1 . That is, E ( x•e p ) = Σ t =1 xt •e p ,t = 0 .
B
Comment:
• E ( x•e p ) = 0 → E ( e p | x• ) = 0 , although the latter implies the former.
Proof:
−1
⎛1 ⎞ 1 B
β p = ( Σ x x′ )
−1
B
t =1 t • t • Σ x yt = ⎜ Σ tB=1 xt • xt′• ⎟
B
t =1 t • Σ t =1 xt• yt .
⎝ B ⎠ B
Comment:
• Intuitively, the OLS estimator is a consistent estimator of β p .
Comment:
• The whole population consists of T groups, and each group has fixed xt•.
We draw yt from each group. The value of yt would change over different
trials, but the value of xt• remains the same.
• Can be replaced by the assumption that E (ε t | x1• ,..., xT • ) = 0 for all t
(assumption of strictly exogenous regressors). This assumption holds as
long as (SIC.1) - (SIC.4) hold. If you do not use (SIC.8), the distributions
of βˆ and s 2 obtained below the conditional ones conditional on
x1i , x2i ,..., xT i .
Theorem:
Assume (SIC.1)-(SIC.6) and (SIC.8). Then,
• E ( βˆ ) = β o (unbiased)
• Cov ( βˆ ) = σ o2 ( X ′X ) −1
Numerical Exercise:
• yt = β1 + β2xt2 + β3xt3 + εt, T = 5:
⎡ 0⎤ ⎡1 − 2 4⎤
⎢ 0⎥ ⎢1 − 1 1⎥
⎢ ⎥ ⎢ ⎥
y = ⎢1⎥; X = ⎢1 0 0⎥ .
⎢1 ⎥ ⎢1 1 1⎥
⎢ ⎥ ⎢ ⎥
⎢⎣3⎥⎦ ⎢⎣1 2 4⎥⎦
• Then,
⎡ 5 0 10 ⎤ ⎡5⎤
X ′X = ⎢ 0 10 0 ⎥ ; X ′y = ⎢ 7 ⎥ ; y ′y = 11 ; y = 1.
⎢ ⎥ ⎢ ⎥
⎢⎣10 0 34⎥⎦ ⎢⎣13⎥⎦
1) Compute βˆ :
⎡17 / 35 0 − 1 / 7⎤
(X′X)-1= ⎢ 0 1 / 10 0 ⎥.
⎢ ⎥
⎢⎣ − 1 / 7 0 1 / 14 ⎥⎦
3) Estimate Cov ( βˆ ) :
⎛5⎞
⎜ ⎟
• SSE = y′y - βˆ ′X ' y = 11 - (0.571 0.7 0.214 )⎜ 7 ⎟ = 0.46
⎜13⎟
⎝ ⎠
• SSR = SST – SSE = 5.54.
5) Compute R2 and R 2 .
• R2 = SSR/SST =5.54/6 = 0.923
T −1 5 −1
• R2 = 1 - (1 − R 2 ) = 1 − (1 − 0.923) = 0.846.
T −k 5−3
Lemma D.1:
βˆ = β o + ( X ′X ) −1 X ′ε .
Proof:
y = X βo + ε .
βˆ = ( X ′X ) −1 X ′y = ( X ′X ) −1 ( X β o + ε ) = β o + ( X ′X ) −1 X ′ε .
Theorem: (Unbiasedness)
E ( βˆ ) = β o .
Proof:
E ( βˆ ) = E [ β o + ( X ′X ) −1 X ′ε ] = β o + ( X ′X ) −1 X ′E (ε ) = β o .
= ( X ′X ) −1 X ′(σ o2 I T ) X ( X ′X ) −1 = σ o2 ( X ′X ) −1 X ′I T X ( X ′X ) −1
= σ o2 ( X ′X ) −1 X ′X ( X ′X ) −1 = σ o2 ( X ′X ) −1 .
3) Show E ( s 2 ) = σ o2 .
Lemma D.2:
SSE = e′e = y ′M ( X ) y = ε ′M ( X )ε .
Proof:
SSE = y′M(X)y = (Xβ+ε)′M(X)(Xβ + ε) = (β′X′+ε′)M(X)ε = ε′M(X)ε.
Theorem:
E ( SSE ) = (T − k )σ o2 .
Lemma D.3:
For Am×n and Bn×m, tr(AB) = tr(BA).
Lemma D.4:
If B is an idempotent n×n matrix,
rank(B) = tr(B).
[Comment]
• For Lemma D.4, many econometrics books assume B to be also symmetric.
But the matrix B does not have to be.
• An idempotent matrix does not have to be symmetric: For example,
⎛ 1/ 2 1 ⎞ ⎛1 a⎞
⎜ 1/ 4 1/ 2 ⎟ ; ⎜0 0⎟
⎝ ⎠ ⎝ ⎠
• Theorem DA.1:
The eigenvalues of an idempotent matrix, say B, are ones or zeros.
<Proof> λξ = Bξ = B 2ξ = Bλξ = λ 2ξ .
• Theorem DA.3:
rank (B) = # of non-zero eigenvalues of B [See Greene.]
Example:
Let A be T×k (T > k). Show that rank[IT-A(A′A)-1A′] = T - k.
[Solution]
rank[IT-A(A′A)-1A′]
= tr(IT - A(A′A)-1A′)
= tr(IT) - tr[A(A′A)-1A′] = T - tr[(A′A)-1A′A]
= T - tr(Ik) = T - k.
End of Digression.
Lemma D. 5:
Let zT×1 ~ N(μT×1, ΩT×T). Suppose that A is a k×T nonstochastic matrix. Then,
b + Az ~ N(b + Aμ, AΩA′).
Theorem: βˆ ~ N ( β o , σ o2 ( X ′X ) −1 )
Proof:
βˆ = β o + ( X ′X ) −1 X ′ε
→ βˆ ~ N(βo+(X′X)-1X′E(ε), (X′X)-1X′Cov(ε)X(X′X)-1)
= N ( β o , σ o2 ( X ′X ) −1 ) .
Lemma D.6:
Let Q be a T×T (nonstochastic) symmetric and idempotent matrix. Suppose
ε ~ N (0T ×1 , σ o2 I T ) . Then,
ε ′Qε
~ χ 2 ( r) , r = tr(Q).
σo 2
Lemma D.7:
Suppose that Q is a T×T (nonstochastic) symmetric and idempotent and B is a
m×T nonstochastic matrix. If ε ~ N (0T ×1 , σ o2 I T ) , Bε and ε′Qε are
stochastically independent iff BQ = 0mxT.
Proof: See Schmidt.
Theorem:
(T − k ) s 2 SSE
= ~ χ 2 (T − k ).
σ 2
o σ 2
o
var( s 2 ) = 2σ o4 /(T − k ) .
Remark:
⎡ σ 2
( X ′
X ) −1
0k ×1 ⎤
⎛β ⎞ ⎛ βˆ ⎞ o
Let θ = ⎜ 2 ⎟ and θˆ = ⎜⎜ 2 ⎟⎟ . Then, Cov (θˆ) = ⎢ ⎥
2σ o4 ⎥ .
⎝σ ⎠ ⎝s ⎠ ⎢ 0
⎢⎣ 1×k
T − k ⎥⎦
Theorem: (Gauss-Markov)
Under (SIC.1) – (SIC.5) (ε may not be normal) and (SIC.8), βˆ is the best
linear unbiased estimator (BLUE) of β.
Comment:
Suppose that β is an estimator which is linear in y; that is, there exists a T×k
matrix C such that β = C′y. Let us assume that E ( β ) = β o . Then, the above
∏ f ( xt , θ o ) .
T
• f(x1, ... , xT, θo) = t =1
∏ f ( xt ,θ ) .
T
→ LT (θ) = f(x1, ... , xT, θ) = t =1
→ ( )
lT(θ) = ln ∏t =1 f ( xt ,θ ) = Σt ln f ( xt ,θ ) .
T
function g(θˆMLE ) such that E[g(θˆMLE )] = θo, then, g(θˆMLE ) is the MVUE.
Example:
• {x1, ... , xT} is a random sample from a population following a Poisson
distribution [i.e., f(x,θ) = e-θθx/x! (suppressing subscript “o” from θ)].
• Note that E(x) = var(x) = θo for Poisson distribution.
• lT(θ) = Σtln[f(xt,θ)] = -θT + (ln(θ))Σtxt - Σtln(xt!)
1
• FOC of maximization: ∂ / ∂θ = −T + Σt xt = 0 .
T
θ
Σx
• Solving this, θˆMLE = t t = x .
T
( )
lT(θ) = ln ∏t =1 f ( xt ,θ ) = Σt ln f ( xt ,θ ) .
T
Definition: (MLE)
MLE θˆMLE maximizes lT(θ) given data (vector) points x1, ... , xT. That is, θˆMLE
solves
⎡ ∂ T (θ ) / ∂θ1 ⎤ ⎡0⎤
⎢ ⎥ ⎢ ⎥
∂ T (θ ) ⎢ ∂ T (θ ) / ∂θ 2 ⎥ ⎢0⎥
= = .
∂θ ⎢ : ⎥ ⎢:⎥
⎢∂ (θ ) / ∂θ ⎥ ⎢ ⎥
⎣ T p⎦ ⎣0⎦ p ×1
function g(θˆMLE ) such that E[g(θˆMLE )] = θo, then, g(θˆMLE ) is the MVUE.
Example:
• Let {x1, ... , xT} be a random sample from N ( μo , σ o2 ) .
2πv ⎣ 2 v ⎦ ⎣ ⎦
1 1 ( xt − μ )2
• ln[ f ( xt , θ )] = − ln(2π ) − ln( v ) − .
2 2 2v
T T Σt ( xt − μ )2
• T (θ ) = − ln(2π ) − ln(v ) − .
2 2 2v
• MLE solves FOC:
∂ T (θ ) 1 Σ (x − μ)
(1) = − Σ t 2( x t − μ )( −1) = t t = 0;
∂μ 2v v
∂ T (θ ) T Σt ( xt − μ )2
(2) =− + = 0.
∂v 2v 2v 2
• From (1):
Σt xt
(3) Σt ( xt − μ ) = 0 → Σtxt - Tμ = 0 → μ̂ MLE = = x.
T
⎛ μˆ MLE ⎞ ⎛⎜ x ⎞
⎟
θˆ =⎜ ⎟= 1
Σ − 2 .
⎝ vˆMLE ⎠ ⎜⎝ T t t
( x x ) ⎟
MLE
⎠
• Note that:
⎛1 ⎞ 1 1
• E ( μˆ MLE ) = E ( x ) = E ⎜ Σt xt ⎟ = Σt E ( xt ) = Σt μo = μo .
⎝T ⎠ T T
T −1 2
σ o (by the fact that E ⎡⎢ ⎤
1
• E (vˆMLE ) = Σt ( xt − x ) 2 ⎥ = σ o2 )
T ⎣T −1 ⎦
T
→ Let g ( vˆMLE ) = vˆMLE .
T −1
⎡ 1 ⎤
→ Clearly, E [ g ( vˆMLE )] = E ⎢ Σt ( xt − x ) 2 ⎥ = σ o2 .
⎣T −1 ⎦
→ Thus, g (vˆMLE ) is MVUE of σ 2 .
• LT (θ ) = Π Tt =1 f ( yt ,θ | xt i ) .
Example:
• Assume that ( yt , xt′i ) iid and f ( yt , βo , vo | xt i ) ~ N ( xt′i βo , vo ) .
1 ⎛ 1 ⎞
• f ( yt , β , v | xt i ) = exp ⎜ − ( yt − xt′i β ) 2 ⎟ .
2π v ⎝ 2v ⎠
lT ( β , v ) = Σ t ln f ( yt , β , v | xt i )
T T 1
• =− ln(2π ) − ln v − Σ t ( yt − xt ′β ) 2 .
2 2 2v
T T 1
= − ln(2π ) − ln v − ( y − X β )′( y − X β )
2 2 2v
End of Digression
Proof:
We already know that E ( βˆ ) = β o and E ( s 2 ) = σ o2 . Thus, it is sufficient to
show that βˆ and s2 are MLE or some functions of MLE. Under (SIC.1) –
(SIC.6) and (SIC.8),
ε ~ N(0T×1, voIT) → y ~ N(Xβo,voIT), where vo = σ o2 .
Therefore, we have the following likelihood function of y,
1 ⎡ 1 ⎤
LT(β,v) = exp ⎢ − ( y − X β )′( vI T ) −1 ( y − X β ) ⎥
(2π )T / 2 vI T ⎣ 2 ⎦
1 ⎡ 1 ⎤
= exp −
⎢⎣ 2 ( y − X β )′( vI ) −1
( y − X β ) ⎥⎦
(2π )T / 2 v T / 2
T
Then,
lT(β,v) = -(T/2)ln(2π) -(T/2)ln(v) - (y-Xβ)′(y-Xβ)/(2v)
= -(T/2)ln(2π) -(T/2)ln(v) - (1/2v)[y′y-2β′X′y+β′X′Xβ].
→ FOC: ∂lT(β,v)/∂β = -(1/2v)[-2X′y + 2X′Xβ] = 0k×1 (i)
∂lT(β,v)/∂v = -(T/2v) + (1/2v2)(y-Xβ)′(y-Xβ) = 0 (ii)
→ From (i), X′y - X′Xβ = 0k×1 → βˆ MLE = (X′X)-1X′y = βˆ .
→ From (ii), v̂MLE = SSE/T → s2 is a function of v̂MLE .
[s2 = [T/(T-k)] v̂MLE ]
where sR = R[ s 2 ( X ′X ) −1 ]R′ .
βj*,
βˆ j − β j *
t= ~ t (T − k ) .
ˆ
se( β j )
Proof:
Let R = [0 0 ... 1 ... 0]; that is, only the j′th entry of R equals 1. Let r = βj*. Then,
Rβˆ − r βˆ j − β j * βˆ j − β j *
βˆ j − β j *
t= = = = .
sR 2
′ −1
Rs ( X X ) R ′ ˆ
var( β j ) se ( ˆ
β )
j
Comment:
• T-Statistics Theorem implies the following:
• Imagine that you collect billions and billions (b) of different samples.
• For each sample, compute the t statistic for the same hypothesis Ho.
Denote the population of these t statistics by {t[1], t[2], ..., t[b]}.
• The above theorem indicates that the population of t-statistics is
distributed as t(T-k).
• Here, you strongly believe that βj,o cannot be negative. If so, you would
regard negative t-statistics as evidence for Ho. So, your
acceptance/rejection decision depends on how positively large the value of
your t-statistic is.
• Choose a critical value (c = 1.701) as in the above graph at 5% significance
level. Then, reject Ho in favor of Ha, if t > c (=1.701). Do not reject Ho, if t
< c.
• Here, you strongly believe that βj,o cannot be positive. If so, you would
regard a positive value of a t-statistic as evidence favoring Ho. So, your
acceptance/rejection decision depends on how negatively large the value of
your t-statistic is.
• Choose a critical value (-c = -1.734) as in the above graph at a given
significance level. Then, reject Ho in favor of Ha, if t < -c (= -1.734). Do not
reject Ho, if t > -c.
3) Student t Distribution
• Let z ~ N(0,1) and y ~ χ2(k). Assume that z and y are stochastically
independent.
z
• Then, t = ~ t(k).
y/k
• f(1,k2) = [t(k2)]2.
• If f ~ f(k1,k2), k1f → χ2(k1) as k2 → ∞.
Gauss Exercise:
• z ~ N(0,1); t ~ t(4); y ~ χ2(2); f ~ f(2,10).
• Gauss program name: dismonte.prg
/*
** Monte Carlo Program for z, x-square, t and f distribution
*/
z = zeros(iter,1);
t = zeros(iter,1);
x = zeros(iter,1);
f = zeros(iter,1);
@ Histograms @
library pgraph;
graphset;
ytics(0,6,0.1,0) ;
v = seqa(-8,0.1,220);
@ {a1,a2,a3}=histp(z,v); @
@ {b1,b2,b3}=histp(t,v); @
library pgraph;
graphset;
ytics(0,10,0.1,0);
w = seqa(0, 0.1, 330);
z ~ N(0,1)
t ~ t(4)
f ~ f(2,10)
End of Digression
⎡ R( βˆ − β ) ⎤ ⎡ R ( βˆ − β ) ⎤
E⎢ ⎥ = 0 ; var ⎢ ⎥ = 1.
⎣ σ R ⎦ ⎣ σ R ⎦
R( βˆ − β )
q1 ≡ ~ N (0,1) .
σR
Note that
sR Rs 2 ( X ′X ) −1 R′ (T − k ) s 2
s2 χ 2 (T − k )
q2 ≡ = = = == .
σR Rσ o ( X ′X ) R′
2 −1 σ 2
o (T − k )σ 2
o T −k
Example:
• A model is given: yt = xt1β1,o + xt2β2,o + xt3β3,o + εt.
• Wish to test for Ho: β1,o = 0 and β2,o + β3,o = 1.
• Define:
⎡ 1 0 0⎤ ⎡ 0⎤
R=⎢ ⎥ ; r = ⎢ ⎥
⎣ 0 1 1⎦ ⎣1⎦
Then, Ho → Rβo = r.
Comment:
F-Statistics Theorem implies the following:
• Imagine that you collect billions and billions (b) of different samples.
• For each sample, compute the F statistic for the same hypothesis Ho. Denote
the population of these F statistics as {F[1], F[2], ... , F[b]}.
• The above theorem indicates that the population of the F-statistics is
distributed as f(m,T-k).
Theorem:
β = βˆ − ( X ′X ) −1 R′[ R( X ′X ) −1 R′]( Rβˆ − r ) .
Proof: See Greene.
Theorem:
Under Ho: Rβo - r = 0,
E ( β ) = βo .
Theorem:
Assume that (SIC.1)-(SIC.6) and (SIC.8) hold (whether (SIC.7) holds or not).
~
If Ho is correct, then, β is more efficient than βˆ .
Proof: Show it by yourself.
Remark:
• Consider a model: yt = xt1β1 + xt2β2 + xt3β3 + xt4β4 + εt.
• Wish to test for Ho: β3,o = β4,o = 0.
~
• To find β , do OLS on:
(*) yt = xt1β1 + xt2β2 + εt .
~ ~
• Denote the OLS estimates by β1 and β 2 . Then, the restricted OLS
~ ~
estimate of β is given by = [ β1 , β 2 , 0, 0]′.
• Also, set SSE from (*) as SSEr.
• Test Ho: β2,o + β3,o = 1 and β4,o = 0.
• yt = xt1β1 + xt2β2 + xt3β3 + xt4β4 + εt.
→ yt = xt1β1 + xt2β2 + xt3(1-β2) + εt.
→ yt - xt3 = xt1β1 + (xt2-xt3)β2 + εt . (**)
~ ~ ~ ~ ~
• Do OLS on (**) and get β1 and β 2 . Set β 3 = 1 - β 2 and β 4 = 0. Set
SSEr = SSE from OLS on (**).
βˆ2 − 0
→ t= = 11.77291 ; c = 1.96 at 5% significance level.
se( βˆ2 )
Wald Test:
Equation: Untitled
Null C(3)=0
Hypothesis:
C(4)=0
• Estimating elasticities:
• Let log ( L) and log( K ) be chosen values of log(Lt) and log(Kt).
[You may choose sample means.]
• Observe that ηQL = ∂ log(Q ) / ∂ log( L) = β2 + β4log(L) + β6log(K).
• Thus, a natural estimate of ηQL is given:
ηˆQL = βˆ2 + βˆ4 log( L) + βˆ6 log( K ) = R βˆ ,
→ E ( β ) = βo .
= Σt ( yt − β1 − xt 2 β 2 − ... − xtk β k )2
= Σt ( yt − β1 )2 = Σt ( yt − y )2 = SST.
Observe that:
( SSEr − SSEu ) /(k − 1) T − k SST − SSE
F = =
SSEu /(T − k ) k −1 SSE
.
T − k 1 − SSE / SST T − k R 2
= =
k − 1 SSE / SST k − 1 1 − R2
Example 1:
Oil shocks during 70’s may have changed firms’ production functions
permanently.
Example 2:
Effects of schooling on wages may be different over different regions. [Why?
Perhaps because of different industries across different regions.]
• Question:
How can we test Ho: βA1,o = βB1,o, βA2,o = βB2,o, βA3,o = βB3,o and βA4,o = βB4,o?
⎛ yA ⎞ ⎛ X A 0TA × k ⎞ ⎛ β A ⎞ ⎛ ε A ⎞
→ (*) ⎜ y ⎟ = ⎜0 + → y = X*β* + ε* .
⎝ B ⎠ ⎝ TB × k X B ⎟⎠ ⎜⎝ β B ⎟⎠ ⎜⎝ ε B ⎟⎠
• Restricted model:
βA,o = βB,o (let us denote them by β): k restrictions.
→ Merge model (A) and (B) with the restriction (Model C):
⎛y ⎞ ⎛X ⎞ ⎛ε ⎞
(**) ⎜ A ⎟ = ⎜ A ⎟ β + ⎜ A ⎟ → y = Xβ + ε
⎝ yB ⎠ ⎝ X B ⎠ ⎝εB ⎠
→ χ2(k).
Wald Test:
Equation: Untitled
Null Hypo.: C(5)=0
C(6)=0
C(7)=0
C(8)=0
F-statistic 10.03332 Probability 0.000000
Chi-square 40.13328 Probability 0.000000
• What is this?
yA = X Aβ + ε A for Group A;
•
y B = X B β + I TB γ + ε B for Group B,
where γ = γ 1 ,..., γ TB ′ .
( )
⎛ ⎞
⎛ y B ,1 ⎞ ⎜ xB ,1i′ 1 0 ... 0 ⎟ ⎛ β ⎞ ⎛ ε B ,1 ⎞
⎜ y ⎟ ⎜
x ′ 0 1 ... 0 ⎟ ⎜ γ 1 ⎟ ⎜ ε B ,2 ⎟
• ⎜
B ,2 ⎟=⎜ B ,2 i
⎟⎜ ⎟ + ⎜ ⎟.
⎜ : ⎟ ⎜ : : : : ⎟⎜ : ⎟ ⎜ : ⎟
⎜⎜ ⎟⎟ ⎜ ⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟
y γ ε
⎝ B ,TB ⎠ ⎜ xB ,T i′ 0 0 ... 1 ⎟ ⎝ TB ⎠ ⎝ B ,TB ⎠
⎝ B ⎠
⎛β ⎞
⎛ y A ⎞ ⎛ X TA×k 0TA×TB ⎞ ⎜ γ 1 ⎟ ⎛ ε A ⎞
• ⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎝ yB ⎠ ⎝⎜ X TB ×k ⎟
ITB ×TB ⎠ ⎜ : ⎟ ⎝ ε B ⎠
⎜⎜ ⎟⎟
⎝ γ TB ⎠
• SSEA = SSE from regression of the above model.
• FACHOW = F for Ho: γ 1 = ... = γ TB = 0 .
Theorem:
Under (SIC.1)-(SIC.6) and (SIC.8), ( y0 − yˆ 0 ) ~ N (0, σ o2 [1 + x0′ ( X ′X ) −1 x0 ]) .
Proof:
ŷ0 = x0′ βˆ = x0′ [ β o + ( X ′X ) −1 X ′ε ] = x0′ β o + x0′ ( X ′X ) −1 X ′ε .
y0 = x0′ βo + ε 0 .
→ y0 − yˆ 0 = ε 0 − x0′ ( X ′X ) −1 X ′ε .
→ E ( y0 − yˆ 0 ) = 0 .
= σ o2 + σ o2 x0′ ( X ′X ) −1 x0 .
Implication:
Let c be a critical value for two-tail t-test given a significance level (say, 5%):
⎛ y0 − yˆ 0 ⎞
Pr ⎜ − c < < c ⎟ = 0.95 ,
⎝ se( y0 − yˆ 0 ) ⎠
Forecasting Procedure:
STEP 1: Let x0′ = ( x01 , x02 ,..., x0k ) .
⎛ 1.2 ⎞
STEP 2: Compute yˆ 0 = x0′ βˆ = (1 1 1) ⎜ −1 ⎟ = 2.2 .
⎜ ⎟
⎜ 2 ⎟
⎝ ⎠
• If you have data points up to t = 100, and if you would like to forecast y at
t = 101 and t = 102, you’d better to use “dynamic forecast.”
• The formula of forecasting standard errors taught in the class can be used
for static forecasts. But the standard errors for dynamic forecasts are
much more complicated.
8.6
1996 1997 1998 1999 2000 2001
LDPIFS
9.00
8.95
8.90
8.85
8.80
8.75
8.70
8.65
8.60
1996 1997 1998 1999 2000 2001
LDPI UPPERBS
LDPIFS LOW ERBS
9.2
Forecast: LDPIFD
Actual: LDPI
9.1
Forecast sample: 1996:01 2001:12
Included observations: 71
9.0
Root Mean Squared Error 0.057155
Mean Absolute Error 0.049899
8.9
Mean Abs. Percent Error 0.565546
Theil Inequality Coefficient 0.003247
8.8 Bias Proportion 0.762216
Variance Proportion 0.221227
8.7 Covariance Proportion 0.016557
8.6
1996 1997 1998 1999 2000 2001
LDPIFD
9.1
9.0
8.9
8.8
8.7
8.6
1996 1997 1998 1999 2000 2001
LDPI UPPERBD
LDPIFD LOW ERBD
(1) Motivation
• If the regressors xt• are stochastic, all t and F tests are wrong (bad news).
• The t and F tests require the OLS estimator βˆ to be unbiased.
βˆ = β o + ( X ′X ) −1 X ′ε
?
→ E ( βˆ ) = β + E[( X ′X ) X ′ε ] = β + ( X ′X ) −1 X ′E (ε ) .
−1
Or E (ε | X ) = 0T ×1 .
Under this assumption,
( ) (
E ( β ) = E X E ( β | X ) = E X E ( β + ( X ′X ) −1 X ′ε | X ) )
= E X ( β + ( X ′X ) −1 X ′E (ε | X ) ) = E X ( β ) = β .
• But, for some cases, condition (*) does not hold. For example, xt• = yt-1. In
this case, E(εt-1|yt-1) ≠ 0. For this case, we can no longer say that β is an
unbiased estimator.
• An example for models with lagged dependent variables as regressors:
yt = β1xt1 + β2xt2 + β3yt-1 + εt . → β2/(1-β3) = long-run effect of xt2.
βˆ is no longer normal.
• What do we wish?
[We wish the distribution of θˆT would become more condensed around
θo as T increases.]
• The law of large numbers (LLN) says that a sample mean xT ( x from a
• Gauss Exercise:
• A population with N(1,9).
• 1000 different random samples of T = 10 to compute x10 .
seed = 1;
tt1 = 10; @ # of observations @
tt2 = 100; @ # of observations @
tt3 = 1500; @ # of observations @
iter = 1000; @ # of sets of different data @
storx10 = zeros(iter,1) ;
storx100 = zeros(iter,1) ;
storx5000 = zeros(iter,1);
x10 = 1 + 3*rndns(tt1,1,seed);
x100 = 1 + 3*rndns(tt2,1,seed);
x5000 = 1 + 3*rndns(tt3,1,seed);
storx10[i,1] = meanc(x10);
storx100[i,1] = meanc(x100);
storx5000[i,1] = meanc(x5000);
i = i + 1; endo;
library pgraph;
graphset;
x100
x5000
var( xT ) = σo2/T → 0 as T → ∞.
Thus, xT is a consistent estimator of μo.
xT - μ o
• Then, T ( xT − μo ) → d N (0, σ o2 ) and T →d N(0,1).
σo
• Implication of CLT:
• T ( xT − μo ) ≈ N (0, σ o2 ) , if T is large.
• E ⎡⎣ T ( xT − μo ) ⎤⎦ = T [ E ( xT ) − μo ] ≈ 0 → E( E ( xT ) ≈ μo .
→ var( xT ) ≈ σ 02 / T .
• xT ≈ N ( μo , σ o2 / T ) , if T is large.
End of Digression
(WIC.1) The conditional mean of yt (dependent variable) given x1•, x2•, ... , xt•,
ε1, ... , and εt-1 is linear in xt•:
yt = E ( yt | x1• ,..., xt • , ε1 ,..., ε t −1 ) + ε t = xt′• βo + ε t .
Comment:
• Implies E (ε t | x1• , x2• ,..., xt • , ε1 , ε 2 ,..., ε t −1 ) = 0 .
• No autocorrelation in the εt: cov(ε t , ε s ) = 0 for all t ≠ s.
• Regressors are weakly exogenous and need not be strictly exogenous.
• E ( xs•ε t ) = 0k ×1 for all t ≥ s, but could be that E ( xs iε t ) ≠ 0 for some s > t.
(WIC.2) βo is unique.
Comment:
• (WIC.2)-(WIC.3) implies that
p limT →∞ T −1 X ′X = p limT →∞ T −1Σt xt • xt′• ≡ Qo is finite and pd.
(WIC.6) The error terms εt are normally distributed conditionally on x1•, … , xt•,
ε1, … , εt-1.
Comment:
SIC → WIC.
p limT →∞ s 2 = σ o2 (consistent).
T ( βˆ − β o ) →d N ( 0k ×1 , σ o2Qo −1 ) .
Implication:
βˆ ≈ N ( β o , σ o 2 (TQo ) −1 ) → βˆ ≈ N ( β o , s 2 ( X ′X ) −1 ) ,
if T is reasonably large.
Implication:
1) t test for Ho: Rβo - r = 0 (R: 1×k, r: scalar) is valid if T is large.
Use z-table to find critical value.
2) For Ho: Rβo - r = 0 (R: m×k, r: m×1),
use WT = mF which is asymptotically χ2(m) distributed. [Why?]
−1
• WT = ( R β − r )′ ⎡ RCov ( β ) R ′⎤ ( R β − r )
⎣ ⎦
−1
= ( R βˆ − r )′ ⎡ Rs 2 ( X ′X ) −1 R ′⎤ ( R βˆ − r ) = mF.
⎣ ⎦
Examples:
1) θ: a scalar
Ho: θo = 2 → Ho: θo - 2 = 0 → Ho: w(θ) = 0, where w(θ) = θ - 2.
2) θ = (θ1, θ2, θ3)′.
Ho: θ1,o2 = θ2,o + 2 and θ3,o = θ1,o + θ2,o.
→ Ho: θ1,o2-θ2,o-2 = 0 and θ3,o-θ1,o-θ2,o = 0.
⎛ w1 (θ ) ⎞ ⎛ θ12 − θ 2 − 2 ⎞ ⎛ 0 ⎞
→ Ho: w(θ ) = ⎜ ⎟ = ⎜θ − θ − θ ⎟ = ⎜ 0 ⎟ .
⎝ 2
w (θ ) ⎠ ⎝ 3 1 2⎠ ⎝ ⎠
3) linear restrictions
θ = [θ1, θ2, θ3]′.
Ho: θ1,o = θ2,o + 2 and θ3,o = θ1,o + θ2,o
⎛ w (θ ) ⎞ ⎛ θ1,o − θ 2,o − 2 ⎞ ⎛ 0 ⎞
→ Ho: w(θo ) = ⎜ 1 o ⎟ = ⎜ ⎟ = ⎜ 0⎟ .
(θ ) θ
⎝ 2 o ⎠ ⎝ 3,o 1,o
w − θ − θ 2,o ⎠ ⎝ ⎠
⎛ θ1,o ⎞
⎛ 1 −1 0 ⎞ ⎜ ⎟ ⎛2⎞
→ Ho: w(θo ) = ⎜ ⎟ ⎜ 2,o ⎟ − ⎜ 0 ⎟ = Rθo − r .
θ
⎝ − 1 − 1 1 ⎠⎜ ⎟ ⎝ ⎠
⎝ θ 3,o ⎠
Definition:
⎛ ∂w1 (θ ) ∂w1 (θ ) ∂w1 (θ ) ⎞
⎜ ∂θ ...
∂θ 2 ∂θ p ⎟
⎜ 1
⎟
⎜ ∂w2 (θ ) ∂w2 (θ ) ∂w2 (θ ) ⎟
∂w(θ ) ⎜ ...
W (θ ) ≡ = ⎜ ∂θ1 ∂θ 2 ∂θ p ⎟ .
∂θ ′ ⎟
⎜ : : : ⎟
⎜ ⎟
⎜ ∂wm (θ ) ∂wm (θ )
...
∂wm (θ ) ⎟
⎜ ∂θ ∂θ 2 ∂θ p ⎟⎠ m× p
⎝ 1
Theorem:
Under (WIC.1)-(WIC.5),
( )
T w( β ) − w( β o ) → d N ( 0m×1 ,W ( β o )σ 2Qo−1W ( β o )′) .
Proof:
Taylor’s expansion around βo:
w( β ) = w( β o ) + W ( β )( β − β o ) ,
( )
T w( β ) − w( β o ) ≈ W ( β o ) T ( β − β o )
Implication:
( w ( β ) − w( β ) ) ≈ N ( 0
o m×1 )
,W ( β ) s 2 ( X ′X ) −1W ( β )′ .
( )
w( β ) ≈ N 0m×1 ,W ( β )Cov ( β )W ( β )′ .
p limT →∞ T −1Σt xt •ε t = 0 .
→ p limT →∞ s 2 = σ o2 − 0′(Qo ) −1 0 = σ o2 .
⎛ 1 ⎞
→ T ( βˆ − β ) = [T −1Σt xt • xt′• ]−1 ⎜ Σ t xt •ε t ⎟ .
⎝ T ⎠
→ By GCLT with martingale difference,
1
Σt xt •ε t →d N(0, lim T-1ΣtCov(xt•εt))
T
Cov( xt •ε t ) = E ( xt •ε tε t xt′• ) = E (ε t2 xt • xt′• )