Advanced Econometrics PDF
Advanced Econometrics PDF
Arnan Viriyavejkul
1 Preliminaries 3
1.1 Bivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Instrumental Variables 19
4.1 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Case 1: L = K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Case 2: L ≥ K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Application: Partitioned Regression . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 A Plot Twist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1
CONTENTS 2
Preliminaries
Cov(X,Y )
2. X ∈ {0, 1} then Var(X) = E(Y |X = 1) − E(Y |X = 0)
P (X = 1) = p
P (X = 0) = 1 − p
E[X] = E[X|X = 1]p + E[X|X = 0] · (1 − p) = p
[ ] [ ] [ ]
E X 2 = E X 2 |X = 1 · p + E X 2 |X = 0 (1 − p) = p
E[Y ] = E[Y |X = 1]p + E[Y |X = 0](1 − p)
E[Y ] = E[E[Y |X]]
E[XY ] = E[E[XY |X]]
= E[X[E[Y |X]]
= pE[1 · E[Y |X = 1]] + (1 − p)[0 · E[Y |X = 0]]
= pE[Y |X = 1]
3
CHAPTER 1. PRELIMINARIES 4
∑N
4. ȲN := i=1 Yi /N . Derive E[ȲN ] and Var[ȲN ]
µY = E [Yi ]
σY2 = Var [Yi ] < ∞
∑N
¯ Yi
YN = i=1
N
[∑ ] [N ]
[ ] N ∑
i=1 Yi 1
E ȲN =E = E Yi
N N
i=1
(N )
1 ∑ 1
= E [Yi ] = µ Y + . . . + µY
N N | {z }
i=1 N times
1
= (N µY ) = µY
N
CHAPTER 1. PRELIMINARIES 5
[∑ ] [N ]
[ ] N ∑
i=1 Yi 1
Var ȲN = Var = 2 Var Yi
N N
i=1
1
= 2 Var [Y1 + Y2 + . . . + YN ]
N
1 [ ]
= 2 · σY2 + . . . + σY2
N | {z }
N times
1 ( ) σ2
= N σY2 = Y
N N
[ ]
ȲN − E ȲN
E [ZN ] = E √ [ ]
var ȲN
ȲN − µY
=E √ 2
σY
N
√
N( [ ] )
= E ȲN − µY
σ
√Y
N
= (µY − µY ) = 0
σY
ȲN − µY
Var [ZN ] = Var √ 2
σY
N
N [ ]
= 2 Var ȲN
σY
N σY2
= · =1
σY2 N
√
E|XY | ≤ E [X 2 ] E [Y 2 ]
|⟨X, Y ⟩|2 ≤ ⟨X, X⟩ · ⟨Y, Y ⟩
[ ] [ ]
|E[XY ]|2 ≤ E X 2 E Y 2
[ ] [ ]
|E[X]|2 ≤ E X 2 E 12
CHAPTER 1. PRELIMINARIES 6
Yi = β0 + β1i Xi + ui
Xi = π0 + π1i Zi + vi
The goal here is to present σZY and σZX in terms of moments of π1i and β1i . In your
derivations, please assume random assignment of Zi so that
P (Zi = 1) = p
P (Zi = 0) = 1 − p
E[Zi ] = E[Zi |Zi = 1]p + E[Zi |Zi = 0] · (1 − p) = p
[ ] [ ] [ ]
E Zi2 = E Zi2 |Zi = 1 · p + E Zi2 |Zi = 0 (1 − p) = p
E[Xi ] = E[Xi |Zi = 1]p + E[Xi |Zi = 0](1 − p)
E[Xi ] = E[E[Xi |Zi ]]
E[Zi Xi ] = E[E[Zi Xi |Zi ]]
= E[Zi [E[Xi |Zi ]]
= pE[1 · E[Xi |Zi = 1]] + (1 − p)[0 · E[Xi |Zi = 0]]
= pE[Xi |Zi = 1]
CHAPTER 1. PRELIMINARIES 7
Since E [π1i ] = [E[Xi |Zi = 1] − E[Xi |Zi = 0]. Also, from the two-stage regression
equations,
[E[Yi |Zi = 1] − E[Yi |Zi = 0]] = E[βo + β1i π0 + β1i π1i + β1i vi + ui ]
− E[β0 + β1i π0 + β1i vi + ui ]
= E [β1i π1i ]
Putting terms together, one gets local average treatment effect (LATE),
SZY σZY
β̂ IV = = + op (1)
SZX σZX
E [β1i π1i ]
= + op (1)
E[π1i ]
CHAPTER 1. PRELIMINARIES 8
PX Y = Ŷ
( )−1 ′
= X X ′X XY
( ′ )−1 ′
=X XX X (Xβ ⋆ + u)
( )−1 ′ ( )−1 ′
= X X ′X X Xβ ⋆ + X X ′ X Xu
( ) −1
= Xβ ⋆ + X X ′ X X ′u
( ( )−1 ′ )
= X β⋆ + X ′X X u = X β̂ OLS = Ŷ
MX Y = û
= (IN − PX ) Y
( ( )−1 ′ )
= IN − X X ′ X X Y
( ′ )−1 ′
=Y −X X X XY
= Y − X β̂ OLS = û
MX u = (IN − PX ) u
( ( )−1 ′ )
= IN − X X ′ X X u
( )−1
= IN u − X X ′ X Xu
( ′ )−1 ′
=u−X X X Xu
( )
= u − X β̂ OLS − β ⋆
= u + Xβ ⋆ − X β̂ OLS
= Y − X β̂ OLS = û
CHAPTER 1. PRELIMINARIES 9
MX = IN − PX
( ( )−1 ′ )′
′
MX = IN − X X ′ X X
( ( )−1 ′ )′
′
= IN − X X ′X X
= IN − PX
MX MX = (In − PX ) (In − PX )
= IN IN − IN PX − PX IN + PX PX
= IN − PX − PX + PX
= IN − PX
tr (MX ) = tr (IN − PX )
= tr (IN ) − tr (PX )
=N −K
Remarks: since PX and MX are idempotent matrices, they have only 0 and 1 eigen-
values. The multiplicity of one as eigenvalue is precisely the rank. By Singular Value
Decomposition, one can also prove that,
PX PX = PX
⇒ CΛC ′ · CΛC ′ = CΛC ′
⇒ ΛCΛΛC ′ = CΛC ′
⇒ C ′ CΛΛC ′ C
⇒ ΛΛ = Λ
Chapter 2
Estimator
( )
1. β̃ := argminb∈RK E (Y − X ′ b)2
dim X = k × 1
dim a = k × 1
dim A = k × k
∂ ( ′ ) ∂ ( ′ )
AX = XA =A
∂X ∂X
∂
(AX) = A
∂X ′
∂ ( ′ ) ( )
X AX = A + A′ X
∂X
∂ 2 ( ′ )
′
X AX = A + A′
∂X∂X
∂s(b)
= −2E[XY ] + 2[XX ′ ]b
∂b
= −2E[XY ] + 2E[XX ′ ]β̃ = 0
β̃ = E[XX ′ ]−1 E[XY ]
11
CHAPTER 2. ORDINARY LEAST SQUARES 12
e(b) = Y − Xb
RSS(b) = e′ (b)e(b)
= (Y − Xb)′ (Y − Xb)
( )
= Y ′ − b′ X ′ (Y − Xb)
= Y ′ Y − Y ′ Xb − b′ X ′ Y + b′ X ′ Xb
= Y ′ Y − 2b′ X ′ Y + b′ X ′ Xb
∂RSS(b)
= 0 − 2X ′ Y + X ′ Xb + X ′ bX
∂b ( )
= −2X ′ Y + X ′ X + X ′ X b = 0
= −2X ′ Y + 2X ′ X β̂ OLS = 0
( )−1
β̂ OLS = X ′X (X ′ Y )
(N )−1 ( N )
∑ ∑
′
= Xi Xi Xi Yi
i=1 i=1
Asymptotic Distribution
√ ( )
1. N β̂ OLS − β ⋆ under homoskedasticity.
[ ]−1
β ⋆ = E Xi Xi′ E [Xi Yi ]
( ) −1
β̂ OLS = X ′ X X ′Y
( )−1 ′
= X ′X X (Xβ ∗ + u)
( )−1 ′
= β⋆ + X ′X Xu
( )−1 ( )
∑N ∑
N
⋆ −1 ′ −1
=β + N Xi Xi N Xi ui
i=1 i=1
( )−1 ( )
√ ( ) ∑
N ∑
N
−1
N β OLS
−β ⋆
= N Xi Xi′ N − 21
Xi ui
i=1 i=1
CHAPTER 2. ORDINARY LEAST SQUARES 13
( )−1
∑
N
[ ]−1
N −1 Xi Xi′ = E Xi Xi′
i=1
( )
∑
N √ ∑
N
− 21 −1
N Xi ui = N N Xi ui
i=1 i=1
√
= N Z̄N ; where Zi = Xi ui
( [ ])
−→ Z ∼ N 0, E Zi , Zi′
d
( [ ])
−→ Z ∼ N 0, E u2i Xi Xi′
d
Therefore,
√ ( ) [ ] ∑
N
′ −1 −1/2
N β OLS
−β ⋆
= E Xi Xi N Xi ui
i=1
[ ( ]−1 [ 2 ] [ ]−1 )
−→ N 0, E Xi Xi′ E ui Xi Xi′ E Xi Xi′
d
Under homoskedasticity,
[ ] [ [ ] ]
E u2i Xi Xi′ = E E u2i Xi Xi′ |X
[ [ ]]
= E Xi Xi′ E u2i |X
[ ]
= σu2 E Xi Xi′
Finally,
√ ( ) d ( [ ])
N β OLS − β ⋆ −→ Z ∼ N 0, E u2i Xi Xi′
( [ ]−1 )
−→ N 0, σu2 E X; Xi′
d
√ ( )
2. A consistent estimator for the asymptotic variance of N β̂ OLS − β ∗ under ho-
moskedasticity
∑
N N (
∑ )2
1 1
2
S = 2
ûi = Yi − Xi′ β̂ OLS
N −K N −K
i=1 i=1
′ OLS
ûi = Yi − Xi β̂
= ui + Xi′ β ⋆ − Xi′ β̂ OLS
( )
= ui − Xi′ β̂ OLS − β ⋆
CHAPTER 2. ORDINARY LEAST SQUARES 14
∑N [ ( )]2
1
Su2 = ui − Xi′ β̂ OLS − β ⋆
N −K
i=1
[ ]
N ∑ N
= N −1 u2i
N −K
i=1
[ ]
( )′ ∑N X X ′ ( ) ( )′ ∑N X u
i=1 i i i=1 i i
+ β̂ OLS
−β ⋆
β̂ OLS
− β − 2 β̂
⋆ OLS
−β ⋆
N −K N −K
[ ]
Su2 −→ 1 · σ 2 + 0 · E Xi Xi′ · 0 − 2 · 0 · 0 = σu2
The proof of consistency ends here. However, to take it a bit further,
û′ û = û′ M û
[ ] [ ]
E û′ û|X = E û′ MX û|X
= E [tr (ûMX û) |X]
[ [ ] ]
= tr E ûû′ |X MX
( )
= tr σu2 IMX
= σu2 tr (MX )
= σu2 (N − K)
[ ] E [û′ u|X]
E s2 |X =
N −K
σu2 (N − K)
=
N −K
2
= σu
1
σ̂u2 = û′ û
N −K
1
= u′ MX u
N −K
1 [ ( )−1 ′ ]
= u′ IN − X X ′ X X u
N −K
1 [ ( )−1 ′ ]
= u′ u − u′ X X ′ X Xu
N −K [ ]
( )
N u′ u u′ X X ′ X −1 X ′ u
= −
N −K N N N N
X ′u p u′ u N p
−→ 0 → σu2 →1
N N N −K
( ′ )−1
′
û û XX p [ ]−1
−→ σu2 E Xi Xi′
N −K N
Chapter 3
( )−1 ( ′ )
β̂ OLS = X ′ X XY
( ′ )−1 ′
= XX X (Xβ + e)
( ′ )−1 ′ ( )−1 ′
= XX X Xβ + X ′ X Xe
( ′ )−1 ′
=β+ XX Xe
= β + Op (1) · op (1)
= β + op (1)
( )−1 ( )
X ′X Xe
p lim β̂OLS = β + p lim
N N
( )
Xe
ρ lim =0
N
( ′ )−1
XX [ ]−1
ρ lim = E X ′X
N
ρ
β̂OLS −→ β
15
CHAPTER 3. LINEAR REGRESSION MODEL 16
( ) ( )( )′
Var β̂OLS |X = E[ β̂OLS − E [βOLS ] β̂OLS − E [βOLS ] |X]
[( ]
( ′ )−1 ′ ) (( ′ )−1 ′ )′
=E XX Xe XX X e |X
[( )−1 ′ ′ ( ′ )−1 ]
= E X ′X X ee X X X |X
( ′ )−1 ′ [ ′ ] ( ′ )−1
= XX X E ee |X X X X
( ′ )−1 ′ ( )−1
= XX X ΣX X ′ X
q̃DΣD′ q̃ ′ ≥ 0 ⇒ DΣD′
( )−1 ( )
= X ′ Γ− 2 X X ′ Γ− 2 Γ− 2 y
1 1 1
( )−1 ′ −1
= X ′ Γ−1 X XΓ Y
( ′ −1 )−1 ′ −1
= XΓ X X Γ (Xβ + e)
( ′ −1 )−1 ( )−1 ′ −1
= XΓ X XΓ−1 Xβ + XΓ−1 X XΓ e
( ′ −1 )−1 ′ −1
=β+ XΓ X XΓ e
[( )−1 ]
E[β̂GLS |X] = E X ′ Γ−1 X XΓ−1 Y |X
[( )−1 ′ ]
= Γ−1 ΓE X ′ X X Y |X
( )−1 ′
= X ′X X E[Y |X]
( ′ )−1 ′
= XX X Xβ
=β
CHAPTER 3. LINEAR REGRESSION MODEL 18
[ ] [( )( )′ ]
var β̂GLS |X = E β̂GLS − β β̂GLS − β |X
[( )−1 ′ −1 ′ −1 ( ′ −1 )−1 ]
= E X ′ Γ−1 X X Γ ee Γ X X Γ X |X
( ′ −1 )−1 ′ −1 [ ′ ] −1 ( ′ −1 )−1
= XΓ X X Γ E ee |X Γ X X Γ X
( ′ −1 )−1 ′ −1 −1 ( ′ −1 )−1
= XΓ X X Γ ΣΓ X X Γ X
2
( ′ −1 )−1 ′ −1 −1 ( ′ −1 )−1
=σ XΓ X X Γ ΓΓ X X Γ X
( ) −1 ( ′ −1 ) ( ′ −1 )−1
= σ 2 X ′ Γ−1 X XΓ X XΓ X
( ) −1
= σ 2 X ′ Γ−1 X
( )−1
= X ′ (σ 2 Γ)−1 X
( )−1
= X ′ Σ−1 X
[ ]
It is worth noting that, Var[β̃|x] ≥ Var β̂GLS |X .
Chapter 4
Instrumental Variables
Recall that for a general model Y = Xβ + e (or Yi = Xi′ β + ei ), one of the most important
assumptions of OLS theory is the exogeneity of the independent variables,
E (e|X) = 0
and the OLS estimator loses all advantages. In particular it is easy to verify that since
E[ei Xi ] ̸= 0 then
[( )−1 ′ ]
E[β̂] = β + E X ′ X Xe
( ( )) −1
= β + E Xi Xi′ E (Xi ei )
̸= β
Yi = Xi′ β + ei
X̃i = Xi + ri
19
CHAPTER 4. INSTRUMENTAL VARIABLES 20
E X̃i X̃i′
Proof. Recall from the least squares estimation,
(N )−1 ( N )
∑ ∑
′
β̂ = X̃i X̃i X̃i Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N ( )
= X̃i X̃i′ X̃i X̃i′ β + vi
i=1 i=1
(N )−1 ( ) ( )−1 ( )
∑ ∑
N ∑
N ∑
N
= X̃i X̃i′ X̃i X̃i′ β+ X̃i X̃i′ X̃i vi
i=1 i=1 i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ X̃i vi
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ X̃i (ei − ri β)
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ (Xi + ri ) (ei − ri β)
i=1 i=1
(N )−1 ( )
∑ ∑
N
( )
=β+ X̃i X̃i′ Xi ei − Xi ri β + ri ei − ri2 β
i=1 i=1
( )−1
∑
N
β̂ = β+ N −1 X̃i X̃i′
i=1
( )
∑N ∑
N ∑
N ∑
N
−1 −1 −1 −1
N Xi ei − N Xi ri β + N ri ei − N ri2 β
i=1 i=1 i=1 i=1
CHAPTER 4. INSTRUMENTAL VARIABLES 21
∑
N
p
−1
N Xi ei −→ 0
i=1
∑
N
p
N −1 Xi ri β −→ 0
i=1
∑
N
p
−1
N ri ei −→ 0
i=1
∑
N
p
N −1 X̃i X̃i′ −→ E[X̃i X̃i′ ]
i=1
∑N
p
N −1 ri2 β −→ E[ri2 ]
i=1
Notice that,
( )−1
1 ∑ ( [ ] [ ])−1
N
′
p lim X̃i X̃i = E Xi Xi′ + E ri ri′
N
i=1
( )
1 ∑
N
p lim X̃i ei = E [Xi ei ] + E [ri ei ] = 0
N
i=1
( )
1 ∑ [ ] [ ] [ ]
N
p lim ′
X̃i ri = E Xi ri′ + E ri ri′ = E ri ri′
N
i=1
( [ ] [ ])−1 ( [ ] )
p lim β̃ = β + E Xi Xi′ + E ri ri′ −E ri ri′ β
( [ ] [ ])−1 ( [ ] )
p lim β̃ = β + E Xi2 + E ri2 −E ri2 β
( [ ] )
E ri2
= β 1 − [ 2] [ ]
E Xi + E ri2
[ ]
E X̃i X̃i is positive definite,
[ ] [ ]
E X̃i X̃i = E (Xi + ri ) (Xi + ri )′
[ ] [ ] [ ] [ ]
= E Xi Xi′ + E Xi ri′ + E ri Xi′ + E ri ri′
[ ] [ ]
= E Xi Xi′ + E ri ri′
CHAPTER 4. INSTRUMENTAL VARIABLES 22
Therefore,
( )
E ri2
β̂ → β 1 − ( )
p
E X̃i X̃i′
( )
p σr2
→β 1− 2
σx + σr2
4.1 Estimator
General theory: consistent estimator of β for the general model y = Xβ +e, when E[X ′ e] ̸=
0 can be obtained if we could find a matrix of instruments Z of order N × L, with L ≥ K
(more instruments than variables) such that:
p
1. Variables in Z correlated with those in X and Z ′ X/N → ΣZX finite and full rank
(by column or row).
2. Z ′ e/N →p 0
Idea 4.1.1. By projecting (regressing) X on Z, hence creating X̂, we are taking away the
share of X related to e, making β̂IV consistent!
Xi = π ′ Zi + vi
such that π = E (Zi Zi′ )−1 E (Zi Xi′ ) implying E (Zi vi′ ) = 0.The reduced form for Xi can be
plugged into the original regression:
Yi = Xi′ β + ei
( )′
= π ′ Zi + vi β + ei
= Zi′ λ + wi
4.1.1 Case 1: L = K
Recall a useful theorem from Linear Algebra
Theorem 4.1.2. The linear system Ax = b has a solution if and only if its augmented
matrix and coefficient matrix have the same rank.
That means the augmented matrix (π λ) has K full rank such that there exists a unique
solution for β.
β = π −1 λ
( )−1 ( ) ( )−1
= E Zi Xi′ E Zi Zi′ E Zi Zi′ E (Zi Yi )
( ) −1
= E Zi Xi′ E (Zi Yi )
Applying the analogy principle delivers the estimator
(N )−1 ( N )
∑ ∑
IV ′
β̂ = Zi Xi Zi Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N
= Zi Xi′ Zi (Xi′ β + ei )
i=1 i=1
(N )−1 ( ) (N )−1 ( N )
∑ ∑
N ∑ ∑
= Zi Xi′ Zi Xi′ β+ Zi Xi′ Zi ei
i=1 i=1 i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ Zi Xi′ Zi e i
i=1 i=1
General Form
To derive the IV estimator we start from the classic regression setup
Y = Xβ + e
such that,
ŷ = X̂β + ê
The instrumental variable estimator can the be obtained by applying OLS on this modified
model
( )−1
β̂IV = X̂ ′ X̂ X̂ ′ Ŷ
[( ]
( ′ )−1 ′ )′ ( ′ )−1 ′ −1 ( ( ′ )−1 ′ )′ ( ′ )−1 ′
= Z ZZ ZX Z ZZ ZX Z ZZ ZX Z ZZ ZY
[ ( )−1 ′ ( ′ )−1 ′ ]−1 ′ ( ′ )−1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZZ ZZ ZX XZ ZZ ZZ ZZ ZY
[ ( ]
)−1 ′ −1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZX XZ ZZ ZY
This is how the IV estimator looks in general. If the number of instruments L equals the
number of regressors L (i.e. L = K) then the product Z ′ X is a square matrix of dimension
K × K( or L × L), which is non-singular (i.e. invertible). Therefore, we can re-write the
term in square brackets as
[ ( )−1 ′ ]−1 ′ ( ′ )−1 ′
β̂IV = X ′ Z Z ′ Z ZX XZ ZZ ZY
( )−1 (( ′ )−1 )−1 ( ′ )−1 ′ ( ′ )−1 ′
= Z ′X ZZ XZ XZ ZZ ZY
( ′ )−1 ( ′ ) ( ′ )−1 ′
= ZX ZZ ZZ ZY
( ′ )−1 ′
= ZX ZY
Consistency
The estimate is consistent
( )−1 ( )
∑
N ∑
N
−1
β̂ IV
=β+ N Zi Xi′ N −1
Zi e i
i=1 i=1
√ ( )
Asymptotic Distribution of N β̂ IV − β
( )−1 ( )
∑
N ∑
N
β̂ IV = β + N −1 Zi Xi′ N −1 Zi ei
i=1 i=1
( )−1 ( )
√ ( ) ∑
N ∑
N √
N β̂ IV − β = N −1 Zi Xi′ N −1 Zi e i N
i=1 i=1
Therefore,
√ ( )
d
N β̂ IV − β −→ N (0, Ω)
( )
where Ω = E (Zi Xi′ )−1 E e2i Zi Zi′ E (Xi Zi′ )−1 .
4.1.2 Case 2: L ≥ K
Recall from the general form that
( )−1
β̂2sls = X̂ ′ X̂ X̂ ′ Ŷ
[( ]
( ′ )−1 ′ )′ ( ′ )−1 ′ −1 ( ( ′ )−1 ′ )′ ( ′ )−1 ′
= Z ZZ ZX Z ZZ ZX Z ZZ ZX Z ZZ ZY
[ ( )−1 ′ ( ′ )−1 ′ ]−1 ′ ( ′ )−1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZZ ZZ ZX XZ ZZ ZZ ZZ ZY
[ ( ]
)−1 ′ −1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZX XZ ZZ ZY
Different choices for the matrix P result in different estimators. For example, the simple IV
estimator for the exactly identified case simply sets P = I. It can be shown that another
choice, namely P = P ∗ := E (Xi Zi′ ) E (Zi Zi′ )−1 , results in an estimator, bP ∗ , with minimal
asymptotic variance. Notice, however, that bP ∗ is an infeasible estimator because you do
not observe P ∗ . Replace P ∗ by P̂ = P ∗ + op (1) resulting in the feasible estimator bP̂ .
(N )−1 ( )
∑ ∑
N
bP̂ = P̂ Zi Xi′ P̂ Zi Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N
( )
= P̂ Zi Xi′ P̂ Zi Xi′ β + ei
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ P̂ Zi Xi′ P̂ Zi ei
i=1 i=1
( )−1 (N )
∑
N ∑
=β+ P̂ Zi Xi P̂ Zi e i
i=1 i=1
( )−1 ( )
1 ∑
N
1 ∑
N
=β+ P̂ Zi Xi′ P̂ Zi ei
N N
i=1 i=1
Consistency
( )−1 (√ )
√ ( ) 1 ∑ N∑
N N
N bP̂ − β = P̂ Zi Xi′ P̂ Zi ei
N N
i=1 i=1
CHAPTER 4. INSTRUMENTAL VARIABLES 27
d ( −1
)
−→ N 0, σe2 CZZ
To sum up,
√
N (bP̂ − β) −→ (CXZ CZZ CZX )−1 CXZ CZZ N (0, E[e2i Zi Zi′ ])
d
where CXZ := E (Xi Zi′ ), CZX := E (Zi Xi′ ), and CZZ := E (Zi Zi′ )−1 . To simplify the
notation,
√
N (bP̂ − β) −→ N (0, AV A′ )
d
where
Which is equivalent to
( )
∑
N
[ ]
N −1 Xi Zi′ = E Xi Zi′ + op (1)
i=1
( )−1
∑
N
N −1 Zi Zi′ = E [Zi Zi ]−1 + op (1)
i=1
The source of the endogeneity is correlation between the two error terms, write
e = vρ + w
Combining, we obtain
Y = Xβ + vρ + w
Normal equation:
X ′X X ′v β̂ X ′Y
=
v′X v′v ρ̂ v′Y
We have
X ′ X β̂ + X ′ v ρ̂ = X ′ Y
v ′ X β̂ + v ′ v ρ̂ = v ′ Y
by rearranging
( )−1 ′
β̂ = X ′ X X (Y − v ρ̂)
( ′ )−1 ′
ρ̂ = v v v (Y − X β̂)
4.2.1 Consistency
( )−1 ′
β̂ OLS = X ′ Mv X X Mv Y
( ′ )−1 ′
= X Mv X X Mv (Xβ + vp + w)
( ′ )−1 ′
= β + 0 + X Mv X X Mv w
Notice that
( ( )−1 ′ )
Mv v = I − v v ′ v v v=0
Rewriting
( )−1 ′
β̂ OLS − β = X ′ Mv X X Mv w
( )−1 ( )
1 ′ 1 ′
= X Mv X X Mv w
N N
Consider each term
1 ′ 1 ( ( )−1 ′ )
X Mv X = X ′ I − v v ′ v v X
N N
1 ( ′ ( )−1 ′ )
= X X − X ′v v′v vX
N
( )
X ′X X ′ v v ′ v −1 v ′ X
= −
N N N n
( )( )−1 ( )
1 ′ 1 ′ 1 ′ 1 ′
= ΣXi Xi − ΣXi vi Σvi vi Σvi Xi
N N N N
= Op (1) − Op (1) · Op−1 (1) · Op (1)
= Op (1)
1 ′ 1 ( ′( ( )−1 ′ ) )
X Mv w = X I − v v′v v w
N N
( )( ) ( )
1 ′ 1 ′ 1 ′ −1 1 ′
= X w− Xv vv vw
N N N N
( )( )−1 ( )
1 ′ 1 ′ 1 ′ 1 ′
= ΣXi wi − ΣXi vi Σvi vi Σvi wi
N N N N
= op (1) − Op (1) · (Op (1))−1 · op (1)
= op (1)
Therefore,
β̂ OLS − β = Op (1) · op (1) = op (1)
p
β̂ OLS −→ β
CHAPTER 4. INSTRUMENTAL VARIABLES 31
Recall that
( )−1 ′
P̂v = v̂ v̂ ′ v̂ v̂
X = Zπ + v
v̂ = X − Z π̂
( )−1 ′
= X − Z Z ′Z ZX
= X − PZ X
= (I − PZ ) X
= MZ X
( )−1 ′
P̂v = MZ X X ′ MZ′ MZ X X MZ
( ′ )−1 ′
= MZ X X MZ X X MZ
Remark: the estimate derived in the former case is more precise but since you cannot
observe vi in reality, the estimate from the latter case is more practical.
Chapter 5
2π 2π
The log likelihood function
( ) ( )
1 − 1 (y1 −µ)2 1 − 1 (yN −µ)2
L (µ|y) = ln √ e 2 + . . . + ln √ e 2
2π 2π
( ( ) ) ( ( ) )
1 1 1 1
= ln √ − (y1 − µ) + . . . + ln √
2
− (yN − µ) 2
2π 2 2π 2
∑N [ ( ) ]
1 1
= ln √ − (yi − µ)2
2π 2
i=1
( )
1∑
N
1
= N ln √ − (yi − µ)2
2π 2
i=1
32
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION 33
5.1.1 Expectation
[ ]
ML 1
E[µ̂ ]=E (y1 + . . . + yN )
N
1
= (E [y1 ] + . . . + E [yN ])
N
1
= · Nµ
N
=µ
5.1.2 Variance
( )
1 ∑
n
ML
var(µ̂ ) = var yi
N
i=1
1
= · (var (y1 ) + . . . + var (yn ))
N
1 1
= 2 ·N =
N N
∂ ln fY (y|µ)
S(y|µ) = =y−µ
∂µ
∂S(y|µ)
= −1
∂µ
[ ]
∂S(y|µ)
−E = 1 = I(µ)
∂µ
1
The information equality holds. Since E[µ̂ML ] = µ and var(µ̂ML ) = N, the ML estimator
is unbiased and attains the CRB.
5.1.5 Decomposition
1 ∑
N
S (yi |µ) = a(µ) · (T (Y1 , . . . , YN ) − µ)
N
i=1
1
= [(y1 − µ) + . . . + (yN − µ)]
N
1 ∑
N
1
= yi − N µ
N N
i=1
1 ∑
N
= yi − µ
N
i=1
( )
1 ∑
N
=1· yi − µ
N
i=1
√ ( ) d
N µ̂ML − µ −→ N (0, 1)
( )
ML d 1
µ̂ −→ N µ,
N
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION 35
Yi = Xi′ β + ei
( )
ei |Xi ∼ N 0, σe2
The novelty here is that the errors are assumed to have a normal distribution. The unknown
parameters are β ∈ RK and σe2 . Notice that the above normal regression model can be
regarded, equivalently, as a statement about the density of Yi given Xi . That conditional
density is
( )
( ) 1 1 ( ′
)2
fY y|x, β, σe = √
2
exp − 2 y − x β
2πσe2 2σe
You have available a random sample (Xi , Yi ) , where the Yi are iid with pdf fY (y|x).
( ) 1 ∑( )2
N
N
L β, σe2 |x, y = −N log (σe ) − log(2π) − 2 yi − x′i β
2 2σe
i=1
N ( ) N 1 ∑N
( )2
=− log σe2 − log(2π) − 2 yi − x′i β
2 2 2σe
i=1
5.2.1 Estimators
1 ∑( )2
N
∂L N
2
= − 2
+ 4
yi − x′i β = 0
∂σe 2σe 2σe
i=1
N (
∑ )2
ML 1
σˆe2 = yi − x′i β̂ ML
N
i=1
1 ∑N
= ê2i
N
i=1
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION 36
2 ∑( )
N
∂L
=− 2 yi − x′i β (−xi ) = 0
∂β 2σe
i=1
∑
N ∑
N
= xi yi − xi x′i β̂ ML = 0
i=1 i=1
(N )−1 ( N )
∑ ∑
β̂ ML = xi x′i xi yi
i=1 i=1
∂ ln fY ( ) 1 ( )
y|x, β, σe2 = 2 xi yi − x′i β
∂β σe
∂ ln fY ( ) 1 1 ( )2
y|x, β, σe2 = − 2 + 4 yi − x′i β
∂σe2 2σ 2σe
e
( ) 1
x (y − x ′ β)
S y|x, β, σe2 : = σe2 i i
i
∂ 2 ln fY ( ) 1
′
y|x, β, σe2 = − 2 xi x′i
∂β∂β σe
2
∂ ln fY ( ) 1 1 ( ′
)2
y|x, β, σ 2
e = − yi − x i β
∂ (σe2 )2 2σe4 σe6
∂ 2 ln fY ( ) 1 ( ′
) ′
y|x, β, σ 2
= − y i − x β xi
∂σe2 ∂β ′ e
σe4 i
∂ 2 ln fY ( ) 1 ( )
2
y|x, β, σe2 = − 4 xi yi − x′i β
∂β∂σe σe
− σ12 xi x′i − σ14 xi ei
H(x, y) = e e
− σ14 ei x′i 1
2σe4
− 1 2
e
σe6 i
e
( )′
∂ 2 ln fY ∂ ln fY
Notice that ∂β∂σe2
= ∂σe2 ∂β
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION 37
Probability Theory
A.1 Moments
Definition A.1.1. (Expectation) Let X be a continuous random variable with density
f (x). Then, the expected value of X, denoted by E[X], is defined to be
∫ ∞
E[X] = xf (x)dx
−∞
if the integral is absolutely convergent. The expected value does not exist when the fol-
lowings are true,
∫ 0
xf (x)dx = −∞
−∞
∫ ∞
xf (x)dx = ∞
0
Definition A.1.2. (Variance) The variance of X measures the expected square of the
deviation of X from its expected value
[ ]
Var(X) = E (X − E[X])2
Definition A.1.3. (Covariance) The covariance of any two random variables X and Y ,
denoted by Cov(X, Y ), is defined by
38
APPENDIX A. PROBABILITY THEORY 39
Properties of Covariance
For any random variables, XY, Z and constant c ∈ R,
1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = Cov(Y, X)
3. Cov(cX, Y ) = c Cov(X, Y )
Definition A.2.2. Let us denote by E[X|Y ] that function of the random variable Y whose
value at Y = y is E[X|Y = y]. Note that E[X|Y ] is itself a random variable. An extremely
important property of conditional expectation is that for all random variables X and Y
E[X] = E[E[X|Y ]]
Proof.
∑ ∑∑
E[X|Y = y]P {Y = y} = xP {X = x|Y = y}P {Y = y}
y y x
∑ ∑ P {X = x, Y = y}
= x P {Y = y}
y x
P {Y = y}
∑∑
= xP {X = x, Y = y}
y x
∑ ∑
= x P {X = x, Y = y}
x y
∑
= xP {X = x}
x
= E[X]
One way to understand the proof is to interpret it as follows. It states that to calculate
E[X] we may take a weighted average of the conditional expected value of X given that
Y = y, each of the terms E[X|Y = y] being weighted by the probability of the event on
which it is conditioned.
(b) Suppose that X and Xn , n ∈ N are all defined on the same probability space. We say
p
that the sequence Xn converges to X, in probability, and write Xn → X, if Xn − X
converges to zero, in probability, i.e.,
When X in part (b) of the definition is deterministic, say equal to some constant c, then
the two parts of the above definition are consistent with each other
p
The intuitive content of the statement Xn → c is that in the limit as n increases, almost all
of the probability mass becomes concentrated in a small interval around c, no matter how
small this interval is. On the other hand, for any fixed n, there can be a small probability
mass outside this interval, with a slowly decaying tail. Such a tail can have a strong
impact on expected values. For this reason, convergence in probability does not have any
p
implications on expected values. For example, We have Xn → X, but E[Xn ] does not
converge to E[X].
Convergence in distribution
Definition A.3.2. Let X and Xn , n ∈ N, be random variables with CDFs F and Fn ,
respectively. We say that the sequence Xn converges to X in distribution, and write
d
Xn → X, if
For convenience, we say op for convergence and Op for boundedness in probability. So, Let
xn be a sequence of non-negative real-valued random variables.
p
1. xn = op (1) means xn → 0 as n grows.
Furthermore,
1. xn = Op (1) mean {xn } is bounded in probability; i.e. for any ϵ > 0, ∃bϵ > 0 such
that supn P (xn > bϵ ) ≤ ϵ
xn
2. xn = Op (bn ) mean bn = Op (1)
xn
3. xn = Op (yn ) means yn = Op (1)
APPENDIX A. PROBABILITY THEORY 42
Slutsky’s Theorem
d
Let {Xn } , {Yn } be sequences of scalar/vector/matrix random elements. If Xn → X and
p
Yn → c, then,
d
1. Xn + Yn → X + c
d
2. Xn Yn → cX
d
3. Xn /Yn → X/c
Appendix B
Linear Algebra
v
v v3
v2
v1 v2
v1
T ∑ n
T
v · w v1 w
between (column) vectors v = ( v , v
11 + , . . . ,
2 v 2 w2 n v + · · · + vn wn 1= 2 , .v. i.w, w
) , w = ( w , w i n ) , both lying in the
Euclidean space R n . A key observation is that the dot product i=1 (3.1) is equal to the matrix
product ⎛ ⎞
Definition B.1.1. An inner product on the real vector spacewV1 is a pairing that takes two
⎜ w2 ⎟
vectors v, w ∈ V and produces a Treal number ⟨v, w⟩ ∈ R.⎜The ⎟inner product is required
v · w = v w = ( v1 v2 . . . vn ) ⎜ .. ⎟ (3.2)
to satisfy the following three axioms for all u, v, w ∈ V , and . ⎠ c, d ∈ R.
⎝ scalars
wn
between the row vector vT and the column vector w.
The dot product is the cornerstone of Euclidean geometry. The key fact is that the dot
product of a vector with itself,
v · v = v12 + v43
2 + · · · + vn ,
2 2
is the sum of the squares of its entries, and hence, by the classical Pythagorean Theorem,
equals the square of its length; see Figure 3.1. Consequently, the Euclidean norm or length
of a vector is found by taking the square root:
√
v = v·v = v12 + v22 + · · · + vn2 . (3.3)
Note that every nonzero vector, v = 0, has positive Euclidean norm, v > 0, while only
the zero vector has zero norm: v = 0 if and only if v = 0. The elementary properties
of dot product and Euclidean norm serve to inspire the abstract definition of more general
APPENDIX B. LINEAR ALGEBRA 44
(i) Bilinearity
(ii) Symmetry
⟨v, w⟩ = ⟨w, v⟩
(iii) Positivity
⟨v, v⟩ > 0 whenever v ̸= 0, while ⟨0, 0⟩ = 0
Given an inner product, the associated norm of a vector v ∈ V is defined as the positive
square root of the inner product of the vector with itself.
√
∥v∥ = ⟨v, v⟩
B.2 Orthogonality
Orthogonal and Orthonormal Bases
Definition B.2.1. A basis u1 , . . . , un of an n-dimensional inner product space V is called
orthogonal if if ⟨ui , uj ⟩ = 0 for all i ̸= j. The basis is called orthonormal if, in addition,
each vector has unit length: ∥ui ∥ = 1, for all i = 1, . . . , n
Proposition B.2.2. Let v1 , . . . , vk ∈ V be nonzero, mutually orthogonal elements, so
vi ̸= 0 and ⟨vi , vj ⟩ = 0 for all i ̸= j. Then, v1 , . . . , vk are linearly independent.
Lemma B.2.3. If v1 , . . . , vn is an orthogonal basis of a vector space V , then the normalized
vectors ui = vi / ∥vi ∥ , i = 1, . . . , n, form an orthonormal basis
Theorem B.2.4. Let u1 , . . . , un be an orthonormal basis for an inner product space V .
Then one can write any element v ∈ V as a linear combination
v = c1 u1 + · · · + cn un
where its coordinates
ci = ⟨v, ui ⟩ , i = 1, . . . , n
are explicitly given as inner products. Moreover, its norm is given by the Pythagorean
formula
v
√ u n
u∑
∥v∥ = c1 + · · · + cn = t
2 2 ⟨v, ui ⟩2
i=1
namely, the square root of the sum of the squares of its orthonormal basis coordinates.
APPENDIX B. LINEAR ALGEBRA 45
QT Q = QQT = I
The orthogonality condition implies that one can easily invert an orthogonal matrix
Q−1 = QT
v
z
Definition 4.31. The orthogonal projection of v onto the subspace W is the element
w ∈ W that makes the difference z = v − w orthogonal to W .
⟨z, ui ⟩ = ⟨v − w, ui ⟩
= ⟨v − c1 u1 − · · · − cn un , ui ⟩
= ⟨v, ui ⟩ − c1 ⟨u1 , ui ⟩ − · · · − cn ⟨un , ui ⟩
= ⟨v, ui ⟩ − ci
=0
The coefficients ci = ⟨v, ui ⟩ of the orthogonal projection W are thus uniquely prescribed
by the orthogonality requirement, which thereby proves its uniqueness.
218 4 Orthogonality
v
W⊥
v1
v1
v2 v2
For instance, 3 v1 + v2 − 2 v3 , 8 v1 − 13 v3 = 8 v1 +∑ 0k v2 − 13 v3 , v2 = 0 v1 + 1 v2 +
0 v3 , and 0 = 0 v1 + 0 v2 + 0 v c1 vare1 + four
c2 v2 + · · · + cklinear
different vk = combinations
ci vi of the three vector
3
space elements v1 , v2 , v3 ∈ V . i=1
wherekey
The theobservation
coefficients is c1 that
, c2 , . .the are any
. , ckspan scalars,
always formsis aknown subspace.as a linear combination of the
elements v1 , . . . , vk . Their span is the subset W = span {v1 , . . . , vk } ⊂ V consisting of all
Proposition 2.14.
possible linear The span with
combinations W =scalarsspan c{v, 1. ,. .. ., .c, v∈k }R.of any finite collection of vector
1 k
space elements v1 , . . . , vk ∈ V is a subspace of the underlying vector space V .
Proposition B.3.2. The span W = span {v1 , . . . , vk } of any finite collection of vector
Proof : We
space need tov1show
elements ,...,v k ∈ if
that V is a subspace of the underlying vector space V .
= c1to
Proof. Wevneed + · · that
v1show · + cifk vk and =
v c1 v1 + · · · +
c k vk
are any two linear combinations,
v = c1 v1 + · ·then
· + cktheir
vk sum
and is also
b=b ca1 v
linear combination, since
v 1 + ··· + b
ck vk
v+v = (c1 + c1 )v1 + · · · + (ck + ck )vk = c 1 v1 + · · · + c k vk ,
where
ci = ci +
ci . Similarly, for any scalar multiple,
a v = (a c1 ) v1 + · · · + (a ck ) vk = c∗1 v1 + · · · + c∗k vk ,
where ci∗ = a ci , which completes the proof. Q.E.D.
are any two linear combinations, then their sum is also a linear combination, since
b = (c1 + b
v+v c1 ) v1 + · · · + (ck + b
ck ) vk = e
c1 v1 + · · · + e
ck vk
where e
ci = ci + b
ci . Similarly, for any scalar multiple,
c1 v1 + · · · + ck vk = 0
Elements that are not linearly dependent are called linearly independent.
1. spans V
2. is linearly independent
Theorem B.3.6. Suppose the vector space V has a basis v1 , . . . , vn for some n ∈ N . Then
every other basis of V has the same number, n, of elements in it. This number is called
the dimension of V , and written dim V = n.
B.3.4 Kernel
Definition B.3.7. The image of an m × n matrix A is the subspace imgA ⊂ Rm spanned
by its columns. The kernel of A is the subspace ker A ⊂ Rn consisting of all vectors that
are annihilated by A,
ker A = {z ∈ Rn |Az = 0} ⊂ Rn
APPENDIX B. LINEAR ALGEBRA 49
Variables Dimension
Y N ×1
X N ×K
β K ×1
e N ×1
r N ×K
Z N ×L
v N ×K
π L×K
λ L×1
w N ×1
Yi 1×1
Xi K ×1
ei 1×1
ri K ×1
Zi L×1
vi K ×1
wi 1×1
APPENDIX
10. Space L2 is the collection of all rvs X defined on (Ω, F , P ) such that E|X|2 < ∞ (finite variance).
15. Projection (orthogonal): P(Y ) = Ŷ = argminZ∈sp(X1 ) E[Y − Z]2 = inf b∈R E(Y − bX)2 is the
orthogonal projection of Y onto S.
∑ ( )
16. Projection (orthonormal): P(Y ) = Ŷ = K ′ ∗
k=1 E X̃k · Y X̃k = X β where X̃k is an orthonor-
mal basis of sp(X), X is a K × 1 vector, and β ∗ := E (XX ′ )−1 E(XY ).
1
Lecture 2: Ordinary Least Squares Estimation
1. Feature: Let Z ∈ L2 and P ∈ P where P is a class of distributions on Z. A feature of P is an
object of the form γ(P ) for some γ : P → S where S is often times R.
3. An estimator γ̂N is a statistic used to infer some feature γ(P ) of an unknown distribution P .
4. The empirical distribution PN of the sample {Z1 , . . . , ZN } is the discrete distribution that puts
equal probability 1/N on each sample point Zi , i = 1, . . . , N.
10. Weak law of large numbers: Let Z1 , Z2 , . . . be a sequence of iid rvs with E[Zi ] = µZ . Define
∑ p p
Z N := Ni=1 Zi /N. Then Z N − µZ → 0 or Z N → µZ , or write Z N = µZ + op (1).
p
11. Consistency of an estimator: γ̂N → γ
∑N ( ∑N )−1 ( ∑N )
′ 2 ′ ′ −1
12. OLS: β̂ OLS := argmin i=1 (Yi − Xi b) =
1
N i=1 Xi Xi
1
N X Y
i=1 i i = (X X) X ′Y
b∈RK
( ∑N )−1 ∑N
13. Method of Moments: β ∗ = E (XX ′ )−1 E(XY ) −→
a.p 1 ′ 1
N i=1 Xi Xi N i=1 Xi Yi = β̂ MM
∑ ∑
14. Representation: Xi Xi′ = X ′ X and Xi Yi = X ′ Y .
4. Central limit theorem (CLT): Let Z1 , Z2 , . . . be a sequence of iid random vectors with µz := E[Zi ]
∑ √ ( ) d ( ( ′ ))
and E ∥Zi ∥2 < ∞. Define Z N := N i=1 Zi /N. Then N Z N − µZ → N 0, E (Zi − µZ ) (Zi − µZ ) .
√ ( ) ( ∑ ) −1 ( ∑ )
d
5. Asymptotic distribution: N β̂ OLS − β ∗ = N −1 N i=1 Xi Xi
′ N −1/2 N
i=1 Xi ui → N (0, Ω)
2
6. Projection matrix: PX := X (X ′ X)−1 X ′ .
7. Residual maker: MX := IN − PX .
∑
8. Trace: tr A := K k=1 akk with tr(AB) = tr(BA) and tr(A + B) = tr A + tr B.
∑ ( 2 )
9. Since σ̂u2 := N
i=1 ûi /N but E σ̂u |X < σu use Su := N −K σ̂u so limN →∞ σ̂u = limN →∞ su .
2 2 2 N 2 2 2
( ∑N )−1 ( ∑N )( ∑N )−1
10. Heteroskedasticity robust variance estimator: Ω̂ = 1 ′ 1 2 ′ 1 ′
N i=1 Xi Xi N −K i=1 ûi Xi Xi N i=1 Xi Xi
√ ( )
11. Confidence interval construction: let N r′ β̂ OLS − r′ β ∗ ∼ N (0, r′ Ωr) where r is non-stochastic
K × 1 vector and set r = ek then,
( √ ′ √ ′ )
e Ωe e Ωe
P e′k β̂ OLS − 1.96 √k < e′k β ∗ < e′k β̂ OLS + 1.96 √k
k k
= 0.95
N N
√ √
( ) e′k Ω̂ek
( ) ( )
13. Standard error: se β̂kOLS = √
N
= e′k · se β̂ OLS where se β̂ OLS = diag Ω̂/N .
( ) β̂kOLS −βknull ( ) d
14. t-statistic: tOLS βknull := if β̂kOLS = βknull + op (1), then tOLS βknull → N(0, 1).
se(β̂kOLS )
8. Linear regression model: Yi = Xi′ β + ei with E (ei |Xi ) = 0, EYi2 < ∞, and E ∥Xi ∥2 < ∞
3
10. Homoskedasticity: E (ee′ |X) = σe2 IN
11. Variance decomposition: For P, Q ∈ L2 , Var P = E Var(P |Q) + Var E(P |Q)
16. Gauss Markov theorem: In the linear regression model with homoskedastic errors,
( β̂ OLS
) is the
linear unbiased estimator with minimum variance (BLUE). Var(β̃|X) ≥ Var β̂ OLS |X .
17. Generalised least squares estimator β̂ GLS is the minimum variance unbiased estimator under
heteroskedasticity by Gauss Markov theorem.
18. GLS is an infeasible estimator because E (ee′ |X) cannot be observed. To make it feasible, run
( )−1
OLS and obtain ê and Σ̂ then compute β̂ GLS ′ −1 X ′ Σ̂−1 Y doesn’t satisfy Gauss
feas := X Σ̂ X
Markov theorem.
8. First stage regression: Xi = π ′ Zi + vi ⇒ π = E (Zi Zi′ )−1 E (Zi Xi′ ) where E (Zi vi ) = 0.
4
12. Identification: If a parameter can be written as an explicit function of population moments, then
it is identifiable. For exogenous variable dim Zi1 = K1 and endogenous variable dim Zi2 = L2
( ) ( )
Zi1 Xi1
Zi := =
Zi2 Zi2
Three cases: L = K (exactly identified), L > K (over identified), L < K (under identified). λ
and π are explicit functions of population moments and thus identified.
13. Existence and uniqueness: if Rank(π) = Rank(π λ) then the solution exists and unique if
dim ker(π) = 0 ⇒ dim π = K. We need E (Zi Xi2 ′ )=K .
2
(∑ )−1 (∑ )
14. Case 1: L = K, β = π −1 λ = E (Zi Xi′ )−1 E (Zi Yi ). Thus, β̂ IV = N
Z X
i=1 i i
′ N
Z Y
i=1 i i
3. Individual treatment effect (ITE): Yi (1) − Yi (0), cannot be observed. To fix this, find any
identical j ̸= i such that Yi (1) − Yj (0) and not Yi (p) = Yj (p) for p ∈ {0, 1}.
4. The regression model: Yi := Yi (1) · Xi + Yi (0) · (1 − Xi ) = β0 + β1i Xi + ũi , don’t use OLS
unless β1i is constant. To take out i-subscript, manipulate it by ±E (Yi (1) − Yi (0)) · Xi where
E (Xi ui ) = 0.
5
Lecture 8: Instrument Variable II
1. Case 2: L > K, π ′ πβ = π ′ λ ⇒ β = (π ′ π)−1 π ′ λ, motivates 2SLS.
( )−1
2. Two stage least squares estimator: β̂ 2SLS = X ′ Z (Z ′ Z)−1 Z ′ X X ′ Z (Z ′ Z)−1 Z ′ Y
( )−1 ( )−1
3. Representation: (X ′ PZ X)−1 X ′ PZ Y = X̂ ′ X X̂ ′ Y = X̂ ′ X̂ X̂ ′ Y .
8. Bias of 2SLS: Hahn and Kuersteiner (2002) uses the concentration parameter µ, when large
enough, the bias approaches zero.
9. Invalid instrument: think of π = E (Zi Zi′ )−1 E (Zi Xi′ ) when E (Zi Xi′ ) = 0. β̂ IV is not consistent
and converge to a Cauchy distribution. t-statistic doesn’t converge to normal distribution.
( ) null
10. Generic t-statistic: t β null := β̂−β .
se(β̂)
p
( ) p (ρ): affects asymptotic distribution of t. As ρ → 1 (worst case), ξ1 → ξ2 ,
11. Degree of endogeneity
σ̂e → 0, se β̂
2 IV → 0, S(ρ) → ∞, and t → ∞ ⇒ rejecting H0 : β = 0.
2. Asymptotic distribution of t: depends on ρ and τ (the strength of the instrument), both cannot
ξ2
be estimated. If ρ = 1, ξ1 = ξ2 , and t-statistic becomes S(1, τ ) = ξ1 + τ1 .
7. Dealing with weak instruments: Staiger and Stock (1997) uses first stage F statistic of X on Z.
The instrument is strong if F > 10 (safe to use β̂ IV and β̂ 2SLS ) and weak if F < 10.
9. Stock and Yogo: provides a table of F -statistic based on actual size. The more you can tolerate
with high α, the more likely you will reject the null and conclude that the instrument is strong.
6
Lecture 10: Maximum Likelihood Estimation
1. Requirements: fY (y|θ) is known given an iid random sample Y1 , . . . , YN .
∏
2. Likelihood function: L (θ|y) := fY1 ,...,YN (y1 , . . . , yN |θ) = Ni=1 fY (yi |θ).
∑N
3. Log likelihood function: L(θ|y) := ln L (θ|y) = i=1 ln fY (yi |θ).
∂ ln fY
7. Score function: S(y|θ) := (y|θ) with a random variable Y .
∂θ
( )
8. Fisher information: I(θ) := E S(Y |θ)2 = Var S(Y |θ).
(( )2 ) ( 2 )
∂ ln fY ∂ ln fY
9. Information Equality: E ∂θ (Y |θ) = −E ∂θ 2 (Y |θ) .
3. Linear probability model (lpm): E (Yi |Xi ) = Pr (Yi = 1|Xi ) = Xi′ β where β is the effect of Xi
on the probability of success Pr (Yi = 1|Xi ). Use OLS to estimate β.
8. In the binary outcome model, fY (y|x, β) = Pr (Yi = y|Xi = x) = G (x′ β)y · (1 − G (x′ β))1−y .
9. Let G be the standard normal or the logistic CDF. Then L(β|x, y) is globally concave.
′ β)
′
∑N
10. Score: S(y|x, β) = G(x′y−G(x
β)(1−G(x′ β)) ·g (x β)·x, let the computer find β such that i=1 S (yi |xi , β) =
0.
√ ( )
d ( )
11. Asymptotic distribution of MLE: N β̂ ML − β → N 0, I(β)−1
7
( )
( ′)
2
g (Xi′ β )
12. Information matrix: I(β) = E S (Yi |Xi , β) S (Yi |Xi , β) = E G(Xi′ β )(1−G(Xi′ β ))
· Xi Xi′
i |Xi )
13. Causal effect: ∂E(Y = ∂ Pr(Y∂X
i =1|Xi )
= g (Xi′ β) β ̸= β by chain rule. Take expectation, ϕ :=
( ) ∂Xi i
∑N ( ( ′ ML ) ML )
i =1|Xi )
E ∂ Pr(Y∂X = E (g (X ′ β) β) and use analogy principal, ϕ̂ = 1 .
i i N i=1 g Xi β̂ β̂
√ d
14. Delta method: Let N (θ̂ − θ) → N (0, Ω) with dim θ = K. Take a continuously differentiable
√ d
function C : Θ → R where Q ≤ K. Then N (C(θ̂) − C(θ)) → N (0, c(θ) · Ω · c(θ)′ ) where
Q
∂C
c(θ) := ∂θ′ (θ).
15. Sample selection model: Yi∗ = Xi′ β + ei and Di = 1 · (Zi′ γ + vi ) where Yi = Yi∗ if Di = 1 and
unobserved if Di = 0 with given (Di , Xi , Yi , Zi ).
ϕ(c)
16. Inverse Mills ratio: E (vi |vi > −c) = Φ(c) =: λ(c)
18. Regression model (second stage): Yi = Xi′ β+ρλ (Zi′ γ)+ri where derive first E (ei |Di = 1, Xi = x, Zi = z) =
ρλ (z ′ γ) then E (Yi∗ |Di = 1, Xi , Zi ) = Xi′ β + ρλ (Zi′ γ). Use OLS.
20. Important note: β is only identified by imposing some functional form on the joint error distri-
bution.
u.p
4. Regularity conditions: Q0 (θ) is continuous (by inspection) and Q0 (θ) : QN (Wi , θ) −→ Q0 (θ).