Problem Set 5 With Solutions
Problem Set 5 With Solutions
Question 1
Look at the following code and output from an analysis performed in Stata
using the “Earnings and Height” dataset you have become familiar with in
the past few weeks. You can refer to the dataset documentation for a de-
scription of the variables.
First, we estimate the following model explaining earning as a linear func-
tion of height:
Next, we estimate the following model where we add a set of dummy vari-
ables1 for the US region of residence of the individual:
1
To be explicit: N ortheast is a variable taking value 1 if the individual resides in the
northeastern part of the US, and 0 otherwise; M idwest is a variable taking value 1 if the
individual resides in the Midwest of the US, and 0 otherwise; etc... Notice that N ortheast,
M idwest, W est and South represent a set of mutually exclusive and exhaustive dummy
variables.
1
1. What is the fraction of the variance of Earnings captured by the
first and the second model, respectively? What is the sum of squared
residuals for the second model?
This question can be answered looking at the R2 of the two models in
the output table. It is approximately 1.1% for model (1) and 2.2% for
model (2).
To obtain the Sum of Squared Residuals (SSR) for the second model,
we recall the formula for the Standard Error of the Regression (SER):
s
SSR
SER =
n − (k + 1)
2
you agree with such statement? Why?
The statement is wrong. The change in R2 tells us about how well the
model fits the data but it says nothing on whether we do or do not have
an omitted variable problem. In other words, a high R2 does not say
anything about whether E(u|X) = 0 is true or not.
Since t < 2.33, the critical value for 1% 1-sided significance test, we
cannot reject the null hypothesis.
4. Does the 90% confidence interval for the coefficient on South include
-3500?
The formula for the 90% confidence interval on the coefficient on
South is:
[βˆ4 − 1.64 × SE(βˆ4 ); βˆ4 + 1.64 × SE(βˆ4 )] = [−5027 − 1.64 × 562.1; −5027 + 1.64 × 562.1]
d d
= [−5948.8; −4105.2]
Hence the 90% confidence interval for β4 does not include -3500.
You could have spared yourself the calculation by realizing that the
95% confidence interval for β4 , which Stata reports automatically, was
already not including -3500. If that is the case, it was already obvious
that -3500 would not be included in the 90% confidence interval either,
since the 90% confidence interval is by construction narrower than the
95% one.
3
6. Write down the null hypothesis for testing the assertion that the dif-
ference in mean earnings between southern and western workers of
the same height is twice as large as the difference in mean earnings
between midwestern and western workers of the same height.
(a) Find a way to re-write the model so that you can test this null
hypothesis by performing a t-test on a single parameter? Hint:
use the restriction on the parameters in H0 to express one of the
parameters in terms of the other(s) and use this expression in (2).
The difference between southern and western workers’ earnings
keeping height constant is β4 ; The difference between midwest-
ern and western workers’ earnings keeping height constant is β3 .
Hence the hypothesis we are asked to write is H0 :β4 − 2β3 = 0.
This is a single restriction involving two coefficients.
Add and subtract 2β3 Southi to the main model to obtain
2
Note that this does not create a perfect multicollinearity problem because the dummy
variable M idwest no longer appears in the transformed model.
4
8. What can you say about the OLS estimates of the regression coefficients of
a richer model like the following?
Earningsi = δ0 +δ1 Heighti +δ2 N ortheasti +δ3 M idwesti +δ4 W esti +δ5 Southi +ui
(4)
Question 2
Look at the following code and output from an analysis performed in Stata
using the California school district dataset we often used as an example in
class.
First, we estimate the following model to explain district test scores
5
1. Construct the 95% confidence interval for the effect on test scores
of increasing the average income of residents in the district by 5000
dollars.
The Stata output table tells us that the 95% confidence interval for α3
(i.e. the effect on test scores of raising income by 1,000 dollars) is
[1.30;1.67]. Hence the 95% confidence interval for the effect on test
scores of raising income by 5,000 dollars is [5*1.30;5*1.67]=[6.5;8.35].
6
5%. However, the significance of the coefficient does not have any
implication for the consistency nor the unbiasedness of the estimate.
These hinge on whether the linear model assumptions #1-#4 hold,
none of which involves the P-value or the significance of the estimate.
Question 3
Suppose that in the model
Y = β0 + β1 X + u (6)
we assume
E(u|X) = c
where c is a number, instead of the standard assumption E(u|X) = 0. All
other assumptions are satisfied.
Adding and subtracting c from the right hand side of (6) gives (7).
7
2. Does model (7) satisfy the standard LS assumption, namely E(ũ|X) =
0?
Yes it does because
3. What are the OLS estimators of the intercept and slope estimating
when you regress Y on X?
Because (7) satisfies LSA #1, we know that OLS is an unbiased and
consistent estimator of the intercept and slope in model (7), namely
of β̃0 and β1 . Thus, the interpretation of the constant term in the
model is not straightforward since it includes the mean value of the
error term (c) as well. In fact, by setting E(u|X) = 0 we are essentially
allowing the intercept to absorb (include) the possibly non-zero mean
of u (give X). This is why we say that “0” is a normalization that does
not affect anything except the intercept term.
Question 4 (optional)
This question addresses the “partialling-out” interpretation of OLS. We use
the data set “wage1” which you can find in PS4’s folder along with a file
describing its contents (“WAGE1.txt”).
8
3. Report the correlation between v̂ and each of the regressors (exper, f emale, smsa).You
can use the “correlate” command in Stata. Are they correlated? In-
terpret this residual: what is v̂ picking up?
The OLS residual is uncorrelated in the sample with each and all of
the regressors. This is true “by construction” i.e., it is due to the way
we define the OLS estimator.
Since we can write the dependent variable – educ – as equal to the pre-
dicted value – educ
d – plus the residual v̂ we can then view the residual
as that part of the dependent variable uncorrelated with the regres-
9
sors. In other words, v̂ is picking up that part of educ uncorrelated
with the other regressors (exper, f emale, smsa).
Note that since the predicted value is a linear combination of the
regressors and each of them is uncorrelated with v̂, so is the predicted
value.
4. Regress log wage on this residual v̂ and report its estimated coefficient
and standard error. Compare them to the estimate of β2 and its
standard error in (1).
The coefficient estimate is identical to β̂2 but its standard error is not!
10