0% found this document useful (0 votes)
41 views

Problem Set 5 With Solutions

Applied Statistics and Econometrics

Uploaded by

Giorgia Fantini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Problem Set 5 With Solutions

Applied Statistics and Econometrics

Uploaded by

Giorgia Fantini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Solution to Problem Set 5

Note: The material underlying the problems included in this assignment


is covered in Chapters 6 and 7 of the textbook. If you had problems answering
the questions or you answered them incorrectly and the solutions below do
not help, you should review the material there and go see the TA to make
sure you understand the concepts and how to apply them.

Question 1
Look at the following code and output from an analysis performed in Stata
using the “Earnings and Height” dataset you have become familiar with in
the past few weeks. You can refer to the dataset documentation for a de-
scription of the variables.
First, we estimate the following model explaining earning as a linear func-
tion of height:

Earningsi = α0 + α1 Heighti + ui (1)

Next, we estimate the following model where we add a set of dummy vari-
ables1 for the US region of residence of the individual:

Earningsi = β0 + β1 Heighti + β2 N ortheasti + β3 M idwesti + β4 Southi + ui (2)

1
To be explicit: N ortheast is a variable taking value 1 if the individual resides in the
northeastern part of the US, and 0 otherwise; M idwest is a variable taking value 1 if the
individual resides in the Midwest of the US, and 0 otherwise; etc... Notice that N ortheast,
M idwest, W est and South represent a set of mutually exclusive and exhaustive dummy
variables.

1
1. What is the fraction of the variance of Earnings captured by the
first and the second model, respectively? What is the sum of squared
residuals for the second model?
This question can be answered looking at the R2 of the two models in
the output table. It is approximately 1.1% for model (1) and 2.2% for
model (2).
To obtain the Sum of Squared Residuals (SSR) for the second model,
we recall the formula for the Standard Error of the Regression (SER):
s
SSR
SER =
n − (k + 1)

where n is the number of observations and k + 1 is the number of pa-


rameters in our model (k independent variables plus a constant term).
The SER shows up as RMSE in the Stata output. Hence:
r
SSR
= 26624
17870 − 5
Solving, we obtain that SSR is equal to (in scientific notation) 1.2663E13.

2. A classmate of yours comments that since the R2 of model (2) is higher


than that of model (1), we solved the omitted variable problem. Do

2
you agree with such statement? Why?
The statement is wrong. The change in R2 tells us about how well the
model fits the data but it says nothing on whether we do or do not have
an omitted variable problem. In other words, a high R2 does not say
anything about whether E(u|X) = 0 is true or not.

3. Compute the t-statistic for the test H0 : β2 = 2500 vs H1 : β2 < 2500.


Can you reject the null hypothesis at 1% significance? The t-statistic
for the proposed test is

βˆ2 − 2500 2652.3 − 2500


t= = = 0.24
SE(βˆ2 )
d 626.1

Since t < 2.33, the critical value for 1% 1-sided significance test, we
cannot reject the null hypothesis.

4. Does the 90% confidence interval for the coefficient on South include
-3500?
The formula for the 90% confidence interval on the coefficient on
South is:

[βˆ4 − 1.64 × SE(βˆ4 ); βˆ4 + 1.64 × SE(βˆ4 )] = [−5027 − 1.64 × 562.1; −5027 + 1.64 × 562.1]
d d

= [−5948.8; −4105.2]

Hence the 90% confidence interval for β4 does not include -3500.
You could have spared yourself the calculation by realizing that the
95% confidence interval for β4 , which Stata reports automatically, was
already not including -3500. If that is the case, it was already obvious
that -3500 would not be included in the 90% confidence interval either,
since the 90% confidence interval is by construction narrower than the
95% one.

5. What are the predicted average annual earnings of a worker of average


height (66.96 inches) residing in the W est?
First, realize that an individual residing in the West is one who will
have the value of all the other regional dummies equal to 0. Then,
using the estimates:

est = 1 , Height = βˆ0 + βˆ1 × 66.96 + βˆ2 ∗ 0 + βˆ3 × 0 + βˆ4 ∗ 0


Earnings|Wd
= 151.5 + 724.5 × 66.96 = 48664

3
6. Write down the null hypothesis for testing the assertion that the dif-
ference in mean earnings between southern and western workers of
the same height is twice as large as the difference in mean earnings
between midwestern and western workers of the same height.

(a) Find a way to re-write the model so that you can test this null
hypothesis by performing a t-test on a single parameter? Hint:
use the restriction on the parameters in H0 to express one of the
parameters in terms of the other(s) and use this expression in (2).
The difference between southern and western workers’ earnings
keeping height constant is β4 ; The difference between midwest-
ern and western workers’ earnings keeping height constant is β3 .
Hence the hypothesis we are asked to write is H0 :β4 − 2β3 = 0.
This is a single restriction involving two coefficients.
Add and subtract 2β3 Southi to the main model to obtain

Earningsi = β0 + β1 Heighti + β2 N ortheasti + β3 M idwesti + 2β3 Southi +


β4 Southi − 2β3 Southi + ui
= β0 + β1 Heighti + β2 N ortheasti + β3 (M idwesti + 2 ∗ Southi ) +
(β4 − 2β3 )Southi + ui
= β0 + β1 Heighti + β2 N ortheasti + β3 M Si + γSouthi + ui

where M Si is a new regressor constructed as M idwesti + 2 ∗

Southi 2 . Since the coefficient γ = β4 − 2β3 , the null hypothesis


of the required test is H0 : γ = 0.

7. Suppose you now estimate the following model:

Earningsi = γ0 + γ1 Heighti + γ2 N ortheasti + γ3 M idwesti + γ4 W esti + ui


(3)
what are the estimates of γ1 , γ2 and γ3 you will obtain?
Here we are simply estimating the same model, changing the excluded group.
Therefore, the coefficient on Height will not be affected at all, i.e., γˆ1 = βˆ1 .
γ2 is the gap in earnings between somebody living in the Northeast and some-
body living in the South. This information was already conveyed in model (2)
by taking the difference between β2 and β4 . That is γˆ2 = 2652.3 − (−5027) =
7679.3 Similarly, γˆ3 = βˆ3 − βˆ4 = −2724.1 − (−5027) = 2302.9

2
Note that this does not create a perfect multicollinearity problem because the dummy
variable M idwest no longer appears in the transformed model.

4
8. What can you say about the OLS estimates of the regression coefficients of
a richer model like the following?
Earningsi = δ0 +δ1 Heighti +δ2 N ortheasti +δ3 M idwesti +δ4 W esti +δ5 Southi +ui
(4)

This model cannot be estimated. By including a full set of exhaustive and


mutually exclusive dummy variables, we have run into the dummy variable
trap a created perfect multicollinearity between the regressors (in particular,
N ortheasti + M idwesti + W esti + Southi is equal to the constant). Hence,
assumption #4 of the multivariate OLS model no longer holds.

Question 2
Look at the following code and output from an analysis performed in Stata
using the California school district dataset we often used as an example in
class.
First, we estimate the following model to explain district test scores

T estScorei = α0 +α1 ST Ri +α2 EL P CTi +α3 AV GIN Ci +α4 COM P ST Ui +ui


(5)
where the definition of the variables is as follows:
ˆ T estScore= average test score in the district
ˆ ST R= average students-teacher ratio in the district
ˆ EL P CT = average fraction of students learning English
ˆ AV GIN C= average income in the district (in thosands of dollars)
ˆ COM P ST U = average number of PCs per student in the district
The results are as follows:

5
1. Construct the 95% confidence interval for the effect on test scores
of increasing the average income of residents in the district by 5000
dollars.
The Stata output table tells us that the 95% confidence interval for α3
(i.e. the effect on test scores of raising income by 1,000 dollars) is
[1.30;1.67]. Hence the 95% confidence interval for the effect on test
scores of raising income by 5,000 dollars is [5*1.30;5*1.67]=[6.5;8.35].

2. Suppose you want to test the hypothesis H0 : α1 = α2 = α3 = α4 = 0.


How many restrictions does this hypothesis involves? How is the F-
statistic that must be used to test it distributed? What is the value
of such F-statistic? Do you reject the null hypothesis at 5%? And at
1%?
This is a test for joint significant of the coefficients and it involves four
restrictions. The distribution of the F-statistics is an F4,415 which can
be approximated well by an F4,∞ . The value of the test statistic is
reported in the Stata output on the top-right part of the output and is
227.50. The P-value for the test is also reported: since it is 0.0000,
we reject the null at both 5% and 1% significance.

3. The government has budget to finance one of these two policies: a


subsidy to families raising the average income by 1000 dollars or a
transfer to schools that raises the number of PCs per student by 0.1.
How would you state the null hypothesis of a test that the two inter-
ventions have the same effect on test scores? How many restrictions
does it involve?
The effect of raising income by 1,000 dollars is exactly α3 (recall that
income is expressed in thousands of dollars). The effect of raising the
number of computers per students by 0.1 is 0.1α4 . Hence the null hy-
pothesis that the two interventions have the same effect can be stated
as H0 : α3 = α104 . This is a single restriction involving two coefficients.

4. What is the P-value for the test H0 : α1 = 0? Is the effect of STR


on test scores significant at 5%? Is the estimator of this effect (αˆ1 )
consistent? Is it unbiased?
The relevant P-value is provided directly in the Stata output and is
0.863. Since 0.863 > 0.05, we cannot reject the null hypothesis at

6
5%. However, the significance of the coefficient does not have any
implication for the consistency nor the unbiasedness of the estimate.
These hinge on whether the linear model assumptions #1-#4 hold,
none of which involves the P-value or the significance of the estimate.

5. We fear that an omitted variable in our model is the number of teach-


ers who can teach a programming class. Learning programming helps
developing logic skills which can boost students’ test scores. More-
over, schools with more PCs per student are more likely to hire more
programming teachers. Can you conjecture the likely direction of the
bias caused by the exclusion of the number of programming teachers
from our model?
The question refers to the bias in estimating the effect of computers
α4 . In our notation, we have
p σP ROGRAM T EACH
α̂4 →
− α4 + α5 ρCOM P ST U,P ROGRAM T EACH
σCOM P ST U
| {z }
OBV

where α5 is the direct effect of the number of programming teachers on


test scores -which are told is positive- and ρCOM P ST U,P ROGRAM T EACH
is the correlation between the number of programming teachers and the
number of computers per student, which is also positive according to
the information in the text. Therefore the sign of the omitted variable
bias should be positive (estimated effect is larger than actual one).

Question 3
Suppose that in the model
Y = β0 + β1 X + u (6)
we assume
E(u|X) = c
where c is a number, instead of the standard assumption E(u|X) = 0. All
other assumptions are satisfied.

1. Show that you can rewrite the model as


Y = β̃0 + β1 X + ũ
where
β̃0 = β0 + c and ũ = u − c (7)

Adding and subtracting c from the right hand side of (6) gives (7).

7
2. Does model (7) satisfy the standard LS assumption, namely E(ũ|X) =
0?
Yes it does because

E(ũ|X) = E(u − c|X) = E(u|X) − c = c − c = 0

3. What are the OLS estimators of the intercept and slope estimating
when you regress Y on X?
Because (7) satisfies LSA #1, we know that OLS is an unbiased and
consistent estimator of the intercept and slope in model (7), namely
of β̃0 and β1 . Thus, the interpretation of the constant term in the
model is not straightforward since it includes the mean value of the
error term (c) as well. In fact, by setting E(u|X) = 0 we are essentially
allowing the intercept to absorb (include) the possibly non-zero mean
of u (give X). This is why we say that “0” is a normalization that does
not affect anything except the intercept term.

4. Can you conclude that when the conditional mean of u given X is a


non-zero constant the only thing being affected is the estimation of the
intercept, i.e., OLS estimates β0 +c instead of β0 . Thus, the estimated
constant is difficult to interpret in most economic applications.
Yes!

Question 4 (optional)
This question addresses the “partialling-out” interpretation of OLS. We use
the data set “wage1” which you can find in PS4’s folder along with a file
describing its contents (“WAGE1.txt”).

1. Estimate the model

log wage = β1 + β2 educ + β3 exper + β4 f emale + β5 smsa + u

and report the estimate for β2 and its standard error.

2. Estimate the following regression

educ = δ1 + δ2 exper + δ3 f emale + δ4 smsa + v

and compute the residual v̂.

8
3. Report the correlation between v̂ and each of the regressors (exper, f emale, smsa).You
can use the “correlate” command in Stata. Are they correlated? In-
terpret this residual: what is v̂ picking up?
The OLS residual is uncorrelated in the sample with each and all of
the regressors. This is true “by construction” i.e., it is due to the way
we define the OLS estimator.
Since we can write the dependent variable – educ – as equal to the pre-
dicted value – educ
d – plus the residual v̂ we can then view the residual
as that part of the dependent variable uncorrelated with the regres-

9
sors. In other words, v̂ is picking up that part of educ uncorrelated
with the other regressors (exper, f emale, smsa).
Note that since the predicted value is a linear combination of the
regressors and each of them is uncorrelated with v̂, so is the predicted
value.

4. Regress log wage on this residual v̂ and report its estimated coefficient
and standard error. Compare them to the estimate of β2 and its
standard error in (1).

The coefficient estimate is identical to β̂2 but its standard error is not!

5. What do you learn from this exercise?


OLS uses only the uncorrelated part of each regressor – the “indepen-
dent variation” –to estimate its effect on the dependent variable. In
this way, OLS “controls for” or “keeps constant” the other regressors.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy