Im ch01
Im ch01
Exercises 1.3 and 1.4 are the first of the series of computer assignments fitting educational attainment functions and earnings functions. Students will need a tailored guide to the regression package being used and some may require supervised operating sessions. Tailored guides for all the assignments for Stata and EViews will be found on the website. Most major regression applications are now userfriendly, once one has got started, so if these first assignments are supported well, there should be little need for support for the remainder. Note that the exercises involve interpretations of regression equations but not statistical tests. Tests are encountered for the first time in Chapter 2.
1.1*
The table below shows the average rates of growth of GDP, g, and employment, e, for 25 OECD countries for the period 19881997. The regression output shows the result of regressing e on g. Provide an interpretation of the coefficients.
Average annual percentage rates of growth of employment and real GDP, 19881997 employment Australia Austria Belgium Canada Denmark Finland France Germany Greece Iceland Ireland Italy Japan 1.68 0.65 0.34 1.17 0.02 1.06 0.28 0.08 0.87 0.13 2.16 0.30 1.06 GDP 3.04 2.55 2.16 2.03 2.02 1.78 2.08 2.71 2.08 1.54 6.40 1.68 2.81 Korea Luxembourg Netherlands New Zealand Norway Portugal Spain Sweden Switzerland Turkey United Kingdom United States employment 2.57 3.02 1.88 0.91 0.36 0.33 0.89 0.94 0.79 2.02 0.66 1.53 GDP 7.73 5.64 2.86 2.01 2.98 2.79 2.60 1.17 1.15 4.18 1.97 2.46
. reg e g if e < 4.5 Source | SS df MS -------------+-----------------------------Model | 14.5753023 1 14.5753023 Residual | 10.1266731 23 .440290135 -------------+-----------------------------Total | 24.7019754 24 1.02924898 Number of obs F( 1, 23) Prob > F R-squared Adj R-squared Root MSE = = = = = = 25 33.10 0.0000 0.5900 0.5722 .66354
-----------------------------------------------------------------------------e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g | .489737 .0851184 5.75 0.000 .3136561 .6658179 _cons | -.5458912 .2740387 -1.99 0.058 -1.112784 .0210011 ------------------------------------------------------------------------------
Answer: Start by discussing the data in the table, and get students to explain why the growth rate of employment should not be as large as that of GDP (technical progress). Literally the regression implies that a 1 percent increase in the growth of GDP generates a 0.49 percent
increase in the rate of growth of employment. The intercept suggests that, if GDP is static, employment with have a negative growth rate of 0.55 percent per year, and indeed in some slow-growing countries employment growth has actually been negative. The table provided as a slide includes the observation for Mexico, which provides an opportunity for a discussion of data integrity. This observation was dropped from the regression because the figures were distorted by special circumstances. With the NAFTA agreement, US firms started moving their manufacturing plants to low-wage Mexico, recruiting workers many of whom had previously been employed in the informal sector. The official employment statistics, collected by social security, measure only employment in the formal sector and therefore grossly overestimate the net increase in employment. In 1997 alone, employment increased by 13.3 percent according to the official figures, clearly a nonsense.
e
3
0 0 -1 1 2 3 4 5 6 7 8
-2
1.2
(e
(g
g ) = 60.77.
2
Calculate the regression coefficients and check that they are the same as in the regression output.
Answer: b2 =
1.3
Source | SS df MS -------------+-----------------------------Model | 963.81124 1 963.81124 Residual | 2363.18691 538 4.39254072 -------------+-----------------------------Total | 3326.99815 539 6.17253831
Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
-----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1391464 .0093936 14.81 0.000 .1206937 .1575991 _cons | 6.569317 .4898844 13.41 0.000 5.606996 7.531637 ------------------------------------------------------------------------------
The regression indicates that years of education are indeed positively associated with the ASVABC score and that a one-point increase in the score is associated with an extra 0.14 years of schooling. The interpretation of the constant is less straightforward. It indicates that an individual with a zero score would receive 6.57 years of schooling, but a literal interpretation should be resisted. Given that the standard deviation of ASVABC is approximately ten, a score of zero would be five standard deviations below the mean. The lowest score in Data Set 22 is in fact 25. In any data set, the lowest score is 22. Clearly there is a possibility of some simultaneity in the model, in that the score might be a positive function of the number of years of schooling. This possibility is investigated further in a later exercise. However the test items are so simple that it is reasonable to assume otherwise, at least as a first approximation. R2 is low, indicating that the ASVABC score accounts for a fairly small proportion of the variance in schooling.
1.4
Do earnings depend on education? Using your EAEF data set, fit an earnings function parallel to that discussed in Section 1.6, regressing EARNINGS on S, and give an interpretation of the coefficients. Comment on the value of R2.
Answer: Of course earnings depend (partly) on education. The regression indicates that each additional year of education increases hourly earnings by $2.58. The interpretation of the constant literally implies that an individual with no years of schooling would earn $15.49 per hour. There are two ways out of this nonsense. One is to say that the fitted relationship can claim to be valid only for the range of values of S in the data set (720 in this one) and nothing can be inferred for values outside this range. The other is to say that the relationship may be nonlinear and the negative intercept is an artefact caused by forcing a linear relationship on the observations. R2 is very low, indicating that education accounts for only a small proportion of the variance in earnings.
. reg EARNINGS S Source | SS df MS -------------+-----------------------------Model | 22135.7511 1 22135.7511 Residual | 98382.6768 538 182.867429 -------------+-----------------------------Total | 120518.428 539 223.596341 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 121.05 0.0000 0.1837 0.1822 13.523
-----------------------------------------------------------------------------EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | 2.579413 .2344455 11.00 0.000 2.118872 3.039954 _cons | -15.49334 3.264622 -4.75 0.000 -21.90631 -9.080375
------------------------------------------------------------------------------
1.5*
The output shows the result of regressing the weight of the respondent in 1985, measured in pounds, on his or her height, measured in inches, using EAEF Data Set 21. Provide an interpretation of the coefficients.
. reg WEIGHT85 HEIGHT Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 355.97 0.0000 0.3982 0.3971 27.084
Source | SS df MS -------------+-----------------------------Model | 261111.383 1 261111.383 Residual | 394632.365 538 733.517407 -------------+-----------------------------Total | 655743.748 539 1216.59322
-----------------------------------------------------------------------------WEIGHT85 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------HEIGHT | 5.192973 .275238 18.87 0.000 4.6523 5.733646 _cons | -194.6815 18.6629 -10.43 0.000 -231.3426 -158.0204 ------------------------------------------------------------------------------
Answer: Literally the regression implies that, for every extra inch of height, an individual tends to weigh an extra 5.2 pounds. The intercept, which literally suggests that an individual with no height would weigh 195 lbs, has no meaning.
1.6
Two individuals fit earnings functions relating EARNINGS to S as defined in Section 1.6, using EAEF Data Set 21. The first individual does it correctly and obtains the result found in Section 1.6: EARNINGS = 13.93 + 2.46 S.
The second individual makes a mistake and regresses S on EARNINGS, obtaining the following result: S = 12.29 + 0.070 EARNINGS . From this result the second individual derives EARNINGS = 175.57 + 14.29S . Explain why this equation is different from that fitted by the first individual. Answer: The first individual calculated the slope coefficient as
and then takes the reciprocal, so he or she is effectively using the expression
(EARNINGS
(EARNINGS
i
EARNINGS
EARNINGS S i S
)(
Obviously the two individuals are using different estimators and therefore in general will obtain different results. It is worth illustrating what is happening graphically, drawing one diagram for the first individual, with EARNINGS on the vertical axis and Son the horizontal, and another diagram for the second, with the axes reversed. A few points representing the same scatter diagram should be drawn in both figures. (Obviously the scatter should be rotated about the 45 degree axis in the second). In both diagrams the individuals are minimizing the sum of the squares of the residuals measured in the vertical dimension, which measures EARNINGS in the first and S in the second. Clearly they are using different criteria, and this accounts for the mathematical explanation. If the second figure is rotated about the 45 degree line and superimposed on the first, it can be seen that effectively the second individual is minimizing the sum of the squares of the residuals measured horizontally in that diagram. The class might be asked under what conditions will the estimates in fact turn out to be identical. The answer is when the points happen to lie on a straight line. Then there will be an exact fit and there will be no residuals in either case. Mathematically, the condition is
(EARNINGS
(EARNINGS
i
EARNINGS
EARNINGS S i S
)(
which is
=1
In other words, that the correlation coefficient should be plus or minus one.
1.7*
Derive, with a proof, the slope coefficient that would have been obtained in Exercise 1.5 if weight and height had been measured in metric units. (Note: one pound is 454 grams, and one inch is 2.54 cm.)
Answer: Let the weight and height be W and H in imperial units and WM and HM in metric units. We will measure WM in kilos. Then WM = 0.454W and HM = 2.54H. The slope
M coefficient for the regression in metric units, b2 , is given by
M b2 =
(HM HM )(WM WM ) = 2.54(H H )0.454(W 2.54 (H H ) (HM HM ) (H H )(W W ) = 0.179b = 0.929 = 0.179 (H H )
i i i 2 2 i i i i 2 2 i
W )
In other words, weight increases at the rate of nearly one kilo per centimetre. The regression output confirms that the calculations are correct (subject to rounding error in the last digit).
. . .
g WM = 0.454*WEIGHT85 g HM = 2.54*HEIGHT reg WM HM Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 355.97 0.0000 0.3982 0.3971 12.296
Source | SS df MS -------------+-----------------------------Model | 53819.2324 1 53819.2324 Residual | 81340.044 538 151.189673 -------------+-----------------------------Total | 135159.276 539 250.759325
-----------------------------------------------------------------------------WM | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------HM | .9281928 .0491961 18.87 0.000 .8315528 1.024833 _cons | -88.38539 8.472958 -10.43 0.000 -105.0295 -71.74125 ------------------------------------------------------------------------------
1.8*
A researcher has data on the aggregate expenditure on services, Y, and aggregate disposable personal income, X, both measured in $ billion at constant prices, for each of the US states and fits the equation Yi = 1 + 2Xi + ui. The researcher initially fits the equation using OLS regression analysis. However, suspecting that tax evasion causes both Y and X to be substantially underestimated, the researcher adopts two alternative methods of compensating for the under-reporting: 1. The researcher adds $90 billion to the data for Y in each state and $200 billion to the data for X. 2. The researcher increases the figures for both Y and X in each state by 10 percent. Evaluate the impact of the adjustments on the regression results. Answer: First adjustment: Let the adjusted data be Y* and X* where Y* = Y + 90 and X* = X + 200. Then
* b2 =
Hence the new slope coefficient will be identical to the original one. The constant will be affected since
* b1* = Y * b2 X * = (Y + 90) b2 ( X + 200) = Y b2 X + 90 200b2 = b1 + 90 200b2
Second adjustment: Let the adjusted data again be Y* and X* where Y* = 1.1Y and X* = 1.1X.
* b2
i 2
1.1Y )
Hence the new slope coefficient will again be identical to the old one. The constant is affected since
* b1* = Y * b2 X * = 1.1Y b2 (1.1X ) = 1.1b1
1.9*
A researcher has international cross-sectional data on aggregate wages, W, aggregate profits, P, and aggregate income, Y, for a sample of n countries. By definition, Yi = Wi + Pi. The regressions Wi = a1 + a 2 Yi Pi = b1 + b2Yi are fitted using OLS regression analysis. automatically satisfy the following equations: Show that the regression coefficients will
a 2 + b2 =
a1 + b1 = (W a 2 Y ) + (P b2Y ) = (W + P ) (a 2 + b2 )Y = Y Y = 0
1.10 Derive from first principles the least squares estimator of 2 in the model
Yi = 2Xi + ui First define RSS, the sum of the squares of the residuals, and then differentiate. Answer: Let the fitted line be Yi = b2 X i . Then ei = Yi b2 X i and
C. Dougherty 20012006. All rights reserved. Version of 09.10.06.
RSS = =
i =1 n
ei2 =
i =1
Yi 2 2b2
i =1
2 X i Yi + b2
X
i =1
2 i
X
i =1
2 i
=0
and so
b2 =
X Y
i =1 n
i i
X
i =1
2 i
Yi = 1 + ui
(In other words, Y consists simply of a constant plus a disturbance term. First define RSS and then differentiate).
RSS = =
(Y
i =1 n
b1 ) 2 =
(Y
i =1
2b1Yi + b12 )
i =1
Yi 2 +
i =1
(2b1Yi ) + nb12 =
i =1
Yi 2 2b1
Y
i =1
+ nb12
Hence
b1 = Y
The second derivative of RSS, 2n, is positive, confirming that we have found a minimum. Effectively, the model assumes that Y is a random variable with unknown population mean 1, and we have shown that the sample mean is the least squares estimator.
1.12 Explain mathematically and intuitively what would happen if you tried to fit a regression
equation when all the values of the explanatory variable in the sample are the same.
Answer: Intuitively: one cannot explain variations in Y with X if there is no variation in X. Mathematically: the numerator and the denominator of the slope coefficient would both be zero. n r (X i X ) Yi Y b2 = i =1 n (X i X )2
i =1
1.13 Using the data in Table 1.5, calculate the correlation between Y and Y and verify that its square is equal to the value of R2 .
Answer:
Observation
Y Y
1.6667 0.3333 1.3333
1 2 3 Total Mean
3 5 6 14 4.6667
rY ,Y =
(Y Y )(Y Y ) (Y Y ) (Y Y )
i i 2 i i
4 .5 4.6667 4.5
= 0.9820.
Comment on it. Answer: R2 is low in every data set, indicating that the ASVABC score accounts for a fairly small proportion of the variance in schooling.
1.15 What was the value of R2 in the earnings function fitted by you in Exercise 1.4? Comment on it.
10
Answer: R2 is very low in every data set, indicating that education accounts for only a small proportion of the variance in earnings.
1.16* The output shows the result of regressing weight in 2002 on height, using EAEF Data Set 21. In
2002 the respondents were aged 3744. Explain why R2 is lower than in the regression reported in Exercise 1.5.
. reg WEIGHT02 HEIGHT Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 216.95 0.0000 0.2874 0.2860 37.878
Source | SS df MS -------------+-----------------------------Model | 311260.383 1 311260.383 Residual | 771880.527 538 1434.72217 -------------+-----------------------------Total | 1083140.91 539 2009.53787
-----------------------------------------------------------------------------WEIGHT02 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------HEIGHT | 5.669766 .3849347 14.73 0.000 4.913606 6.425925 _cons | -199.6832 26.10105 -7.65 0.000 -250.9556 -148.4107 ------------------------------------------------------------------------------
Answer: The explained sum of squares is actually higher than that in Exercise 1.5. The reason for the fall in R2 is the huge increase in the total sum of squares, no doubt caused by the cumulative effect of variations in eating habits.
1.17* The useful results in Box 1.2 are in general no longer valid if the model does not contain an
X Y X
2 i
i i
X i.
Hence
e =Y
X Y X
2 i
i i
X.
There is no reason why the right side of the equation should be zero and one can provide an empirical counterexample by running a regression. The following output shows the result of regressing earnings on schooling, using EAEF Data Set 21, with no intercept. The residuals are saved as e. The mean value of e is 0.43.
. reg EARNINGS S, nocon Number of obs F( 1, 539) Prob > F R-squared Adj R-squared Root MSE = 540 = 1260.52 = 0.0000 = 0.7005 = 0.6999 = 13.34
Source | SS df MS -------------+-----------------------------Model | 224309.221 1 224309.221 Residual | 95914.8706 539 177.949667 -------------+-----------------------------Total | 320224.092 540 593.007577
-----------------------------------------------------------------------------EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | 1.467575 .0413357 35.50 0.000 1.386376 1.548773 ------------------------------------------------------------------------------
11
. .
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------E | 540 -.4287836 13.33287 -20.26634 96.70881