Logit Model For Binary Data
Logit Model For Binary Data
3.1
We start by introducing an example that will be used to illustrate the analysis of binary data. We then discuss the stochastic structure of the data in
terms of the Bernoulli and binomial distributions, and the systematic structure in terms of the logit transformation. The result is a generalized linear
model with binomial response and link logit.
3.1.1
Table 3.1, adapted from Little (1978), shows the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey of
1975, classified by current age, level of education, desire for more children,
and contraceptive use.
In our analysis of these data we will view current use of contraception as
the response or dependent variable of interest and age, education and desire
for more children as predictors. Note that the response has two categories:
use and non-use. In this example all predictors are treated as categorical
G. Rodrguez. Revised September 2007
Education
<25
Lower
Upper
2529
Lower
Upper
3039
Lower
Upper
4049
Lower
Upper
Total
Desires More
Children?
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Contraceptive
No
53
10
212
50
60
19
155
65
112
77
118
68
35
46
8
12
1100
Use
Yes
6
4
52
10
14
10
54
27
33
80
46
78
6
48
8
31
507
Total
59
14
264
60
74
29
209
92
145
157
164
146
41
94
16
43
1607
dictors. For models involving discrete factors we can obtain exactly the same
results working with grouped data or with individual data, but grouping is
convenient because it leads to smaller datasets. If we were to incorporate
continuous predictors into the model we would need to work with the original 1607 observations. Alternatively, it might be possible to group cases
with identical covariate patterns, but the resulting dataset may not be much
smaller than the original one.
The basic aim of our analysis will be to describe the way in which contraceptive use varies by age, education and desire for more children. An
example of the type of research question that we will consider is the extent
to which the association between education and contraceptive use is affected
by the fact that women with upper primary or higher education are younger
and tend to prefer smaller families than women with lower primary education
or less.
3.1.2
We consider first the case where the response yi is binary, assuming only two
values that for convenience we code as one or zero. For example, we could
define
(
1 if the i-th woman is using contraception
yi =
0 otherwise.
We view yi as a realization of a random variable Yi that can take the values
one and zero with probabilities i and 1 i , respectively. The distribution
of Yi is called a Bernoulli distribution with parameter i , and can be written
in compact form as
Pr{Yi = yi } = iyi (1 i )1yi ,
(3.1)
the predictors to affect the mean but assumes that the variance is constant
will not be adequate for the analysis of binary data.
Suppose now that the units under study can be classified according to
the factors of interest into k groups in such a way that all individuals in a
group have identical values of all covariates. In our example, women may be
classified into 16 different groups in terms of their age, education and desire
for more children. Let ni denote the number of observations in group i, and
let yi denote the number of units who have the attribute of interest in group
i. For example, let
yi = number of women using contraception in group i.
We view yi as a realization of a random variable Yi that takes the values
0, 1, . . . , ni . If the ni observations in each group are independent, and they
all have the same probability i of having the attribute of interest, then the
distribution of Yi is binomial with parameters i and ni , which we write
Yi B(ni , i ).
The probability distribution function of Yi is given by
Pr{Yi = yi } =
ni
yi
iyi (1 i )ni yi
(3.3)
on the underlying probability i . Any factor that affects this probability will
affect both the mean and the variance of the observations.
From a mathematical point of view the grouped data formulation given
here is the most general one; it includes individual data as the special case
where we have n groups of size one, so k = n and ni = 1 for all i. It also
includes as a special case the other extreme where the underlying probability
is the same for all individuals and we have a single group, with k = 1 and
n1 = n. Thus, all we need to consider in terms of estimation and testing is
the binomial distribution.
From a practical point of view it is important to note that if the predictors are discrete factors and the outcomes are independent, we can use
the Bernoulli distribution for the individual zero-one data or the binomial
distribution for grouped data consisting of counts of successes in each group.
The two approaches are equivalent, in the sense that they lead to exactly
the same likelihood function and therefore the same estimates and standard
errors. Working with grouped data when it is possible has the additional
advantage that, depending on the size of the groups, it becomes possible to
test the goodness of fit of the model. In terms of our example we can work
with 16 groups of women (or fewer when we ignore some of the predictors)
and obtain exactly the same estimates as we would if we worked with the
1607 individuals.
In Appendix B we show that the binomial distribution belongs to Nelder
and Wedderburns (1972) exponential family, so it fits in our general theoretical framework.
3.1.3
The next step in defining a model for our data concerns the systematic
structure. We would like to have the probabilities i depend on a vector
of observed covariates xi . The simplest idea would be to let i be a linear
function of the covariates, say
i = x0i ,
(3.5)
predicted values will be in the correct range unless complex restrictions are
imposed on the coefficients.
A simple solution to this problem is to transform the probability to remove the range restrictions, and model the transformation as a linear function of the covariates. We do this in two steps.
First, we move from the probability i to the odds
oddsi =
i
,
1 i
i
,
1 i
(3.6)
which has the effect of removing the floor restriction. To see this point note
that as the probability goes down to zero the odds approach zero and the logit
approaches . At the other extreme, as the probability approaches one
the odds approach + and so does the logit. Thus, logits map probabilities
from the range (0, 1) to the entire real line. Note that if the probability
is 1/2 the odds are even and the logit is zero. Negative logits represent
probabilities below one half and positive logits correspond to probabilities
above one half. Figure 3.1 illustrates the logit transformation.
Logits may also be defined in terms of the binomial mean i = ni i as
the log of the ratio of expected successes i to expected failures ni i . The
result is exactly the same because the binomial denominator ni cancels out
when calculating the odds.
In the contraceptive use data there are 507 users of contraception among
1607 women, so we estimate the probability as 507/1607 = 0.316. The odds
are 507/1100 or 0.461 to one, so non-users outnumber users roughly two to
one. The logit is log(0.461) = 0.775.
The logit transformation is one-to-one. The inverse transformation is
sometimes called the antilogit, and allows us to go back from logits to prob-
0.6
0.4
0.0
0.2
probability
0.8
1.0
-4
-2
logit
ei
.
1 + ei
(3.7)
In the contraceptive use data the estimated logit was 0.775. Exponentiating this value we obtain odds of exp(0.775) = 0.461 and from this we
obtain a probability of 0.461/(1 + 0.461) = 0.316.
We are now in a position to define the logistic regression model, by
assuming that the logit of the probability i , rather than the probability
itself, follows a linear model.
3.1.4
(3.8)
(3.9)
exp{x0i }
.
1 + exp{x0i }
(3.11)
While the left-hand-side is in the familiar probability scale, the right-handside is a non-linear function of the predictors, and there is no simple way
to express the effect on the probability of increasing a predictor by one unit
while holding the other variables constant. We can obtain an approximate
answer by taking derivatives with respect to xj , which of course makes sense
only for continuous predictors. Using the quotient rule we get
di
= j i (1 i ).
dxij
Thus, the effect of the j-th predictor on the probability i depends on the
coefficient j and the value of the probability. Analysts sometimes evaluate
this product setting i to the sample mean (the proportion of cases with the
attribute of interest in the sample). The result approximates the effect of
the covariate near the mean of the response.
In the examples that follow we will emphasize working directly in the
logit scale, but we will often translate effects into odds ratios to help in
interpretation.
Before we leave this topic it may be worth considering the linear probability model of Equation 3.5 one more time. In addition to the fact that the
linear predictor x0i may yield values outside the (0, 1) range, one should
consider whether it is reasonable to assume linear effects on a probability
scale that is subject to floor and ceiling effects. An incentive, for example,
may increase the probability of taking an action by ten percentage points
when the probability is a half, but couldnt possibly have that effect if the
baseline probability was 0.95. This suggests that the assumption of a linear
effect across the board may not be reasonable.
In contrast, suppose the effect of the incentive is 0.4 in the logit scale,
which is equivalent to approximately a 50% increase in the odds of taking
the action. If the original probability is a half the logit is zero, and adding
0.4 to the logit gives a probability of 0.6, so the effect is ten percentage
points, just as before. If the original probability is 0.95, however, the logit
is almost three, and adding 0.4 in the logit scale gives a probability of 0.97,
an effect of just two percentage points. An effect that is constant in the
logit scale translates into varying effects on the probability scale, adjusting
automatically as one approaches the floor of zero or the ceiling of one. This
feature of the transformation is clearly seen from Figure 3.1.
3.2
3.2.1
Although you will probably use a statistical package to compute the estimates, here is a brief description of the underlying procedure. The likelihood
10
yi
i
ni ,
i (ni
i )
yi + 1/2
,
ni yi + 1/2
11
3.2.2
Suppose we have just fitted a model and want to assess how well it fits the
data. A measure of discrepancy between observed and fitted values is the
deviance statistic, which is given by
D=2
{yi log(
yi
n i yi
) + (ni yi ) log(
)},
i
ni i
(3.13)
where yi is the observed and i is the fitted value for the i-th observation.
Note that this statistic is twice a sum of observed times log of observed over
expected, where the sum is over both successes and failures (i.e. we compare
both yi and ni yi with their expected values). In a perfect fit the ratio
observed over expected is one and its logarithm is zero, so the deviance is
zero.
In Appendix B we show that this statistic may be constructed as a likelihood ratio test that compares the model of interest with a saturated model
that has one parameter for each observation.
With grouped data, the distribution of the deviance statistic as the group
sizes ni for all i, converges to a chi-squared distribution with n p
d.f., where n is the number of groups and p is the number of parameters
in the model, including the constant. Thus, for reasonably large groups,
the deviance provides a goodness of fit test for the model. With individual
data the distribution of the deviance does not converge to a chi-squared
(or any other known) distribution, and cannot be used as a goodness of fit
test. We will, however, consider other diagnostic tools that can be used with
individual data.
An alternative measure of goodness of fit is Pearsons chi-squared statistic, which for binomial data can be written as
2P =
X ni (yi
i )2
i
i (ni
i )
(3.14)
Note that each term in the sum is the squared difference between observed
and fitted values yi and
i , divided by the variance of yi , which is i (ni
12
3.2.3
Tests of Hypotheses
j z1/2 var(
j ),
where z1/2 is the normal critical value for a two-sided test of size . Confidence intervals for effects in the logit scale can be translated into confidence
intervals for odds ratios by exponentiating the boundaries.
The Wald test can be applied to tests hypotheses concerning several
coefficients by calculating the usual quadratic form. This test can also be
inverted to obtain confidence regions for vector-value parameters, but we
will not consider this extension.
13
For more general problems we consider the likelihood ratio test. A key
to construct these tests is the deviance statistic introduced in the previous
subsection. In a nutshell, the likelihood ratio test to compare two nested
models is based on the difference between their deviances.
To fix ideas, consider partitioning the model matrix and the vector of
coefficients into two components
X = (X1 , X2 )
and
1
2
14
3.3
We start our applications of logit regression with the simplest possible example: a two by two table. We study a binary outcome in two groups, and
introduce the odds ratio and the logit analogue of the two-sample t test.
3.3.1
A 2-by-2 Table
We will use the contraceptive use data classified by desire for more children,
as summarized in Table 3.2
Table 3.2: Contraceptive Use by Desire for More Children
Desires
i
Yes
No
All
Using
yi
219
288
507
Not Using
n i yi
753
347
1100
All
ni
972
635
1607
3.3.2
Testing Homogeneity
There are only two possible models we can entertain for these data. The
first one is the null model. This model assumes homogeneity, so the two
groups have the same probability and therefore the same logit
logit(i ) = .
The m.l.e. of the common logit is 0.775, which happens to be the logit of the
sample proportion 507/1607 = 0.316. The standard error of the estimate is
0.054. This value can be used to obtain an approximate 95% confidence limit
for the logit with boundaries (0.880, 0.669). Calculating the antilogit of
these values, we obtain a 95% confidence interval for the overall probability
of using contraception of (0.293, 0.339).
The deviance for the null model happens to be 91.7 on one d.f. (two
groups minus one parameter). This value is highly significant, indicating
that this model does not fit the data, i.e. the two groups classified by desire
for more children do not have the same probability of using contraception.
15
3.3.3
The other model that we can entertain for the two-by-two table is the onefactor model, where we write
logit(i ) = + i ,
where is an overall logit and i is the effect of group i on the logit. Just as
in the one-way anova model, we need to introduce a restriction to identify
this model. We use the reference cell method, and set 1 = 0. The model
can then be written
(
logit(i ) =
i=1
+ 2 i = 2
so that becomes the logit of the reference cell, and 2 is the effect of level
two of the factor compared to level one, or more simply the difference in
logits between level two and the reference cell. Table 3.3 shows parameter
estimates and standard errors for this model.
Table 3.3: Parameter Estimates for Logit Model of
Contraceptive Use by Desire for More Children
Parameter
Constant
Desire
Symbol
Estimate
1.235
1.049
Std. Error
0.077
0.111
z-ratio
16.09
9.48
The estimate of is, as you might expect, the logit of the observed
proportion using contraception among women who desire more children,
logit(219/972) = 1.235. The estimate of 2 is the difference between the
logits of the two groups, logit(288/635) logit(219/972) = 1.049.
16
1
1
+
,
i n i i
17
3.3.4
It might be instructive to compare the results obtained here with the conventional analysis of this type of data, which focuses on the sample proportions
and their difference. In our example, the proportions using contraception are
0.225 among women who want another child and 0.453 among those who do
not. The difference of 0.228 has a standard error of 0.024 (calculated using
the pooled estimate of the proportion). The corresponding z-ratio is 9.62
and is equivalent to a chi-squared of 92.6 on one d.f.
Note that the result coincides with the Pearson chi-squared statistic testing the goodness of fit of the null model. In fact, Pearsons chi-squared and
the conventional test for equality of two proportions are one and the same.
In the case of two samples it is debatable whether the group effect is
best measured in terms of a difference in probabilities, the odds-ratio, or even
some other measures such as the relative difference proposed by Sheps (1961).
For arguments on all sides of this issue see Fleiss (1973).
3.4
Let us take a more general look at logistic regression models with a single
predictor by considering the comparison of k groups. This will help us
illustrate the logit analogues of one-way analysis of variance and simple
linear regression models.
3.4.1
A k-by-Two Table
18
3.4.2
Using
yi
72
105
237
93
507
Not Using
n i yi
325
299
375
101
1100
Total
ni
397
404
612
194
1607
Consider now a one-factor model, where we allow each group or level of the
discrete factor to have its own logit. We write the model as
logit(i ) = + i .
To avoid redundancy we adopt the reference cell method and set 1 = 0,
as before. Then is the logit of the reference group, and i measures the
difference in logits between level i of the factor and the reference level. This
model is exactly analogous to an analysis of variance model. The model
matrix X consists of a column of ones representing the constant and k 1
columns of dummy variables representing levels two to k of the factor.
Fitting this model to Table 3.4 leads to the parameter estimates and
standard errors in Table 3.5. The deviance for this model is of course zero
because the model is saturated: it uses four parameters to model four groups.
Table 3.5: Estimates and Standard Errors for Logit Model
of Contraceptive Use by Age in Groups
Parameter
Constant
Age 2529
3039
4049
Symbol
2
3
4
Estimate
1.507
0.461
1.048
1.425
Std. Error
0.130
0.173
0.154
0.194
z-ratio
11.57
2.67
6.79
7.35
The baseline logit of 1.51 for women under age 25 corresponds to odds
of 0.22. Exponentiating the age coefficients we obtain odds ratios of 1.59,
2.85 and 4.16. Thus, the odds of using contraception increase by 59% and
19
185% as we move to ages 2529 and 3039, and are quadrupled for ages
4049, all compared to women under age 25.
All of these estimates can be obtained directly from the frequencies in
Table 3.4 in terms of the logits of the observed proportions. For example
the constant is logit(72/397) = 1.507, and the effect for women 2529 is
logit(105/404) minus the constant.
To test the hypothesis of no age effects we can compare this model with
the null model. Since the present model is saturated, the difference in deviances is exactly the same as the deviance of the null model, which was 79.2
on three d.f. and is highly significant. An alternative test of
H0 : 2 = 3 = 4 = 0
is based on the estimates and their variance-covariance matrix. Let =
(2 , 3 , 4 )0 . Then
0.461
= 1.048
1.425
and
3.4.3
A One-Variate Model
Note that the estimated logits in Table 3.5 (and therefore the odds and
probabilities) increase monotonically with age. In fact, the logits seem to
increase by approximately the same amount as we move from one age group
to the next. This suggests that the effect of age may actually be linear in
the logit scale.
To explore this idea we treat age as a variate rather than a factor. A
thorough exploration would use the individual data with age in single years
(or equivalently, a 35 by two table of contraceptive use by age in single
years from 15 to 49). However, we can obtain a quick idea of whether the
model would be adequate by keeping age grouped into four categories but
representing these by the mid-points of the age groups. We therefore consider
a model analogous to simple linear regression, where
logit(i ) = + xi ,
20
where xi takes the values 20, 27.5, 35 and 45, respectively, for the four age
groups. This model fits into our general framework, and corresponds to
the special case where the model matrix X has two columns, a column of
ones representing the constant and a column with the mid-points of the age
groups, representing the linear effect of age.
Fitting this model gives a deviance of 2.40 on two d.f. , which indicates
a very good fit. The parameter estimates and standard errors are shown in
Table 3.6. Incidentally, there is no explicit formula for the estimates of the
constant and slope in this model, so we must rely on iterative procedures to
obtain the estimates.
Table 3.6: Estimates and Standard Errors for Logit Model
of Contraceptive Use with a Linear Effect of Age
Parameter
Constant
Age (linear)
Symbol
Estimate
2.673
0.061
Std. Error
0.233
0.007
z-ratio
11.46
8.54
The slope indicates that the logit of the probability of using contraception
increases 0.061 for every year of age. Exponentiating this value we note that
the odds of using contraception are multiplied by 1.063that is, increase
6.3%for every year of age. Note, by the way, that e 1 + for small
||. Thus, when the logit coefficient is small in magnitude, 100 provides
a quick approximation to the percent change in the odds associated with
a unit change in the predictor. In this example the effect is 6.3% and the
approximation is 6.1%.
To test the significance of the slope we can use the Wald test, which
gives a z statistic of 8.54 or equivalently a chi-squared of 73.9 on one d.f.
Alternatively, we can construct a likelihood ratio test by comparing this
model with the null model. The difference in deviances is 76.8 on one d.f.
Comparing these results with those in the previous subsection shows that
we have captured most of the age effect using a single degree of freedom.
Adding the estimated constant to the product of the slope by the midpoints of the age groups gives estimated logits at each age, and these may be
compared with the logits of the observed proportions using contraception.
The results of this exercise appear in Figure 3.2. The visual impression of the
graph confirms that the fit is quite good. In this example the assumption of
linear effects on the logit scale leads to a simple and parsimonious model. It
would probably be worthwhile to re-estimate this model using the individual
21
0.0
.
.
-1.5
-1.0
logit
-0.5
.
.
.
.
.
.
20
25
30
35
40
45
age
3.5
We now consider models involving two predictors, and discuss the binary
data analogues of two-way analysis of variance, multiple regression with
dummy variables, and analysis of covariance models. An important element
of the discussion concerns the key concepts of main effects and interactions.
3.5.1
Consider the distribution of contraceptive use by age and desire for more
children, as summarized in Table 3.7. We have a total of eight groups,
which will be indexed by a pair of subscripts i, j, with i = 1, 2, 3, 4 referring
to the four age groups and j = 1, 2 denoting the two categories of desire for
more children. We let yij denote the number of women using contraception
and nij the total number of women in age group i and category j of desire
for more children.
We now analyze these data under the usual assumption of a binomial
error structure, so the yij are viewed as realizations of independent random
variables Yij B(nij , ij ).
22
Table 3.7: Contraceptive Use by Age and Desire for More Children
Age
i
<25
2529
3039
4049
Desires
j
Yes
No
Yes
No
Yes
No
Yes
No
Total
3.5.2
Using
yij
58
14
68
37
79
158
14
79
507
Not Using
nij yij
265
60
215
84
230
145
43
58
1100
All
nij
323
74
283
121
309
303
57
137
1607
There are five basic models of interest for the systematic structure of these
data, ranging from the null to the saturated model. These models are listed
in Table 3.8, which includes the name of the model, a descriptive notation,
the formula for the linear predictor, the deviance or goodness of fit likelihood
ratio chi-squared statistic, and the degrees of freedom.
Note first that the null model does not fit the data: the deviance of 145.7
on 7 d.f. is much greater than 14.1, the 95-th percentile of the chi-squared
distribution with 7 d.f. This result is not surprising, since we already knew
that contraceptive use depends on desire for more children and varies by age.
Table 3.8: Deviance Table for Models of Contraceptive Use
by Age (Grouped) and Desire for More Children
Model
Null
Age
Desire
Additive
Saturated
Notation
A
D
A+D
AD
logit(ij )
+ i
+ j
+ i + j
+ i + j + ()ij
Deviance
145.7
66.5
54.0
16.8
0
d.f.
7
4
6
3
0
Introducing age in the model reduces the deviance to 66.5 on four d.f.
The difference in deviances between the null model and the age model provides a test for the gross effect of age. The difference is 79.2 on three d.f.,
23
and is highly significant. This value is exactly the same that we obtained in
the previous section, when we tested for an age effect using the data classified by age only. Moreover, the estimated age effects based on fitting the
age model to the three-way classification in Table 3.7 would be exactly the
same as those estimated in the previous section, and have the property of
reproducing exactly the proportions using contraception in each age group.
This equivalence illustrate an important property of binomial models.
All information concerning the gross effect of age on contraceptive use is
contained in the marginal distribution of contraceptive use by age. We can
work with the data classified by age only, by age and desire for more children,
by age, education and desire for more children, or even with the individual
data. In all cases the estimated effects, standard errors, and likelihood ratio
tests based on differences between deviances will be the same.
The deviances themselves will vary, however, because they depend on
the context. In the previous section the deviance of the age model was
zero, because treating age as a factor reproduces exactly the proportions
using contraception by age. In this section the deviance of the age model is
66.5 on four d.f. and is highly significant, because the age model does not
reproduce well the table of contraceptive use by both age and preferences.
In both cases, however, the difference in deviances between the age model
and the null model is 79.2 on three d.f.
The next model in Table 3.8 is the model with a main effect of desire
for more children, and has a deviance of 54.0 on six d.f. Comparison of
this value with the deviance of the null model shows a gain of 97.1 at the
expense of one d.f., indicating a highly significant gross effect of desire for
more children. This is, of course, the same result that we obtained in Section
3.3, when we first looked at contraceptive use by desire for more children.
Note also that this model does not fit the data, as it own deviance is highly
significant.
The fact that the effect of desire for more children has a chi-squared
statistic of 91.7 with only one d.f., whereas age gives 79.2 on three d.f.,
suggests that desire for more children has a stronger effect on contraceptive
use than age does. Note, however, that the comparison is informal; the
models are not nested, and therefore we cannot construct a significance test
from their deviances.
3.5.3
24
25
distribution for three d.f., we conclude that the additive model fails to
fit the data.
Table 3.9 shows parameter estimates for the additive model. We show
briefly how they would be interpreted, although we have evidence that the
additive model does not fit the data.
Table 3.9: Parameter Estimates for Additive Logit Model of
Contraceptive Use by Age (Grouped) and Desire for Children
Parameter
Constant
Age
2529
3039
4049
Desire No
Symbol
2
3
4
2
Estimate
1.694
0.368
0.808
1.023
0.824
Std. Error
0.135
0.175
0.160
0.204
0.117
z-ratio
12.53
2.10
5.06
5.01
7.04
The estimates of the j s show a monotonic effect of age on contraceptive use. Although there is evidence that this effect may vary depending on
whether women desire more children, on average the odds of using contraception among women age 40 or higher are nearly three times the corresponding
odds among women under age 25 in the same category of desire for another
child.
Similarly, the estimate of 2 shows a strong effect of wanting no more
children. Although there is evidence that this effect may depend on the
womans age, on average the odds of using contraception among women who
desire no more children are more than double the corresponding odds among
women in the same age group who desire another child.
3.5.4
We now consider a model which includes an interaction of age and desire for
more children, denoted AD in Table 3.8. The model is
logit(ij ) = + i + j + ()ij ,
where is a constant, the i and j are the main effects of age and desire,
and ()ij is the interaction effect. To avoid redundancies we follow the
reference cell method and set to zero all parameters involving the first cell,
so that 1 = 1 = 0, ()1j = 0 for all j and ()i1 = 0 for all i. The
remaining parameters may be interpreted as follows:
26
is the logit of the reference group: women under age 25 who desire more
children.
i for i = 2, 3, 4 are the effects of the age groups 2529, 3039 and 4049,
compared to ages under 25, for women who want another child.
2 is the effect of desiring no more children, compared to wanting another
child, for women under age 25.
()i2 for i = 2, 3, 4 is the additional effect of desiring no more children,
compared to wanting another child, for women in age group i rather
than under age 25. (This parameter is also the additional effect of age
group i, compared to ages under 25, for women who desire no more
children rather than those who want more.)
One way to simplify the presentation of results involving interactions is
to combine the interaction terms with one of the main effects, and present
them as effects of one factor within categories or levels of the other. In our
example, we can combine the interactions ()i2 with the main effects of
desire 2 , so that
2 + ()i2 is the effect of desiring no more children, compared to wanting
another child, for women in age group i.
Of course, we could also combine the interactions with the main effects
of age, and speak of age effects which are specific to women in each category
of desire for more children. The two formulations are statistically equivalent,
but the one chosen here seems demographically more sensible.
To obtain estimates based on this parameterization of the model we have
to define the columns of the model matrix as follows. Let ai be a dummy
variable representing age group i, for i = 2, 3, 4, and let d take the value one
for women who want no more children and zero otherwise. Then the model
matrix X should have a column of ones to represent the constant or reference
cell, the age dummies a2 , a3 and a4 to represent the age effects for women
in the reference cell, and then the dummy d and the products a2 d, a3 d and
a4 d, to represent the effect of wanting no more children at ages < 25, 2529,
3039 and 4049, respectively. The resulting estimates and standard errors
are shown in Table 3.10.
The results indicate that contraceptive use among women who desire
more children varies little by age, increasing up to age 3539 and then declining somewhat. On the other hand, the effect of wanting no more children
27
Estimate
1.519
0.368
0.451
0.397
0.064
0.331
1.154
1.431
Std. Error
0.145
0.201
0.195
0.340
0.330
0.241
0.174
0.353
z-ratio
10.481
1.832
2.311
1.168
0.194
1.372
6.640
4.057
increases dramatically with age, from no effect among women below age 25
to an odds ratio of 4.18 at ages 4049. Thus, in the older cohort the odds
of using contraception among women who want no more children are four
times the corresponding odds among women who desire more children. The
results can also be summarized by noting that contraceptive use for spacing
(i.e. among women who desire more children) does not vary much by age,
but contraceptive use for limiting fertility (i.e among women who want no
more children) increases sharply with age.
3.5.5
28
Notation
X
X +D
XD
logit(ij )
+ xi
j + xi
j + j xi
Deviance
68.88
18.99
9.14
d.f.
6
5
4
more children. The reduction in deviance of 39.9 on one d.f. indicates that
desire for no more children has a strong effect on contraceptive use after
controlling for a linear effect of age. However, the attained deviance of 19.0
on five d.f. is significant, indicating that the assumption of two parallel lines
is not consistent with the data.
The last model in the table, denoted XD, includes an interaction between
the linear effect of age and desire, and thus allows the effect of desire for
more children to vary by age. This variation is modelled by allowing each
category of desire for more children to have its own slope in addition to its
own constant, and results in two regression lines. The reduction in deviance
of 9.9 on one d.f. is a test of the hypothesis of parallelism or common slope
H0 : 1 = 2 , which is rejected with a P-value of 0.002. The model deviance
of 9.14 on four d.f. is just below the five percent critical value of the chisquared distribution with four d.f., which is 9.49. Thus, we have no evidence
against the assumption of two straight lines.
Before we present parameter estimates we need to discuss briefly the
choice of parameterization. Direct application of the reference cell method
leads us to use four variables: a dummy variable always equal to one, a
variable x with the mid-points of the age groups, a dummy variable d which
takes the value one for women who want no more children, and a variable dx
equal to the product of this dummy by the mid-points of the age groups. This
choice leads to parameters representing the constant and slope for women
who want another child, and parameters representing the difference in constants and slopes for women who want no more children.
An alternative is to simply report the constants and slopes for the two
groups defined by desire for more children. This parameterization can be
easily obtained by omitting the constant and using the following four variables: d and 1 d to represent the two constants and dx and (1 d)x to
represent the two slopes. One could, of course, obtain the constant and slope
for women who want no more children from the previous parameterization
29
simply by adding the main effect and the interaction. The simplest way to
obtain the standard errors, however, is to change parameterization.
In both cases the constants represent effects at age zero and are not
very meaningful. To obtain parameters that are more directly interpretable,
we can center age around the sample mean, which is 30.6 years. Table
3.12 shows parameter estimates obtained under the two parameterizations
discussed above, using the mid-points of the age groups minus the mean.
Table 3.12: Parameter Estimates for Model of Contraceptive Use With an
Interaction Between Age (Linear) and Desire for More Children
Desire
More
No More
Difference
Age
Constant
Slope
Constant
Slope
Constant
Slope
Symbol
1
1
2
2
2 1
2 1
Estimate
1.1944
0.0218
0.4369
0.0698
0.7575
0.0480
Std. Error
0.0786
0.0104
0.0931
0.0114
0.1218
0.0154
z-ratio
15.20
2.11
4.69
6.10
6.22
3.11
Thus, we find that contraceptive use increases with age, but at a faster
rate among women who want no more children. The estimated slopes correspond to increases in the odds of two and seven percent per year of age for
women who want and do not want more children, respectively. The difference of the slopes is significant by a likelihood ratio test or by Walds test,
with a z-ratio of 3.11.
Similarly, the effect of wanting no more children increases with age. The
odds ratio around age 30.6which we obtain by exponentiating the difference in constantsis 2.13, so not wanting more children at this age is associated with a doubling of the odds of using contraception. The difference in
slopes of 0.048 indicates that this differential increases five percent per year
of age.
The parameter estimates in Table 3.12 may be used to produce fitted
logits for each age group and category of desire for more children. In turn,
these can be compared with the empirical logits for the original eight groups,
to obtain a visual impression of the nature of the relationships studied and
the quality of the fit. The comparison appears in Figure 3.3, with the solid
line representing the linear age effects (the dotted lines are discussed below).
The graph shows clearly how the effect of wanting no more children increases
with age (or, alternatively, how age has much stronger effects among limiters
30
0.5
.
.
-0.5
-1.5
-1.0
logit
0.0
.
.
.
.
.
.
..
.
.
20
25
30
35
40
45
age
Figure 3.3: Observed and Fitted Logits for Models of Contraceptive Use
With Effects of Age (Linear and Quadratic), Desire for More Children
and a Linear Age by Desire Interaction.
The graph also shows that the assumption of linearity of age effects, while
providing a reasonably parsimonious description of the data, is somewhat
suspect, particularly at higher ages. We can improve the fit by adding higherorder terms on age. In particular
Introducing a quadratic term on age yields an excellent fit, with a
deviance of 2.34 on three d.f. This model consists of two parabolas,
one for each category of desire for more children, but with the same
curvature.
Adding a quadratic age by desire interaction further reduces the deviance to 1.74 on two d.f. This model allows for two separate parabolas
tracing contraceptive use by age, one for each category of desire.
Although the linear model passes the goodness of fit test, the fact that we can
reduce the deviance by 6.79 at the expense of one d.f. indicates significant
curvature. The dotted line in Figure 3.3 shows the intermediate model,
where the curvature by age is the same for the two groups. While the fit is
much better, the overall substantive conclusions do not change.
3.6
31
Let us consider a full analysis of the contraceptive use data in Table 3.1,
including all three predictors: age, education and desire for more children.
We use three subscripts to reflect the structure of the data, so ijk is the
probability of using contraception in the (i, j, k)-th group, where i = 1, 2, 3, 4
indexes the age groups, j = 1, 2 the levels of education and k = 1, 2 the
categories of desire for more children.
3.6.1
There are 19 basic models of interest for these data, which are listed for
completeness in Table 3.13. Not all of these models would be of interest in
any given analysis. The table shows the model in abbreviated notation, the
formula for the linear predictor, the deviance and its degrees of freedom.
Note first that the null model does not fit the data. The assumption of
a common probability of using contraception for all 16 groups of women is
clearly untenable.
Next in the table we find the three possible one-factor models. Comparison of these models with the null model provides evidence of significant
gross effects of age and desire for more children, but not of education. The
likelihood ratio chi-squared tests are 91.7 on one d.f. for desire, 79.2 on three
d.f. for age, and 0.7 on one d.f. for education.
Proceeding down the table we find the six possible two-factor models,
starting with the additive ones. Here we find evidence of significant net
effects of age and desire for more children after controlling for one other
factor. For example the test for an effect of desire net of age is a chi-squared
of 49.7 on one d.f., obtained by comparing the additive model A + D on
age and desire the one-factor model A with age alone. Education has a
significant effect net of age, but not net of desire for more children. For
example the test for the net effect of education controlling for age is 6.2 on
one d.f., and follows from the comparison of the A + E model with A. None
of the additive models fits the data, but the closest one to a reasonable fit
is A + D.
Next come the models involving interactions between two factors. We
use the notation ED to denote the model with the main effects of E and D
as well as the E D interaction. Comparing each of these models with the
corresponding additive model on the same two factors we obtain a test of the
interaction effect. For example comparing the model ED with the additive
model E + D we can test whether the effect of desire for more children varies
32
Model
logit(ijk )
Dev.
d.f.
Null
165.77
15
One Factor
Age
Education
Desire
+i
+j
+k
86.58
165.07
74.10
12
14
14
Two Factors
A+E
A+D
E+D
AE
AD
ED
+i +j
+i
+k
+j +k
+i +j
+()ij
+i
+k
+()ik
+j +k
+()jk
80.42
36.89
73.87
73.03
20.10
67.64
11
11
13
8
8
12
Three Factors
A+E+D
AE + D
AD + E
A + ED
AE + AD
AE + ED
AD + ED
AE + AD + ED
+i +j +k
+i +j +k +()ij
+i +j +k
+()ik
+i +j +k
+()jk
+i +j +k +()ij +()ik
+i +j +k +()ij
+()jk
+i +j +k
+()ik +()jk
+i +j +k +()ij +()ik +()jk
29.92
23.15
12.63
23.02
5.80
13.76
10.82
2.44
10
7
7
9
4
6
6
3
33
3.6.2
The first entry is the additive model A+E +D, with a deviance of 29.9 on ten
d.f. This value represents a significant improvement over any of the additive
models on two factors. Thus, we have evidence that there are significant
net effects of age, education and desire for more children, considering each
factor after controlling the other two. For example the test for a net effect
of education controlling the other two variables compares the three-factor
additive model A + E + D with the model without education, namely A + D.
The difference of 6.97 on one d.f. is significant, with a P-value of 0.008.
However, the three-factor additive model does not fit the data.
The next step is to add one interaction between two of the factors. For
example the model AE + D includes the main effects of A, E and D and
the A E interaction. The interactions of desire for more children with
age and with education produce significant gains over the additive model
(2 = 17.3 on three d.f. and 2 = 6.90 on one d.f., respectively), whereas
the interaction between age and education is not significant (2 = 6.77 with
three d.f.). These tests for interactions differ from those based on two-factor
models in that they take into account the third factor. The best of these
models is clearly the one with an interaction between age and desire for more
children, AD + E. This is also the first model in our list that actually passes
the goodness of fit test, with a deviance of 12.6 on seven d.f.
Does this mean that we can stop our search for an adequate model?
Unfortunately, it does not. The goodness of fit test is a joint test for all
terms omitted in the model. In this case we are testing for the AE, ED
and AED interactions simultaneously, a total of seven parameters. This
type of omnibus test lacks power against specific alternatives. It is possible
that one of the omitted terms (or perhaps some particular contrast) would
be significant by itself, but its effect may not stand out in the aggregate.
At issue is whether the remaining deviance of 12.6 is spread out uniformly
over the remaining d.f. or is concentrated in a few d.f. If you wanted to
be absolutely sure of not missing anything you might want to aim for a
deviance below 3.84, which is the five percent critical value for one d.f., but
this strategy would lead to over-fitting if followed blindly.
Let us consider the models involving two interactions between two factors, of which there are three. Since the AD interaction seemed important
we restrict attention to models that include this term, so we start from
AD + E, the best model so far. Adding the age by education interaction
34
AE to this model reduces the deviance by 6.83 at the expense of three d.f.
A formal test concludes that this interaction is not significant. If we add
instead the education by desire interaction ED we reduce the deviance by
only 1.81 at the expense of one d.f. This interaction is clearly not significant.
A model-building strategy based on forward selection of variables would stop
here and choose AD + E as the best model on grounds of parsimony and
goodness of fit.
An alternative approach is to start with the saturated model and impose
progressive simplification. Deleting the three-factor interaction yields the
model AE + AD + ED with three two-factor interactions, which fits the
data rather well, with a deviance of just 2.44 on three d.f. If we were to
delete the AD interaction the deviance would rise by 11.32 on three d.f.,
a significant loss. Similarly, removing the AE interaction would incur a
significant loss of 8.38 on 3 d.f. We can, however, drop the ED interaction
with a non-significant increase in deviance of 3.36 on one d.f. At this point
we can also eliminate the AE interaction, which is no longer significant, with
a further loss of 6.83 on three d.f. Thus, a backward elimination strategy
ends up choosing the same model as forward selection.
Although you may find these results reassuring, there is a fact that both
approaches overlook: the AE and DE interactions are jointly significant!
The change in deviance as we move from AD+E to the model with three twofactor interactions is 10.2 on four d.f., and exceeds (although not by much)
the five percent critical value of 9.5. This result indicates that we need to
consider the more complicated model with all three two-factor interactions.
Before we do that, however, we need to discuss parameter estimates for
selected models.
3.6.3
35
Table 3.14: Gross and Net Effects of Age, Education and Desire
for More Children on Current Use of Contraception
Variable and
category
Constant
Age <25
2529
3039
4049
Education
Lower
Upper
Desires More
Yes
No
Gross
effect
0.461
1.048
1.425
Net
effect
1.966
0.389
0.909
1.189
-0.093
0.325
1.049
0.833
education rather than lower primary or less appears to reduce the odds of
using contraception by almost 10%.
The net or adjusted effects are based on the three-factor additive model
A + E + D. This model assumes that the effect of each factor is the same for
all categories of the others. We know, however, that this is not the case
particularly with desire for more children, which has an effect that varies by
ageso we have to interpret the results carefully. The net effect of desire
for more children shown in Table 3.14 represents an average effect across all
age groups and may not be representative of the effect at any particular age.
Having said that, we note that desire for no more children has an important
effect net of age and education: on the average, it is associated with an
increase in the odds of using contraception of 130%.
The result for education is particularly interesting. Having upper primary or higher education is associated with an increase in the odds or using
contraception of 38%, compared to having lower primary or less, after we
control for age and desire for more children. The gross effect was close to
zero. To understand this result bear in mind that contraceptive use in Fiji
occurs mostly among older women who want no more children. Education
has no effect when considered by itself because in Fiji more educated women
are likely to be younger than less educated women, and thus at a stage of
their lives when they are less likely to have reached their desired family size,
36
even though they may want fewer children. Once we adjust for their age,
calculating the net effect, we obtain the expected association. In this example age is said to act as a suppressor variable, masking the association
between education and contraceptive use.
We could easily add columns to Table 3.14 to trace the effects of one
factor after controlling for one or both of the other factors. We could, for
example, examine the effect of education adjusted for age, the effect adjusted
for desire for more children, and finally the effect adjusted for both factors.
This type of analysis can yield useful insights into the confounding influences
of other variables.
3.6.4
Let us now examine parameter estimates for the model with an age by desire
for more children interaction AD + E, where
logit(ijk ) = + i + j + j + ()ik .
The parameter estimates depend on the restrictions used in estimation. We
use the reference cell method, so that 1 = 1 = 1 = 0, and ()ik = 0
when either i = 1 or k = 1.
In this model is the logit of the probability of using contraception
in the reference cell, that is, for women under 25 with lower primary or
less education who want another child. On the other hand 2 is the effect
of upper primary or higher education, compared to lower primary or less,
for women in any age group or category of desire for another child. The
presence of an interaction makes interpretation of the estimates for age and
desire somewhat more involved:
i represents the effect of age group i, compared to age < 25, for women
who want more children.
2 represents the effect of wanting no more children, compared to desiring
more, for women under age 25.
()i2 , the interaction term, can be interpreted as the additional effect of
wanting no more children among women in age group i, compared to
women under age 25.
It is possible to simplify slightly the presentation of the results by combining the interactions with some of the main effects. In the present example, it
is convenient to present the estimates of i as the age effects for women who
37
Education
Desires
no more
at age
Category
2529
3039
4049
Upper
<25
2529
3039
4049
Symbol
2
3
4
2
2
2 + ()22
2 + ()32
2 + ()42
Estimate
1.803
0.395
0.547
0.580
0.341
0.066
0.325
1.179
1.428
Std. Err
0.180
0.201
0.198
0.347
0.126
0.331
0.242
0.175
0.354
z-ratio
10.01
1.96
2.76
1.67
2.71
0.20
1.35
6.74
4.04
want another child, and to present 2 + ()i2 as the effect of not wanting
another child for women in age group i.
Calculation of the necessary dummy variables proceeds exactly as in
Section 3.5. This strategy leads to the parameter estimates in Table 3.15.
To aid in interpretation as well as model criticism, Figure 3.4 plots observed logits based on the original data in Table 3.1, and fitted logits based
on the model with an age by desire interaction.
The graph shows four curves tracing contraceptive use by age for groups
defined by education and desire for more children. The curves are labelled
using L and U for lower and upper education, and Y and N for desire for
more children. The lowest curve labelled LY corresponds to women with
lower primary education or less who want more children, and shows a slight
increase in contraceptive use up to age 3539 and then a small decline. The
next curve labelled U Y is for women with upper primary education or more
who also want more children. This curve is parallel to the previous one
because the effect of education is additive on age. The constant difference
between these two curves corresponds to a 41% increase in the odds ratio as
we move from lower to upper primary education. The third curve, labelled
LN , is for women with lower primary education or less who want no more
children. The distance between this curve and the first one represents the
effect of wanting no more children at different ages. This effect increases
sharply with age, reaching an odds ratio of four by age 4049. The fourth
curve, labelled U N , is for women with upper primary education or more who
want no more children. The distance between this curve and the previous
one is the effect of education, which is the same whether women want more
children or not, and is also the same at every age.
0.5
1.0
38
UN
0.5
1.0
LN
1.5
logit
0.0
LN
UY
UY
LY
2.0
UN
LY
15
20
25
30
35
40
45
50
age
3.6.5
How can we improve the model of the last section? The most obvious solution
is to move to the model with all three two-factor interactions, AE + AD +
ED, which has a deviance of 2.44 on three d.f. and therefore fits the data
39
1.0
LN
UY
0.5
1.0
logit
0.0
0.5
UN
LN
1.5
UY
2.0
UN
LY
LY
15
20
25
30
35
40
45
50
age
40
between the UY and UN curves). On the other hand, the effect of education
is clearly more pronounced at ages 4049 than at earlier ages, and also seems
slightly larger for women who want more children than for those who do not
(look at the distance between the LY and UY curves, and between the LN
and UN curves).
One can use this knowledge to propose improved models that fit the data
without having to use all three two-factor interactions. One approach would
note that all interactions with age involve contrasts between ages 4049 and
the other age groups, so one could collapse age into only two categories for
purposes of modelling the interactions. A simplified version of this approach
is to start from the model AD + E and add one d.f. to model the larger
educational effect for ages 4049. This can be done by adding a dummy
variable that takes the value one for women aged 4049 who have upper
primary or more education. The resulting model has a deviance of 6.12 on
six d.f., indicating a good fit. Comparing this value with the deviance of 12.6
on seven d.f. for the AD + E model, we see that we reduced the deviance
by 6.5 at the expense of a single d.f. The model AD + AE includes all three
d.f. for the age by education interaction, and has a deviance of 5.8 on four
d.f. Thus, the total contribution of the AE interaction is 6.8 on three d.f.
Our one-d.f. improvement has captured roughly 90% of this interaction.
An alternative approach is to model the effects of education and desire
for no more children as smooth functions of age. The logit of the probability
of using contraception is very close to a linear function of age for women with
upper primary education who want no more children, who could serve as a
new reference cell. The effect of wanting more children could be modelled as
a linear function of age, and the effect of education could be modelled as a
quadratic function of age. Let Lijk take the value one for lower primary or
less education and zero otherwise, and let Mijk be a dummy variable that
takes the value one for women who want more children and zero otherwise.
Then the proposed model can be written as
logit(ijk ) = + xijk + (E + E xijk + E x2ijk )Lijk + (D + D xijk )Mijk .
Fitting this model, which requires only seven parameters, gives a deviance
of 7.68 on nine d.f. The only weakness of the model is that it assumes equal
effects of education on use for limiting and use for spacing, but these effects
are not well-determined. Further exploration of these models is left as an
exercise.
3.7
41
All the models considered so far use the logit transformation of the probabilities, but other choices are possible. In fact, any transformation that
maps probabilities into the real line could be used to produce a generalized
linear model, as long as the transformation is one-to-one, continuous and
differentiable.
In particular, suppose F (.) is the cumulative distribution function (c.d.f.)
of a random variable defined on the real line, and write
i = F (i ),
for < i < . Then we could use the inverse transformation
i = F 1 (i ),
for 0 < i < 1 as the link function.
Popular choices of c.d.f.s in this context are the normal, logistic and extreme value distributions. In this section we motivate this general approach
by introducing models for binary data in terms of latent variables.
3.7.1
0.2
0.1
density
0.3
0.4
42
Y=1
0.0
Y=0
-3
-2
-1
latent variable
43
(3.17)
3.7.2
Probit Analysis
44
Symbol
1
1
2 1
2 1
Estimate
0.7297
0.0129
0.4572
0.0305
Std. Error
0.0460
0.0061
0.0731
0.0092
z-ratio
15.85
2.13
6.26
3.32
of age, and a linear age by desire interaction. Fitting this model gives a
deviance of 8.91 on four d.f. Estimates of the parameters and standard
errors appear in Table 3.16
To interpret these results we imagine a latent continuous variable representing the womans motivation to use contraception (or the utility of using
contraception, compared to not using). At the average age of 30.6, not wanting more children increases the motivation to use contraception by almost
half a standard deviation. Each year of age is associated with an increase in
motivation of 0.01 standard deviations if she wants more children and 0.03
standard deviations more (for a total of 0.04) if she does not. In the next
section we compare these results with logit estimates.
A slight disadvantage of using the normal distribution as a link for binary
response models is that the c.d.f. does not have a closed form, although excellent numerical approximations and computer algorithms are available for
computing both the normal probability integral and its inverse, the probit.
3.7.3
Logistic Regression
An alternative to the normal distribution is the standard logistic distribution, whose shape is remarkably similar to the normal distribution but has
the advantage of a closed form expression
i = F (i ) =
ei
,
1 + ei
i
,
1 i
45
0.6
0.4
0.2
probability
0.8
1.0
0.0
...
......
-2
...
..
...
-1
.
..
.
..
..
.
..
..
..
.
..
..
..
probit
logit
c-log-log
link
46
3.7.4
F (i ) = 1 ee i .
For small values of i the complementary log-log transformation is close
to the logit. As the probability increases, the transformation approaches
infinity more slowly that either the probit or logit.
This particular choice of link function can also be obtained from our
general latent variable formulation if we assume that Ui (note the minus
sign) has a standard extreme value distribution, so the error term itself has
a reverse extreme value distribution, with c.d.f.
Ui
F (Ui ) = ee
The reverse extreme value distribution is asymmetric, with a long tail to the
right. It has mean equal to Eulers constant 0.577 and variance 2 /6 = 1.645.
The median is log log 2 = 0.367 and the quartiles are 0.327 and 1.246.
Inverting the reverse extreme value c.d.f. and applying Equation 3.17,
which is valid for both symmetric and asymmetric distributions, we find
that the link corresponding to this error distribution is the complementary
log-log.
Thus, coefficients in a generalized linear model with binary response and
a complementary log-log link can be interpreted as effects of the covariates
on a latent variable which follows a linear model with reverse extreme value
errors.
To compare these coefficients with estimates
based on a probit analysis
47
Figure 3.7 compares the c-log-log link with the probit and logit after
standardizing it to have mean zero and variance one. Although the c-log-log
link differs from the other two, one would need extremely large sample sizes
to be able to discriminate empirically between these links.
The complementary log-log transformation has a direct interpretation in
terms of hazard ratios, and thus has practical applications in terms of hazard
models, as we shall see later in the sequel.
3.8
3.8.1
Pearson Residuals
pi = p
yi
i
,
i (ni
i )/ni
(3.18)
where
i is the fitted value and the denominator follows from the fact that
var(yi ) = ni i (1 i ).
The result is called the Pearson residual because the square of pi is the
contribution of the i-th observation to Pearsons chi-squared statistic, which
was introduced in Section 3.2.2, Equation 3.14.
With grouped data the Pearson residuals are approximately normally
distributed, but this is not the case with individual data. In both cases,
however, observations with a Pearson residual exceeding two in absolute
value may be worth a closer look.
48
3.8.2
Deviance Residuals
An alternative residual is based on the deviance or likelihood ratio chisquared statistic. The deviance residual is defined as
s
di =
2[yi log(
ni yi
yi
) + (ni yi ) log(
)],
i
ni
i
(3.19)
with the same sign as the raw residual yi yi . Squaring these residuals and
summing over all observations yields the deviance statistic. Observations
with a deviance residual in excess of two may indicate lack of fit.
3.8.3
Studentized Residuals
The residuals defined so far are not fully standardized. They take into
account the fact that different observations have different variances, but
they make no allowance for additional variation arising from estimation of
the parameters, in the way studentized residuals in classical linear models
do.
Pregibon (1981) has extended to logit models some of the standard regression diagnostics. A key in this development is the weighted hat matrix
H = W1/2 X(X0 WX)1 X0 W1/2 ,
where W is the diagonal matrix of iteration weights from Section 3.2.1, with
entries wii = i (ni i )/ni , evaluated at the m.l.e.s. Using this expression
it can be shown that the variance of the raw residual is, to a first-order
approximation,
var(yi
i ) (1 hii )var(yi ),
where hi i is the leverage or diagonal element of the weighted hat matrix.
Thus, an internally studentized residual can be obtained dividing the Pearson
residual by the square root of 1 hii , to obtain
si =
pi
yi
i
=p
.
1 hii
(1 hii )
i (ni
i )/ni
49
be expensive. Suppose, however, that we start from the final estimates and
do only one iteration of the IRLS procedure. Since this step is a standard
weighted least squares calculation, we can apply the standard regression
updating formulas to obtain the new coefficients and thus the predictive
residuals. Thus, we can calculate a jack-knifed residual as a function of the
standardized residual using the same formula as in linear models
s
ti = si
np1
n p s2i
3.8.4
The diagonal elements of the hat matrix can be interpreted as leverages just
as in linear models. To measure actual rather than potential influence we
with
, the m.l.e.s of the
could calculate Cooks distance, comparing
(i)
coefficients with and without the i-th observation. Calculation of the later
would be expensive if we iterated to convergence. Pregibon (1981), however,
has shown that we can use the standard linear models formula
Di = s2i
hii
,
(1 hii )p
3.8.5
With grouped data we can assess goodness of fit by looking directly at the
deviance, which has approximately a chi-squared distribution for large ni . A
common rule of thumb is to require all expected frequencies (both expected
successes
i and failures ni
i ) to exceed one, and 80% of them to exceed
five.
With individual data this test is not available, but one can always group
the data according to their covariate patterns. If the number of possible
combinations of values of the covariates is not too large relative to the total
sample size, it may be possible to group the data and conduct a formal
goodness of fit test. Even when the number of covariate patterns is large, it is
possible that a few patterns will account for most of the observations. In this
50
case one could compare observed and fitted counts at least for these common
patterns, using either the deviance or Pearsons chi-squared statistic.
Hosmer and Lemeshow (1980, 1989) have proposed an alternative procedure that can be used with individual data even if there are no common
covariate patterns. The basic idea is to use predicted probabilities to create
groups. These authors recommend forming ten groups, with predicted probabilities of 00.1, 0.10.2, and so on, with the last group being 0.91. One
can then compute expected counts of successes (and failures) for each group
by summing the predicted values (and their complements), and compare
these with observed values using Pearsons chi-squared statistic. Simulation
studies show that the resulting statistic has approximately in large samples
the usual chi-squared distribution, with degrees of freedom equal to g 2,
where g is the number of groups, usually ten. It seems reasonable to assume that this result would also apply if one used the deviance rather than
Pearsons chi-squared.
Another measure that has been proposed in the literature is a pseudoR2 , based on the proportion of deviance explained by a model. This is a
direct extension of the calculations based on RSSs for linear models. These
measures compare a given model with the null model, and as such do not
necessarily measure goodness of fit. A more direct measure of goodness of
fit would compare a given model with the saturated model, which brings us
back again to the deviance.
Yet another approach to assessing goodness of fit is based on prediction
errors. Suppose we were to use the fitted model to predict success if the
fitted probability exceeds 0.5 and failure otherwise. We could then crosstabulate the observed and predicted responses, and calculate the proportion of
cases predicted correctly. While intuitively appealing, one problem with this
approach is that a model that fits the data may not necessarily predict well,
since this depends on how predictable the outcome is. If prediction was the
main objective of the analysis, however, the proportion classified correctly
would be an ideal criterion for model comparison.