An Introduction To Logistic Regression: Johnwhitehead Department of Economics East Carolina University
An Introduction To Logistic Regression: Johnwhitehead Department of Economics East Carolina University
Regression
JohnWhitehead
Department of Economics
East Carolina University
Outline
Introduction and
Description
Some Potential Problems
and Solutions
Writing Up the Results
Introduction and
Description
Why use logistic regression?
Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model
Why use logistic
regression?
There are many important research topics for
which the dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression
analysis where the dependent variable is a dummy
variable: coded 0 (did not vote) or 1(did vote)
The Linear Probability
Model
In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y
takes on only two values
The predicted probabilities can be greater
than 1 or less than 0
An Example: Hurricane
Evacuations
Q: EVAC
Std.
N MinimumMaximum Mean Deviatio
Unstandardized
1070 -.08498 .76027 .2429907 .16325
Predicted Value
Valid N (listwise) 1070
Heteroskedasticity
10
U
n
s
t
a 0
n
d
a
r
d
i
Park Test
z
e
d
R
-10
Dependent Variable: LNESQ
e
s
i
B t-stat
d
u
a
(Constant) -2.34 -15.99
l -20
0 20 40 60 LNTNSQ
80 -0.20
100 -6.19
TENURE
The Logistic Regression Model
The "logit" model solves these problems:
ln[p/(1-p)] = + X + e
0
Maximum Likelihood Estimation
(MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that occur
in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of
observing the ps in the sample.
MLE involves finding the coefficients (, ) that
makes the log of the likelihood function (LL < 0)
as large as possible
Or, finds the coefficients that make -2 times the
log of the likelihood function (-2LL) as small as
possible
The maximum likelihood estimates solve the
following condition:
{Y - p(Y=1)}Xi = 0
ln[p/(1-p)] = + X + e
Wald = [ /s.e.B]2
which is distributed chi-square with 1
degree of freedom.
The "Partial R" (in SPSS output) is
R = {[(Wald-2)/(-2LL()]}1/2
An Example:
Chi-Square df Sign.
McFadden's-R2 = 1 - [LL(,)/LL()]
{= 1 - [-2LL(, )/-2LL()] (from SPSS printout)}
Beginning -2 LL 687.36
Ending -2 LL 641.84
Ending/Beginning 0.9338
2
McF. R = 1 - E./B. 0.0662
Some potential problems
and solutions
Omitted Variable Bias
Irrelevant Variable Bias
Functional Form
Multicollinearity
Structural Breaks
Omitted Variable Bias
Omitted variable(s) can result in bias in the coefficient
estimates. To test for omitted variables you can conduct a
likelihood ratio test:
Beginning -2 LL 687.36
Ending -2 LL 641.41
Constructing the LR Test
Ending -2 LL Partial Model 641.84
Ending -2 LL Full Model 641.41
Block Chi-Square 0.43
DF 3
Critical Value 11.345
“Since the chi-squared value is less than the critical value the
set of coefficients is not statistically significant. The full model
is not an improvement over the partial model.”
Irrelevant Variable Bias
Since the chi-squared value is greater than the critical value the
set of coefficients are statistically different. The pooled model is
inappropriate.
What should you do?
Try adding a dummy variable:
http://personal.ecu.edu/whiteheadj/data/logit/logitpap.htm
E-mail: WhiteheadJ@mail.ecu.edu