Fitting & Interpreting Linear Models in Rinear Models in R
Fitting & Interpreting Linear Models in Rinear Models in R
R makes it easy to fit a linear model to your data. The hard part is knowing whether the model
you've built is worth keeping and, if so, figuring out what to do next.
This is a post about linear models in R, how to interpret lm results, and common rules of thumb
to help side-step the most common mistakes.
comes with base R, so you don't have to install any packages or import anything special. The
documentation for lm is very extensive, so if you have any questions about using it, just type ?lm
into the R console.
lm
Introduction to lm
For our example linear model, I'm going to use data from the original, or at least one of the
earliest, linear regression models. The dataset consists of heights of children and their parents.
The origin of the term "regression" stems from a 19th century statistician's observation that
children's heights tended to "regress" towards the population mean in relation to their parent's
heights.
#
#
#
#
head(galton)
# child parent
#1 61.7 70.5
#2 61.7 68.5
#3 61.7 65.5
#4 61.7 64.5
#5 61.7 64.0
#6 62.2 67.5
view raw intro_to_lm_summary.R hosted with by GitHub
Fit the model to the data by creating a formula and passing it to the lm function. In our case we
want to use the parent's height to predict the child's height, so we make the formula (child ~
parent). In other words, we're representing the relationship between parents' heights (X) and
children's heights (y).
We then set the data being used to galton so lm knows what data frame to associate "child" and
"parent" to.
fit <- lm(child ~ parent, data=galton)
fit
#Call:
#lm(formula = child ~ parent, data = galton)
#
#Coefficients:
#(Intercept) parent
# 23.942 0.646
NOTE: Formulas in R take the form (y ~ x). To add more predictor variables,
summary(fit)
#Call:
#lm(formula = child ~ parent, data = galton)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.805 -1.366 0.049 1.634 5.926
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 23.9415 2.8109 8.52 <2e-16 ***
#parent 0.6463 0.0411 15.71 <2e-16 ***
#--#Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
#
#Residual standard error: 2.24 on 926 degrees of freedom
#Multiple R-squared: 0.21, Adjusted R-squared: 0.21
#F-statistic: 247 on 1 and 926 DF, p-value: <2e-16
So if you're like I was at first, your reaction was probably something like
"Whoa this is cool...what does it mean?"
#
Name
1 Residuals
Description
The residuals are the difference between the actual values
of the variable you're predicting and predicted values from
your regression--y - . For most regressions you want your
residuals to look like a normal distribution when plotted. If
our residuals are normally distributed, this indicates the
mean of the difference between our predictions and the
actual values is close to 0 (good) and that when we miss,
we're missing both short and long of the actual value, and
the likelihood of a miss being far from the actual value gets
smaller as the distance from the actual value gets larger.
Think of it like a dartboard. A good model is going to hit the
bullseye some of the time (but not everytime). When it
doesn't hit the bullseye, it's missing in all of the other
buckets evenly (i.e. not just missing in the 16 bin) and it
also misses closer to the bullseye as opposed to on the
One very important thing to note is that one of your levels will not appear in
the output. This is because when fitting a regression with a categorical
variable, one option must be left out to avoid overfitting the model. This is
often referred to as the dummy variable trap. In our model, Africa is left out
of the summary but it is still accounted for in the model.
library(reshape2)