100% found this document useful (1 vote)

147 views

Fitting & Interpreting Linear Models in Rinear Models in R

This document discusses linear regression models in R. It begins by explaining how to easily build linear models using the lm function. Key steps include fitting a model to example height data of children and parents using the formula child ~ parent. The summary function is then used to interpret the lm results, which provides important metrics like residuals, coefficient estimates, p-values and R-squared to evaluate the quality and significance of the fitted model. Categorical variables can also be included, with R automatically handling dummy variable coding.

Uploaded by

ReaderRat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

147 views

Fitting & Interpreting Linear Models in Rinear Models in R

Uploaded by

ReaderRat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Fitting & Interpreting Linear Models in R

R makes it easy to fit a linear model to your data. The hard part is knowing whether the model
you've built is worth keeping and, if so, figuring out what to do next.
This is a post about linear models in R, how to interpret lm results, and common rules of thumb
to help side-step the most common mistakes.

Building a linear model in R

makes building linear models really easy. Things like dummy variables, categorical features,
interactions, and multiple regression all come very naturally. The centerpiece for linear
regression in R is the lm function.
R

comes with base R, so you don't have to install any packages or import anything special. The
documentation for lm is very extensive, so if you have any questions about using it, just type ?lm
into the R console.
lm

Introduction to lm
For our example linear model, I'm going to use data from the original, or at least one of the
earliest, linear regression models. The dataset consists of heights of children and their parents.
The origin of the term "regression" stems from a 19th century statistician's observation that
children's heights tended to "regress" towards the population mean in relation to their parent's
heights.

galton <- read.csv("http://blog.yhathq.com/static/misc/galton.csv",

header=TRUE, stringsAsFactors=FALSE)
summary(galton)
# child parent
# Min. :61.7 Min. :64.0
# 1st Qu.:66.2 1st Qu.:67.5

#
#
#
#

Median :68.2 Median :68.5

Mean :68.1 Mean :68.3
3rd Qu.:70.2 3rd Qu.:69.5
Max. :73.7 Max. :73.0

head(galton)
# child parent
#1 61.7 70.5
#2 61.7 68.5
#3 61.7 65.5
#4 61.7 64.5
#5 61.7 64.0
#6 62.2 67.5
view raw intro_to_lm_summary.R hosted with by GitHub

Fit the model to the data by creating a formula and passing it to the lm function. In our case we
want to use the parent's height to predict the child's height, so we make the formula (child ~
parent). In other words, we're representing the relationship between parents' heights (X) and
children's heights (y).
We then set the data being used to galton so lm knows what data frame to associate "child" and
"parent" to.
fit <- lm(child ~ parent, data=galton)
fit
#Call:
#lm(formula = child ~ parent, data = galton)
#
#Coefficients:
#(Intercept) parent
# 23.942 0.646

view raw model_lm_summary.R hosted with by GitHub

NOTE: Formulas in R take the form (y ~ x). To add more predictor variables,

just use the + sign. i.e. (y ~ x + z).

Calling summary
We fit a model to our data. That's great! But the important question is, is it
any good?
There are lots of ways to evaluate model fit. lm consolidates some of the
most popular ways into the summary function. You can invoke the summary
function on any model you've fit with lm and get some metrics indicating the
quality of the fit.

summary(fit)
#Call:
#lm(formula = child ~ parent, data = galton)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.805 -1.366 0.049 1.634 5.926
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 23.9415 2.8109 8.52 <2e-16 ***
#parent 0.6463 0.0411 15.71 <2e-16 ***
#--#Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
#
#Residual standard error: 2.24 on 926 degrees of freedom
#Multiple R-squared: 0.21, Adjusted R-squared: 0.21
#F-statistic: 247 on 1 and 926 DF, p-value: <2e-16

view raw summary_lm_summary.R hosted with by GitHub

So if you're like I was at first, your reaction was probably something like
"Whoa this is cool...what does it mean?"

Interpreting the output

#
Name
1 Residuals

Description
The residuals are the difference between the actual values
of the variable you're predicting and predicted values from
your regression--y - . For most regressions you want your
residuals to look like a normal distribution when plotted. If
our residuals are normally distributed, this indicates the
mean of the difference between our predictions and the
actual values is close to 0 (good) and that when we miss,
we're missing both short and long of the actual value, and
the likelihood of a miss being far from the actual value gets
smaller as the distance from the actual value gets larger.
Think of it like a dartboard. A good model is going to hit the
bullseye some of the time (but not everytime). When it
doesn't hit the bullseye, it's missing in all of the other
buckets evenly (i.e. not just missing in the 16 bin) and it
also misses closer to the bullseye as opposed to on the

outer edges of the dartboard.

The stars are shorthand for significance levels, with the
number of asterisks displayed according to the p-value
Significance computed. *** for high significance and * for low
2
Stars
significance. In this case, *** indicates that it's unlikely that
no relationship exists b/w heights of parents and heights of
their children.
The estimated coefficient is the value of slope calculated by
the regression. It might seem a little confusing that the
Intercept also has a value, but just think of it as a slope that
Estimated
3
is always multiplied by 1. This number will obviously vary
Coeffecient
based on the magnitude of the variable you're inputting into
the regression, but it's always good to spot check this
number to make sure it seems reasonable.
Measure of the variability in the estimate for the coefficient.
Lower means better but this number is relative to the value
of the coefficient. As a rule of thumb, you'd like this value to
Standard
be at least an order of magnitude less than the coefficient
Error of the
4
estimate.
Coefficient
Estimate
In our example, the std error or the parent variable is 0.04
which is 16x less than the estimate of the coefficient (or 1.6
orders of magnitude greater).
Score that measures whether or not the coefficient for this
t-value of the
variable is meaningful for the model. You probably won't use
5 Coefficient
this value itself, but know that it is used to calculate the pEstimate
value and the significance levels.
Probability the variable is NOT relevant. You want this
number to be as small as possible. If the number is really
Variable p6
small, R will display it in scientific notation. In or example
value
2e-16 means that the odds that parent is meaningless is
about 15000000000000000
The more punctuation there is next to your variables, the
better.
Significance
7
Legend
Blank=bad, Dots=pretty good, Stars=good, More
Stars=very good
8 Residual Std The Residual Std Error is just the standard deviation of your
Error /
residuals. You'd like this number to be proportional to the
Degrees of quantiles of the residuals in #1. For a normal distribution,
Freedom
the 1st and 3rd quantiles should be 1.5 +/- the std error.
The Degrees of Freedom is the difference between the
number of observations included in your training sample

and the number of variables used in your model (intercept

counts as a variable).
Metric for evaluating the goodness of fit of your model.
Higher is better with 1 being the best. Corresponds with the
amount of variability in what you're predicting that is
9 R-squared
explained by the model. In this instance, ~21% of the cause
for a child's height is due to the height their parent.
WARNING: While a high R-squared indicates good
correlation, correlation does not always imply causation.
Performs an F-test on the model. This takes the parameters
of our model (in our case we only have 1) and compares it
to a model that has fewer parameters. In theory the model
with more parameters should fit better. If the model with
more parameters (your model) doesn't perform better than
F-statistic & the model with fewer parameters, the F-test will have a high
1
resulting p- p-value (probability NOT significant boost). If the model with
0
value
more parameters is better than the model with fewer
parameters, you will have a lower p-value.
The DF, or degrees of freedom, pertains to how many
variables are in the model. In our case there is one variable
so there is one degree of freedom.
Categorical Variables
People often wonder how they can include categorical variables in their
regression models. With R this is extremely easy. Just include the categorical
variable in your regression formula and R will take care of the rest. R calls
categorical variables factors. A factor has a set of levels, or possible values.
These levels will show up as variables in the model summary.
Dummy Variable Trap

One very important thing to note is that one of your levels will not appear in
the output. This is because when fitting a regression with a categorical
variable, one option must be left out to avoid overfitting the model. This is
often referred to as the dummy variable trap. In our model, Africa is left out
of the summary but it is still accounted for in the model.

library(reshape2)

phones <- melt(WorldPhones)

names(phones) <- c("year", "continent", "n_phones")
head(phones)
fit <- lm(n_phones ~ year + continent, data=phones)

QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
9B BMGT 220 THEORY of ESTIMATION 2
No ratings yet
9B BMGT 220 THEORY of ESTIMATION 2
4 pages
Inverted Index
No ratings yet
Inverted Index
9 pages
Inspection and Testing Requirements For Fire Sprinkler Systems
100% (1)
Inspection and Testing Requirements For Fire Sprinkler Systems
3 pages
Gcse English Language: Paper 2 Writers' Viewpoints and Perspectives
No ratings yet
Gcse English Language: Paper 2 Writers' Viewpoints and Perspectives
6 pages
Engineer's Report On Collapse of South Alabama Football Facility
100% (1)
Engineer's Report On Collapse of South Alabama Football Facility
22 pages
Homework 2
100% (1)
Homework 2
12 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
Statistical Methods For Decision Making (SMDM) Project Report
100% (2)
Statistical Methods For Decision Making (SMDM) Project Report
22 pages
20 - Basic Concepts and Terminology in Biostatistics (SepI2020)
No ratings yet
20 - Basic Concepts and Terminology in Biostatistics (SepI2020)
38 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Regression - Elements of AI 4-2
100% (2)
Regression - Elements of AI 4-2
20 pages
Practical Problems in Statistic
100% (1)
Practical Problems in Statistic
8 pages
Session 15 Regression and Correlation
No ratings yet
Session 15 Regression and Correlation
66 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Decision Tree
No ratings yet
Decision Tree
25 pages
K Kiran Kumar IIM Indore
100% (1)
K Kiran Kumar IIM Indore
115 pages
1
100% (1)
1
385 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Class 7
No ratings yet
Class 7
42 pages
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong, Rj@iimshillong - in
No ratings yet
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong, Rj@iimshillong - in
10 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Homework 2
100% (1)
Homework 2
14 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Correlation and Regression - The Simple Case
100% (2)
Correlation and Regression - The Simple Case
106 pages
Binary Logistic Regression Lecture 9
No ratings yet
Binary Logistic Regression Lecture 9
33 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Introduction To Survival Analysis: BIOST 515 February 26, 2004
No ratings yet
Introduction To Survival Analysis: BIOST 515 February 26, 2004
30 pages
Conda Cheatsheet
100% (1)
Conda Cheatsheet
22 pages
Enhancing Machine Learning Algorithms For Predictive Analytics in Healthcare - A Comparative Study and Optimization Approach
No ratings yet
Enhancing Machine Learning Algorithms For Predictive Analytics in Healthcare - A Comparative Study and Optimization Approach
53 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
172 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Ugc Model Curriculum Statistics: Submitted To The University Grants Commission in April 2001
No ratings yet
Ugc Model Curriculum Statistics: Submitted To The University Grants Commission in April 2001
101 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Modeling Ordinal Categorical Data (Agresti)
No ratings yet
Modeling Ordinal Categorical Data (Agresti)
71 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Inferential Statistics in Details
No ratings yet
Inferential Statistics in Details
652 pages
Survival Competing Risk
No ratings yet
Survival Competing Risk
29 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Estimation and Hypothesis
100% (1)
Estimation and Hypothesis
32 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Topic Probability Distributions
100% (1)
Topic Probability Distributions
25 pages
Homework 3 R Tutorial: How To Use This Tutorial
No ratings yet
Homework 3 R Tutorial: How To Use This Tutorial
8 pages
Linear Model
No ratings yet
Linear Model
10 pages
Data highlights combined (1)
No ratings yet
Data highlights combined (1)
36 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Mathematics Principles on Your Mobile
From Everand
Mathematics Principles on Your Mobile
Clive W. Humphris
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
3 pages
R Packages For Machine Learning
No ratings yet
R Packages For Machine Learning
3 pages
Rubik's Cube 2 Look OLL
No ratings yet
Rubik's Cube 2 Look OLL
2 pages
Using Counters in Hadoop
No ratings yet
Using Counters in Hadoop
2 pages
Laundry Symbols
No ratings yet
Laundry Symbols
10 pages
59 Hilarious But True Programming Quotes For Software Developers
No ratings yet
59 Hilarious But True Programming Quotes For Software Developers
4 pages
Map Reduce 101 Basic Template
No ratings yet
Map Reduce 101 Basic Template
1 page
CCDH Preperation
No ratings yet
CCDH Preperation
4 pages
Wild Weather
No ratings yet
Wild Weather
3 pages
P460NH
No ratings yet
P460NH
3 pages
Ice Fantasy
No ratings yet
Ice Fantasy
7 pages
2.2024 06-18 Venminder - Assessing Vendor Risk A Deeper Dive Webinar
No ratings yet
2.2024 06-18 Venminder - Assessing Vendor Risk A Deeper Dive Webinar
56 pages
Book of Elementals PDF
No ratings yet
Book of Elementals PDF
22 pages
Pittsburgh Corning Foamglass Cold Process Piping Installation Guideline
No ratings yet
Pittsburgh Corning Foamglass Cold Process Piping Installation Guideline
7 pages
IRE2004: Data Analytics and Metrics For IR/HR: Predictors of The Length of TTC Subway Delay
No ratings yet
IRE2004: Data Analytics and Metrics For IR/HR: Predictors of The Length of TTC Subway Delay
17 pages
Practical 2 IOT
No ratings yet
Practical 2 IOT
4 pages
Vol12 Hazardcombinations
No ratings yet
Vol12 Hazardcombinations
49 pages
Worksheet 2
No ratings yet
Worksheet 2
5 pages
Chapter 3 Ecology Quiz
No ratings yet
Chapter 3 Ecology Quiz
2 pages
Drying of Foods - Practical Action Technical Brief
100% (1)
Drying of Foods - Practical Action Technical Brief
14 pages
Q'Max Technical Bulletin #6 Viscosity Modification in Oil Muds
No ratings yet
Q'Max Technical Bulletin #6 Viscosity Modification in Oil Muds
6 pages
HSK 4
No ratings yet
HSK 4
56 pages
Grade 5 Science DLL Week 3
No ratings yet
Grade 5 Science DLL Week 3
8 pages
App Guide - Xl16iapg01enb Trane
No ratings yet
App Guide - Xl16iapg01enb Trane
20 pages
Lesson Plan in Science Grade 7: February 27, 2019
No ratings yet
Lesson Plan in Science Grade 7: February 27, 2019
4 pages
Đề cương ôn tập học kỳ 2
No ratings yet
Đề cương ôn tập học kỳ 2
7 pages
History Compass 2018 Haldon Plagues Climate Change and The End of An Empire A Response To Kyle Harper S The Fate
No ratings yet
History Compass 2018 Haldon Plagues Climate Change and The End of An Empire A Response To Kyle Harper S The Fate
13 pages
Updated Weather Systems
No ratings yet
Updated Weather Systems
20 pages
Maps of Agriculture
No ratings yet
Maps of Agriculture
8 pages
t t 24922 Eyfs All About Winter Powerpoint Ver 4
No ratings yet
t t 24922 Eyfs All About Winter Powerpoint Ver 4
14 pages
The Chemistry in Contrails
No ratings yet
The Chemistry in Contrails
111 pages
Automated Acoustic Monitoring Captures Timing and Intensity of Bird Migration
No ratings yet
Automated Acoustic Monitoring Captures Timing and Intensity of Bird Migration
27 pages
LIS QP-1
No ratings yet
LIS QP-1
11 pages
4TH Science Reviewer Grade 3
No ratings yet
4TH Science Reviewer Grade 3
7 pages
We Didnt Mean To Go To Sea
No ratings yet
We Didnt Mean To Go To Sea
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Fitting & Interpreting Linear Models in Rinear Models in R

Uploaded by

Fitting & Interpreting Linear Models in Rinear Models in R

Uploaded by

Fitting & Interpreting Linear Models in R

Building a linear model in R

galton <- read.csv("http://blog.yhathq.com/static/misc/galton.csv",

Median :68.2 Median :68.5

view raw model_lm_summary.R hosted with by GitHub

just use the + sign. i.e. (y ~ x + z).

view raw summary_lm_summary.R hosted with by GitHub

Interpreting the output

outer edges of the dartboard.

and the number of variables used in your model (intercept

phones <- melt(WorldPhones)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.