GAMS Getting Started
GAMS Getting Started
j=1
b
j
(x)
j
To better explain by example, if we had a cubic polynomial the basis
is: b
1
(x) = 1, b
2
(x) = x, b
3
(x) = x
2
, b
4
(x) = x
3
, which, when
combined give us our f (x). For those familiar with the mechanics of
linear modeling, this is just our model matrix, which we can save out
with many model functions in R via the argument x=T or using the
model.matrix function. With the graph at the right we can see the
effects in action. It is based on the results extracted from running the
model
9
and obtaining the coefcients (e.g. the rst plot represents
9
see ?poly for how to t a polynomial in
R.
the intercept of 470.44, the second plot, our b
2
coefcient of 289.5
multiplied by Income and so forth). The bottom plot shows the nal t
f (x), i.e. the linear combination of the basis functions.
At this point we have done nothing we couldnt do in our regular
regression approach, but the take home message is that as we move
to GAMs we are going about things in much the same fashion; we are
simply changing the nature of the basis, and have a great deal more
exibility in choosing the form.
In the next gure I show the t using a by-hand cubic spline basis
Getting Started with Additive Models in R 12
(see the appendix and Wood, 2006, p.126-7). A cubic spline is essen-
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
400
500
0.4 0.6 0.8 1.0
Income
O
v
e
r
a
ll
400
500
0.4 0.6 0.8 1.0
tially a connection of multiple cubic polynomial regressions. We choose
points of the variable at which to create sections, and these points are
referred to as knots. Separate cubic polynomials are t at each section,
and then joined at the knots to create a continuous curve. The graph
represents a cubic spline with 8 knots
10
between the rst and third
10
Ten including the endpoints.
quartiles. The inset graph uses the GAM functionality within ggplot2s
geom
_
smooth as a point of comparison.
Lets now t an actual generalized additive model using the same
cubic spline as our smoothing function. We again use the gam function
as before for basic model tting, but now we are using a function s
within the formula to denote the smooth terms. Within that function
we also specify the type of smooth, though a default is available. I
chose cr, denoting cubic regression splines, to keep consistent with
our previous example.
mod
_
gam1 <- gam(Overall ~ s(Income, bs = "cr"), data = d)
summary(mod
_
gam1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 470.44 4.08 115 <2e-16
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Income) 6.9 7.74 16.4 2e-14
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.7 Deviance explained = 73.9%
## GCV score = 1053.7 Scale est. = 899.67 n = 54
The rst thing to note is, that aside from the smooth part, our model
code is similar to what were used to with core R functions such as lm
and glm. In the summary, we rst see the distribution assumed as well
as the link function used, in this case normal and identity, respectively,
which to iterate, had we had no smoothing would result in a standard
regression model. After that we see that the output is separated into
parametric and smooth, or nonparametric parts. In this case, the only
parametric component is the intercept, but its good to remember that
you are not bound to smooth every effect of interest, and indeed, as
we will discuss in more detail later, part of the process may involve
retting the model with terms that were found to be linear for the
13 Generalized Additive Models
most part anyway. The smooth component of our model regarding a
countrys incomes relationship with overall science score is statistically
signicant, but there are a couple of things in the model summary that
would be unfamiliar.
Well start with the effective degrees of freedom, or edf. In typical OLS
regression the model degrees of freedom is equivalent to the number
of predictors/terms in the model. This is not so straightforward with
a GAM due to the smoothing process and the penalized regression es-
timation procedure, something that will be discussed more later
11
.
11
In this example there are actually 9
terms associated with this smooth, but
they are each penalized to some extent
and thus the edf does not equal 9.
In this situation, we are still trying to minimize the residual sums
of squares, but also have a built in penalty for wiggliness of the t,
where in general we try to strike a balance between an undersmoothed
t and an oversmoothed t. The default p-value for the test is based on
the effective degrees of freedom and the rank r of the covariance ma-
trix for the coefcients for a particular smooth, so here conceptually it
is the p-value associated with the F(r, n ed f ). However there are still
other issues to be concerned about, and ?summary.gam will provide
your rst step down that particular rabbit hole. For hypothesis testing
an alternate edf is actually used, which is the other one provided there
in the summary result
12
. At this point you might be thinking these p-
12
Here it is noted Rel.df but if, for ex-
ample, the argument p.type = 5 is used,
it will be labeled Est.Rank. Also, there
are four p-value types one can choose
from. The full story of edf, p-values and
related is scattered throughout Woods
text. See also ?anova.gam
values are a bit fuzzy, and youd be right. The gist is, they arent to be
used for harsh cutoffs, say, at an arbitrary .05 level
13
, but if they are
13
But then, standard p-values shouldnt
be used that way either.
pretty low you can feel comfortable claiming statistical signicance,
which of course is the end all, be all, of the scientic endeavor.
The GCV, or generalized cross validation score can be taken as an
estimate of the mean square prediction error based on a leave-one-out
cross validation estimation process. We estimate the model for all ob-
servations except i, then note the squared residual predicting observa-
tion i from the model. Then we do this for all observations. However,
the GCV score is an efcient measure of this concept that doesnt ac-
tually require tting all those models and overcomes other issues. It is In this initial model the GCV can be
found as:
GCV =
nscaled est.
(ned f [n o f parametric terms])
2
this score that is minimized when determining the specic nature of
the smooth. On its own it doesnt tell us much, but we can use it simi-
lar to AIC as a comparative measure to choose among different models,
with lower being better.
Graphical Display
0.4 0.5 0.6 0.7 0.8 0.9
2
0
0
1
5
0
1
0
0
5
0
0
5
0
Income
s
(
I
n
c
o
m
e
,
6
.
9
)
The intervals are Bayesian credible
intervals.
One can get sense of the form of the t by simply plotting the model
object as follows:
plot(mod
_
gam1)
Although in this single predictor case one can also revisit the previ-
ous graph, where the inset was constructed directly from ggplot. You
Getting Started with Additive Models in R 14
can examine the code in the appendix.
Model Comparison
Let us now compare our regular regression t to the GAM model t.
The following shows how one can extract various measures of perfor-
mance and the subsequent table shows them gathered together.
AIC(mod
_
lm)
## [1] 550.2
summary(mod
_
lm)$sp.criterion
## [1] 1504
summary(mod
_
lm)$r.sq #adjusted R squared
## [1] 0.5175
Do the same to extract those same elements from the GAM. The
following display
14
makes for easy comparison.
14
I just gathered the values into a
data.frame object and used xtable from
the eponymous package to produce the
table.
aic_lm aic_gam1 gcv_lm gcv_gam1 rsq_lm rsq_gam1
550.24 529.81 1504.50 1053.73 0.52 0.70
Comparing these various measures, its safe to conclude that the GAM
ts better. We can also perform the familiar statistical test via the
anova function we apply to other R model objects. As with the previ-
ous p-value issue, we cant be too particular, and technically one could
have a model with more terms but lower edf, where it just wouldnt
even make sense
15
. As it would be best to be conservative, well pro-
15
?anova.gam for more information.
ceed cautiously.
anova(mod
_
lm, mod
_
gam1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Overall ~ Income
## Model 2: Overall ~ s(Income, bs = "cr")
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 52.0 75336
## 2 46.1 41479 5.9 33857 1.2e-06
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
It would appear the anova results tell us what we have probably
come to believe already, that incorporating nonlinear effects has im-
proved the model considerably.
15 Generalized Additive Models
Multiple Predictors
Lets now see what we can do with a more realistic case where we have
added model complexity.
Linear Fit
Well start with the typical linear model approach again, this time
adding the Health and Education indices.
mod
_
lm2 <- gam(Overall ~ Income + Edu + Health, data = d)
summary(mod
_
lm2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ Income + Edu + Health
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.2 79.0 1.53 0.131
## Income 182.3 85.3 2.14 0.038
*
## Edu 234.1 54.8 4.27 9.1e-05
***
## Health 27.0 134.9 0.20 0.842
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.616 Deviance explained = 63.9%
## GCV score = 1212.3 Scale est. = 1119 n = 52
It appears we have statistical effects for Income and Education, but
not for Health, and the adjusted R-squared suggests a notable amount
of the variance is accounted for
16
. Lets see about nonlinear effects.
16
Note that a difference in sample sizes
do not make this directly comparable to
the rst model.
GAM
As far as the generalized additive model goes, we can approach things
in a similar manner as before. We will ignore the results of the linear
model for now and look for nonlinear effects for each covariate. The default smoother for s() is the
argument bs=tp, a thin plate regression
spline. mod
_
gam2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data = d)
summary(mod
_
gam2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income) + s(Edu) + s(Health)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
Getting Started with Additive Models in R 16
## (Intercept) 471.15 2.77 170 <2e-16
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Income) 7.59 8.41 8.83 1.3e-07
***
## s(Edu) 6.20 7.18 3.31 0.0073
**
## s(Health) 1.00 1.00 2.74 0.1066
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.863 Deviance explained = 90.3%
## GCV score = 573.83 Scale est. = 399.5 n = 52
There are again a couple things to take note of. First, statistically
speaking, we come to the same conclusion as the linear model regard-
ing the individual effects. One should take particular note of the effect
of Health index. The effective degrees of freedom with value 1 sug-
gests that it has essentially been reduced to a simple linear effect. The
following will update the model to explicitly model the effect as such,
but as one can see, the results are identical.
mod
_
gam2B = update(mod
_
gam2, . ~ . - s(Health) + Health)
summary(mod
_
gam2B)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income) + s(Edu) + Health
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 640 102 6.26 3.1e-07
***
## Health -190 115 -1.65 0.11
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Income) 7.59 8.41 8.83 1.3e-07
***
## s(Edu) 6.20 7.18 3.31 0.0073
**
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.863 Deviance explained = 90.3%
## GCV score = 573.83 Scale est. = 399.5 n = 52
We can also note that this model accounts for much of the variance
in Overall science scores, with an adjusted R-squared of .86. In short,
it looks like the living standards and educational resources of a country
are associated with overall science scores, even if we dont really need
the Health index in the model.
17 Generalized Additive Models
Graphical Display
Now we examine the effects of interest visually. In the following code,
aside from the basic plot we have changed the default options to put
all the plots on one page, added the partial residuals
17
, and changed
17
See e.g. John Foxs texts and chapters
regarding regression diagnostics, and his
crplots function in the car package.
the symbol and its respective size, among other options.
plot(mod
_
gam2, pages=1, residuals=T, pch=19, cex=0.25,
scheme=1, col='#FF8000', shade=T,shade.col='gray90')
0.4 0.5 0.6 0.7 0.8 0.9
2
5
0
1
5
0
5
0
0
5
0
Income
s
(
In
c
o
m
e
,7
.5
9
)
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
0.6 0.7 0.8 0.9 1.0
2
5
0
1
5
0
5
0
0
5
0
Edu
s
(
E
d
u
,6
.2
)
G
G
G G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G G
G
G
G
G
G
G
G
0.75 0.80 0.85 0.90 0.95
2
5
0
1
5
0
5
0
0
5
0
Health
s
(
H
e
a
lth
,1
)
G
G G G
G
G G G G G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G GG
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G G
G
Here we can see the effects of interest, and again one might again
note the penalized-to-linear effect of Health. We again see the tapering
off of Incomes effect at its highest level, and in addition, a kind of
sweet spot for a positive effect of Education in the mid-range values,
with a slight positive effect overall. Health, as noted, has been reduced
to a linear, surprisingly negative effect, but again this is non-signicant.
One will also note the y-axis scale. The scale is on that of the linear
predictor, but due to identiability constraints, the smooths must sum
to zero, and thus are presented in a mean-centered fashion.
Previously, we had only one predictor and so we could let ggplot
do the work to produce a plot on the response scale, although we
could have added the intercept back in. Here well need to do a little
300
400
500
0.4 0.6 0.8
Income
f
i
t
more. The following will produce a plot for Income on the scale of the
response, with the other predictors held at their mean. Seen at right.
# Note that mod
_
gam2$model is the data that was used in the modeling process,
# so it will have NAs removed.
testdata = data.frame(Income=seq(.4,1, length=100),
Edu=mean(mod
_
gam2$model$Edu),
Health=mean(mod
_
gam2$model$Health))
fits = predict(mod
_
gam2, newdata=testdata, type='response', se=T)
predicts = data.frame(testdata, fits)
ggplot(aes(x=Income,y=fit), data=predicts) +
geom
_
smooth(aes(ymin = fit - 1.96
*
se.fit, ymax=fit + 1.96
*
se.fit),
fill='gray80', size=1,stat='identity') +
ggtheme
This gives us a sense for one predictor, but lets take a gander at
Income and Education at the same time. Previously, while using the
function plot, it actually is plot.gam, or the basic plot function for a
GAM class object. There is another plotting function, vis.gam, that will
give us a bit more to play with. The following will produce a contour
plot (directly to the right) with Income on the x axis, Education on the
y, with values on the response scale given by the contours, with lighter
color indicating higher values.
0.5 0.6 0.7 0.8 0.9
0
.
6
0
.
7
0
.
8
0
.
9
response
Income
E
d
u
vis.gam(mod
_
gam2, type = "response", plot.type = "contour")
First and foremost the gure to the right reects the individual plots
(e.g. one can see the decrease in scores at the highest income lev-
Getting Started with Additive Models in R 18
els), and we can see that middling on Education and high on Income
generally produces the highest scores. Conversely, being low on both
the Education and Income indices are associated with poor Overall
science scores. However, while interesting, these respective smooths
were created separately of one another, and there is another way we
might examine how these two effects work together in predicting the
response.
0.5 0.6 0.7 0.8 0.9
0
.
6
0
.
7
0
.
8
0
.
9
response
Income
E
d
u
Lets take a look at another approach, continuing the focus on visual
display. It may not be obvious at all, but one can utilize smooths of
more than one variable, in effect, a smooth of the smooths of the vari-
ables that go into it. Lets create a new model to play around with this
feature. After tting the model, an example of the plot code is given
producing the middle gure, but I also provide a different perspec-
tive as well as contour plot for comparison with the graph based on
separate smooths.
In
co
m
e
E
d
u
r
e
s
p
o
n
s
e
Income
E
d
u
r
e
s
p
o
n
s
e
mod
_
gam3 <- gam(Overall ~ te(Income, Edu), data = d)
summary(mod
_
gam3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ te(Income, Edu)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 471.15 3.35 141 <2e-16
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(Income,Edu) 10.1 12.2 16.9 <2e-16
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.8 Deviance explained = 84%
## GCV score = 741.42 Scale est. = 583.17 n = 52
vis.gam(mod
_
gam3, type='response', plot.type='persp',
phi=30, theta=30,n.grid=500, border=NA)
In the above we are using a type of smooth called a tensor product
smooth, and by smoothing the marginal smooths of Income and Educa-
tion, we see a bit clearer story. As we might suspect, wealthy countries
with more of an apparent educational infrastructure are going to score
higher on the Overall science score. However, wealth is not necessarily
indicative of higher science scores (note the dark bottom right corner
on the contour plot)
18
, though without at least moderate wealth hopes
18
This is likely due to Qatar. Refer again
to the gure on page 9.
19 Generalized Additive Models
are fairly dim for a decent score.
One can also, for example, examine interactions between a smooth
and a linear term f (x)z, and in a similar vein of thought look at
smooths at different levels of a grouping factor. Well leave that for
some other time.
Model Comparison
As before, we can examine indices such as GCV or perhaps adjusted
R-square, which both suggest our GAM performs considerably better.
Statistically we can compare the two models with the anova function
as before.
anova(mod
_
lm2, mod
_
gam2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Overall ~ Income + Edu + Health
## Model 2: Overall ~ s(Income) + s(Edu) + s(Health)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 48.0 53713
## 2 36.2 14463 11.8 39250 9.8e-16
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
Not that we couldnt have assumed as such already, but now we
have additional statistical evidence to suggest that incorporating non-
linear effects improves the model.
Other Issues
Estimation
AS NOTED PREVI OUSLY, estimation of GAMs in the mgcv package
is conducted via a penalized likelihood approach. Conceptually this
amounts to tting the following model
g() = X + f (x
1
) + f (x
2
)... f (x
p
)
But note that each smooth has its own model matrix made up of the
bases. So for each smooth covariate we have:
f
j
=
X
j
j
Given a matrix of coefcients S, we can more formally note a penal-
ized likelihood function:
l
p
() = l()
1
2
T
S
j
= (X
T
X)
1
X
T
y
= F
4
0
2
0
0
2
0
4
0
theoretical quantiles
d
e
v
ia
n
c
e
r
e
s
id
u
a
ls
350 400 450 500
4
0
2
0
0
2
0
4
0
Resids vs. linear pred.
linear predictor
r
e
s
id
u
a
ls
Histogram of residuals
Residuals
F
r
e
q
u
e
n
c
y
40 20 0 20 40
0
2
4
6
8
1
0
1
2
1
4
350 400 450 500
3
5
0
4
0
0
4
5
0
5
0
0
5
5
0
Response vs. Fitted Values
Fitted Values
R
e
s
p
o
n
s
e
gam.check(mod
_
gam2, k.rep = 1000)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 21 iterations.
## The RMS GCV score gradiant at convergence was 2.499e-05 .
## The Hessian was positive definite.
## The estimated model rank was 28 (maximum possible: 28)
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Income) 9.000 7.593 1.258 0.97
## s(Edu) 9.000 6.204 1.008 0.46
## s(Health) 9.000 1.000 0.899 0.18
The plots are of the sort were used to from a typical regression
setting, though its perhaps a bit difcult to make any grand conclu-
sion based on such a small data set. The printed output on the other One can inspect the quantile-quantile
plot directly with the qq.gam function.
hand contains unfamiliar information, but is largely concerned with
over-smoothing, and so has tests of whether the basis dimension for a
smooth is too low. The p-values are based on simulation, so I bumped
up the number with the additional argument. Guidelines are given
in the output itself, and at least in this case it does not look like we
have an issue. However, if there were a potential problem, it is sug-
gested to double k
22
and ret, and if the effective degrees of freedom
22
k can be set as an argument to
s(var1, k=?).
increases quite a bit you would probably want to go with the updated
model
23
. Given the penalization process, the exact choice of k isnt
23
I actually did this for Health, which is
the only one of the predictors that could
be questioned, and there was no change
at all; it still reduced to a linear effect.
too big of a deal, but the defaults are arbitrary. You want to set it large
enough to get at the true effect as best as possible, but in some cases
computational efciency will also be of concern. The help on the func-
tion choose.k provides another approach to examining k based on
the residuals from the model under consideration, and provides other
useful information.
Getting Started with Additive Models in R 22
Prediction
A previous example used the predict function on the data used to t
the model to obtain tted values on the response scale. Typically wed
use this on new data. I do not cover it, because the functionality is the
same as the predict.glm function in base R, and one can just refer
to that. It is worth noting that there is an option, type=lpmatrix,
which will return the actual model matrix by which the coefcients
must be pre-multiplied to get the values of the linear predictor at the
supplied covariate values. This can be particularly useful towards
opening the black box as one learns the technique.
Model Comparison Revisited
We have talked about automated smoothing parameter and term selec-
tion, and in general potential models are selected based on estimation
of the smoothing parameter. Using an extra penalty to allow coef-
cients to tend toward zero with the argument select=TRUE is an auto-
matic way to go about it, where some terms could effectively drop out.
Otherwise we could compare models GCV/AIC scores
24
, and in general
24
GCV scores are not useful for compar-
ing ts of different families; AIC is still
ok though.
either of these would be viable approaches. Consider the following
comparison:
mod
_
1d = gam(Overall ~ s(Income) + s(Edu), data = d)
mod
_
2d = gam(Overall ~ te(Income, Edu, bs = "tp"), data = d)
AIC(mod
_
1d, mod
_
2d)
## df AIC
## mod
_
1d 15.59 476.1
## mod
_
2d 13.25 489.7
In some cases we might prefer to be explicit in comparing models
with and without particular terms, and we can go about comparing
models as we would with a typical GLM analysis of deviance. We have
demonstrated this previously using the anova.gam function, where we
compared linear ts to a model with an additional smooth function.
While we could construct a scenario that is identical to the GLM situ-
ation for a statistical comparison, it should be noted that in the usual
situation the test is actually an approximation, though it should be
close enough when it is appropriate in the rst place. The following
provides an example that would nest the main effects of Income and
Education within the product smooth, i.e. sets their basis dimension
and smoothing function to the defaults employed by the te smooth.
mod
_
A = gam(Overall ~ s(Income, bs = "cr", k = 5) + s(Edu, bs = "cr", k = 5),
data = d)
mod
_
B = gam(Overall ~ s(Income, bs = "cr", k = 5) + s(Edu, bs = "cr", k = 5) +
te(Income, Edu), data = d)
23 Generalized Additive Models
anova(mod
_
A, mod
_
B, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: Overall ~ s(Income, bs = "cr", k = 5) + s(Edu, bs = "cr", k = 5)
## Model 2: Overall ~ s(Income, bs = "cr", k = 5) + s(Edu, bs = "cr", k = 5) +
## te(Income, Edu)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 46.3 36644
## 2 42.0 24964 4.26 11679 0.00075
***
## ---
## Signif. codes: 0 '
***
' 0.001 '
**
' 0.01 '
*
' 0.05 '.' 0.1 ' ' 1
Again though, we could have just used the summary output from the
second model.
Instances where such an approach does not appear to be appropri-
ate within the context of the mgcv package are when terms are able
to be penalized to zero; in such a case p-values will be much too low.
In addition, when comparing GAMs, sometimes the nesting of models
would not be so clear when there are multiple smooths involved, and
additional steps may need to be taken to make sure they are nested.
We must make sure that each smooth term in the null model has no
more effective degrees of freedom than the same term in the alterna-
tive, otherwise its possible that the model with more terms can have
lower effective degrees of freedom but better t, rendering the test
nonsensical. Wood (2006) suggests that if such model comparison is
the ultimate goal an unpenalized approach
25
would be best to have
25
This can be achieved with the argu-
ment s(..., fx=T), although now one
has to worry more about the k value
used, as very high k will lead to low
power
much condence in the p-values.
26
26
One might also examine the gss
package for anova based approach to
generalized smoothing splines.
Other Approaches
THI S SECTI ON WI LL DI SCUSS some ways to relate the generalized
models above to other forms of nonlinear modeling approaches, some
familiar and others perhaps less so. In addition, I will note some exten-
sions to GAMs to consider.
Relation to Other Nonlinear Modeling Approaches
Known Form
It should be noted that one can place generalized additive models A general form of nonlinear models:
y = f (X, ) +
under a general heading of nonlinear models whose focus may be on
transformations of the outcome (as with generalized linear models),
the predictor variables (polynomial regression and GAMs), or both
(GAMs), in addition to those whose effects are nonlinear in the param-
eters
27
. The difference between the current presentation and those
27
For example, various theoretically
motivated models in economics and
ecology.
Getting Started with Additive Models in R 24
latter nonlinear models as distinguished in typical introductory statisti-
cal texts that might cover some nonlinear modeling, is that we simply
dont know the form beforehand.
In cases where the form may be known, one can use an approach
such as nonlinear least squares, and there is inherent functionality
within a standard R installation, such as the nls function. As is the
usual case, such functionality is readily extendable to a great many
other analytic situations, e.g. the gnm for generalized nonlinear models
or nlme for nonlinear mixed effects models.
Response Transformation
It is common practice, perhaps too common, to manually transform the
response and go about things with a typical linear model. While there
might be specic reasons for doing so, the primary reason applied
researchers seem to do so is to make the distribution more normal
so that regular regression methods can be applied. As an example, a
typical transformation is to take the log, particularly to tame outliers
or deal with skewness.
While it was a convenience back in the day because we didnt have
software or computing power to deal with a lot of data situations aptly,
this is denitely not the case now. In many situations it would be better
to, for example, conduct a generalized linear model with a log link or
perhaps assume a different distribution for the response directly (e.g.
skew-normal), and many tools allow researchers to do this with ease
28
.
28
A lot of outliers tend to magically
go away with an appropriate choice
of distribution for the data generating
process.
There are still cases where one might focus on response transforma-
tion, just not so one can overcome some particular nuisance in trying
to t a linear regression. An example might be in some forms of func-
tional data analysis, where we are concerned with some function of the
response that has been measured on many occasions over time.
The Black Box
Venables and Ripley (2002, Section 11.5) make an interesting classi-
cation of nonlinear models into those that are less exible but under
full user control (fully parametric), and those that are black box tech- One could probably make the case that
most modeling is black box for a great
many researchers.
niques that are highly exible and fully automatic: stuff goes in, stuff
comes out, but were not privy to the specics
29
.
29
For an excellent discussion of these
different approaches to understanding
data see Breiman (2001) and associated
commentary. For some general packages
outside of R that incorporate a purely
algorithmic approach to modeling, you
might check out RapidMiner or Weka.
Note also that RapidMiner allows for R
functionality via an add-on package, and
the RWeka package brings Weka to the R
environment.
Two examples of the latter that they provide are projection pursuit
and neural net models, though a great many would fall into such a
heading. Projection pursuit models are well suited to high dimensional
data where dimension reduction is a concern. One may think of an
example where one uses a technique such as principal components
analysis on the predictor set and then examines smooth functions of M
principal components.
25 Generalized Additive Models
In the case of neural net models, one can imagine a model where
the input units (predictor variables) are weighted and summed to
create hidden layer units, which are then essentially put through the
same process to create outputs (see a simple example to the right).
Input Layer Hidden Layer Output
A Neural Net Model
One can see projection pursuit models as an example where a smooth
function is taken of the components which make up the hidden layer.
Neural networks are highly exible in that there can be any number
of inputs, hidden layers, and outputs. However, such models are very
explicit in the black box approach.
Projection pursuit and neural net models are usually found among
data mining/machine learning techniques any number of which might
be utilized in a number of disciplines. Other more algorithmic/black
box approaches include k-nearest-neighbors, random forests, support
vector machines, and various tweaks or variations thereof including
boosting, bagging, bragging and other alliterative shenanigans
30
.
30
See Hastie et al. (2009) for an
overview of such approaches.
As Venables and Ripley note, generalized additive models might be
thought of as falling somewhere in between the fully parametric and
interpretable models of linear regression and black box techniques.
Indeed, there are algorithmic approaches which utilize GAMs as part of
their approach.
Extensions
Other GAMs
Note that just as generalized additive models are an extension of the
generalized linear model, there are generalizations of the basic GAM
beyond the settings described. In particular, random effects can be
dealt with in this context as they can with linear and generalized linear
models, and there is an interesting connection between smooths and
random effects in general
31
. Generalized additive models for location,
31
Wood (2006) has a whole chapter
devoted to the subject.
scale, and shape (GAMLSS) allow for distributions beyond the expo-
nential family Rigby and Stasinopoulos (2005). In addition there are
boosted, ensemble and other machine learning approaches that apply
GAMs as well
32
. In short, theres plenty to continue to explore once
32
See the GAMens package for example.
one gets the hang of generalized additive models.
Gaussian Processes: a Probabilistic Approach
We can also approach modeling by using generalizations of the Gaus-
sian distribution. Where the Gaussian distribution is over vectors and
dened by a mean vector and covariance matrix, the Gaussian Pro-
cess is over functions. A function f is distributed as a Gaussian Process
dened by a mean function m and covariance function k.
Getting Started with Additive Models in R 26
f GP(m, k)
-3
-2
-1
0
1
2
-4 0 4
x
v
a
lu
e
Prior
-1
0
1
2
-8 -4 0 4 8
x
v
a
lu
e
Posterior Predictive
Gaussian Process y = sin(x) +noise.
The left graph shows functions from the
prior distribution, the right shows the
posterior mean function, 95% condence
interval shaded, as well as specic draws
from the posterior predictive mean
distribution.
In the Bayesian context we can dene a prior distribution over func-
tions and make draws from a posterior predictive distribution of f .
The reader is encouraged to consult Rasmussen and Williams (2006)
for the necessary detail. The text is free for download here, and Ras-
mussen provides a nice and brief intro here. I also have some R code
for demonstration here based on his Matlab code.
Sufce it to say in this context, it turns out that generalized additive
models with a tensor product or cubic spline smooth are maximum
a posteriori (MAP) estimates of gaussian processes with specic co-
variance functions and a zero mean function. In that sense one might
segue nicely to those if familiar with additive models.
Conclusion
Generalized additive models are a conceptually straightforward tool
that allows one to incorporate nonlinear predictor effects into their
otherwise linear models. In addition, they allow one to keep within the
linear and generalized linear frameworks with which one is already fa-
miliar, while providing new avenues of model exploration and possibly
improved results. As was demonstrated, it is easy enough with just a
modicum of familiarity to pull them off within the R environment, and
as such it is hoped that this document provides a means to do so for
the uninitiated.
27 Generalized Additive Models
Appendix
R packages
THE FOLLOWI NG I S a non-exhaustive list of R packages which contain
GAM functionality. Each is linked to the CRAN page for the package.
Note also that several build upon the mgcv package used for this docu-
ment.
amer: Fitting generalized additive mixed models based on the mixed
model algorithm of lme4 (gamm4 now includes this approach).
CausalGAM: This package implements various estimators for average
treatment effects.
COZIGAM: Constrained and Unconstrained Zero-Inated Generalized
Additive Models.
CoxBoost: This package provides routines for tting Cox models. See
also cph in rms package for nonlinear approaches in the survival con-
text.
gam: Functions for tting and working with generalized additive mod-
els.
GAMBoost: This package provides routines for tting generalized lin-
ear and and generalized additive models by likelihood based boosting.
gamboostLSS: Boosting models for tting generalized additive models
for location, shape and scale (gamLSS models).
GAMens: This package implements the GAMbag, GAMrsm and GAMens
ensemble classiers for binary classication.
gamlss: Generalized additive models for location, shape, and scale.
gamm4: Fit generalized additive mixed models via a version of mgcvs
gamm function.
gammSlice: Bayesian tting and inference for generalized additive
mixed models.
GMMBoost: Likelihood-based Boosting for Generalized mixed models.
gss: A comprehensive package for structural multivariate function
estimation using smoothing splines.
mboost: Model-Based Boosting.
mgcv: Routines for GAMs and other generalized ridge regression with
multiple smoothing parameter selection by GCV, REML or UBRE/AIC.
Also GAMMs.
VGAM: Vector generalized linear and additive models, and associated
models.
Getting Started with Additive Models in R 28
Miscellaneous Code
Here is any code that might be okay to play around with but would
have unnecessarily cluttered sections of the paper.
ggtheme
My ggplot2 default options: ggtheme. Note that by actually adding the
theme line after the ggtheme section, you can then override or change
certain parts of it on the y while maintaining the gist. I have this in
my Rprole.site le so that it is always available.
#set default options in case we use ggplot later
ggtheme =
theme(
axis.text.x = element
_
text(colour='gray50'),
axis.text.y = element
_
text(colour='gray50'),
panel.background = element
_
blank(),
panel.grid.minor = element
_
blank(),
panel.grid.major = element
_
blank(),
panel.border = element
_
rect(colour='gray50'),
strip.background = element
_
blank()
)
mod_gam1 plot
ggplot(aes(x=Income, y=Overall), data=d) +
geom
_
point(color="#FF8000") +
geom
_
smooth(se=F, method='gam', formula=y~s(x, bs="cr")) +
xlim(.4,1) +
ggtheme
Penalized Estimation Example
Initial data set up and functions.
############################
### Wood by-hand example ###
############################
size = c(1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,
2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98)
wear = c(4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,
3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7)
x= size-min(size); x = x/max(x)
d = data.frame(wear, x)
#cubic spline function
rk <- function(x,z) {
((z-0.5)^2 - 1/12)
*
((x-0.5)^2 - 1/12)/4-
((abs(x-z)-0.5)^4-(abs(x-z)-0.5)^2/2 + 7/240) / 24
29 Generalized Additive Models
}
spl.X <- function(x,knots){
q <- length(knots) + 2 # number of parameters
n <- length(x) # number of observations
X <- matrix(1,n,q) # initialized model matrix
X[,2] <- x # set second column to x
X[,3:q] <- outer(x,knots,FUN=rk) # remaining to cubic spline
X
}
spl.S <- function(knots) {
q = length(knots) + 2
S = matrix(0,q,q) # initialize matrix
S[3:q,3:q] = outer(knots,knots,FUN=rk) # fill in non-zero part
S
}
#matrix square root function
mat.sqrt <- function(S){
d = eigen(S, symmetric=T)
rS = d$vectors%
*
%diag(d$values^.5)%
*
%t(d$vectors)
rS
}
#the fitting function
prs.fit <- function(y,x,knots,lambda){
q = length(knots) + 2 # dimension of basis
n = length(x) # number of observations
Xa = rbind(spl.X(x,knots), mat.sqrt(spl.S(knots))
*
sqrt(lambda)) # augmented model matrix
y[(n+1):(n+q)] = 0 #augment the data vector
lm(y ~ Xa-1) # fit and return penalized regression spline
}
Example 1.
2.0
2.5
3.0
3.5
4.0
4.5
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
0.0 0.2 0.4 0.6 0.8 1.0
x
w
e
a
r
knots = 1:4/5
X = spl.X(x,knots) # generate model matrix
mod.1 = lm(wear~X-1) # fit model
xp <- 0:100/100 # x values for prediction
Xp <- spl.X(xp, knots) # prediction matrix
# Base R plot
plot(x, wear, xlab='Scaled Engine size', ylab='Wear Index', pch=19,
col="#FF8000", cex=.75, col.axis='gray50')
lines(xp,Xp%
*
%coef(mod.1), col='#2957FF') #plot
# ggplot
library(ggplot2)
ggplot(aes(x=x, y=wear), data=data.frame(x,wear))+
geom
_
point(color="#FF8000") +
geom
_
line(aes(x=xp, y=Xp%
*
%coef(mod.1)), data=data.frame(xp,Xp), color="#2957FF") +
ggtheme
Getting Started with Additive Models in R 30
Example 2.
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
lambda = 0.1 lambda = 0.01 lambda = 0.001
lambda = 1e04 lambda = 1e05 lambda = 1e06
2.0
2.5
3.0
3.5
4.0
4.5
2.0
2.5
3.0
3.5
4.0
4.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x
w
e
a
r
knots = 1:7/8
d2 = data.frame(x=xp)
for (i in c(.1,.01,.001,.0001,.00001,.000001)){
mod.2 = prs.fit(wear, x, knots,i) #fit penalized regression
#spline choosing lambda
Xp = spl.X(xp,knots) #matrix to map parameters to fitted values at xp
d2[,paste('lambda = ',i, sep="")] = Xp%
*
%coef(mod.2)
}
### ggplot
library(ggplot2); library(reshape)
d3 = melt(d2, id='x')
ggplot(aes(x=x, y=wear), data=d) +
geom
_
point(col='#FF8000') +
geom
_
line(aes(x=x,y=value), col="#2957FF", data=d3) +
facet
_
wrap(~variable) +
ggtheme
### Base R approach
par(mfrow=c(2,3))
for (i in c(.1,.01,.001,.0001,.00001,.000001)){
mod.2 = prs.fit(wear, x, knots,i)
Xp = spl.X(xp,knots)
plot(x,wear, main=paste('lambda = ',i), pch=19,
col="#FF8000", cex=.75, col.axis='gray50')
lines(xp,Xp%
*
%coef(mod.2), col='#2957FF')
}
R Session Info
R version 3.0.2 (2013-09-25)
Base packages: base, datasets, graphics, grDevices, grid, methods,
stats, utils
Other packages: psych:1.3.10.12 mgcv:1.7-27 nlme:3.1-113
gridExtra:0.9.1 reshape2:1.2.2 ggplot2:0.9.3.1 MASS:7.3-29
31 Generalized Additive Models
References
Breiman, L. (2001). Statistical modeling: The two cultures (with
comments and a rejoinder by the author). Statistical Science,
16(3):199231.
Fox, J. (2000a). Multiple and Generalized Nonparametric Regression.
SAGE.
Fox, J. (2000b). Nonparametric Simple Regression: Smoothing Scatter-
plots. SAGE.
Hardin, J. W. and Hilbe, J. M. (2012). Generalized Linear Models and
Extensions, Third Edition. Stata Press.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. CRC
Press.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, Sec-
ond Edition. Springer, 2nd ed. 2009. corr. 3rd printing 5th printing.
edition.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for
machine learning. MIT Press, Cambridge, Mass.
Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive
models for location, scale and shape. Journal of the Royal Statistical
Society: Series C (Applied Statistics), 54(3).
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric
Regression. Cambridge University Press.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics
With S. Birkhuser.
Wood, S. N. (2006). Generalized additive models: an introduction with
R, volume 66. CRC Press.