0% found this document useful (0 votes)
51 views

A. Data Science Methods

Uploaded by

Hussein Mazaar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

A. Data Science Methods

Uploaded by

Hussein Mazaar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

A.

 Data Science Methods

As marketing data scientists, we must speak the language of business—accounting,


finance, marketing, and management. We need to know about information technology,
including data structures, algorithms, and object-oriented programming. We must
understand statistical modeling, machine learning, mathematical programming, and
simulation methods. These are the things that we do:
 Information search. We begin by learning what others have done before, learning
from the literature. We draw on the work of academics and practitioners in many fields
of study, contributors to predictive analytics and data science.
 Preparing text and data. Text is unstructured or partially structured data that must
be prepared for analysis. We extract features from text. We define measures.
Quantitative data are often messy or missing. They may require transformation prior to
analysis. Data preparation consumes much of a data scientist’s time.
 Looking at data. We do exploratory data analysis, data visualization for the purpose
of discovery. We look for groups in data. We find outliers. We identify common
dimensions, patterns, and trends.
 Predicting how much. We are often asked to predict how many units or dollars of
product will be sold, the price of financial securities or real estate. Regression
techniques are useful for making these predictions.
 Predicting yes or no. Many business problems are classification problems. We use
classification methods to predict whether or not a person will buy a product, default on a
loan, or access a web page.
 Testing it out. We examine models with diagnostic graphics. We see how well a
model developed on one data set works on other data sets. We employ a training-and-
test regimen with data partitioning, cross-validation, or bootstrap methods.
 Playing what-if. We manipulate key variables to see what happens to our
predictions. We play what-if games in simulated marketplaces. We employ sensitivity or
stress testing of mathematical programming models. We see how values of input
variables affect outcomes, payoffs, and predictions. We assess uncertainty about
forecasts.
 Explaining it all. Data and models help us understand the world. We turn what we
have learned into an explanation that others can understand. We present project results
in a clear and concise manner.
Prediction is distinct from explanation. We may not know why models work, but we
need to know when they work and when to show others how they work. We identify the
most critical components of models and focus on the things that make a difference. 1
1
 Statisticians distinguish between explanatory and predictive models. Explanatory models are
designed to test causal theories. Predictive models are designed to predict new or future
observations. See Geisser (1993), Breiman (2001), and Shmueli (2010).
Data scientists are methodological eclectics, drawing from many scientific disciplines
and translating the results of empirical research into words and pictures that
management can understand. These presentations benefit from well-constructed data
visualizations. In communicating with management, data scientists need to go beyond
formulas, numbers, definitions of terms, and the magic of algorithms. Data scientists
convert the results of predictive models into simple, straightforward language that
others can understand.
Data scientists are knowledge workers par excellence. They are communicators playing
a critical role in today’s data-intensive world. Data scientists turn data into models and
models into plans for action.
The approach we have taken in this and other books in the Modeling Techniques series
has been to employ both classical and Bayesian methods. And sometimes we dispense
with traditional statistics entirely and rely on machine learning algorithms.
Within the statistical literature, Seymour Geisser introduced an approach best described
as Bayesian predictive inference (Geisser 1993). In emphasizing the success of
predictions in marketing data science, we are in agreement with Geisser. But our
approach is purely empirical and in no way dependent on classical or Bayesian thinking.
We do what works, following a simple premise:

The value of a model lies in the quality of its predictions.

We learn from statistics that we should quantify our uncertainty. On the one hand, we
have confidence intervals, point estimates with associated standard errors, significance
tests, and p-values—that is the classical way. On the other hand, we have posterior
probability distributions, probability intervals, prediction intervals, Bayes factors, and
subjective (perhaps diffuse) priors—the path of Bayesian statistics.
The role of data science in business has been discussed by many (Davenport and Harris
2007; Laursen and Thorlund 2010; Davenport, Harris, and Morison 2010; Franks
2012; Siegel 2013; Maisel and Cokins 2014; Provost and Fawcett 2014). In-depth
reviews of methods include those of Izenman (2008), Hastie, Tibshirani, and Friedman
(2009), and Murphy (2012).
Doing data science means implementing flexible, scalable, extensible systems for data
preparation, analysis, visualization, and modeling. We are empowered by the growth of
open source. Whatever the modeling technique or application, there is likely a relevant
package, module, or library that someone has written or is thinking of writing. Doing
data science with open-source tools is discussed in Conway and White (2012), Putler
and Krider (2012), James et al. (2013), Kuhn and Johnson (2013), Lantz (2013),
and Ledoiter (2013). Additional discussion of data science, modeling techniques in
predictive analytics, and open-source tools is provided in other books in the Modeling
Techniques series (Miller 2015a, 2015b, and 2015c).
This appendix identifies classes of methods and reviews selected methods in databases
and data preparation, statistics, machine learning, data visualization, and text analytics.
We provide an overview of these methods and cite relevant sources for further reading.

A.1 DATABASE SYSTEMS AND DATA


PREPARATION
There have always been more data than we have time to analyze. What is new
today is the ease of collecting data and the low cost of storing data. Data come
from many sources. There are unstructured text data from online systems. There
are pixels from sensors and cameras. There are data from mobile phones, tablets,
and computers worldwide, located in space and time. Flexible, scalable, distributed
systems are needed to accommodate these data.
Relational databases have a row-and-column table structure, similar to a spreadsheet.
We access and manipulate these data using structured query language (SQL). Because
they are transaction-oriented with enforced data integrity, relational databases provide
the foundation for sales order processing and financial accounting systems.
It is easy to understand why non-relational (NoSQL) databases have received so much
attention. Non-relational databases focus on availability and scalability. They may
employ key-value, column-oriented, document-oriented, or graph structures. Some are
designed for online or real-time applications, where fast response times are key. Others
are well suited for massive storage and off-line analysis, with map-reduce providing a
key data aggregation tool.
Many firms are moving away from internally owned, centralized computing systems and
toward distributed cloud-based services. Distributed hardware and software systems,
including database systems, can be expanded more easily as the data management
needs of organizations grow.
Doing data science means being able to gather data from the full range of database
systems, relational and non-relational, commercial and open source. We employ
database query and analysis tools, gathering information across distributed systems,
collating information, creating contingency tables, and computing indices of
relationship across variables of interest. We use information technology and database
systems as far as they can take us, and then we do more, applying what we know about
statistical inference and the modeling techniques of predictive analytics.
Regarding analytics, we acknowledge an unwritten code in data science. We do not
select only the data we prefer. We do not change data to conform to what we would like
to see or expect to see. A two of clubs that destroys the meld is part of the natural
variability in the game and must be played with the other cards. We play the hand that is
dealt. The hallmarks of science are an appreciation of variability, an understanding of
sources of error, and a respect for data. Data science is science.
Raw data are unstructured, messy, and sometimes missing. But to use data in models,
they must be organized, clean, and complete. We are often asked to make a model out of
a mess. Management needs answers, and the data are replete with miscoded and
missing observations, outliers and values of dubious origin. We use our best judgement
in preparing data for analysis, recognizing that many decisions we make are subjective
and difficult to justify.
Missing data present problems in applied research because many modeling algorithms
require complete data sets. With large numbers of explanatory variables, most cases
have missing data on at least one of the variables. Listwise deletion of cases with missing
data is not an option. Filling in missing data fields with a single value, such as the mean,
median, or mode, would distort the distribution of a variable, as well as its relationship
with other variables. Filling in missing data fields with values randomly selected from
the data adds noise, making it more difficult to discover relationships with other
variables. Multiple imputation is preferred by statisticians.
Garcia-Molina, Ullman, and Widom (2009) and Connolly and Begg (2015) review
database systems with a focus on the relational model. Worsley and Drake
(2002) and Obe and Hsu (2012) review PostgreSQL. White (2011), Chodorow (2013),
and Robinson, Webber, and Eifrem (2013) review selected non-relational systems. For
MongoDB document database examples, see Copeland (2013) and Hoberman (2014).
For map-reduce operations, see Dean and Ghemawat (2004) and Rajaraman and
Ullman (2012).
Osborne (2013) provides an overview of data preparation issues, and the edited volume
by McCallum (2013) provides much needed advice about what to do with messy data.
Missing data methods are discussed in various sources (Rubin 1987; Little and Rubin
1987; Schafer 1997; Lumley 2010; Snijders and Bosker 2012)), with methods
implemented in R packages from Gelman et al. (2014), Honaker, King, and Blackwell
(2014), and Lumley (2014).

A.2 CLASSICAL AND BAYESIAN STATISTICS


How shall we draw inferences from data? Formal scientific method suggests that
we construct theories and test those theories with sample data. The process
involves drawing statistical inferences as point estimates, interval estimates, or
tests of hypotheses about the population. Whatever the form of inference, we need
sample data relating to questions of interest. For valid use of statistical methods we
desire a random sample from the population.
Which statistics do we trust? Statistics are functions of sample data, and we have more
faith in statistics when samples are representative of the population. Large random
samples, small standard errors, and narrow confidence intervals are preferred.
Classical and Bayesian statistics represent alternative approaches to inference,
alternative ways of measuring uncertainty about the world. Classical hypothesis testing
involves making null hypotheses about population parameters and then rejecting or not
rejecting those hypotheses based on sample data. Typical null hypotheses (as the
word null would imply) state that there is no difference between proportions or group
means, or no relationship between variables. Null hypotheses may also refer to
parameters in models involving many variables.
To test a null hypothesis, we compute a special statistic called a test statistic along with
its associated p-value. Assuming that the null hypothesis is true, we can derive the
theoretical distribution of the test statistic. We obtain a p-value by referring the sample
test statistic to this theoretical distribution. The p-value, itself a sample statistic, gives
the probability of rejecting the null hypothesis under the assumption that it is true.
Let us assume that the conditions for valid inference have been satisfied. Then, when we
observe a very low p-value (0.05, 0.01, or 0.001, for instance), we know that one of two
things must be true: either (1) an event of very low probability has occurred under the
assumption that the null hypothesis is true or (2) the null hypothesis is false. A low p-
value leads us to reject the null hypothesis, and we say the research results are
statistically significant. Some results are statistically significant and meaningful. Others
are statistically significant and picayune.
For applied research in the classical tradition, we look for statistics with low p-values.
We define null hypotheses as straw men with the intention of rejecting them. When
looking for differences between groups, we set up a null hypothesis that there are no
differences between groups. In studying relationships between variables, we create null
hypotheses of independence between variables and then collect data to reject those
hypotheses. When we collect sufficient data, testing procedures have statistical power.
Variability is both our enemy and our friend. It is our enemy when it arises from
unexplained sources or from sampling variability—the values of statistics vary from one
sample to the next. But variability is also our friend because, without variability, we
would be unable to see relationships between variables.2
2
 To see the importance of variability in the discovery of relationships, we can begin with a scatter
plot of two variables with a high correlation. Then we restrict the range of one of the variables. More
often than not, the resulting scatter plot within the window of the restricted range will exhibit a lower
correlation.
While the classical approach treats parameters as fixed, unknown quantities to be
estimated, the Bayesian approach treats parameters as random variables. In other
words, we can think of parameters as having probability distributions representing of
our uncertainty about the world.
The Bayesian approach takes its name from Bayes’ theorem, a famous theorem in
statistics. In addition to making assumptions about population distributions, random
samples, and sampling distributions, we can make assumptions about population
parameters. In taking a Bayesian approach, our job is first to express our degree of
uncertainty about the world in the form of a probability distribution and then to reduce
that uncertainty by collecting relevant sample data.
How do we express our uncertainty about parameters? We specify prior probability
distributions for those parameters. Then we use sample data and Bayes’ theorem to
derive posterior probability distributions for those same parameters. The Bayesian
obtains conditional probability estimates from posterior distributions.
Many argue that Bayesian statistics provides a logically consistent approach to empirical
research. Forget the null hypothesis and focus on the research question of interest—the
scientific hypothesis. There is no need to talk about confidence intervals when we can
describe uncertainty with a probability interval. There is no need to make decisions
about null hypotheses when we can view all scientific and business problems from a
decision-theoretic point of view (Robert 2007). A Bayesian probabilistic perspective can
be applied to machine learning as well as traditional statistical models (Murphy 2012).
It may be a challenge to derive mathematical formulas for posterior probability
distributions. Indeed, for many research problems, it is impossible to derive formulas
for posterior distributions. This does not stop us from using Bayesian methods,
however, because computer programs can generate or estimate posterior distributions.
Markov chain Monte Carlo simulation is at the heart of Bayesian practice (Tanner
1996; Albert 2009; Robert and Casella 2009; Suess and Trumbo 2010).
Bayesian statistics is alive and well today because it helps us solve real-world problems
(McGrayne 2011; Flam 2014). In the popular press, Silver (2012) makes a strong
argument for taking a Bayesian approach to predictive models. As Efron (1986) points
out, however, there are good reasons why everyone is not a Bayesian.
There are many works from which to learn about classical inference (Fisher 1970; Fisher
1971; Snedecor and Cochran 1989; Hinkley, Reid, and Snell 1991; Stuart, Ord, and
Arnold 2010; O’Hagan 2010; Wasserman 2010). There are also many good sources for
learning about Bayesian methods (Geisser 1993; Gelman, Carlin, Stern, and Rubin
1995; Carlin and Louis 1996; Robert 2007).
When asked if the difference between two groups could have arisen by chance, we might
prefer a classical approach. We estimate a p-value as a conditional probability, given a
null hypothesis of no difference between the groups. But when asked to estimate the
probability that the share price of Apple stock will be above $100 at the beginning of the
next calendar year, we may prefer a Bayesian approach. Which is better, classical or
Bayesian? It does not matter. We need both. Which is better, Python or R? It does not
matter. We need both.

A.3 REGRESSION AND CLASSIFICATION


Much of the work of data science involves a search for meaningful relationships
between variables. We look for relationships between pairs of continuous variables
using scatter plots and correlation coefficients. We look for relationships between
categorical variables using contingency tables and the methods of categorical data
analysis. We use multivariate methods and multi-way contingency tables to
examine relationships among many variables. And we build predictive models.
There are two main types of predictive models: regression and classification.
Regression is prediction of a response of meaningful magnitude. Classification involves
prediction of a class or category. In the language of machine learning, these are methods
of supervised learning.
The most common form of regression is least-squares regression, also called ordinary
least-squares regression, linear regression, or multiple regression. When we use
ordinary least-squares regression, we estimate regression coefficients so that they
minimize the sum of the squared residuals, where residuals are differences between the
observed and predicted response values. For regression problems, we think of the
response as taking any value along the real number line, although in practice the
response may take a limited number of distinct values. The important thing for
regression is that the response values have meaningful magnitude.
Poisson regression is useful for counts. The response has meaningful magnitude but
takes discrete (whole number) values with a minimum value of zero. Log-linear models
for frequencies, grouped frequencies, and contingency tables for cross-classified
observations fall within this domain.
For models of events, duration, and survival, as in survival analysis, we must often
accommodate censoring, in which some observations are measured precisely and others
are not. With left censoring, all we know about imprecisely measured observations is
that they are less than some value. With right censoring, all we know about imprecisely
measured observations is that they are greater than some value.
A good example of a duration or survival model in marketing is customer lifetime
estimation. We know the lifetime or tenure of a customer only after that person stops
being our customer. For current customers, lifetime is imprecisely measured—it is right
censored.
Most traditional modeling techniques involve linear models or linear equations. The
response or transformed response is on the left-hand side of the linear model.
The linear predictor is on the right-hand side. The linear predictor involves explanatory
variables and is linear in its parameters. That is, it involves the addition of coefficients
or the multiplication of coefficients by the explanatory variables. The coefficients we fit
to linear models represent estimates of population parameters.
Generalized linear models, as their name would imply, are generalizations of the
classical linear regression model. They include models for choices and counts, including
logistic regression, multinomial logit models, log-linear models, ordinal logistic models,
Poisson regression, and survival data models. To introduce the theory behind these
important models, we begin by reviewing the classical linear regression model.
We can write the classical linear regression model in matrix notation as

y = Xb + e

y is an n × 1 vector of responses. X is an n × p matrix, with p being the number of
parameters being estimated. Often the first column of X is a column of ones for the
constant or intercept term in the model; additional columns are for parameters
associated with explanatory variables. b is a p × 1 vector of parameter estimates. That is
to say that Xb is linear predictor in matrix notation. The error vector e represents
independent and identically distributed errors; it is common to assume a Gaussian or
normal distribution with mean zero.
The assumptions of the classical linear regression model give rise to classical methods of
statistical inference, including tests of regression parameters and analyses of variance.
These methods apply equally well to observational and experimental studies.
Parameters in the classical linear regression model are estimated by ordinary least
squares. There are many variations on the theme, including generalized least squares
and a variety of econometric models for time-series regression, panel (longitudinal) data
models, and hierarchical models in general. There are also Bayesian alternatives for
most classical models. For now, we focus on classical inference and the simplest error
structure—independent, identically distributed (iid) errors.
Let y be one element from the response vector, corresponding to one observation from
the sample, and let x be its corresponding row from the matrix X. Because the mean or
expected value of the errors is zero, we observe that

E[y] = μ = xb

That is, μ, the mean of the response, is equal to the linear predictor. Generalized linear
models build from this fact. If we were to write in functional notation g(μ) = xb, then,
for the Gaussian distribution of classical linear regression, g is the identity
function: g(μ) = μ.
Suppose g(μ) were the logit transformation. We would have the logistic regression
model:

Knowing that the exponential function is the inverse of the natural logarithm, we solve
for μ as

For every observation in the sample, the expected value of the binary response is a
proportion. It represents the probability that one of two events will occur. It has a value
between zero and one. In generalized linear model parlance, g is called the link function.
It links the mean of the response to the linear predictor.
In most of the choice studies in this book, each observation is a binary response.
Customers choose to stay with their current telephone service or switch to another
service. Commuters choose to drive their cars or take the train. Probability theorists
think of binary responses as Bernoulli trials with the proportion or
probability μ representing the mean of the response. For n observations in a sample, we

have mean nμ and variance  . The distribution is binomial.3


 In many statistical treatments of this subject, π, rather than μ is used to represent the mean of the
3

response. We use μ here to provide a consistent symbol across the class of generalized linear models.
Table A.1 provides an overview of the most important generalized linear models for
work in business and economics. Classical linear regression has an identity link. Logistic
regression uses the logit link. Poisson regression and log-linear models use a log link.
We work Gaussian (normal), binomial, and Poisson distributions, which are in the
exponential family of distributions. Generalized linear models have linear predictors of
the form Xb. This is what makes them linear models; they involve functions of
explanatory variables that are linear in their parameters.

Table A.1. Three Generalized Linear Models

Generalized linear models help us model what are obvious nonlinear relationships
between explanatory variables and responses. Except for the special case of the
Gaussian or normal model, which has an identity link, the link function is nonlinear.
Also, unlike the normal model, there is often a relationship between the mean and
variance of the underlying distribution.
The binomial distribution builds on individual binary responses. Customers order or do
not order, respond to a direct marketing mailing or not. Customers choose to stay with
their current telephone service or switch to another service. This type of problem lends
itself to logistic regression and the use of the logit link. Note that the multinomial logit
model is a natural extension of logistic regression. Multinomial logit models are useful
in the analysis of multinomial response variables. A customer chooses Coke, Pepsi, or
RC Cola. A commuter drives, takes the train or bus, or walks to work.
When we record choices over a period of time or across a group of individuals, we get
counts or frequencies. Counts arranged in multi-way contingency tables comprise the
raw data of categorical data analysis. We also use the Poisson distribution and the log
link for categorical data analysis and log-linear modeling. As we have seen from our
discussion of the logit transformation, the log function converts a variable defined on
the domain of positive reals into a variable defined on the range of all real numbers.
This is why it works for counts.
The Poisson distribution is discrete on the domain of non-negative integers. It is used
for modeling counts, as in Poisson regression. The insurance company counts the
number of claims over the past year. A retailer counts the number of customers
responding to a sales promotion or the number of stock units sold. A nurse counts the
number of days a patient stays in the hospital. An auto dealer counts the number of days
a car stays on the lot before it sells.
Linear regression is a special generalized linear model. It has normally distributed
responses and an identity link relating the expected value of responses to the linear
predictor. Linear regression coefficients may be estimated by ordinary least squares. For
other members of the family of generalized linear models we use maximum likelihood
estimation. With the classical linear model we have analysis of variance and F-tests.
With generalized linear models we have analysis of deviance and likelihood ratio tests,
which are asymptotic chi-square tests.
There are close connections among generalized linear models for the analysis of choices
and counts. Alternative formulations often yield comparable results. The multinomial
model looks at the distribution of counts across response categories with a fixed sum of
counts (the sample size). For the Poisson model, counts, being associated with
individual cases, are random variables, and the sum of counts is not known until
observed in the sample. But we can use the Poisson distribution for the analysis of
multinomial data. Log-linear models make no explicit distinction between response and
explanatory variables. Instead, frequency counts act as responses in the model. But, if
we focus on appropriate subsets of linear predictors, treating response variables as
distinct from explanatory variables, log-linear models yield results comparable to
logistic regression. The Poisson distribution and the log link are used for log-linear
models.
When communicating with managers, we often use R-squared or the coefficient of
determination as an index of goodness of fit. This is a quantity that is easy to explain to
management as the proportion of response variance accounted for by the model. An
alternative index that many statisticians prefer is the root-mean-square error (RMSE),
which is an index of badness or lack of fit. Other indices of badness of fit, such as the
percentage error in prediction, are sometimes preferred by managers.
The method of logistic regression, although called “regression,” is actually a
classification method. It involves the prediction of a binary response. Ordinal and
multinomial logit models extend logistic regression to problems involving more than
two classes. Linear discriminant analysis is another classification method from the
domain of traditional statistics. The benchmark study of text classification in the
chapter on sentiment analysis employed logistic regression and a number of machine
learning algorithms for classification.
Evaluating classifier performance presents a challenge because many problems are low
base rate problems. Fewer than five percent of customers may respond to a direct mail
campaign. Disease rates, loan default, and fraud are often low base rate events. When
evaluating classifiers in the context of low base rates, we must look beyond the
percentage of events correctly predicted. Based on the four-fold table known as
the confusion matrix, figure A.1 provides an overview of various indices available for
evaluating binary classifiers.
Figure A.1. Evaluating the Predictive Accuracy of a Binary Classifier

Summary statistics such as Kappa (Cohen 1960) and the area under the receiver
operating characteristic (ROC) curve are sometimes used to evaluate classifiers. Kappa
depends on the probability cut-off used in classification. The area under the ROC curve
does not.
The area under the ROC curve is a preferred index of classification performance for low-
base-rate problems. The ROC curve is a plot of the true positive rate against the false
positive rate. It shows the tradeoff between sensitivity and specificity and measures how
well the model separates positive from negative cases.
The area under the ROC curve provides an index of predictive accuracy independent of
the probability cut-off that is being used to classify cases. Perfect prediction corresponds
to an area of 1.0 (curve that touches the top-left corner). An area of 0.5 depicts random
(null-model) predictive accuracy.
Useful references for linear regression include Draper and Smith (1998), Harrell
(2001), Chatterjee and Hadi (2012), and Fox and Weisberg (2011). Data-adaptive
regression methods and machine learning algorithms are reviewed in Berk
(2008), Izenman (2008), and Hastie, Tibshirani, and Friedman (2009). For traditional
nonlinear models, see Bates and Watts (2007).
Of special concern to data scientists is the structure of the regression model. Under what
conditions should we transform the response or selected explanatory variables? Should
interaction effects be included in the model? Regression diagnostics are data
visualizations and indices we use to check on the adequacy of regression models and to
suggest variable transformations. Discussion may be found in Belsley, Kuh, and Welsch
(1980) and Cook (1998). The base R system provides many diagnostics, and Fox and
Weisberg (2011) provide additional diagnostics. Diagnostics may suggest that
transformations of the response or explanatory variables are needed in order to meet
model assumptions or improve predictive performance. A theory of power
transformations is provided in Box and Cox (1964) and reviewed by Fox and Weisberg
(2011).
When defining parametric models, we would like to include the right set of explanatory
variables in the right form. Having too few variables or omitting key explanatory
variables can result in biased predictions. Having too many variables, on the other hand,
may lead to over-fitting and high out-of-sample prediction error. This bias-variance
tradeoff, as it is sometimes called, is a statistical fact of life.
Shrinkage and regularized regression methods provide mechanisms for tuning,
smoothing, or adjusting model complexity (Tibshirani 1996; Hoerl and Kennard 2000).
Alternatively, we can select subsets of explanatory variables to go into predictive models.
Special methods are called into play when the number of parameters being estimated is
large, perhaps exceeding the number of observations (Bühlmann and van de Geer 2011).
For additional discussion of the bias-variance tradeoff, regularized regression, and
subset selection, see Izenman (2008) and Hastie, Tibshirani, and Friedman
(2009). Graybill (1961, 2000) and Rencher and Schaalje (2008) review linear models.
Generalized linear models are discussed in McCullagh and Nelder (1989) and Firth
(1991). Kutner, Nachtsheim, Neter, and Li (2004) provide a comprehensive review of
linear and generalized linear models, including discussion of their application in
experimental design. R methods for the estimation of linear and generalized linear
models are reviewed in Chambers and Hastie (1992) and Venables and Ripley (2002).
The standard reference for generalized linear models is McCullagh and Nelder
(1989). Firth (1991) provides additional review of the underlying theory. Hastie
(1992) and Venables and Ripley (2002) give modeling examples with S/SPlus, most
which are easily duplicated in R. Lindsey (1997) discusses a wide range of application
examples. See Christensen (1997), Le (1998), and Hosmer, Lemeshow, and Sturdivant
(2013) for discussion of logistic regression. Lloyd (1999) provides an overview of
categorical data analysis. See Fawcett (2003) and Sing et al. (2005) for further
discussion of the ROC curve. Discussion of alternative methods for evaluating classifiers
is provided in Hand (1997) and Kuhn and Johnson (2013).
For Poisson regression and the analysis of multi-way contingency tables, useful
references include Bishop, Fienberg, and Holland (1975), Cameron and Trivedi
(1998), Fienberg (2007), Tang, He, and Tu (2012), and Agresti (2013). Reviews of
survival data analysis have been provided by Andersen, Borgan, Gill, and Keiding
(1993), Le (1997), Therneau and Grambsch (2000), Harrell (2001), Nelson
(2003), Hosmer, Lemeshow, and May (2013), and Allison (2010), with programming
solutions provided by Therneau (2014) and Therneau and Crowson (2014). Wassertheil-
Smoller (1990) provides an elementary introduction to classification procedures and the
evaluation of binary classifiers. For a more advanced treatment, see Hand
(1997). Burnham and Anderson (2002) review model selection methods, particularly
those using the Akaike information criterion or AIC (Akaike 1973).
We sometimes consider robust regression methods when there are influential outliers or
extreme observations. Robust methods represent an active area of research using
statistical simulation tools (Fox 2002; Koller and Stahel 2011; Maronna, Martin, and
Yohai 2006; Maechler 2014b; Koller 2014). Huet et al. (2004) and Bates and Watts
(2007) review nonlinear regression, and Harrell (2001) discusses spline functions for
regression problems.

A.4 DATA MINING AND MACHINE LEARNING


Recommender systems, collaborative filtering, association rules, optimization
methods based on heuristics, as well as a myriad of methods for regression,
classification, and clustering fall under the rubric of machine learning.
We use the term “machine learning” to refer to the methods or algorithms that we use as
an alternative to traditional statistical methods. When we apply these methods in the
analysis of data, we use the term “data mining.” Rajaraman and Ullman (2012) describe
data mining as the “discovery of models for data.” Regarding the data themselves, we
are often referring to massive data sets.
With traditional statistics, we define the model specification prior to working with the
data. With traditional statistics, we often make assumptions about the population
distributions from which the data have been drawn. Machine learning, on the other
hand, is data-adaptive. The model specification is defined by applying algorithms to the
data. With machine learning, few assumptions are made about the underlying
distributions of the data.
Machine learning methods often perform better than traditional linear or logistic
regression methods, but explaining why they work is not easy. Machine learning models
are sometimes called black box models for a reason. The underlying algorithms can
yield thousands of formulas or nodal splits fit to the training data.
Extensive discussion of machine learning algorithms may be found in Duda, Hart, and
Stork (2001), Izenman (2008), Hastie, Tibshirani, and Friedman (2009), Kuhn and
Johnson (2013), Tan, Steinbach, and Kumar (2006), and Murphy (2012). Bacon
(2002) describes their application in marketing.
Hothorn et al. (2005) review principles of benchmark study design, and Schauerhuber
et al. (2008) show a benchmark study of classification methods. Alfons (2014a) provides
cross-validation tools for benchmark studies. Benchmark studies, also known as
statistical simulations or statistical experiments, may be conducted with programming
packages designed for this type of research (Alfons 2014b; Alfons, Templ, and Filzmoser
2014).
Duda, Hart, and Stork (2001), Tan, Steinbach, and Kumar (2006), Hastie, Tibshirani,
and Friedman (2009), and Rajaraman and Ullman (2012) introduce clustering from a
machine learning perspective. Everitt, Landau, Leese, and Stahl (2011), Kaufman and
Rousseeuw (1990) review traditional clustering methods. Izenman (2008) provides a
review of traditional clustering, self-organizing maps, fuzzy clustering, model-based
clustering, and biclustering (block clustering).
Within the machine learning literature, cluster analysis is referred to as unsupervised
learning to distinguish it from classification, which is supervised learning, guided by
known, coded values of a response variable or class. Association rules modeling,
frequent itemsets, social network analysis, link analysis, recommender systems, and
many multivariate methods as we employ them in data science represent unsupervised
learning methods.
An important multivariate method, principal component analysis, draws on linear
algebra and gives us a way to reduce the number of measures or quantitative features we
use to describe domains of interest. Long a staple of measurement experts and a
prerequisite of factor analysis, principal component analysis has seen recent
applications in latent semantic analysis, a technology for identifying important topics
across a document corpus (Blei, Ng, and Jordan 2003; Murphy 2012; Ingersoll, Morton,
and Farris 2013).
When some observations in the training set have coded responses and others do not, we
employ a semi-supervised learning approach. The set of coded observations for the
supervised component can be small relative to the set of uncoded observations for the
unsupervised component (Liu 2011).
Leisch and Gruen (2014) describe programming packages for various clustering
algorithms. Methods developed by Kaufman and Rousseeuw (1990) have been
implemented in R programs by Maechler (2014a), including silhouette modeling and
visualization techniques for determining the number of clusters. Silhouettes were
introduced by Rousseeuw (1987), with additional documentation and examples
provided in Kaufman and Rousseeuw (1990) and Izenman (2008).
Thinking more broadly about machine learning, we see it as a subfield of artificial
intelligence (Luger 2008; Russell and Norvig 2009). Machine learning encompasses
biologically-inspired methods, genetic algorithms, and heuristics, which may be used to
address complex optimization, scheduling, and systems design problems. (Mitchell
1996; Engelbrecht 2007; Michalawicz and Fogel 2004; Brownlee 2011).

A.5 DATA VISUALIZATION


Data visualization is critical to the work of data science. Examples in this book
demonstrate the importance of data visualization in discovery, diagnostics, and
design. We employ tools of exploratory data analysis (discovery) and statistical
modeling (diagnostics). In communicating results to management, we use
presentation graphics (design).
Statistical summaries fail to tell the story of data. To understand data, we must look
beyond data tables, regression coefficients, and the results of statistical tests.
Visualization tools help us learn from data. We explore data, discover patterns in data,
identify groups of observations that go together and unusual observations or outliers.
We note relationships among variables, sometimes detecting underlying dimensions in
the data.
Graphics for exploratory data analysis are reviewed in classic references by Tukey
(1977) and Tukey and Mosteller (1977). Regression graphics are covered by Cook
(1998), Cook and Weisberg (1999), and Fox and Weisberg (2011). Statistical graphics
and data visualization are illustrated in the works of Tufte
(1990, 1997, 2004, 2006), Few (2009), and Yau (2011, 2013). Wilkinson
(2005) presents a review of human perception and graphics, as well as a conceptual
structure for understanding statistical graphics. Cairo (2013) provides a general review
of information graphics. Heer, Bostock, and Ogievetsky (2010) demonstrate
contemporary visualization techniques for web distribution. When working with very
large data sets, special methods may be needed, such as partial transparency and hexbin
plots (Unwin, Theus, and Hofmann 2006; Carr, Lewin-Koh, and Maechler 2014; Lewin-
Koh 2014).
R is particularly strong in data visualization. An R graphics overview is provided
by Murrell (2011). R lattice graphics, discussed by Sarkar (2008, 2014), build on the
conceptual structure of an earlier system called S-Plus Trellis ™ (Cleveland 1993; Becker
and Cleveland 1996). Wilkinson’s (2005) “grammar of graphics” approach has been
implemented in the Python ggplot package (Lamp 2014) and in the R ggplot2 package
(Wickham and Chang 2014), with R programming examples provided by Chang
(2013). Cairo (2013) and Zeileis, Hornik, and Murrell (2009, 2014) provide advice about
colors for statistical graphics. Ihaka et al. (2014) show how to specify colors in R by hue,
chroma, and luminance.

A.6 TEXT AND SENTIMENT ANALYSIS


Text analytics draws from a variety of disciplines, including linguistics,
communication and language arts, experimental psychology, political discourse
analysis, journalism, computer science, and statistics. And, given the amount of
text being gathered and stored by organizations, text analytics is an important and
growing area of predictive analytics.
We have discussed web crawling, scraping, and parsing. The output from these
processes is a document collection or text corpus. This document collection or corpus is
in the natural language. The two primary ways of analyzing a text corpus are the bag of
words approach and natural language processing. We parse the corpus further,
creating commonly formatted expressions, indices, keys, and matrices that are more
easily analyzed by computer. This additional parsing is sometimes referred to as text
annotation. We extract features from the text and then use those features in subsequent
analyses.
Natural language is what we speak and write every day. Natural language processing is
more than a matter of collecting individual words. Natural language conveys meaning.
Natural language documents contain paragraphs, paragraphs contain sentences, and
sentences contain words. There are grammatical rules, with many ways to convey the
same idea, along with exceptions to rules and rules about exceptions. Words used in
combination and the rules of grammar comprise the linguistic foundations of text
analytics as shown in figure A.2.
Source: Adapted from Pinker (1999).

Figure A.2. Linguistic Foundations of Text Analytics

Linguists study natural language, the words and the rules that we use to form
meaningful utterances. “Generative grammar” is a general term for the rules;
“morphology,” “syntax,” and “semantics” are more specific terms. Computer programs
for natural language processing use linguistic rules to mimic human communication and
convert natural language into structured text for further analysis.
Natural language processing is a broad area of academic study itself, and an important
area of computational linguistics. The location of words in sentences is a key to
understanding text. Words follow a sequence, with earlier words often more important
than later words, and with early sentences and paragraphs often more important than
later sentences and paragraphs.
Words in the title of a document are especially important to understanding the meaning
of a document. Some words occur with high frequency and help to define the meaning of
a document. Other words, such as the definite article “the” and the indefinite articles “a”
and “an,” as well as many prepositions and pronouns, occur with high frequency but
have little to do with the meaning of a document. These stop words are dropped from
the analysis.
The features or attributes of text are often associated with terms—collections of words
that mean something special. There are collections of words relating to the same
concept or word stem. The words “marketer,” “marketeer,” and “marketing” build on the
common word stem “market.” There are syntactic structures to consider, such as
adjectives followed by nouns and nouns followed by nouns. Most important to text
analytics are sequences of words that form terms. The words “New” and “York” have
special meaning when combined to form the term “New York.” The words “financial”
and “analysis” have special meaning when combined to form the term “financial
analysis.” We often employ stemming, which is the identification of word stems,
dropping suffixes (and sometimes prefixes) from words. More generally, we are parsing
natural language text to arrive at structured text.
In English, it is customary to place the subject before the verb and the object after the
verb. In English, verb tense is important. The sentence “Daniel carries the Apple
computer,” can have the same meaning as the sentence “The Apple computer is carried
by Daniel.” “Apple computer,” the object of the active verb “carry” is the subject of the
passive verb “is carried.” Understanding that the two sentences mean the same thing is
an important part of building intelligent text applications.
A key step in text analysis is the creation of a terms-by-documents matrix (sometimes
called a lexical table). The rows of this data matrix correspond to words or word stems
from the document collection, and the columns correspond to documents in the
collection. The entry in each cell of a terms-by-documents matrix could be a binary
indicator for the presence or absence of a term in a document, a frequency count of the
number of times a term is used in a document, or a weighted frequency indicating the
importance of a term in a document.
Figure A.3 illustrates the process of creating a terms-by-documents matrix. The first
document comes from Steven Pinker’s Words and Rules (1999, p. 4), the second from
Richard K. Belew’s Finding Out About (2000, p. 73). Terms correspond to words or
word stems that appear in the documents. In this example, each matrix entry represents
the number of times a term appears in a document. We treat nouns, verbs, and
adjectives similarly in the definition of stems. The stem “combine” represents both the
verb “combine” and the noun “combination.” Likewise, “function” represents the verb,
noun, and adjective form “functional.” An alternative system might distinguish among
parts of speech, permitting more sophisticated syntactic searches across documents.
After being created, the terms-by-documents matrix is like an index, a mapping of
document identifiers to terms (keywords or stems) and vice versa. For information
retrieval systems or search engines we might also retain information regarding the
specific location of terms within documents.
Source: Adapted from Miller (2005).

Figure A.3. Creating a Terms-by-Documents Matrix

Typical text analytics applications have many more terms than documents, resulting in
sparse rectangular terms-by-documents matrices. To obtain meaningful results for text
analytics applications, analysts examine the distribution of terms across the document
collection. Very low frequency terms, those used in few documents, are dropped from
the terms-by-documents matrix, reducing the number of rows in the matrix.
Unsupervised text analytics problems are those for which there is no response or class to
be predicted. Rather, as we showed with the movie taglines, the task is to identify
common patterns or trends in the data. As part of the task, we may define text measures
describing the documents in the corpus.
For supervised text analytics problems there is a response or class of documents to be
predicted. We build a model on a training set and test it on a test set. Text classification
problems are common. Spam filtering has long been a subject of interest as a
classification problem, and many e-mail users have benefitted from the efficient
algorithms that have evolved in this area. In the context of information retrieval, search
engines classify documents as being relevant to the search or not. Useful modeling
techniques for text classification include logistic regression, linear discriminant function
analysis, classification trees, and support vector machines. Various ensemble or
committee methods may be employed.
Automatic text summarization is an area of research and development that can help
with information management. Imagine a text processing program with the ability to
read each document in a collection and summarize it in a sentence or two, perhaps
quoting from the document itself. Today’s search engines are providing partial analysis
of documents prior to their being displayed. They create automated summaries for fast
information retrieval. They recognize common text strings associated with user
requests. These applications of text analysis comprise tools of information search that
we take for granted as part of our daily lives.
Programs with syntactic processing capabilities, such as IBM’s Watson, provide a
glimpse of what intelligent agents for text analytics are becoming. These programs
perform grammatical parsing with an understanding of the roles of subject, verb, object,
and modifier. They know parts of speech (nouns, verbs, adjective, adverbs). And, using
identified entities representing people, places, things, and organizations, they perform
relationship searches.
Sentiment analysis is measurement-focused text analysis. Sometimes called opinion
mining, one approach to sentiment analysis is to draw on positive and negative word
sets (lexicons, dictionaries) that convey human emotion or feeling. These word sets are
specific to the language being spoken and the context of application. Another approach
to sentiment analysis is to work directly with text samples and human ratings of those
samples, developing text scoring methods specific to the task at hand.
A semi-supervised machine learning regimen can be especially useful in sentiment
analysis. We work two sets of text samples. One sample, often a small sample (because it
is expensive and time-consuming to obtain human ratings of text), associates a rating or
score with each text document. Another much larger sample is unrated but comes from
the same content domain. We learn the direction of scoring from the first sample, and
we learn about the text domain (including term frequencies in context) from both
samples.
The objective of sentiment analysis is to score text for affect, feelings, attitudes, or
opinions. Sentiment analysis and text measurement in general hold promise as
technologies for understanding consumer opinion and markets. Just as political
researchers can learn from the words of the public, press, and politicians, business
researchers can learn from the words of customers and competitors. There are customer
service logs, telephone transcripts, and sales call reports, along with user group, listserv,
and blog postings. And we have ubiquitous social media from which to build document
collections for text and sentiment analysis.
Precursors to sentiment analysis may be found in content analysis, thematic, semantic,
and network text analysis (Roberts 1997; Popping 2000; West 2001; Leetaru
2011; Krippendorff 2012). These methods have seen a wide range of applications within
the social sciences, including analysis of political discourse. An early computer
implementation of content analysis is found in the General Inquirer program (Stone et
al. 1966; Stone 1997). Buva  and Stone (2001) describe a version of the program that
provides text measures based upon word counts across numerous semantic categories.
Text measures flow from a measurement model (algorithms for scoring) and a
dictionary, both defined by the researcher or analyst. A dictionary in this context is not a
traditional dictionary; it is not an alphabetized list of words and their definitions.
Rather, the dictionary used to construct text measures is a repository of word lists, such
as synonyms and antonyms, positive and negative words, strong and weak sounding
words, bipolar adjectives, parts of speech, and so on. The lists come from expert
judgments about the meaning of words. A text measure assigns numbers to documents
according to rules, with the rules being defined by the word lists, scoring algorithms,
and modeling techniques in predictive analytics.
Sentiment analysis and text measurement in general hold promise as technologies for
understanding consumer opinion and markets. Just as political researchers can learn
from the words of the public, press, and politicians, marketing data scientists can learn
from the words of customers and competitors.
For the marketing data scientist interested in understanding consumer opinions about
brands and products, there are substantial sources from which to draw samples. We
have customer service logs, telephone transcripts, and sales call reports, along with user
group, listserv, and blog postings. And we have ubiquitous social media from which to
build document collections for text and sentiment analysis.
The measurement story behind opinion and sentiment analysis is an important story
that needs to be told. Sentiment analysis, like all measurement, is the assignment of
numbers to attributes according to rules. But what do the numbers mean? To what
extent are text measures reliable or valid? To demonstrate content or face validity, we
show that the content of the text measure relates to the attribute being measured. We
examine word sets, and we try to gain agreement (among subject matter experts,
perhaps) that they measure a particular attribute or trait. Sentiment research often
involves the testing of word sets within specific contexts and, when possible, testing
against external criteria. To demonstrate predictive validity, we show that a text
measure can be used for prediction.
Regarding Twitter-based text measures, there have been various attempts to predict the
success of movies prior to their being distributed to theaters nationwide (Sharda and
Delen 2006; Delen, Sharda, and Kumar 2007). Most telling is work completed at HP
Labs that utilized chat on Twitter as a predictor of movie revenues (Asur and Huberman
2010). Bollen, Mao, and Zeng (2011) utilize Twitter sentiment analysis in predicting
stock market movements. Taddy’s (2013b, 2014) sentiment analysis work builds on the
inverse regression methods of Cook (1998, 2007). Taddy (2013a) uses Twitter data to
examine political sentiment.
Some have voiced concerns about unidimensional measures of sentiment. There have
been attempts to develop more extensive sentiment word sets, as well as
multidimensional measures (Turney 2002; Asur and Huberman 2010). Recent
developments in machine learning and quantitative linguistics point to sentiment
measurement methods that employ natural language processing rather than relying on
positive and negative word sets (Socher et al. 2011).
Among the more popular measurement schemes from the psychometric literature is
Charles Osgood’s semantic differential (Osgood, Suci, and Tannenbaum 1957; Osgood
1962). Exemplary bipolar dimensions include the positive–negative, strong–weak, and
active–passive dimensions. Schemes like Osgood’s set the stage for multidimensional
measures of sentiment.
We expect sentiment analysis to be an active area of research for many years. Reviews of
sentiment analysis methods have been provided by Liu (2010, 2011, 2012) and Feldman
(2013). Other books in the Modeling Techniques series provide examples of sentiment
analysis using document collections of movie reviews Miller (2015b, 2015a, 2015c).
Ingersoll, Morton, and Farris (2013) provide an introduction to the domain of text
analytics for the working data scientist. Those interested in reading further can refer
to Feldman and Sanger (2007), Jurafsky and Martin (2009), Weiss, Indurkhya, and
Zhang (2010), and the edited volume by Srivastava and Sahami (2009). Reviews may be
found in Trybula (1999), Witten, Moffat, and Bell (1999), Meadow, Boyce, and Kraft
(2000), Sullivan (2001), Feldman (2002b), and Sebastiani (2002). Hausser (2001) gives
an account of generative grammar and computational linguistics. Statistical language
learning and natural language processing are discussed by Charniak (1993), Manning
and Schütze (1999), and Indurkhya and Damerau (2010).
The writings of Steven Pinker (1994, 1997, 1999) provide insight into grammar and
psycholinguistics. Maybury (1997) reviews data preparation for text analytics and the
related tasks of source detection, translation and conversion, information extraction,
and information exploitation. Detection relates to identifying relevant sources of
information; conversion and translation involve converting from one medium or coding
form to another.
Belew (2000), Meadow, Boyce, and Kraft (2000) and the edited volume by Baeza-Yates
and Ribeiro-Neto (1999) provide reviews of technologies for information retrieval,
which depend on text classification, among other technologies and algorithms.
Authorship identification, a problem addressed a number of years ago in the statistical
literature by Mosteller and Wallace (1984), continues to be an active area of research
(Joula 2008). Merkl (2002) provides discussion of clustering techniques, which explore
similarities between documents and the grouping of documents into classes. Dumais
(2004) reviews latent semantic analysis and statistical approaches to extracting
relationships among terms in a document collection.

A.7 TIME SERIES AND MARKET RESPONSE


MODELS
Sales forecasts are a critical component of business planning and a first step in the
budgeting process. Models and methods that provide accurate forecasts can be of
great benefit to management. They help managers to understand the determinants
of sales, including promotions, pricing, advertising, and distribution. They reveal
competitive position and market share.
There are many approaches to forecasting. Some are judgmental, relying on expert
opinion or consensus. There are top-down and bottom-up forecasts, and various
techniques for combining the views of experts. Other approaches depend on the analysis
of past sales data.
Certain problems in business have a special structure, and if we pay attention to that
structure, we can find our way to a solution. Sales forecasting is one of those problems.
Sales forecasts can build on the special structure of sales data as they are found in
business. These are data organized by time and location, where location might refer to
geographical regions or sales territories, stores, departments within stores, or product
lines.
A sales forecasting model can and should be organized by time periods useful to
management. These may be days, weeks, months, or whatever intervals make sense for
the problem at hand. Time dependencies can be noted in the same manner as in
traditional time-series models. Autoregressive terms are useful in many contexts. Time-
construed covariates, such as day of the week or month of the year, can be added to
provide additional predictive power. And we can include promotion, pricing, and
advertising variables organized in time.
An analyst can work with time series data, using past sales to predict future sales, noting
overall trends and cyclical patterns in the data. Exponential smoothing, moving
averages, and various regression and econometric methods may be used with time-
series data.
Forecasting by location provides detail needed for management action. And organizing
data by location contributes to a model’s predictive power. Location may itself be used
as a factor in models. In addition, we can search for explanatory variables tied to
location. With geographic regions, for example, we might include consumer and
business demographic variables known to relate to sales.
Sales dollars per time period is the typical response variable of interest in sales
forecasting studies. Alternative response variables include sales volume and time-to-
sale. Related studies of market share require information about the sales of other firms
in the same product category.
Forecasting is a large area of application deserving of its own professional conferences
and journals. An overview of business forecasting methods is provided by Armstrong
(2001). Time-series, panel (longitudinal) data, financial, and econometric modeling
methods are especially relevant to this area of application (Judge et al. 1985; Hamilton
1994; Zivot and Wang 2003). Frees and Miller (2004) describe a sales forecasting
method that utilizes a mixed modeling approach, reflecting the special structure of sales
data.
Having gathered the data and looked at the plots, we turn to forecasting the future. For
these economic time series, we use autoregressive integrated moving average (ARIMA)
models or what are often called Box-Jenkins models (Box, Jenkins, and Reinsel 2008).
Drawing on software from Hyndman et al. (2014) and working with one measure at a
time, we use programs to search across large sets of candidate models, including
autoregressive, moving-average, and seasonal components. We try to select the very best
model in terms of the Akaike Information Criterion (AIC) or some other measure that
combines goodness-of-fit and parsimony. Then, having found the model that the
algorithm determines as the best for each measure, we use that model to forecast future
values of the economic measure. In particular, we ask for forecasts for a time horizon
that management requests. We obtain the forecast mean as well as a prediction interval
around that mean for each time period of the forecasting horizon.
Forecasting uncertainty is estimated around the forecasted values. There is often much
uncertainty about the future, and the further we look into the future, the greater our
uncertainty. The value of a model lies in the quality of its predictions, and sales
forecasting presents challenging problems for the marketing data scientist.
When working with multiple time series, we might fit a multivariate time series or
vector autoregressive (VAR) model to the four time series. Alternatively, we could
explore dynamic linear models, regressing one time series on another. We could utilize
ARIMA transfer function models or state space models with regression components.
The possibilities are as many as the modeling issues to be addressed.
There is a subtle but important distinction to be made here. The term time series
regression refers to regression analysis in which the organizing unit of analysis is time.
We look at relationships among economic measures organized in time. Much economic
analysis concerns time series regression. Special care must be taken to avoid what might
be called spurious relationships, as many economic time series are correlated with one
another because they depend upon underlying factors, such as population growth or
seasonality.
In time series regression, we use standard linear regression methods. We check the
residuals from our regression to ensure that they are not correlated in time. If they are
correlated in time (autocorrelated), then we use a method such as generalized least
squares as an alternative to ordinary least squares. That is, we incorporate an error data
model as part of our modeling process. Longitudinal data analysis or panel data analysis
is an example of a mixed data method with a focus on data organized by cross-sectional
units and time.
When we use the term time series analysis, however, we are not talking about time
series regression. We are talking about methods that start by focusing on one economic
measure at a time and its pattern across time. We look for trends, seasonality, and cycles
in that individual time series. Then, after working with that single time series, we look at
possible relationships with other time series. If we are concerned with forecasting or
predicting the future, as we often are in predictive analytics, then we use methods of
time series analysis. Recently, there has been considerable interest in state space models
for time series, which provide a convenient mechanism for incorporating regression
components into dynamic time series models (Commandeur and Koopman
2007; Hyndman, Koehler, Ord, and Snyder 2008; Durbin and Koopman 2012).
There are myriad applications of time series analysis in marketing, including marketing
mix models and advertising research models. Along with sales forecasting, these fall
under the general class of market response models, as reviewed by Hanssens, Parsons,
and Schultz (2001). Marketing mix models look at the effects of price, promotion, and
product placement in retail establishments. These are multiple time series problems.
Advertising research looks for cumulative effectiveness of advertising on brand and
product awareness, as well as sales. Exemplary reviews of advertising research methods
and findings have been provided by Berndt (1991) and Lodish et al. (1995). Much of this
research employs defined measures such as “advertising stock,” which attempt to
convert advertising impressions or rating points to a single measure in time. The
thinking is that messages are most influential immediately after being received, decline
in influence with time, but do not decline completely until many units in time later.
Viewers or listeners remember advertisements long after initial exposure to those
advertisements. Another way of saying this is to note that there is a carry-over effect
from one time period to the next. Needless to say, measurement and modeling on the
subject of advertising effectiveness presents many challenges for the marketing data
scientist.
Similar to other data with which we work, sales and marketing data are organized by
observational unit, time, and space. The observational unit is typically an economic
agent (individual or firm) or a group of such agents as in an aggregate analysis. It is
common to use geographical areas as a basis for aggregation. Alternatively, space
(longitude and latitude) can be used directly in spatial data analyses. Time
considerations are especially important in macroeconomic analysis, which focuses upon
nationwide economic measures.
Baumohl (2008) provides a review of economic measures that are commonly thought of
as leading indicators. Kennedy (2008) provides an introduction to the terminology of
econometrics. Key references in the area of econometrics include Judge et al.
(1985), Berndt (1991), Enders (2010), and Greene (2012). Reviews of time series
modeling and forecasting methods are provided by Holden, Peel, and Thompson
(1990) and in the edited volume by Armstrong (2001).
More detailed discussion of time series methods is provided by Hamilton
(1994), Makridakis, Wheelwright, and Hyndman (2005), Box, Jenkins, and Reinsel
(2008), Hyndman et al. (2008), Durbin and Koopman (2012), and Hyndman and
Athanasopoulos (2014). Time-series, panel (longitudinal) data, financial, and
econometric modeling methods are especially relevant in demand and sales
forecasting. Frees and Miller (2004) present a longitudinal sales forecasting method,
reflecting the special structure of sales data in space and time. Hierarchical and grouped
time series methods are discussed by Athanasopoulos, Ahmed, and Hyndman
(2009) and Hyndman et al. (2011).
For gathering economic data with R, we can build on foundation code provided by Ryan
(2014). Useful for programming with dates are R functions provided by Grolemund and
Wickham (2011, 2014). Associated sources for econometric and time series
programming are Kleiber and Zeileis (2008), Hothorn et al. (2014), Cowpertwait and
Metcalfe (2009), Petris, Petrone, and Campagnoli (2009), and Tsay (2013). Most useful
for time series forecasting is code from Hyndman et al.(2014), Petris (2010), Petris and
Gilks (2014), and Szymanski (2014).
The Granger test of causality, a test of temporal ordering, was introduced in the classic
reference by Granger (1969). The interested reader should also check out a delightful
article that answers the perennial question Which came first, the chicken or the
egg? (Thurman and Fisher 1988).
Applications of traditional methods and models in economics, business, and market
research are discussed by Leeflang et al. (2000), Franses and Paap (2001), Hanssens,
Parsons, and Schultz (2001), and Frees and Miller (2004). Lilien, Kotler, and Moorthy
(1992) and Lilien and Rangaswamy (2003) focus upon marketing models. For a review
of applications from the research practitioner’s point of view, see Chakrapani (2000).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy