A. Data Science Methods
A. Data Science Methods
We learn from statistics that we should quantify our uncertainty. On the one hand, we
have confidence intervals, point estimates with associated standard errors, significance
tests, and p-values—that is the classical way. On the other hand, we have posterior
probability distributions, probability intervals, prediction intervals, Bayes factors, and
subjective (perhaps diffuse) priors—the path of Bayesian statistics.
The role of data science in business has been discussed by many (Davenport and Harris
2007; Laursen and Thorlund 2010; Davenport, Harris, and Morison 2010; Franks
2012; Siegel 2013; Maisel and Cokins 2014; Provost and Fawcett 2014). In-depth
reviews of methods include those of Izenman (2008), Hastie, Tibshirani, and Friedman
(2009), and Murphy (2012).
Doing data science means implementing flexible, scalable, extensible systems for data
preparation, analysis, visualization, and modeling. We are empowered by the growth of
open source. Whatever the modeling technique or application, there is likely a relevant
package, module, or library that someone has written or is thinking of writing. Doing
data science with open-source tools is discussed in Conway and White (2012), Putler
and Krider (2012), James et al. (2013), Kuhn and Johnson (2013), Lantz (2013),
and Ledoiter (2013). Additional discussion of data science, modeling techniques in
predictive analytics, and open-source tools is provided in other books in the Modeling
Techniques series (Miller 2015a, 2015b, and 2015c).
This appendix identifies classes of methods and reviews selected methods in databases
and data preparation, statistics, machine learning, data visualization, and text analytics.
We provide an overview of these methods and cite relevant sources for further reading.
y = Xb + e
y is an n × 1 vector of responses. X is an n × p matrix, with p being the number of
parameters being estimated. Often the first column of X is a column of ones for the
constant or intercept term in the model; additional columns are for parameters
associated with explanatory variables. b is a p × 1 vector of parameter estimates. That is
to say that Xb is linear predictor in matrix notation. The error vector e represents
independent and identically distributed errors; it is common to assume a Gaussian or
normal distribution with mean zero.
The assumptions of the classical linear regression model give rise to classical methods of
statistical inference, including tests of regression parameters and analyses of variance.
These methods apply equally well to observational and experimental studies.
Parameters in the classical linear regression model are estimated by ordinary least
squares. There are many variations on the theme, including generalized least squares
and a variety of econometric models for time-series regression, panel (longitudinal) data
models, and hierarchical models in general. There are also Bayesian alternatives for
most classical models. For now, we focus on classical inference and the simplest error
structure—independent, identically distributed (iid) errors.
Let y be one element from the response vector, corresponding to one observation from
the sample, and let x be its corresponding row from the matrix X. Because the mean or
expected value of the errors is zero, we observe that
E[y] = μ = xb
That is, μ, the mean of the response, is equal to the linear predictor. Generalized linear
models build from this fact. If we were to write in functional notation g(μ) = xb, then,
for the Gaussian distribution of classical linear regression, g is the identity
function: g(μ) = μ.
Suppose g(μ) were the logit transformation. We would have the logistic regression
model:
Knowing that the exponential function is the inverse of the natural logarithm, we solve
for μ as
For every observation in the sample, the expected value of the binary response is a
proportion. It represents the probability that one of two events will occur. It has a value
between zero and one. In generalized linear model parlance, g is called the link function.
It links the mean of the response to the linear predictor.
In most of the choice studies in this book, each observation is a binary response.
Customers choose to stay with their current telephone service or switch to another
service. Commuters choose to drive their cars or take the train. Probability theorists
think of binary responses as Bernoulli trials with the proportion or
probability μ representing the mean of the response. For n observations in a sample, we
response. We use μ here to provide a consistent symbol across the class of generalized linear models.
Table A.1 provides an overview of the most important generalized linear models for
work in business and economics. Classical linear regression has an identity link. Logistic
regression uses the logit link. Poisson regression and log-linear models use a log link.
We work Gaussian (normal), binomial, and Poisson distributions, which are in the
exponential family of distributions. Generalized linear models have linear predictors of
the form Xb. This is what makes them linear models; they involve functions of
explanatory variables that are linear in their parameters.
Generalized linear models help us model what are obvious nonlinear relationships
between explanatory variables and responses. Except for the special case of the
Gaussian or normal model, which has an identity link, the link function is nonlinear.
Also, unlike the normal model, there is often a relationship between the mean and
variance of the underlying distribution.
The binomial distribution builds on individual binary responses. Customers order or do
not order, respond to a direct marketing mailing or not. Customers choose to stay with
their current telephone service or switch to another service. This type of problem lends
itself to logistic regression and the use of the logit link. Note that the multinomial logit
model is a natural extension of logistic regression. Multinomial logit models are useful
in the analysis of multinomial response variables. A customer chooses Coke, Pepsi, or
RC Cola. A commuter drives, takes the train or bus, or walks to work.
When we record choices over a period of time or across a group of individuals, we get
counts or frequencies. Counts arranged in multi-way contingency tables comprise the
raw data of categorical data analysis. We also use the Poisson distribution and the log
link for categorical data analysis and log-linear modeling. As we have seen from our
discussion of the logit transformation, the log function converts a variable defined on
the domain of positive reals into a variable defined on the range of all real numbers.
This is why it works for counts.
The Poisson distribution is discrete on the domain of non-negative integers. It is used
for modeling counts, as in Poisson regression. The insurance company counts the
number of claims over the past year. A retailer counts the number of customers
responding to a sales promotion or the number of stock units sold. A nurse counts the
number of days a patient stays in the hospital. An auto dealer counts the number of days
a car stays on the lot before it sells.
Linear regression is a special generalized linear model. It has normally distributed
responses and an identity link relating the expected value of responses to the linear
predictor. Linear regression coefficients may be estimated by ordinary least squares. For
other members of the family of generalized linear models we use maximum likelihood
estimation. With the classical linear model we have analysis of variance and F-tests.
With generalized linear models we have analysis of deviance and likelihood ratio tests,
which are asymptotic chi-square tests.
There are close connections among generalized linear models for the analysis of choices
and counts. Alternative formulations often yield comparable results. The multinomial
model looks at the distribution of counts across response categories with a fixed sum of
counts (the sample size). For the Poisson model, counts, being associated with
individual cases, are random variables, and the sum of counts is not known until
observed in the sample. But we can use the Poisson distribution for the analysis of
multinomial data. Log-linear models make no explicit distinction between response and
explanatory variables. Instead, frequency counts act as responses in the model. But, if
we focus on appropriate subsets of linear predictors, treating response variables as
distinct from explanatory variables, log-linear models yield results comparable to
logistic regression. The Poisson distribution and the log link are used for log-linear
models.
When communicating with managers, we often use R-squared or the coefficient of
determination as an index of goodness of fit. This is a quantity that is easy to explain to
management as the proportion of response variance accounted for by the model. An
alternative index that many statisticians prefer is the root-mean-square error (RMSE),
which is an index of badness or lack of fit. Other indices of badness of fit, such as the
percentage error in prediction, are sometimes preferred by managers.
The method of logistic regression, although called “regression,” is actually a
classification method. It involves the prediction of a binary response. Ordinal and
multinomial logit models extend logistic regression to problems involving more than
two classes. Linear discriminant analysis is another classification method from the
domain of traditional statistics. The benchmark study of text classification in the
chapter on sentiment analysis employed logistic regression and a number of machine
learning algorithms for classification.
Evaluating classifier performance presents a challenge because many problems are low
base rate problems. Fewer than five percent of customers may respond to a direct mail
campaign. Disease rates, loan default, and fraud are often low base rate events. When
evaluating classifiers in the context of low base rates, we must look beyond the
percentage of events correctly predicted. Based on the four-fold table known as
the confusion matrix, figure A.1 provides an overview of various indices available for
evaluating binary classifiers.
Figure A.1. Evaluating the Predictive Accuracy of a Binary Classifier
Summary statistics such as Kappa (Cohen 1960) and the area under the receiver
operating characteristic (ROC) curve are sometimes used to evaluate classifiers. Kappa
depends on the probability cut-off used in classification. The area under the ROC curve
does not.
The area under the ROC curve is a preferred index of classification performance for low-
base-rate problems. The ROC curve is a plot of the true positive rate against the false
positive rate. It shows the tradeoff between sensitivity and specificity and measures how
well the model separates positive from negative cases.
The area under the ROC curve provides an index of predictive accuracy independent of
the probability cut-off that is being used to classify cases. Perfect prediction corresponds
to an area of 1.0 (curve that touches the top-left corner). An area of 0.5 depicts random
(null-model) predictive accuracy.
Useful references for linear regression include Draper and Smith (1998), Harrell
(2001), Chatterjee and Hadi (2012), and Fox and Weisberg (2011). Data-adaptive
regression methods and machine learning algorithms are reviewed in Berk
(2008), Izenman (2008), and Hastie, Tibshirani, and Friedman (2009). For traditional
nonlinear models, see Bates and Watts (2007).
Of special concern to data scientists is the structure of the regression model. Under what
conditions should we transform the response or selected explanatory variables? Should
interaction effects be included in the model? Regression diagnostics are data
visualizations and indices we use to check on the adequacy of regression models and to
suggest variable transformations. Discussion may be found in Belsley, Kuh, and Welsch
(1980) and Cook (1998). The base R system provides many diagnostics, and Fox and
Weisberg (2011) provide additional diagnostics. Diagnostics may suggest that
transformations of the response or explanatory variables are needed in order to meet
model assumptions or improve predictive performance. A theory of power
transformations is provided in Box and Cox (1964) and reviewed by Fox and Weisberg
(2011).
When defining parametric models, we would like to include the right set of explanatory
variables in the right form. Having too few variables or omitting key explanatory
variables can result in biased predictions. Having too many variables, on the other hand,
may lead to over-fitting and high out-of-sample prediction error. This bias-variance
tradeoff, as it is sometimes called, is a statistical fact of life.
Shrinkage and regularized regression methods provide mechanisms for tuning,
smoothing, or adjusting model complexity (Tibshirani 1996; Hoerl and Kennard 2000).
Alternatively, we can select subsets of explanatory variables to go into predictive models.
Special methods are called into play when the number of parameters being estimated is
large, perhaps exceeding the number of observations (Bühlmann and van de Geer 2011).
For additional discussion of the bias-variance tradeoff, regularized regression, and
subset selection, see Izenman (2008) and Hastie, Tibshirani, and Friedman
(2009). Graybill (1961, 2000) and Rencher and Schaalje (2008) review linear models.
Generalized linear models are discussed in McCullagh and Nelder (1989) and Firth
(1991). Kutner, Nachtsheim, Neter, and Li (2004) provide a comprehensive review of
linear and generalized linear models, including discussion of their application in
experimental design. R methods for the estimation of linear and generalized linear
models are reviewed in Chambers and Hastie (1992) and Venables and Ripley (2002).
The standard reference for generalized linear models is McCullagh and Nelder
(1989). Firth (1991) provides additional review of the underlying theory. Hastie
(1992) and Venables and Ripley (2002) give modeling examples with S/SPlus, most
which are easily duplicated in R. Lindsey (1997) discusses a wide range of application
examples. See Christensen (1997), Le (1998), and Hosmer, Lemeshow, and Sturdivant
(2013) for discussion of logistic regression. Lloyd (1999) provides an overview of
categorical data analysis. See Fawcett (2003) and Sing et al. (2005) for further
discussion of the ROC curve. Discussion of alternative methods for evaluating classifiers
is provided in Hand (1997) and Kuhn and Johnson (2013).
For Poisson regression and the analysis of multi-way contingency tables, useful
references include Bishop, Fienberg, and Holland (1975), Cameron and Trivedi
(1998), Fienberg (2007), Tang, He, and Tu (2012), and Agresti (2013). Reviews of
survival data analysis have been provided by Andersen, Borgan, Gill, and Keiding
(1993), Le (1997), Therneau and Grambsch (2000), Harrell (2001), Nelson
(2003), Hosmer, Lemeshow, and May (2013), and Allison (2010), with programming
solutions provided by Therneau (2014) and Therneau and Crowson (2014). Wassertheil-
Smoller (1990) provides an elementary introduction to classification procedures and the
evaluation of binary classifiers. For a more advanced treatment, see Hand
(1997). Burnham and Anderson (2002) review model selection methods, particularly
those using the Akaike information criterion or AIC (Akaike 1973).
We sometimes consider robust regression methods when there are influential outliers or
extreme observations. Robust methods represent an active area of research using
statistical simulation tools (Fox 2002; Koller and Stahel 2011; Maronna, Martin, and
Yohai 2006; Maechler 2014b; Koller 2014). Huet et al. (2004) and Bates and Watts
(2007) review nonlinear regression, and Harrell (2001) discusses spline functions for
regression problems.
Linguists study natural language, the words and the rules that we use to form
meaningful utterances. “Generative grammar” is a general term for the rules;
“morphology,” “syntax,” and “semantics” are more specific terms. Computer programs
for natural language processing use linguistic rules to mimic human communication and
convert natural language into structured text for further analysis.
Natural language processing is a broad area of academic study itself, and an important
area of computational linguistics. The location of words in sentences is a key to
understanding text. Words follow a sequence, with earlier words often more important
than later words, and with early sentences and paragraphs often more important than
later sentences and paragraphs.
Words in the title of a document are especially important to understanding the meaning
of a document. Some words occur with high frequency and help to define the meaning of
a document. Other words, such as the definite article “the” and the indefinite articles “a”
and “an,” as well as many prepositions and pronouns, occur with high frequency but
have little to do with the meaning of a document. These stop words are dropped from
the analysis.
The features or attributes of text are often associated with terms—collections of words
that mean something special. There are collections of words relating to the same
concept or word stem. The words “marketer,” “marketeer,” and “marketing” build on the
common word stem “market.” There are syntactic structures to consider, such as
adjectives followed by nouns and nouns followed by nouns. Most important to text
analytics are sequences of words that form terms. The words “New” and “York” have
special meaning when combined to form the term “New York.” The words “financial”
and “analysis” have special meaning when combined to form the term “financial
analysis.” We often employ stemming, which is the identification of word stems,
dropping suffixes (and sometimes prefixes) from words. More generally, we are parsing
natural language text to arrive at structured text.
In English, it is customary to place the subject before the verb and the object after the
verb. In English, verb tense is important. The sentence “Daniel carries the Apple
computer,” can have the same meaning as the sentence “The Apple computer is carried
by Daniel.” “Apple computer,” the object of the active verb “carry” is the subject of the
passive verb “is carried.” Understanding that the two sentences mean the same thing is
an important part of building intelligent text applications.
A key step in text analysis is the creation of a terms-by-documents matrix (sometimes
called a lexical table). The rows of this data matrix correspond to words or word stems
from the document collection, and the columns correspond to documents in the
collection. The entry in each cell of a terms-by-documents matrix could be a binary
indicator for the presence or absence of a term in a document, a frequency count of the
number of times a term is used in a document, or a weighted frequency indicating the
importance of a term in a document.
Figure A.3 illustrates the process of creating a terms-by-documents matrix. The first
document comes from Steven Pinker’s Words and Rules (1999, p. 4), the second from
Richard K. Belew’s Finding Out About (2000, p. 73). Terms correspond to words or
word stems that appear in the documents. In this example, each matrix entry represents
the number of times a term appears in a document. We treat nouns, verbs, and
adjectives similarly in the definition of stems. The stem “combine” represents both the
verb “combine” and the noun “combination.” Likewise, “function” represents the verb,
noun, and adjective form “functional.” An alternative system might distinguish among
parts of speech, permitting more sophisticated syntactic searches across documents.
After being created, the terms-by-documents matrix is like an index, a mapping of
document identifiers to terms (keywords or stems) and vice versa. For information
retrieval systems or search engines we might also retain information regarding the
specific location of terms within documents.
Source: Adapted from Miller (2005).
Typical text analytics applications have many more terms than documents, resulting in
sparse rectangular terms-by-documents matrices. To obtain meaningful results for text
analytics applications, analysts examine the distribution of terms across the document
collection. Very low frequency terms, those used in few documents, are dropped from
the terms-by-documents matrix, reducing the number of rows in the matrix.
Unsupervised text analytics problems are those for which there is no response or class to
be predicted. Rather, as we showed with the movie taglines, the task is to identify
common patterns or trends in the data. As part of the task, we may define text measures
describing the documents in the corpus.
For supervised text analytics problems there is a response or class of documents to be
predicted. We build a model on a training set and test it on a test set. Text classification
problems are common. Spam filtering has long been a subject of interest as a
classification problem, and many e-mail users have benefitted from the efficient
algorithms that have evolved in this area. In the context of information retrieval, search
engines classify documents as being relevant to the search or not. Useful modeling
techniques for text classification include logistic regression, linear discriminant function
analysis, classification trees, and support vector machines. Various ensemble or
committee methods may be employed.
Automatic text summarization is an area of research and development that can help
with information management. Imagine a text processing program with the ability to
read each document in a collection and summarize it in a sentence or two, perhaps
quoting from the document itself. Today’s search engines are providing partial analysis
of documents prior to their being displayed. They create automated summaries for fast
information retrieval. They recognize common text strings associated with user
requests. These applications of text analysis comprise tools of information search that
we take for granted as part of our daily lives.
Programs with syntactic processing capabilities, such as IBM’s Watson, provide a
glimpse of what intelligent agents for text analytics are becoming. These programs
perform grammatical parsing with an understanding of the roles of subject, verb, object,
and modifier. They know parts of speech (nouns, verbs, adjective, adverbs). And, using
identified entities representing people, places, things, and organizations, they perform
relationship searches.
Sentiment analysis is measurement-focused text analysis. Sometimes called opinion
mining, one approach to sentiment analysis is to draw on positive and negative word
sets (lexicons, dictionaries) that convey human emotion or feeling. These word sets are
specific to the language being spoken and the context of application. Another approach
to sentiment analysis is to work directly with text samples and human ratings of those
samples, developing text scoring methods specific to the task at hand.
A semi-supervised machine learning regimen can be especially useful in sentiment
analysis. We work two sets of text samples. One sample, often a small sample (because it
is expensive and time-consuming to obtain human ratings of text), associates a rating or
score with each text document. Another much larger sample is unrated but comes from
the same content domain. We learn the direction of scoring from the first sample, and
we learn about the text domain (including term frequencies in context) from both
samples.
The objective of sentiment analysis is to score text for affect, feelings, attitudes, or
opinions. Sentiment analysis and text measurement in general hold promise as
technologies for understanding consumer opinion and markets. Just as political
researchers can learn from the words of the public, press, and politicians, business
researchers can learn from the words of customers and competitors. There are customer
service logs, telephone transcripts, and sales call reports, along with user group, listserv,
and blog postings. And we have ubiquitous social media from which to build document
collections for text and sentiment analysis.
Precursors to sentiment analysis may be found in content analysis, thematic, semantic,
and network text analysis (Roberts 1997; Popping 2000; West 2001; Leetaru
2011; Krippendorff 2012). These methods have seen a wide range of applications within
the social sciences, including analysis of political discourse. An early computer
implementation of content analysis is found in the General Inquirer program (Stone et
al. 1966; Stone 1997). Buva and Stone (2001) describe a version of the program that
provides text measures based upon word counts across numerous semantic categories.
Text measures flow from a measurement model (algorithms for scoring) and a
dictionary, both defined by the researcher or analyst. A dictionary in this context is not a
traditional dictionary; it is not an alphabetized list of words and their definitions.
Rather, the dictionary used to construct text measures is a repository of word lists, such
as synonyms and antonyms, positive and negative words, strong and weak sounding
words, bipolar adjectives, parts of speech, and so on. The lists come from expert
judgments about the meaning of words. A text measure assigns numbers to documents
according to rules, with the rules being defined by the word lists, scoring algorithms,
and modeling techniques in predictive analytics.
Sentiment analysis and text measurement in general hold promise as technologies for
understanding consumer opinion and markets. Just as political researchers can learn
from the words of the public, press, and politicians, marketing data scientists can learn
from the words of customers and competitors.
For the marketing data scientist interested in understanding consumer opinions about
brands and products, there are substantial sources from which to draw samples. We
have customer service logs, telephone transcripts, and sales call reports, along with user
group, listserv, and blog postings. And we have ubiquitous social media from which to
build document collections for text and sentiment analysis.
The measurement story behind opinion and sentiment analysis is an important story
that needs to be told. Sentiment analysis, like all measurement, is the assignment of
numbers to attributes according to rules. But what do the numbers mean? To what
extent are text measures reliable or valid? To demonstrate content or face validity, we
show that the content of the text measure relates to the attribute being measured. We
examine word sets, and we try to gain agreement (among subject matter experts,
perhaps) that they measure a particular attribute or trait. Sentiment research often
involves the testing of word sets within specific contexts and, when possible, testing
against external criteria. To demonstrate predictive validity, we show that a text
measure can be used for prediction.
Regarding Twitter-based text measures, there have been various attempts to predict the
success of movies prior to their being distributed to theaters nationwide (Sharda and
Delen 2006; Delen, Sharda, and Kumar 2007). Most telling is work completed at HP
Labs that utilized chat on Twitter as a predictor of movie revenues (Asur and Huberman
2010). Bollen, Mao, and Zeng (2011) utilize Twitter sentiment analysis in predicting
stock market movements. Taddy’s (2013b, 2014) sentiment analysis work builds on the
inverse regression methods of Cook (1998, 2007). Taddy (2013a) uses Twitter data to
examine political sentiment.
Some have voiced concerns about unidimensional measures of sentiment. There have
been attempts to develop more extensive sentiment word sets, as well as
multidimensional measures (Turney 2002; Asur and Huberman 2010). Recent
developments in machine learning and quantitative linguistics point to sentiment
measurement methods that employ natural language processing rather than relying on
positive and negative word sets (Socher et al. 2011).
Among the more popular measurement schemes from the psychometric literature is
Charles Osgood’s semantic differential (Osgood, Suci, and Tannenbaum 1957; Osgood
1962). Exemplary bipolar dimensions include the positive–negative, strong–weak, and
active–passive dimensions. Schemes like Osgood’s set the stage for multidimensional
measures of sentiment.
We expect sentiment analysis to be an active area of research for many years. Reviews of
sentiment analysis methods have been provided by Liu (2010, 2011, 2012) and Feldman
(2013). Other books in the Modeling Techniques series provide examples of sentiment
analysis using document collections of movie reviews Miller (2015b, 2015a, 2015c).
Ingersoll, Morton, and Farris (2013) provide an introduction to the domain of text
analytics for the working data scientist. Those interested in reading further can refer
to Feldman and Sanger (2007), Jurafsky and Martin (2009), Weiss, Indurkhya, and
Zhang (2010), and the edited volume by Srivastava and Sahami (2009). Reviews may be
found in Trybula (1999), Witten, Moffat, and Bell (1999), Meadow, Boyce, and Kraft
(2000), Sullivan (2001), Feldman (2002b), and Sebastiani (2002). Hausser (2001) gives
an account of generative grammar and computational linguistics. Statistical language
learning and natural language processing are discussed by Charniak (1993), Manning
and Schütze (1999), and Indurkhya and Damerau (2010).
The writings of Steven Pinker (1994, 1997, 1999) provide insight into grammar and
psycholinguistics. Maybury (1997) reviews data preparation for text analytics and the
related tasks of source detection, translation and conversion, information extraction,
and information exploitation. Detection relates to identifying relevant sources of
information; conversion and translation involve converting from one medium or coding
form to another.
Belew (2000), Meadow, Boyce, and Kraft (2000) and the edited volume by Baeza-Yates
and Ribeiro-Neto (1999) provide reviews of technologies for information retrieval,
which depend on text classification, among other technologies and algorithms.
Authorship identification, a problem addressed a number of years ago in the statistical
literature by Mosteller and Wallace (1984), continues to be an active area of research
(Joula 2008). Merkl (2002) provides discussion of clustering techniques, which explore
similarities between documents and the grouping of documents into classes. Dumais
(2004) reviews latent semantic analysis and statistical approaches to extracting
relationships among terms in a document collection.