0% found this document useful (0 votes)
679 views

Essentials of Business Analytics

Business analytics basics Business analytics basics Business analytics basics Business analytics basics

Uploaded by

jahir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
679 views

Essentials of Business Analytics

Business analytics basics Business analytics basics Business analytics basics Business analytics basics

Uploaded by

jahir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

International Series in

Operations Research & Management Science

Bhimasankaram Pochiraju
Sridhar Seshadri Editors

Essentials
of Business
Analytics
An Introduction to the Methodology
and its Applications
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sridhar Seshadri

Part I Tools
2 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Sudhir Voleti
3 Data Management—Relational Database Systems (RDBMS) . . . . . . . . . 41
Hemanth Kumar Dasararaju and Peeyush Taori
4 Big Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Peeyush Taori and Hemanth Kumar Dasararaju
5 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
John F. Tripp
6 Statistical Methods: Basic Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Vishnuprasad Nagadevara
7 Statistical Methods: Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Bhimasankaram Pochiraju and Hema Sri Sai Kollipara
8 Advanced Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Vishnuprasad Nagadevara
9 Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Sudhir Voleti

Part II Modeling Methods


10 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Sumit Kunnumkal
11 Introduction to Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Milind G. Sohoni

vii

Shared by LOSC
viii Contents

12 Forecasting Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381


Konstantinos I. Nikolopoulos and Dimitrios D. Thomakos
13 Count Data Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Thriyambakam Krishnan
14 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Thriyambakam Krishnan
15 Machine Learning (Unsupervised) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Shailesh Kumar
16 Machine Learning (Supervised) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Shailesh Kumar
17 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Manish Gupta

Part III Applications


18 Retail Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Ramandeep S. Randhawa
19 Marketing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
S. Arunachalam and Amalesh Sharma
20 Financial Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Krishnamurthy Vaidyanathan
21 Social Media and Web Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
Vishnuprasad Nagadevara
22 Healthcare Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
Maqbool (Mac) Dada and Chester Chambers
23 Pricing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Kalyan Talluri and Sridhar Seshadri
24 Supply Chain Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Yao Zhao
25 Case Study: Ideal Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Deepak Agrawal and Soumithri Mamidipudi
26 Case Study: AAA Airline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
Deepak Agrawal, Hema Sri Sai Kollipara, and Soumithri Mamidipudi
27 Case Study: InfoMedia Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Deepak Agrawal, Soumithri Mamidipudi, and Sriram Padmanabhan
28 Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Peeyush Taori and Hemanth Kumar Dasararaju

Shared by LOSC
Contents ix

29 Introduction to Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917


Peeyush Taori and Hemanth Kumar Dasararaju
30 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
Peeyush Taori, Soumithri Mamidipudi, and Deepak Agrawal

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Disclaimer

This book contains information obtained from authentic and highly regarded
sources. Reasonable efforts have been made to publish reliable data and information,
but the author and publisher cannot assume responsibility for the validity of
all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let
us know so we may rectify in any future reprint.

xi

www.dbooks.org
Acknowledgements

This book is the outcome of a truly collaborative effort amongst many people who
have contributed in different ways. We are deeply thankful to all the contributing
authors for their ideas and support. The book belongs to them. This book would not
have been possible without the help of Deepak Agrawal. Deepak helped in every
way, from editorial work, solution support, programming help, to coordination with
authors and researchers, and many more things. Soumithri Mamidipudi provided
editorial support, helped with writing summaries of every chapter, and proof-edited
the probability and statistics appendix and cases. Padmavati Sridhar provided edi-
torial support for many chapters. Two associate alumni—Ramakrishna Vempati and
Suryanarayana Ambatipudi—of the Certificate Programme in Business Analytics
(CBA) at Indian School of Business (ISB) helped with locating contemporary
examples and references. They suggested examples for the Retail Analytics and
Supply Chain Analytics chapters. Ramakrishna also contributed to the draft of the
Big Data chapter. Several researchers in the Advanced Statistics and Computing
Lab (ASC Lab) at ISB helped in many ways. Hema Sri Sai Kollipara provided
support for the cases, exercises, and technical and statistics support for various
chapters. Aditya Taori helped with examples for the machine learning chapters
and exercises. Saurabh Jugalkishor contributed examples for the machine learning
chapters. The ASC Lab’s researchers and Hemanth Kumar provided technical
support in preparing solutions for various examples referred in the chapters. Ashish
Khandelwal, Fellow Program student at ISB, helped with the chapter on Linear
Regression. Dr. Kumar Eswaran and Joy Mustafi provided additional thoughts for
the Unsupervised Learning chapter. The editorial team comprising Faith Su, Mathew
Amboy and series editor Camille Price gave immense support during the book
proposal stage, guidance during editing, production, etc. The ASC Lab provided
the research support for this project.
We thank our families for the constant support during the 2-year long project.
We thank each and every person associated with us during the beautiful journey of
writing this book.

xiii
Contributors

Deepak Agrawal Indian School of Business, Hyderabad, Telangana, India


S. Arunachalam Indian School of Business, Hyderabad, Telangana, India
Chester Chambers Carey Business School, Johns Hopkins University, Baltimore,
MD, USA
Maqbool (Mac) Dada Carey Business School, Johns Hopkins University, Balti-
more, MD, USA
Manish Gupta Microsoft Corporation, Hyderabad, India
Hema Sri Sai Kollipara Indian School of Business, Hyderabad, Telangana, India
Thriyambakam Krishnan Chennai Mathematical Institute, Chennai, India
Shailesh Kumar Reliance Jio, Navi Mumbai, Maharashtra, India
Hemanth Kumar Dasararaju Indian School of Business, Hyderabad, Telangana,
India
Sumit Kunnumkal Indian School of Business, Hyderabad, Telangana, India
Soumithri Mamidipudi Indian School of Business, Hyderabad, Telangana, India
Vishnuprasad Nagadevara IIM-Bangalore, Bengaluru, Karnataka, India
Konstantinos I. Nikolopoulos Bangor Business School, Bangor, Gwynedd, UK
Sriram Padmanabhan New York, NY, USA
Bhimasankaram Pochiraju Applied Statistics and Computing Lab, Indian School
of Business, Hyderabad, Telangana, India
Ramandeep S. Randhawa Marshall School of Business, University of Southern
California, Los Angeles, CA, USA
Sridhar Seshadri Gies College of Business, University of Illinois at Urbana
Champaign, Champaign, IL, USA

xv

www.dbooks.org
xvi Contributors

Amalesh Sharma Texas A&M University, College Station, TX, USA


Milind G. Sohoni Indian School of Business, Hyderabad, Telangana, India
Kalyan Talluri Imperial College Business School, South Kensington, London, UK
Peeyush Taori London Business School, London, UK
Dimitrios D. Thomakos University of Peloponnese, Tripoli, Greece
John F. Tripp Clemson University, Clemson, SC, USA
Krishnamurthy Vaidyanathan Indian School of Business, Hyderabad,
Telangana, India
Sudhir Voleti Indian School of Business, Hyderabad, Telangana, India
Yao Zhao Rutgers University, Newark, NJ, USA
Chapter 1
Introduction

Sridhar Seshadri

Business analytics is the science of posing and answering data questions related to
business. Business analytics has rapidly expanded in the last few years to include
tools drawn from statistics, data management, data visualization, and machine learn-
ing. There is increasing emphasis on big data handling to assimilate the advances
made in data sciences. As is often the case with applied methodologies, business
analytics has to be soundly grounded in applications in various disciplines and
business verticals to be valuable. The bridge between the tools and the applications
are the modeling methods used by managers and researchers in disciplines such as
finance, marketing, and operations. This book provides coverage of all three aspects:
tools, modeling methods, and applications.
The purpose of the book is threefold: to fill the void in the graduate-level study
materials for addressing business problems in order to pose data questions, obtain
optimal business solutions via analytics theory, and ground the solution in practice.
In order to make the material self-contained, we have endeavored to provide ample
use of cases and data sets for practice and testing of tools. Each chapter comes
with data, examples, and exercises showing students what questions to ask, how to
apply the techniques using open source software, and how to interpret the results. In
our approach, simple examples are followed with medium to large applications and
solutions. The book can also serve as a self-study guide to professionals who wish
to enhance their knowledge about the field.
The distinctive features of the book are as follows:
• The chapters are written by experts from universities and industry.
• The major software used are R, Python, MS Excel, and MYSQL. These are all
topical and widely used in the industry.

S. Seshadri ()
Gies College of Business, University of Illinois at Urbana Champaign, Champaign, IL, USA
e-mail: sridhar@illinois.edu

© Springer Nature Switzerland AG 2019 1


B. Pochiraju, S. Seshadri (eds.), Essentials of Business Analytics, International
Series in Operations Research & Management Science 264,
https://doi.org/10.1007/978-3-319-68837-4_1

www.dbooks.org
2 S. Seshadri

• Extreme care has been taken to ensure continuity from one chapter to the next.
The editors have attempted to make sure that the content and flow are similar in
every chapter.
• In Part A of the book, the tools and modeling methodology are developed in
detail. Then this methodology is applied to solve business problems in various
verticals in Part B. Part C contains larger case studies.
• The Appendices cover required material on Probability theory, R, and Python, as
these serve as prerequisites for the main text.
The structure of each chapter is as follows:
• Each chapter has a business orientation. It starts with business problems, which
are transformed into technological problems. Methodology is developed to solve
the technological problems. Data analysis is done using suitable software and the
output and results are clearly explained at each stage of development. Finally, the
technological solution is transformed back to a business solution. The chapters
conclude with suggestions for further reading and a list of references.
• Exercises (with real data sets when applicable) are at the end of each chapter and
on the Web to test and enhance the understanding of the concepts and application.
• Caselets are used to illustrate the concepts in several chapters.

1 Detailed Description of Chapters

Data Collection: This chapter introduces the concepts of data collection and
problem formulation. Firstly, it establishes the foundation upon which the fields
of data sciences and analytics are based, and defines core concepts that will be used
throughout the rest of the book. The chapter starts by discussing the types of data
that can be gathered, and the common pitfalls that can occur when data analytics
does not take into account the nature of the data being used. It distinguishes between
primary and secondary data sources using examples, and provides a detailed
explanation of the advantages and constraints of each type of data. Following this,
the chapter details the types of data that can be collected and sorted. It discusses the
difference between nominal-, ordinal-, interval-, and ratio-based data and the ways
in which they can be used to obtain insights into the subject being studied.
The chapter then discusses problem formulation and its importance. It explains
how and why formulating a problem will impact the data that is gathered, and
thus affect the conclusions at which a research project may arrive. It describes
a framework by which a messy real-world situation can be clarified so that a
mathematical toolkit can be used to identify solutions. The chapter explains the
idea of decision-problems, which can be used to understand the real world, and
research-objectives, which can be used to analyze decision-problems.
1 Introduction 3

The chapter also details the challenges faced when collecting and collating data.
It discusses the importance of understanding what data to collect, how to collect it,
how to assess its quality, and finally the most appropriate way of collating it so that
it does not lose its value.
The chapter ends with an illustrative example of how the retailing industry might
use various sources of data in order to better serve their customers and understand
their preferences.
Data Management—Relational Database Management Systems: This chapter
introduces the idea of data management and storage. The focus of the chapter
is on relational database management systems or RDBMS. RDBMS is the most
commonly used data organization system in enterprises. The chapter introduces and
explains the ideas using MySQL, an open-source structural query language used by
many of the largest data management systems in the world.
The chapter describes the basic functions of a MySQL server, such as creating
databases, examining data tables, and performing functions and various operations
on data sets. The first set of instructions the chapter discusses is about the rules,
definition, and creation of relational databases. Then, the chapter describes how to
create tables and add data to them using MySQL server commands. It explains how
to examine the data present in the tables using the SELECT command.
Data Management—Big Data: This chapter builds on some of the concepts
introduced in the previous chapter but focuses on big data tools. It describes what
really constitutes big data and focuses on some of the big data tools. In this chapter,
the basics of big data tools such as Hadoop, Spark, and surrounding ecosystem are
presented.
The chapter begins by describing Hadoop’s uses and key features, as well as the
programs in its ecosystem that can also be used in conjunction with it. It also briefly
visits the concepts of distributed and parallel computing and big data cloud.
The chapter describes the architecture of the Hadoop runtime environment. It
starts by describing the cluster, which is the set of host machines, or nodes for
facilitating data access. It then moves on to the YARN infrastructure, which is
responsible for providing computational resources to the application. It describes
two main elements of the YARN infrastructure—the Resource Manager and the
Node Manager. It then details the HDFS Federation, which provides storage,
and also discusses other storage solutions. Lastly, it discusses the MapReduce
framework, which is the software layer.
The chapter then describes the functions of MapReduce in detail. MapReduce
divides tasks into subtasks, which it runs in parallel in order to increase efficiency. It
discusses the manner in which MapReduce takes lists of input data and transforms
them into lists of output data, by implementing a “map” process and a “reduce”
process, which it aggregates. It describes in detail the process steps that MapReduce
takes in order to produce the output, and describes how Python can be used to create
a MapReduce process for a word count program.
The chapter briefly describes Spark and an application using Spark. It concludes
with a discussion about cloud storage. The chapter makes use of Cloudera virtual
machine (VM) distributable to demonstrate different hands-on exercises.

www.dbooks.org
4 S. Seshadri

Data Visualization: This chapter discusses how data is visualized and the way
that visualization can be used to aid in analysis. It starts by explaining that humans
use visuals to understand information, and that using visualizations incorrectly can
lead to mistaken conclusions. It discusses the importance of visualization as a
cognitive aid and the importance of working memory in the brain. It emphasizes
the role of data visualization in reducing the load on the reader.
The chapter details the six meta-rules of data visualization, which are as follows:
use the most appropriate chart, directly represent relationships between data, refrain
from asking the viewer to compare differences in area, never use color on top of
color, keep within the primal perceptions of the viewer, and chart with integrity.
Each rule is expanded upon in the chapter. The chapter discusses the kinds of
graphs and tables available to a visualizer, the advantages and disadvantages of 3D
visualization, and the best practices of color schemes.
Statistical Methods—Basic Inferences: This chapter introduces the fundamental
concepts of statistical inferences, such as population and sample parameters,
hypothesis testing, and analysis of variance. It begins by describing the differences
between population and sample means and variance and the methods to calculate
them. It explains the central limit theorem and its use in estimating the mean of a
population.
Confidence intervals are explained for samples in which variance is both
known and unknown. The concept of standard errors and the t- and Chi-squared
distributions are introduced. The chapter introduces hypothesis testing and the use
of statistical parameters to reject or fail to reject hypotheses. Type I and type II errors
are discussed.
Methods to compare two different samples are explained. Analysis of vari-
ance between two samples and within samples is also covered. The use of the
F-distribution in analyzing variance is explained. The chapter concludes with
discussion of when we need to compare means of a number of populations. It
explains how to use a technique called “Analysis of Variance (ANOVA)” instead
of carrying out pairwise comparisons.
Statistical Methods—Linear Regression Analysis: This chapter explains the idea
of linear regression in detail. It begins with some examples, such as predicting
newspaper circulation. It uses the examples to discuss the methods by which
linear regression obtains results. It describes a linear regression as a functional
form that can be used to understand relationships between outcomes and input
variables and perform statistical inference. It discusses the importance of linear
regression and its popularity, and explains the basic assumptions underlying linear
regression.
The modeling section begins by discussing a model in which there is only a
single regressor. It explains why a scatter-plot can be useful in understanding single-
regressor models, and the importance of visual representation in statistical inference.
It explains the ordinary least squares method of estimating a parameter, and the use
of the sum of squares of residuals as a measure of the fit of a model. The chapter then
discusses the use of confidence intervals and hypothesis testing in a linear regression
1 Introduction 5

model. These concepts are used to describe a linear regression model in which there
are multiple regressors, and the changes that are necessary to adjust a single linear
regression model to a multiple linear regression model.
The chapter then describes the ways in which the basic assumptions of the linear
regression model may be violated, and the need for further analysis and diagnostic
tools. It uses the famous Anscombe data sets in order to demonstrate the existence
of phenomena such as outliers and collinearity that necessitate further analysis. The
methods needed to deal with such problems are explained. The chapter considers
the ways in which the necessity for the use of such methods may be determined,
such as tools to determine whether some data points should be deleted or excluded
from the data set. The possible advantages and disadvantages of adding additional
regressors to a model are described. Dummy variables and their use are explained.
Examples are given for the case where there is only one category of dummy, and
then multiple categories.
The chapter then discusses assumptions regarding the error term. The effect of
the assumption that the error term is normally distributed is discussed, and the Q-Q
plot method of examining the truth of this assumption for the data set is explained.
The Box–Cox method of transforming the response variable in order to normalize
the error term is discussed. The chapter then discusses the idea that the error terms
may not have equal variance, that is, be homoscedastic. It explains possible reasons
for heteroscedasticity, and the ways to adapt the analysis to those situations.
The chapter considers the methods in which the regression model can be
validated. The root mean square error is introduced. Segmenting the data into
training and validation sets is explained. Finally, some frequently asked questions
are presented, along with exercises.
Statistical Methods—Advanced Regression: Three topics are covered in this
chapter. In the main body of the chapter the tools for estimating the parameters of
regression models when the response variable is binary or categorical is presented.
The appendices to the chapter cover two other important techniques, namely,
maximum likelihood estimate (MLE) and how to deal with missing data.
The chapter begins with a description of logistics regression models. It continues
with diagnostics of logistics regression, including likelihood ratio tests, Wald’s
and the Hosmer–Lemeshow tests. It then discusses different R-squared tests, such
as Cox and Snell, Nagelkerke, and McFadden. Then, it discusses how to choose
the cutoff probability for classification, including discussion of discordant and
concordant pairs, the ROC curve, and Youden’s index. It concludes with a similar
discussion of Multinomial Logistics Function and regression. The chapter contains
a self-contained introduction to the maximum likelihood method and methods for
treating missing data. The ideas introduced in this chapter are used in several
following chapters in the book.
Text Analytics: This is the first of several chapters that introduce specialized
analytics methods depending on the type of data and analysis. This chapter begins
by considering various motivating examples for text analysis. It explains the need
for a process by which unstructured text data can be analyzed, and the ways that
it can be used to improve business outcomes. It describes in detail the manner in

www.dbooks.org
6 S. Seshadri

which Google used its text analytics software and its database of searches to identify
vectors of H1N1 flu. It lists out the most common sources of text data, with social
media platforms and blogs producing the vast majority.
The second section of the chapter concerns the ways in which text can be
analyzed. It describes two approaches: a “bag-of-words” approach, in which the
structure of the language is not considered important, and a “natural-language”
approach, in which structure and phrases are also considered.
The example of a retail chain surveying responses to a potential ice-cream
product is used to introduce some terminology. It uses this example to describe
the problems of analyzing sentences due to the existence of grammatical rules, such
as the abundance of articles or the different tense forms of verbs. Various methods
of dealing with these problems are introduced. The term-document matrix (TDM)
is introduced along with its uses, such as generation of wordclouds.
The third and fourth sections of the chapter describe how to run text analysis
and some elementary applications. The text walks through a basic use of the
program R to analyze text. It looks at two ways that the TDM can be used to run
text analysis—using a text-base to cluster or segment documents, and elementary
sentiment analysis.
Clustering documents is a method by which similar customers are sorted into
the same group by analyzing their responses. Sentiment analysis is a method by
which attempts are made to make value judgments and extract qualitative responses.
The chapter describes the models for both processes in detail with regard to an
example.
The fifth section of the chapter then describes the more advanced technique
of latent topic mining. Latent topic mining aims to identify themes present in a
corpus, or a collection of documents. The chapter uses the example of the mission
statements of Fortune-1000 firms in order to identify some latent topics.
The sixth section of the chapter concerns natural-language processing (NLP).
NLP is a set of techniques that enables computers to understand nuances in human
languages. The method by which NLP programs detect data is discussed. The ideas
of this chapter are further explored in the chapter on Deep Learning. The chapter
ends with exercises for the student.
Simulation: This chapter introduces the uses of simulation as a tool for analytics,
focusing on the example of a fashion retailer. It explains the use of Monte Carlo
simulation in the presence of uncertainty as an aid to making decisions that have
various trade-offs.
First, the chapter explains the purposes of simulation, and the ways it can be used
to design an optimal intervention. It differentiates between computer simulation,
which is the main aim of the chapter, and physical simulation. It discusses the
advantages and disadvantages of simulations, and mentions various applications of
simulation in real-world contexts.
The second part of the chapter discusses the steps that are followed in making a
simulation model. It explains how to identify dependent and independent variables,
and the manner in which the relationships between those variables can be modeled.
It describes the method by which input variables can be randomly generated,
1 Introduction 7

and the output of the simulation can be interpreted. It illustrates these steps
using the example of a fashion retailer that needs to make a decision about
production.
The third part of the chapter describes decision-making under uncertainty and
the ways that simulation can be used. It describes how to set out a range of possible
interventions and how they can be modeled using a simulation. It discusses how to
use simulation processes in order to optimize decision-making under constraints, by
using the fashion retailer example in various contexts.
The chapter also contains a case study of a painting business deciding how much
to bid for a contract to paint a factory, and describes the solution to making this
decision. The concepts explained in this chapter are applied in different settings in
the following chapters.
Optimization: Optimization techniques are used in almost every application
in this book. This chapter presents some of the core concepts of constrained
optimization. The basic ideas are illustrated using one broad class of optimization
problems called linear optimization. Linear optimization covers the most widely
used models in business. In addition, because linear models are easy to visualize in
two dimensions, it offers a visual introduction to the basic concepts in optimization.
Additionally, the chapter provides a brief introduction to other optimization models
and techniques such as integer/discrete optimization, nonlinear optimization, search
methods, and the use of optimization software.
The linear optimization part is conventionally developed by describing the deci-
sion variables, the objective function, constraints, and the assumptions underlying
the linear models. Using geometric arguments, it illustrates the concept of feasibility
and optimality. It then provides the basic theorems of linear programming. The
chapter then develops the idea of shadow prices, reduced costs, and sensitivity
analysis, which is the underpinning of any post-optimality business analysis. The
solver function in Excel is used for illustrating these ideas. Then, the chapter
explains how these ideas extend to integer programming and provides an outline
of the branch and bound method with examples. The ideas are further extended
to nonlinear optimization via examples of models for linear regression, maximum
likelihood estimation, and logistic regression.
Forecasting Analytics: Forecasting is perhaps the most commonly used method
in business analytics. This chapter introduces the idea of using analytics to predict
the outcomes in the future, and focuses on applying analytics tools for business and
operations. The chapter begins by explaining the difficulty of predicting the future
with perfect accuracy, and the importance of accepting the uncertainty inherent in
any predictive analysis.
The chapter begins by defining forecasting as estimating in unknown situations.
It describes data that can be used to make forecasts, but focuses on time-series
forecasting. It introduces the concepts of point-forecasts and prediction intervals,
which are used in time-series analysis as part of predictions of future outcomes. It
suggests reasons for the intervention of human judgment in the forecasts provided
by computers. It describes the core method of time-series forecasting—identifying
a model that forecasts the best.

www.dbooks.org
8 S. Seshadri

The second part of the chapter describes quantitative approaches to forecasting.


It begins by describing the various kinds of data that can be used to make forecasts,
such as spoken, written, numbers, and so on. It explains some methods of dealing
with outliers in the data set, which can affect the fit of the forecast, such as trimming
and winsorizing.
The chapter discusses the effects of seasonal fluctuations on time-series data and
how to adjust for them. It introduces the autocorrelation function and its use. It also
explains the partial autocorrelation function.
A number of methods used in predictive forecasting are explained, including
the naïve method, the average and moving average methods, Holt exponential
smoothing, and the ARIMA framework. The chapter also discusses ways to predict
stochastic intermittent demand, such as Croston’s approach, and the Syntetos and
Boylan approximation.
The third section of the chapter describes the process of applied forecasting
analytics at the operational, tactical, and strategic levels. It propounds a seven-step
forecasting process for operational tasks, and explains each step in detail.
The fourth section of the chapter concerns evaluating the accuracy of forecasts.
It explains measures such as mean absolute error, mean squared error, and root
mean squared error, and how to calculate them. Both Excel and R software use
is explained.
Advanced Statistical Methods: Count Data: The chapter begins by introducing
the idea of count variables and gives examples of where they are encountered, such
as insurance applications and the amount of time taken off by persons that fall sick.
It first introduces the idea of the Poisson regression model, and explains why
ordinary least squares are not suited to some situations for which the Poisson model
is more appropriate. It illustrates the differences between the normal and Poisson
distributions using conditional distribution graphs.
It defines the Poisson distribution model and its general use, as well as an
example regarding insurance claims data. It walks through the interpretation of
the regression’s results, including the explanation of the regression coefficients,
deviance, dispersion, and so on.
It discusses some of the problems with the Poisson regression, and how
overdispersion can cause issues for the analysis. It introduces the negative binomial
distribution as a method to counteract overdispersion. Zero-inflation models are
discussed. The chapter ends with a case study on Canadian insurance data.
Advanced Statistical Methods—Survival Analysis: Like the previous chapter, this
one deals with another specialized application. It involves techniques that analyze
time-to-event data. It defines time-to-event data and the contexts in which it can
be used, and provides a number of business situations in which survival analysis is
important.
The chapter explains the idea of censored data, which refers to survival times
in which the event in question has not yet occurred. It explains the differences
between survival models and other types of analysis, and the fields in which it can be
used. It defines the types of censoring: right-censoring, left-censoring, and interval-
censoring, and the method to incorporate them into the data set.
1 Introduction 9

The chapter then defines the survival analysis functions: the survival function and
the hazard function. It describes some simple types of hazard functions. It describes
some parametric and nonparametric methods of analysis, and defines the cases in
which nonparametric methods must be used. It explains the Kaplan–Meier method
in detail, along with an example. Semiparametric models are introduced for cases
in which several covariate variables are believed to contribute to survival. Cox’s
proportional hazards model and its interpretation are discussed.
The chapter ends with a comparison between semiparametric and parametric
models, and a case study regarding churn data.
Unsupervised Learning: The first of the three machine learning chapters sets
out the philosophy of machine learning. This chapter explains why unsupervised
learning—an important paradigm in machine learning—is akin to uncovering the
proverbial needle in the haystack, discovering the grammar of the process that
generated the data, and exaggerating the “signal” while ignoring the “noise” in it.
The chapter covers methods of projection, clustering, and density estimation—three
core unsupervised learning frameworks that help us perceive the data in different
ways. In addition, the chapter describes collaborative filtering and applications of
network analysis.
The chapter begins with drawing the distinction between supervised and unsuper-
vised learning. It then presents a common approach to solving unsupervised learning
problems by casting them into an optimization framework. In this framework, there
are four steps:
• Intuition: to develop an intuition about how to approach the problem as an
optimization problem
• Formulation: to write the precise mathematical objective function in terms of data
using intuition
• Modification: to modify the objective function into something simpler or “more
solvable”
• Optimization: to solve the final objective function using traditional optimization
approaches
The chapter discusses principal components analysis (PCA), self-organizing
maps (SOM), and multidimensional scaling (MDS) under projection algorithms.
In clustering, it describes partitional and hierarchical clustering. Under density
estimation, it describes nonparametric and parametric approaches. The chapter
concludes with illustrations of collaborative filtering and network analysis.
Supervised Learning: In supervised learning, the aim is to learn from previously
identified examples. The chapter covers the philosophical, theoretical, and practical
aspects of one of the most common machine learning paradigms—supervised
learning—that essentially learns to map from an observation (e.g., symptoms and
test results of a patient) to a prediction (e.g., disease or medical condition), which
in turn is used to make decisions (e.g., prescription). The chapter then explores the
process, science, and art of building supervised learning models.
The first part explains the different paradigms in supervised learning: classifi-
cation, regression, retrieval, recommendation, and how they differ by the nature

www.dbooks.org
10 S. Seshadri

of their input and output. It then describes the process of learning, from features
description to feature engineering to models to algorithms that help make the
learning happen.
Among algorithms, the chapter describes rule-based classifiers, decision trees, k-
nearest neighbor, Parzen window, and Bayesian and naïve Bayes classifiers. Among
discriminant functions that partition a region using an algorithm, linear (LDA) and
quadratic discriminant analysis (QDA) are discussed. A section describes recom-
mendation engines. Neural networks are then introduced followed by a succinct
introduction to a key algorithm called support vector machines (SVM). The chapter
concludes with a description of ensemble techniques, including bagging, random
forest, boosting, mixture of experts, and hierarchical classifiers. The specialized
neural networks for Deep Learning are explained in the next chapter.
Deep Learning: This chapter introduces the idea of deep learning as a part of
machine learning. It aims to explain the idea of deep learning and various popular
deep learning architectures. It has four main parts:
• Understand what is deep learning.
• Understand various popular deep learning architectures, and know when to use
which architecture for solving a business problem.
• How to perform image analysis using deep learning.
• How to perform text analysis using deep learning.
The chapter explains the origins of learning, from a single perceptron to mimic
the functioning of a neuron to the multilayered perceptron (MLP). It briefly recaps
the backpropagation algorithm and introduces the learning rate and error functions.
It then discusses the deep learning architectures applied to supervised, unsupervised,
and reinforcement learning. An example of using an artificial neural network for
recognizing handwritten digits (based on the MNIST data set) is presented.
The next section of the chapter describes Convolutional Neural Networks (CNN),
which are aimed at solving vision-related problems. The ImageNet data set is
introduced. The use of CNNs in the ImageNet Large Scale Visual Recognition
Challenge is explained, along with a brief history of the challenge. The biological
inspiration for CNNs is presented. Four layers of a typical CNN are introduced—
the convolution layer, the rectified linear units layer, the pooling layers, and the fully
connected layer. Each layer is explained, with examples. A unifying example using
the same MNIST data set is presented.
The third section of the chapter discusses recurrent neural networks (RNNs).
It begins by describing the motivation for sequence learning models, and their
use in understanding language. Traditional language models and their functions in
predicting words are explained. The chapter describes a basic RNN model with
three units, aimed at predicting the next word in a sentence. It explains the detailed
example by which an RNN can be built for next word prediction. It presents some
uses of RNNs, such as image captioning and machine translation.
The next seven chapters contain descriptions of analytics usage in different
domains and different contexts. These are described next.
1 Introduction 11

Retail Analytics: The chapter begins by introducing the background and defini-
tion of retail analytics. It focuses on advanced analytics. It explains the use of four
main categories of business decisions: consumer, product, human resources, and
advertising. Several examples of retail analytics are presented, such as increasing
book recommendations during periods of cold weather. Complications in retail
analytics are discussed.
The second part of the chapter focuses on data collection in the retail sector. It
describes the traditional sources of retail data, such as point-of-sale devices, and
how they have been used in decision-making processes. It also discusses advances
in technology and the way that new means of data collection have changed the field.
These include the use of radio frequency identification technology, the Internet of
things, and Bluetooth beacons.
The third section describes methodologies, focusing on inventory, assortment,
and pricing decisions. It begins with modeling product-based demand in order
to make predictions. The penalized L1 regression LASSO for retail demand
forecasting is introduced. The use of regression trees and artificial neural networks
is discussed in the same context. The chapter then discusses the use of such forecasts
in decision-making. It presents evidence that machine learning approaches benefit
revenue and profit in both price-setting and inventory-choice contexts.
Demand models into which consumer choice is incorporated are introduced.
The multinomial logit, mixed multinomial logit, and nested logit models are
described. Nonparametric choice models are also introduced as an alternative to
logit models. Optimal assortment decisions using these models are presented.
Attempts at learning customer preferences while optimizing assortment choices are
described.
The fourth section of the chapter discusses business challenges and opportunities.
The benefits of omnichannel retail are discussed, along with the need for retail
analytics to change in order to fit an omnichannel shop. It also discusses some recent
start-ups in the retail analytics space and their focuses.
Marketing Analytics: Marketing is one of the most important, historically the
earliest, and fascinating areas for applying analytics to solve business problems.
Due to the vast array of applications, only the most important ones are surveyed
in this chapter. The chapter begins by explaining the importance of using marketing
analytics for firms. It defines the various levels that marketing analytics can apply to:
the firm, the brand or product, and the customer. It introduces a number of processes
and models that can be used in analyzing and making marketing decisions, including
statistical analysis, nonparametric tools, and customer analysis. The processes
and tools discussed in this chapter will help in various aspects of marketing
such as target marketing and segmentation, price and promotion, customer valua-
tion, resource allocation, response analysis, demand assessment, and new product
development.
The second section of the chapter explains the use of the interaction effect
in regression models. Building on earlier chapters on regression, it explains the
utility of a term that captures the effect of one or more interactions between other

www.dbooks.org
12 S. Seshadri

variables. It explains how to interpret new variables and their significance. The use
of curvilinear relationships in order to identify the curvilinear effect is discussed.
Mediation analysis is introduced, along with an example.
The third section describes data envelopment analysis (DEA), which is aimed at
improving the performance of organizations. It describes the manner in which DEA
works to present targets to managers and can be used to answer key operational
questions in Marketing: sales force productivity, performance of sales regions, and
effectiveness of geomarketing.
The next topic covered is conjoint analysis. It explains how knowing customers’
preference provides invaluable information about how customers think and make
their decisions before purchasing products. Thus, it helps firms devise their market-
ing strategies including advertising, promotion, and sales activities.
The fifth section of the chapter discusses customer analytics. Customer lifetime
value (CLV), a measure of the value provided to firms by customers, is introduced,
along with some other measures. A method to calculate CLV is presented, along
with its limitations. The chapter also discusses two more measures of customer
value: customer referral value and customer influence value, in detail. Additional
topics are covered in the chapters on retail analytics and social media analytics.
Financial Analytics: Financial analytics like Marketing has been a big consumer
of data. The topics chosen in this chapter provide one unified way of thinking
about analytics in this domain—valuation. This chapter focuses on the two main
branches of quantitative finance: the risk-neutral or “Q” world and the risk-averse
or “P” world. It describes the constraints and aims of analysts in each world, along
with their primary methodologies. It explains Q-quant theories such as the work of
Black and Scholes, and Harrison and Pliska. P-quant theories such as net present
value, capital asset pricing models, arbitrage pricing theory, and the efficient market
hypothesis are presented.
The methodology of financial data analytics is explained via a three-stage
process: asset price estimation, risk management, and portfolio analysis.
Asset price estimation is explained as a five-step process. It describes the use
of the random walk in identifying the variable to be analyzed. Several methods of
transforming the variable into one that is identical and independently distributed
are presented. A maximum likelihood estimation method to model variance is
explained. Monte Carlo simulations of projecting variables into the future are
discussed, along with pricing projected variables.
Risk management is discussed as a three-step process. The first step is risk
aggregation. Copula functions and their uses are explained. The second step,
portfolio assessment, is explained by using metrics such as Value at Risk. The third
step, attribution, is explained. Various types of capital at risk are listed.
Portfolio analysis is described as a two-stage process. Allocating risk for the
entire portfolio is discussed. Executing trades in order to move the portfolio to a
new risk/return level is explained.
A detailed example explaining each of the ten steps is presented, along with data
and code in MATLAB. This example also serves as a stand-alone case study on
financial analytics.
1 Introduction 13

Social Media Analytics: Social-media-based analytics has been growing in


importance and value to businesses. This chapter discusses the various tools
available to gather and analyze data from social media and Internet-based sources,
focusing on the use of advertisements. It begins by describing Web-based analytical
tools and the information they can provide, such as cookies, sentiment analysis, and
mobile analytics.
It introduces real-time advertising on online platforms, and the wealth of
data generated by browsers visiting target websites. It lists the various kinds of
advertising possible, including video and audio ads, map-based ads, and banner
ads. It explains the various avenues in which these ads can be displayed, and
details the reach of social media sites such as Facebook and Twitter. The various
methods in which ads can be purchased are discussed. Programmatic advertising
and its components are introduced. Real-time bidding on online advertising spaces
is explained.
A/B experiments are defined and explained. The completely randomized design
(CRD) experiment is discussed. The regression model for the CRD and an example
are presented. The need for randomized complete block design experiments is
introduced, and an example for such an experiment is shown. Analytics of mul-
tivariate experiments and their advantages are discussed. Orthogonal designs and
their meanings are explained.
The chapter discusses the use of data-driven search engine advertising. The
use of data in order to help companies better reach consumers and identify
trends is discussed. The power of search engines in this regard is discussed. The
problem of attribution, or identifying the influence of various ads across various
platforms is introduced, and a number of models that aim to solve this problem
are elucidated. Some models discussed are: the first click attribution model, the
last click attribution model, the linear attribution model, and algorithmic attribution
models.
Healthcare Analytics: Healthcare is once again an area where data, experi-
ments, and research have coexisted within an analytical framework for hundreds
of years. This chapter discusses analytical approaches to healthcare. It begins
with an overview of the current field of healthcare analytics. It describes the
latest innovations in the use of data to refine healthcare, including telemedicine,
wearable technologies, and simulations of the human body. It describes some of the
challenges that data analysts can face when attempting to use analytics to understand
healthcare-related problems.
The main part of the chapter focuses on the use of analytics to improve
operations. The context is patient flow in outpatient clinics. It uses Academic
Medical Centers as an example to describe the processes that patients go through
when visiting clinics that are also teaching centers. It describes the effects of the
Affordable Care Act, an aging population, and changes in social healthcare systems
on the public health infrastructure in the USA.
A five-step process map of a representative clinic is presented, along with a
discrete event simulation of the clinic. The history of using operations research-
based methods to improve healthcare processes is discussed. The chapter introduces

www.dbooks.org
14 S. Seshadri

a six-step process aimed at understanding complex systems, identifying potential


improvements, and predicting the effects of changes, and describes each step in
detail.
Lastly, the chapter discusses the various results of this process on some goals
of the clinic, such as arrivals, processing times, and impact on teaching. Data
regarding each goal and its change are presented and analyzed. The chapter contains
a hands-on exercise based on the simulation models discussed. The chapter is a fine
application of simulation concepts and modeling methodologies used in Operations
Management to improve healthcare systems.
Pricing Analytics: This chapter discusses the various mechanisms available
to companies in order to price their products. The topics pertain to revenue
management, which constitutes perhaps the most successful and visible area of
business analytics.
The chapter begins by introducing defining two factors that affect pricing: the
nature of the product and its competition, and customers’ preferences and values.
It introduces the concept of a price optimization model, and the need to control
capacity constraints when estimating customer demand.
The first type of model introduced is the independent class model. The underlying
assumption behind the model is defined, as well as its implications for modeling
customer choice. The EMSR heuristic and its use are explained.
The issue of overbooking in many service-related industries is introduced. The
trade-off between an underutilized inventory and the risk of denying service to
customers is discussed. A model for deciding an overbooking limit, given the
physical capacity at the disposal of the company, is presented. Dynamic pricing
is presented as a method to better utilize inventory.
Three main types of dynamic pricing are discussed: surge pricing, repricing,
and markup/markdown pricing. Each type is comprehensively explained. Three
models of forecasting and estimating customer demand are presented: additive,
multiplicative, and choice.
A number of processes for capacity control, such as nested allocations, are
presented. Network revenue management systems are introduced. A backward
induction method of control is explained. The chapter ends with an example of a
hotel that is planning allocation of rooms based on a demand forecast.
Supply Chain Analytics: This chapter discusses the use of data and analytical
tools to increase value in the supply chain. It begins by defining the processes
that constitute supply chains, and the goals of supply chain management. The
uncertainty inherent in supply chains is discussed. Four applications of supply chain
analytics are described: demand forecasting, inventory optimization, supply chain
disruption, and commodity procurement.
A case study of VASTA, one of the largest wireless services carriers in the USA,
is presented. The case study concerns the decision of whether the company should
change its current inventory strategy from a “push” strategy to a “pull” strategy.
The advantages and disadvantages of each strategy are discussed. A basic model
to evaluate both strategies is introduced. An analysis of the results is presented.
Following the analysis, a more advanced evaluation model is introduced. Customer
satisfaction and implementation costs are added to the model.
1 Introduction 15

The last three chapters of the book contain case studies. Each of the cases comes
with a large data set upon which students can practice almost every technique and
modeling approach covered in the book. The Info Media case study explains the use
of viewership data to design promotional campaigns. The problem presented is to
determine a multichannel ad spots allocation in order to maximize “reach” given
a budget and campaign guidelines. The approach uses simulation to compute the
viewership and then uses the simulated data to link promotional aspects to the total
reach of a campaign. Finally, the model can be used to optimize the allocation of
budgets across channels.
The AAA airline case study illustrates the use of choice models to design airline
offerings. The main task is to develop a demand forecasting model, which predicts
the passenger share for every origin–destination pair (O–D pair) given AAA, as
well as competitors’ offerings. The students are asked to explore different models
including the MNL and machine learning algorithms. Once a demand model has
been developed it can be used to diagnose the current performance and suggest
various remedies, such as adding, dropping, or changing itineraries in specific city
pairs. The third case study, Ideal Insurance, is on fraud detection. The problem faced
by the firm is the growing cost of servicing and settling claims in their healthcare
practice. The students learn about the industry and its intricate relationships with
various stakeholders. They also get an introduction to rule-based decision support
systems. The students are asked to create a system for detecting fraud, which should
be superior to the current “rule-based” system.

2 The Intended Audience

This book is the first of its kind both in breadth and depth of coverage and serves as
a textbook for students of first year graduate program in analytics and long duration
(1-year part time) certificate programs in business analytics. It also serves as a
perfect guide to practitioners.
The content is based on the curriculum of the Certificate Programme in Business
Analytics (CBA), now renamed as Advanced Management Programme in Business
Analytics (AMPBA) of Indian School of Business (ISB). The original curriculum
was created by Galit Shmueli. The curriculum was further developed by the
coeditors, Bhimasankaram Pochiraju and Sridhar Seshadri, who were responsible
for starting and mentoring the CBA program in ISB. Bhimasankaram Pochiraju has
been the Faculty Director of CBA since its inception and was a member of the
Academic Board. Sridhar Seshadri managed the launch of the program and since
then has chaired the academic development efforts. Based on the industry needs,
the curriculum continues to be modified by the Academic Board of the Applied
Statistics and Computing Lab (ASC Lab) at ISB.

www.dbooks.org
Part I
Tools
Chapter 2
Data Collection

Sudhir Voleti

1 Introduction

Collecting data is the first step towards analyzing it. In order to understand and solve
business problems, data scientists must have a strong grasp of the characteristics of
the data in question. How do we collect data? What kinds of data exist? Where
is it coming from? Before beginning to analyze data, analysts must know how to
answer these questions. In doing so, we build the base upon which the rest of our
examination follows. This chapter aims to introduce and explain the nuances of data
collection, so that we understand the methods we can use to analyze it.

2 The Value of Data: A Motivating Example

In 2017, video-streaming company Netflix Inc. was worth more than $80 billion,
more than 100 times its value when it listed in 2002. The company’s current position
as the market leader in the online-streaming sector is a far cry from its humble
beginning as a DVD rental-by-mail service founded in 1997. So, what had driven
Netflix’s incredible success? What helped its shares, priced at $15 each on their
initial public offering in May 2002, rise to nearly $190 in July 2017? It is well
known that a firm’s [market] valuation is the sum total in today’s money, or the net
present value (NPV) of all the profits the firm will earn over its lifetime. So investors
reckon that Netflix is worth tens of billions of dollars in profits over its lifetime.
Why might this be the case? After all, companies had been creating television and

S. Voleti ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: sudhir_voleti@isb.edu

© Springer Nature Switzerland AG 2019 19


B. Pochiraju, S. Seshadri (eds.), Essentials of Business Analytics, International
Series in Operations Research & Management Science 264,
https://doi.org/10.1007/978-3-319-68837-4_2

www.dbooks.org
20 S. Voleti

cinematic content for decades before Netflix came along, and Netflix did not start
its own online business until 2007. Why is Netflix different from traditional cable
companies that offer shows on their own channels?
Moreover, the vast majority of Netflix’s content is actually owned by its
competitors. Though the streaming company invests in original programming, the
lion’s share of the material available on Netflix is produced by cable companies
across the world. Yet Netflix has access to one key asset that helps it to predict
where its audience will go and understand their every quirk: data.
Netflix can track every action that a customer makes on its website—what they
watch, how long they watch it for, when they tune out, and most importantly, what
they might be looking for next. This data is invaluable to its business—it allows the
company to target specific niches of the market with unerring accuracy.
On February 1, 2013, Netflix debuted House of Cards—a political thriller starring
Kevin Spacey. The show was a hit, propelling Netflix’s viewership and proving
that its online strategy could work. A few months later, Spacey applauded Netflix’s
approach and cited its use of data for its ability to take a risk on a project that every
other major television studio network had declined. Casey said in Edinburgh, at the
Guardian Edinburgh International Television Festival1 on August 22: “Netflix was
the only company that said, ‘We believe in you. We have run our data, and it tells us
our audience would watch this series.’”
Netflix’s data-oriented approach is key not just to its ability to pick winning
television shows, but to its global reach and power. Though competitors are
springing up the world over, Netflix remains at the top of the pack, and so long
as it is able to exploit its knowledge of how its viewers behave and what they prefer
to watch, it will remain there.
Let us take another example. The technology “cab” company Uber has taken the
world by storm in the past 5 years. In 2014, Uber’s valuation was a mammoth 40
billion USD, which by 2015 jumped another 50% to reach 60 billion USD. This
fact begs the question: what makes Uber so special? What competitive advantage,
strategic asset, and/or enabling platform accounts for Uber’s valuation numbers?
The investors reckon that Uber is worth tens of billions of dollars in profits over
its lifetime. Why might this be the case? Uber is after all known as a ride-sharing
business—and there are other cab companies available in every city.
We know that Uber is “asset-light,” in the sense that it does not own the cab fleet
or have drivers of the cabs on its direct payroll as employees. It employs a franchise
model wherein drivers bring their own vehicles and sign up for Uber. Yet Uber
does have one key asset that it actually owns, one that lies at the heart of its profit
projections: data. Uber owns all rights to every bit of data from every passenger,
every driver, every ride and every route on its network. Curious as to how much
data are we talking about? Consider this. Uber took 6 years to reach one billion

1 Guardian Edinburgh International Television Festival, 2017 (https://www.ibtimes.com/kevin-


spacey-speech-why-netflix-model-can-save-television-video-full-transcript-1401970) accessed
on Sep 13, 2018.
2 Data Collection 21

rides (Dec 2015). Six months later, it had reached the two billion mark. That is one
billion rides in 180 days, or 5.5 million rides/day. How did having consumer data
play a factor in the exponential growth of a company such as Uber? Moreover, how
does data connect to analytics and, finally, to market value?
Data is a valuable asset that helps build sustainable competitive advantage. It
enables what economists would call “supernormal profits” and thereby plausibly
justify some of those wonderful valuation numbers we saw earlier. Uber had help,
of course. The nature of demand for its product (contractual personal transporta-
tion), the ubiquity of its enabling platform (location-enabled mobile devices), and
the profile of its typical customers (the smartphone-owning, convenience-seeking
segment) has all contributed to its success. However, that does not take away from
the central point being motivated here—the value contained in data, and the need to
collect and corral this valuable resource into a strategic asset.

3 Data Collection Preliminaries

A well-known management adage goes, “We can only manage what we can mea-
sure.” But why is measurement considered so critical? Measurement is important
because it precedes analysis, which in turn precedes modeling. And more often than
not, it is modeling that enables prediction. Without prediction (determination of
the values an outcome or entity will take under specific conditions), there can be
no optimization. And without optimization, there is no management. The quantity
that gets measured is reflected in our records as “data.” The word data comes
from the Latin root datum for “given.” Thus, data (datum in plural) becomes facts
which are given or known to be true. In what follows, we will explore some
preliminary conceptions about data, types of data, basic measurement scales, and
the implications therein.

3.1 Primary Versus Secondary Dichotomy

Data collection for research and analytics can broadly be divided into two major
types: primary data and secondary data. Consider a project or a business task that
requires certain data. Primary data would be data that is collected “at source” (hence,
primary in form) and specifically for the research at hand. The data source could
be individuals, groups, organizations, etc. and data from them would be actively
elicited or passively observed and collected. Thus, surveys, interviews, and focus
groups all fall under the ambit of primary data. The main advantage of primary data
is that it is tailored specifically to the questions posed by the research project. The
disadvantages are cost and time.
On the other hand, secondary data is that which has been previously collected
for a purpose that is not specific to the research at hand. For example, sales records,

www.dbooks.org
22 S. Voleti

industry reports, and interview transcripts from past research are data that would
continue to exist whether or not the project at hand had come to fruition. A good
example of a means to obtain secondary data that is rapidly expanding is the API
(Application Programming Interface)—an interface that is used by developers to
securely query external systems and obtain a myriad of information.
In this chapter, we concentrate on data available in published sources and
websites (often called secondary data sources) as these are the most commonly used
data sources in business today.

4 Data Collection Methods

In this section, we describe various methods of data collection based on sources,


structure, type, etc. There are basically two methods of data collection: (1) data
generation through a designed experiment and (2) collecting data that already exists.
A brief description of these methods is given below.

4.1 Designed Experiment

Suppose an agricultural scientist wants to compare the effects of five different


fertilizers, A, B, C, D, and E, on the yield of a crop. The yield depends not only
on the fertilizer but also on the fertility of the soil. The consultant considers a few
relevant types of soil, for example, clay, silt, and sandy soil. In order to compare
the fertilizer effect one has to control for the soil effect. For each soil type, the
experimenter may choose ten representative plots of equal size and assign the five
fertilizers to the ten plots at random in such a way that each fertilizer is assigned
to two plots. He then observes the yield in each plot. This is the design of the
experiment. Once the experiment is conducted as per this design, the yields in
different plots are observed. This is the data collection procedure. As we notice, the
data is not readily available to the scientist. He designs an experiment and generates
the data. This method of data collection is possible when we can control different
factors precisely while studying the effect of an important variable on the outcome.
This is quite common in the manufacturing industry (while studying the effect
of machines on output or various settings on the yield of a process), psychology,
agriculture, etc. For well-designed experiments, determination of the causal effects
is easy. However, in social sciences and business where human beings often are the
instruments or subjects, experimentation is not easy and in fact may not even be
feasible. Despite the limitations, there has been tremendous interest in behavioral
experiments in disciplines such as finance, economics, marketing, and operations
management. For a recent account on design of experiments, please refer to
Montgomery (2017).
2 Data Collection 23

4.2 Collection of Data That Already Exists

Household income, expenditure, wealth, and demographic information are examples


of data that already exists. Collection of such data is usually done in three possible
ways: (1) complete enumeration, (2) sample survey, and (3) through available
sources where the data was collected possibly for a different purpose and is available
in different published sources. Complete enumeration is collecting data on all
items/individuals/firms. Such data, say, on households, may be on consumption
of essential commodities, the family income, births and deaths, education of each
member of the household, etc. This data is already available with the households
but needs to be collected by the investigator. The census is an example of complete
enumeration. This method will give information on the whole population. It may
appear to be the best way but is expensive both in terms of time and money. Also,
it may involve several investigators and investigator bias can creep in (in ways that
may not be easy to account for). Such errors are known as non-sampling errors. So
often, a sample survey is employed. In a sample survey, the data is not collected on
the entire population, but on a representative sample. Based on the data collected
from the sample, inferences are drawn on the population. Since data is not collected
on the entire population, there is bound to be an error in the inferences drawn. This
error is known as the sampling error. The inferences through a sample survey can be
made precise with error bounds. It is commonly employed in market research, social
sciences, public administration, etc. A good account on sample surveys is available
in Blair and Blair (2015).
Secondary data can be collected from two sources: internal or external. Internal
data is collected by the company or its agents on behalf of the company. The
defining characteristic of the internal data is its proprietary nature; the company
has control over the data collection process and also has exclusive access to the
data and thus the insights drawn on it. Although it is costlier than external data, the
exclusivity of access to the data can offer competitive advantage to the company.
The external data, on the other hand, can be collected by either third-party data
providers (such as IRI, AC Nielsen) or government agencies. In addition, recently
another source of external secondary data has come into existence in the form of
social media/blogs/review websites/search engines where users themselves generate
a lot of data through C2B or C2C interactions. Secondary data can also be classified
on the nature of the data along the dimension of structure. Broadly, there are
three types of data: structured, semi-structured (hybrid), and unstructured data.
Some examples of structured data are sales records, financial reports, customer
records such as purchase history, etc. A typical example of unstructured data is
in the form of free-flow text, images, audio, and videos, which are difficult to
store in a traditional database. Usually, in reality, data is somewhere in between
structured and unstructured and thus is called semi-structured or hybrid data. For
example, a product web page will have product details (structured) and user reviews
(unstructured).

www.dbooks.org
24 S. Voleti

The data and its analysis can also be classified on the basis of whether a single
unit is observed over multiple time points (time-series data), many units observed
once (cross-sectional data), or many units are observed over multiple time periods
(panel data). The insights that can be drawn from the data depend on the nature
of data, with the richest insights available from panel data. The panel could be
balanced (all units are observed over all time periods) or unbalanced (observations
on a few units are missing for a few time points either by design or by accident).
If the data is not missing excessively, it can be accounted for using the methods
described in Chap. 8.

5 Data Types

In programming, we primarily classify the data into three types—numerals, alpha-


bets, and special characters and the computer converts any data type into binary
code for further processing. However, the data collected through various sources
can be of types such as numbers, text, image, video, voice, and biometrics.
The data type helps analyst to evaluate which operations can be performed to
analyze the data in a meaningful way. The data can limit or enhance the complexity
and quality of analysis.
Table 2.1 lists a few examples of data categorized by type, source, and uses. You
can read more about them following the links (all accessed on Aug 10, 2017).

5.1 Four Data Types and Primary Scales

Generally, there are four types of data associated with four primary scales, namely,
nominal, ordinal, interval, and ratio. Nominal scale is used to describe categories in
which there is no specific order while the ordinal scale is used to describe categories
in which there is an inherent order. For example, green, yellow, and red are three
colors that in general are not bound by an inherent order. In such a case, a nominal
scale is appropriate. However, if we are using the same colors in connection with
the traffic light signals there is clear order. In this case, these categories carry an
ordinal scale. Typical examples of the ordinal scale are (1) sick, recovering, healthy;
(2) lower income, middle income, higher income; (3) illiterate, primary school pass,
higher school pass, graduate or higher, and so on. In the ordinal scale, the differences
in the categories are not of the same magnitude (or even of measurable magnitude).
Interval scale is used to convey relative magnitude information such as temperature.
The term “Interval” comes about because rulers (and rating scales) have intervals
of uniform lengths. Example: “I rate A as a 7 and B as a 4 on a scale of 10.”
In this case, we not only know that A is preferred to B, but we also have some
idea of how much more A is preferred to B. Ratio scales convey information on
an absolute scale. Example: “I paid $11 for A and $12 for B.” The 11 and 12
2 Data Collection 25

here are termed “absolute” measures because the corresponding zero point ($0) is
understood in the same way by different people (i.e., the measure is independent of
subject).
Another set of examples for the four data types, this time from the world of
sports, could be as follows. The numbers assigned to runners are of nominal data
type, whereas the rank order of winners is of the ordinal data type. Note in the latter
case that while we may know who came first and who came second, we would not
know by how much based on the rank order alone. A performance rating on a 0–10

Table 2.1 A description of data and their types, sources, and examples
Category Examples Type Sourcesa
Internal data
Transaction Sales (POS/online) Numbers, text http://times.cs.uiuc.edu/
data transactions, stock ~wang296/Data/
market orders and https://www.quandl.com/
trades, customer IP https://www.nyse.com/
and geolocation data data/transactions-statistics-
data-library
https://www.sec.gov/
answers/shortsalevolume.
htm
Customer Website click stream, Numbers, text C:\Users\username\App
preference data cookies, shopping Data\Roaming\Microsoft
cart, wish list, \Windows\Cookies,
preorder Nearbuy.com (advance
coupon sold)
Experimental Simulation games, Text, number, image, https://www.
data clinical trials, live audio, video clinicaltrialsregister.eu/
experiments https://www.novctrd.com/
http://ctri.nic.in/
Customer Demographics, Text, number, image,
relationship purchase history, biometrics
data loyalty rewards data,
phone book
External data
Survey data Census, national Text, number, image, http://www.census.gov/
sample survey, audio, video data.html
annual survey of http://www.mospi.gov.in/
industries, http://www.csoisw.gov.in/
geographical survey, https://www.gsi.gov.in/
land registry http://
landrecords.mp.gov.in/
Biometric data Immigration data, Number, text, http://www.migration
(fingerprint, social security image, policy.org/programs/
retina, pupil, identity, Aadhar card biometric migration-data-hub
palm, face) (UID) https://www.dhs.gov/
immigration-statistics
(continued)

www.dbooks.org
26 S. Voleti

Table 2.1 (continued)


Category Examples Type Sourcesa
Third party RenTrak, A. C. All possible data types http://aws.amazon.com/
data Nielsen, IRI, MIDT datasets
(Market Information https://www.worldwildlife.
Data Tapes) in airline org/pages/conservation-
industry, people science-data-and-tools
finder, associations, http://www.whitepages.
NGOs, database com/
vendors, Google https://pipl.com/
Trends, Google https://www.bloomberg.
Public Data com/
https://in.reuters.com/
http://www.imdb.com/
http://datacatalogs.org/
http://www.google.com/
trends/explore
https://www.google.com/
publicdata/directory
Govt and quasi Federal All possible data types http://data.gov/
govt agencies governments, https://data.gov.in/
regulators— http://data.gov.uk/
Telecom, BFSI, etc., http://open-data.europa.eu/
World Bank, IMF, en/data/
credit reports, http://www.imf.org/en/
climate and weather Data
reports, agriculture https://www.rbi.org.in/
production, Scripts/Statistics.aspx
benchmark https://www.healthdata.
indicators—GDP, gov/
etc., electoral roll, https://www.cibil.com/
driver and vehicle http://eci.nic.in/
licenses, health http://data.worldbank.org/
statistics, judicial
records
Social sites Twitter, Facebook, All possible data types https://dev.twitter.com/
data, YouTube, Instagram, streaming/overview
user-generated Pinterest https://developers.
data Wikipedia, YouTube facebook.com/docs/graph-
videos, blogs, api
articles, reviews, https://en.wikipedia.org/
comments https://www.youtube.com/
https://snap.stanford.edu/
data/web-Amazon.html
http://www.cs.cornell.edu/
people/pabo/movie-
review-data/
a All the sources are last accessed on Aug 10, 2017
2 Data Collection 27

scale would be an example of an interval scale. We see this used in certain sports
ratings (i.e., gymnastics) wherein judges assign points based on certain metrics.
Finally, in track and field events, the time to finish in seconds is an example of ratio
data. The reference point of zero seconds is well understood by all observers.

5.2 Common Analysis Types with the Four Primary Scales

The reason why it matters what primary scale was used to collect data is that
downstream analysis is constrained by data type. For instance, with nominal data, all
we can compute are the mode, some frequencies and percentages. Nothing beyond
this is possible due to the nature of the data. With ordinal data, we can compute
the median and some rank order statistics in addition to whatever is possible with
nominal data. This is because ordinal data retains all the properties of the nominal
data type. When we proceed further to interval data and then on to ratio data,
we encounter a qualitative leap over what was possible before. Now, suddenly,
the arithmetic mean and the variance become meaningful. Hence, most statistical
analysis and parametric statistical tests (and associated inference procedures) all
become available. With ratio data, in addition to everything that is possible with
interval data, ratios of quantities also make sense.
The multiple-choice examples that follow are meant to concretize the understand-
ing of the four primary scales and corresponding data types.

6 Problem Formulation Preliminaries

Even before data collection can begin, the purpose for which the data collection
is being conducted must be clarified. Enter, problem formulation. The importance
of problem formulation cannot be overstated—it comes first in any research project,
ideally speaking. Moreover, even small deviations from the intended path at the very
beginning of a project’s trajectory can lead to a vastly different destination than was
intended. That said, problem formulation can often be a tricky issue to get right. To
see why, consider the musings of a decision-maker and country head for XYZ Inc.
Sales fell short last year. But sales would’ve approached target except for 6 territories in 2
regions where results were poor. Of course, we implemented a price increase across-the-
board last year, so our profit margin goals were just met, even though sales revenue fell
short. Yet, 2 of our competitors saw above-trend sales increases last year. Still, another
competitor seems to be struggling, and word on the street is that they have been slashing
prices to close deals. Of course, the economy was pretty uneven across our geographies last
year and the 2 regions in question, weak anyway, were particularly so last year. Then there
was that mess with the new salesforce compensation policy coming into effect last year. 1
of the 2 weak regions saw much salesforce turnover last year . . .

These are everyday musings in the lives of business executives and are far from
unusual. Depending on the identification of the problem, data collection strategies,

www.dbooks.org
28 S. Voleti

resources, and approaches will differ. The difficulty in being able to readily pinpoint
any one cause or a combination of causes as specific problem highlights the issues
that crop up in problem formulation. Four important points jump out from the above
example. First, that reality is messy. Unlike textbook examples of problems, wherein
irrelevant information is filtered out a priori and only that which is required to solve
“the” identified problem exactly is retained, life seldom simplifies issues in such a
clear-cut manner. Second, borrowing from a medical analogy, there are symptoms—
observable manifestations of an underlying problem or ailment—and then there is
the cause or ailment itself. Symptoms could be a fever or a cold and the causes
could be bacterial or viral agents. However, curing the symptoms may not cure
the ailment. Similarly, in the previous example from XYZ Inc., we see symptoms
(“sales are falling”) and hypothesize the existence of one or more underlying
problems or causes. Third, note the pattern of connections between symptom(s) and
potential causes. One symptom (falling sales) is assumed to be coming from one
or more potential causes (product line, salesforce compensation, weak economy,
competitors, etc.). This brings up the fourth point—How can we diagnose a problem
(or cause)? One strategy would be to narrow the field of “ailments” by ruling out
low-hanging fruits—ideally, as quickly and cheaply as feasible. It is not hard to see
that the data required for this problem depends on what potential ailments we have
shortlisted in the first place.

6.1 Towards a Problem Formulation Framework

For illustrative purposes, consider a list of three probable causes from the messy
reality of the problem statement given above, namely, (1) product line is obsolete;
(2) customer-connect is ineffective; and (3) product pricing is uncompetitive (say).
Then, from this messy reality we can formulate decision problems (D.P.s) that
correspond to the three identified probable causes:
• D.P. #1: “Should new product(s) be introduced?”
• D.P. #2: “Should advertising campaign be changed?”
• D.P. #3: “Should product prices be changed?”
Note what we are doing in mathematical terms—if messy reality is a large
multidimensional object, then these D.P.s are small-dimensional subsets of that
reality. This “reduces” a messy large-dimensional object to a relatively more
manageable small-dimensional one.
The D.P., even though it is of small dimension, may not contain sufficient detail
to map directly onto tools. Hence, another level of refinement called the research
objective (R.O.) may be needed. While the D.P. is a small-dimensional object,
the R.O. is (ideally) a one-dimensional object. Multiple R.O.s may be needed to
completely “cover” or address a single D.P. Furthermore, because each R.O. is
one-dimensional, it maps easily and directly onto one or more specific tools in
the analytics toolbox. A one-dimensional problem formulation component better be
2 Data Collection 29

Large-dimensional object

Messy Reality Analycs Toolbox

One-dimensional object

Decision Research
Problem Objecve

Relavely small-
dimensional object

Fig. 2.1 A framework for problem formulation

well defined. The R.O. has three essential parts that together lend necessary clarity
to its definition. R.O.s comprise of (a) an action verb and (b) an actionable object,
and typically fit within one handwritten line (to enforce brevity). For instance, the
active voice statement “Identify the real and perceived gaps in our product line vis-
à-vis that of our main competitors” is an R.O. because its components action verb
(“identify”), actionable object (“real and perceived gaps”), and brevity are satisfied.
Figure 2.1 depicts the problem formulation framework we just described in
pictorial form. It is clear from the figure that as we impose preliminary structure, we
effectively reduce problem dimensionality from large (messy reality) to somewhat
small (D.P.) to the concise and the precise (R.O.).

6.2 Problem Clarity and Research Type

A quotation attributed to former US defense secretary Donald Rumsfeld in the run-


up to the Iraq war goes as follows: “There are known-knowns. These are things we
know that we know. There are known-unknowns. That is to say, there are things that
we know we don’t know. But there are also unknown-unknowns. There are things
we don’t know we don’t know.” This statement is useful in that it helps discern the
differing degrees of the awareness of our ignorance about the true state of affairs.
To understand why the above statement might be relevant for problem formula-
tion, consider that there are broadly three types of research that correspond to three
levels of clarity in problem definition. The first is exploratory research wherein the
problem is at best ambiguous. For instance, “Our sales are falling . . . . Why?” or
“Our ad campaign isn’t working. Don’t know why.” When identifying the problem
is itself a problem, owing to unknown-unknowns, we take an exploratory approach
to trace and list potential problem sources and then define what the problems

www.dbooks.org
30 S. Voleti

may be. The second type is descriptive research wherein the problem’s identity is
somewhat clear. For instance, “What kind of people buy our products?” or “Who is
perceived as competition to us?” These are examples of known-unknowns. The third
type is causal research wherein the problem is clearly defined. For instance, “Will
changing this particular promotional campaign raise sales?” is a clearly identified
known-unknown. Causal research (the cause in causal comes from the cause in
because) tries to uncover the “why” behind phenomena of interest and its most
powerful and practical tool is the experimentation method. It is not hard to see that
the level of clarity in problem definition vastly affects the choices available in terms
of data collection and downstream analysis.

7 Challenges in Data Collection

Data collection is about data and about collection. We have seen the value inherent
in the right data in Sect. 1. In Sect. 3, we have seen the importance of clarity in
problem formulation while determining what data to collect. Now it is time to turn
to the “collection” piece of data collection. What challenges might a data scientist
typically face in collecting data? There are various ways to list the challenges that
arise. The approach taken here follows a logical sequence.
The first challenge is in knowing what data to collect. This often requires
some familiarity with or knowledge of the problem domain. Second, after the data
scientist knows what data to collect, the hunt for data sources can proceed apace.
Third, having identified data sources (the next section features a lengthy listing of
data sources in one domain as part of an illustrative example), the actual process
of mining of raw data can follow. Fourth, once the raw data is mined, data quality
assessment follows. This includes various data cleaning/wrangling, imputation, and
other data “janitorial” work that consumes a major part of the typical data science
project’s time. Fifth, after assessing data quality, the data scientist must now judge
the relevance of the data to the problem at hand. While considering the above, at
each stage one has to take into consideration the cost and time constraints.
Consider a retailing context. What kinds of data would or could a grocery retail
store collect? Of course, there would be point-of-sale data on items purchased,
promotions availed, payment modes and prices paid in each market basket, captured
by UPC scanner machines. Apart from that, retailers would likely be interested in
(and can easily collect) data on a varied set of parameters. For example, that may
include store traffic and footfalls by time of the day and day of the week, basic
segmentation (e.g., demographic) of the store’s clientele, past purchase history of
customers (provided customers can be uniquely identified, that is, through a loyalty
or bonus program), routes taken by the average customer when navigating the
store, or time spent on an average by a customer in different aisles and product
departments. Clearly, in the retail sector, the wide variety of data sources and capture
points to data are typically large in the following three areas:
2 Data Collection 31

• Volume
• Variety (ranges from structured metric data on sales, inventory, and geo location
to unstructured data types such as text, images, and audiovisual files)
• Velocity—(the speed at which data comes in and gets updated, i.e., sales or
inventory data, social media monitoring data, clickstreams, RFIDs—Radio-
frequency identification, etc.)
These fulfill the three attribute criteria that are required to being labeled “Big
Data” (Diebold 2012). The next subsection dives into the retail sector as an
illustrative example of data collection possibilities, opportunities, and challenges.

8 Data Collation, Validation, and Presentation

Collecting data from multiple sources will not result in rich insights unless the data
is collated to retain its integrity. Data validity may be compromised if proper care is
not taken during collation. One may face various challenges while trying to collate
the data. Below, we describe a few challenges along with the approaches to handle
them in the light of business problems.
• No common identifier: A challenge while collating data from multiple sources
arises due to the absence of common identifiers across different sources. The
analyst may seek a third identifier that can serve as a link between two data
sources.
• Missing data, data entry error: Missing data can either be ignored, deleted, or
imputed with relevant statistics (see Chap. 8).
• Different levels of granularity: The data could be aggregated at different levels.
For example, primary data is collected at the individual level, while secondary
data is usually available at the aggregate level. One can either aggregate the
data in order to bring all the observations to the same level of granularity or
can apportion the data using business logic.
• Change in data type over the period or across the samples: In financial and
economic data, many a time the base period or multipliers are changed, which
needs to be accounted for to achieve data consistency. Similarly, samples
collected from different populations such as India and the USA may suffer from
inconsistent definitions of time periods—the financial year in India is from April
to March and in the USA, it is from January to December. One may require
remapping of old versus new data types in order to bring the data to the same
level for analysis.
• Validation and reliability: As the secondary data is collected by another user, the
researcher may want to validate to check the correctness and reliability of the
data to answer a particular research question.
Data presentation is also very important to understand the issues in the data. The
basic presentation may include relevant charts such as scatter plots, histograms, and

www.dbooks.org
32 S. Voleti

pie charts or summary statistics such as the number of observations, mean, median,
variance, minimum, and maximum. You will read more about data visualization in
Chap. 5 and about basic inferences in Chap. 6.

9 Data Collection in the Retailing Industry: An Illustrative


Example

Bradlow et al. (2017) provide a detailed framework to understand and classify the
various data sources becoming popular with retailers in the era of Big Data and
analytics. Figure 2.2, taken from Bradlow et al. (2017), “organizes (an admittedly
incomplete) set of eight broad retail data sources into three primary groups, namely,
(1) traditional enterprise data capture; (2) customer identity, characteristics, social
graph and profile data capture; and (3) location-based data capture.” The claim
is that insight and possibilities lie at the intersection of these groups of diverse,
contextual, and relevant data.
Traditional enterprise data capture (marked #1 in Fig. 2.2) from UPC scanners
combined with inventory data from ERP or SCM software and syndicated databases
(such as those from IRI or Nielsen) enable a host of analyses, including the
following:

1. Sales & Inventory 2. Loyalty or Bonus Card 3. Customers' Web-


data capture from data for Household presence data from
enterprise systems identification retailer's site and/or
syndicated sources.

Data capture from traditional Enterprise


systems (UPC scanners, ERP etc.)
4. Customers' Social Graph and profile information

Customer or household level Data capture

Location based Data capture

5. Mobile and app based data (both 6. Customers' subconscious, habit based
retailer's own app and from syndicated or subliminally influenced choices (RFID,
sources) eye-tracking etc.)

7. Relative product locations in the store 8. Environmental data such as weather


layout and on shop shelves within an aisle. conditions

9. Store location used for third party order fulfillment

Fig. 2.2 Data sources in the modern retail sector


2 Data Collection 33

• Cross-sectional analysis of market baskets—item co-occurrences, complements


and substitutes, cross-category dependence, etc. (e.g., Blattberg et al. 2008;
Russell and Petersen 2000)
• Analysis of aggregate sales and inventory movement patterns by stock-keeping
unit
• Computation of price or shelf-space elasticities at different levels of aggregation
such as category, brand, and SKU (see Bijmolt et al. (2005) for a review of this
literature)
• Assessment of aggregate effects of prices, promotions, and product attributes on
sales
In other words, traditional enterprise data capture in a retailing context enables
an overview of the four P’s of Marketing (product, price, promotion, and place at
the level of store, aisle, shelf, etc.).
Customer identity, characteristics, social graph, and profile data capture identify
consumers and thereby make available a slew of consumer- or household-specific
information such as demographics, purchase history, preferences and promotional
response history, product returns history, and basic contacts such as email for email
marketing campaigns and personalized flyers and promotions. Bradlow et al. (2017,
p. 12) write:
Such data capture adds not just a slew of columns (consumer characteristics) to the most
detailed datasets retailers would have from previous data sources, but also rows in that
household-purchase occasion becomes the new unit of analysis. A common data source for
customer identification is loyalty or bonus card data (marked #2 in Fig. 2.2) that customers
sign up for in return for discounts and promotional offers from retailers. The advent of
household specific “panel” data enabled the estimation of household specific parameters
in traditional choice models (e.g., Rossi and Allenby 1993; Rossi et al. 1996) and their use
thereafter to better design household specific promotions, catalogs, email campaigns, flyers,
etc. The use of household- or customer identity requires that a single customer ID be used
as primary key to link together all relevant information about a customer across multiple
data sources. Within this data capture type, another data source of interest (marked #3 in
Fig. 2.2) is predicated on the retailer’s web-presence and is relevant even for purely brick-
and-mortar retailers. Any type of customer initiated online contact with the firm—think
of an email click-through, online browser behavior and cookies, complaints or feedback
via email, inquiries, etc. are captured and recorded, and linked to the customer’s primary
key. Data about customers’ online behavior purchased from syndicated sources are also
included here. This data source adds new data columns to retailer data on consumers’
online search, products viewed (consideration set) but not necessarily bought, purchase
and behavior patterns, which can be used to better infer consumer preferences, purchase
contexts, promotional response propensities, etc.

Marked #4 in Fig. 2.2 is another potential data source—consumers’ social


graph information. This could be obtained either from syndicated means or by
customers volunteering their social media identities to use as logins at various
websites. Mapping the consumer’s social graph opens the door to increased
opportunities in psychographic and behavior-based targeting, personalization and
hyper-segmentation, preference and latent need identification, selling, word of
mouth, social influence, recommendation systems, etc. While the famous AIDA

www.dbooks.org
34 S. Voleti

framework in marketing has four conventional stages, namely, awareness, interest,


desire, and action, it is clear that the “social” component’s importance in data
collection, analysis, modeling, and prediction is rising. Finally, the third type of
data capture—location-based data capture—leverages customers’ locations to infer
customer preferences, purchase propensities, and design marketing interventions on
that basis. The biggest change in recent years in location-based data capture and
use has been enabled by customer’s smartphones (e.g., Ghose and Han 2011, 2014).
Figure 2.2 marks consumers’ mobiles as data source #5. Data capture here involves
mining location-based services data such as geo location, navigation, and usage data
from those consumers who have installed and use the retailer’s mobile shopping
apps on their smartphones. Consumers’ real-time locations within or around retail
stores potentially provide a lot of context that can be exploited to make marketing
messaging on deals, promotions, new offerings, etc. more relevant and impactful
to consumer attention (see, e.g., Luo et al. 2014) and hence to behavior (including
impulse behavior).
Another distinct data source, marked #6 in Fig. 2.2, draws upon habit patterns
and subconscious consumer behaviors that consumers are unaware of at a conscious
level and are hence unable to explain or articulate. Examples of such phenomena
include eye-movement when examining a product or web-page (eye-tracking studies
started with Wedel and Pieters 2000), the varied paths different shoppers take
inside physical stores which can be tracked using RFID chips inside shopping
carts (see, e.g., Larson et al. 2005) or inside virtual stores using clickstream data
(e.g., Montgomery et al. 2004), the distribution of first-cut emotional responses to
varied product and context stimuli which neuro-marketing researchers are trying to
understand using functional magnetic resonance imaging (fMRI) studies (see, e.g.,
Lee et al. (2007) for a survey of the literature), etc.
Data source #7 in Fig. 2.1 draws on how retailers optimize their physical store
spaces for meeting sales, share, or profit objectives. Different product arrangements
on store shelves lead to differential visibility and salience. This results in a height-
ened awareness, recall, and inter-product comparison and therefore differential
purchase propensity, sales, and share for any focal product. More generally, an
optimization of store layouts and other situational factors both offline (e.g., Park
et al. 1989) as well as online (e.g., Vrechopoulos et al. 2004) can be considered
given the physical store data sources that are now available. Data source #8
pertains to environmental data that retailers routinely draw upon to make assortment,
promotion, and/or inventory stocking decisions. For example, that weather data
affects consumer spending propensities (e.g., Murray et al. 2010) and store sales has
been known and studied for a long time (see, e.g., Steele 1951). Today, retailers can
access a well-oiled data collection, collation, and analysis ecosystem that regularly
takes in weather data feeds from weather monitoring system APIs and collates
it into a format wherein a rules engine can apply, and thereafter output either
recommendations or automatically trigger actions or interventions on the retailer’s
behalf.
2 Data Collection 35

Finally, data source #9 in Fig. 2.2 is pertinent largely to emerging markets and lets
small, unorganized sector retailers (mom-and-pop stores) to leverage their physical
location and act as fulfillment center franchisees for large retailers (Forbes 2015).

10 Summary and Conclusion

This chapter was an introduction to the important task of data collection, a process
that precedes and heavily influences the success or failure of data science and
analytics projects in meeting their objectives. We started with why data is such a
big deal and used an illustrative example (Uber) to see the value inherent in the
right kind of data. We followed up with some preliminaries on the four main types
of data, their corresponding four primary scales, and the implications for analysis
downstream. We then ventured into problem formulation, discussed why it is of
such critical importance in determining what data to collect, and built a simple
framework against which data scientists could check and validate their current
problem formulation tasks. Finally, we walked through an extensive example of the
various kinds of data sources available in just one business domain—retailing—and
the implications thereof.

Exercises

Ex. 2.1 Prepare the movie release dataset of all the movies released in the last 5 years
using IMDB.
(a) Find all movies that were released in the last 5 years.
(b) Generate a file containing URLs for the top 50 movies every year on IMDB.
(c) Read in the URL’s IMDB page and scrape the following information:
Producer(s), Director(s), Star(s), Taglines, Genres, (Partial) Storyline, Box
office budget, and Box office gross.
(d) Make a table out of these variables as columns with movie name being the first
variable.
(e) Analyze the movie-count for every Genre. See if you can come up with some
interesting hypotheses. For example, you could hypothesize that “Action Genres
occur significantly more often than Drama in the top-250 list.” or that “Action
movies gross higher than Romance movies in the top-250 list.”
(f) Write a markdown doc with your code and explanation. See if you can storify
your hypotheses.
Note: You can web-scrape with the rvest package in R or use any platform that
you are comfortable with.

www.dbooks.org
36 S. Voleti

Ex. 2.2 Download the movie reviews from IMDB for the list of movies.
(a) Go to www.imdb.com and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, movie type, votes.”
(c) Filter the data frame. Retain only those movies that got over 500 reviews. Let
us call this Table 1.
(d) Now for each of the remaining movies, go to the movie’s own web page on the
IMDB, and extract the following information:
Duration of the movie, Genre, Release date, Total views, Commercial
description from the top of the page.
(e) Add these fields to Table 1 in that movie’s row.
(f) Now build a separate table for each movie in Table 1 from that movie’s web
page on IMDB. Extract the first five pages of reviews of that movie and in each
review, scrape the following information:
Reviewer, Feedback, Likes, Overall, Review (text), Location (of the
reviewer), Date of the review.
(g) Store the output in a table. Let us call it Table 2.
(h) Create a list (List 1) with as many elements as there are rows in Table 1. For the
ith movie in Table 1, store Table 2 as the ith element of a second list, say, List 2.
Ex. 2.3 Download the Twitter data through APIs.
(a) Read up on how to use the Twitter API (https://dev.twitter.com/overview/api).
If required, make a twitter ID (if you do not already have one).
(b) There are three evaluation dimensions for a movie at IMDB, namely, Author,
Feedback, and Likes. More than the dictionary meanings of these words, it is
interesting how they are used in different contexts.
(c) Download 50 tweets each that contain these terms and 100 tweets for each
movie.
(d) Analyze these tweets and classify what movie categories they typically refer to.
Insights here could, for instance, be useful in designing promotional campaigns
for the movies.
P.S.: R has a dedicated package twitteR (note capital R in the end). For additional
functions, refer twitteR package manual.
Ex. 2.4 Prepare the beer dataset of all the beers that got over 500 reviews.
(a) Go to (https://www.ratebeer.com/beer/top-50/) and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, count, style.”
(c) Filter the data frame. Retain only those beers that got over 500 reviews. Let us
call this Table 1.
(d) Now for each of the remaining beers, go to the beer’s own web page on the
ratebeer site, and scrape the following information:
2 Data Collection 37

“Brewed by, Weighted Avg, Seasonal, Est.Calories, ABV, commercial


description” from the top of the page.
Add these fields to Table 1 in that beer’s row.
(e) Now build a separate table for each beer in Table 1 from that beer’s ratebeer
web page. Scrape the first three pages of reviews of that beer and in each review,
scrape the following info:
“rating, aroma, appearance, taste, palate, overall, review (text), location (of
the reviewer), date of the review.”
(f) Store the output in a dataframe, let us call it Table 2.
(g) Create a list (let us call it List 1) with as many elements as there are rows in
Table 1. For the ith beer in Table 1, store Table 2 as the ith element List 2.
Ex. 2.5 Download the Twitter data through APIs.
(a) Read up on how to use the twitter API here (https://dev.twitter.com/overview/
api). If required, make a twitter ID (if you do not already have one).
(b) Recall three evaluation dimensions for beer at ratebeer.com, viz., aroma,
taste, and palate. More than the dictionary meanings of these words, what is
interesting is how they are used in context.
So pull 50 tweets each containing these terms.
(c) Read through these tweets and note what product categories they typically
refer to. Insights here could, for instance, be useful in designing promotional
campaigns for the beers. We will do text analysis, etc. next visit.
P.S.: R has a dedicated package twitteR (note capital R in the end). For additional
functions, refer twitteR package manual.
Ex. 2.6 WhatsApp Data collection.
(a) Form a WhatsApp group with few friends/colleagues/relatives.
(b) Whenever you travel or visit different places as part of your everyday work,
share your location to the WhatsApp group.
For example, if you are visiting an ATM, your office, a grocery store, the
local mall, etc., then send the WhatsApp group a message saying: “ATM, [share
of location here].”
Ideally, you should share a handful of locations every day. Do this DC
exercise for a week. It is possible you may repeat-share certain locations.
P.S.: We assume you have a smartphone with google maps enabled on it to
share locations with.
(c) Once this exercise is completed export the WhatsApp chat history of DC group
to a text file. To do this, see below:
Go to WhatsApp > Settings > Chat history > Email Chat > Select the chat
you want to export.
(d) Your data file should look like this:
28/02/17, 7:17 pm—fname lname: location: https://maps.google.com/?q=17.
463869,78.367403
28/02/17, 7:17 pm—fname lname: ATM

www.dbooks.org
38 S. Voleti

(e) Now compile this data in a tabular format. Your data should have these columns:
• Sender name
• Time
• Latitude
• Longitude
• Type of place
(f) Extract your locations from the chat history table and plot it on google maps.
You can use the spatial DC code we used on this list of latitude and longitude
co-ordinates or use leaflet() package in R to do the same. Remember to extract
and map only your own locations not those of other group members.
(g) Analyze your own movements over a week *AND* record your observations
about your travels as a story that connects these locations together.

References

Bijmolt, T. H. A., van Heerde, H. J., & Pieters, R. G. M. (2005). New empirical generalizations on
the determinants of price elasticity. Journal of Marketing Research, 42(2), 141–156.
Blair, E., & Blair, C. (2015). Applied survey sampling. Los Angeles: Sage Publications.
Blattberg, R. C., Kim, B.-D., & Neslin, S. A. (2008). Market basket analysis. Database Marketing:
Analyzing and Managing Customers, 339–351.
Bradlow, E., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The role of big data and predictive
analytics in retailing. Journal of Retailing, 93, 79–95.
Diebold, F. X. (2012). On the origin (s) and development of the term ‘Big Data’.
Forbes. (2015). From Dabbawallas to Kirana stores, five unique E-commerce delivery innovations
in India. Retrieved April 15, 2015, from http://tinyurl.com/j3eqb5f.
Ghose, A., & Han, S. P. (2011). An empirical analysis of user content generation and usage
behavior on the mobile Internet. Management Science, 57(9), 1671–1691.
Ghose, A., & Han, S. P. (2014). Estimating demand for mobile applications in the new economy.
Management Science, 60(6), 1470–1488.
Larson, J. S., Bradlow, E. T., & Fader, P. S. (2005). An exploratory look at supermarket shopping
paths. International Journal of Research in Marketing, 22(4), 395–414.
Lee, N., Broderick, A. J., & Chamberlain, L. (2007). What is ‘neuromarketing’? A discussion and
agenda for future research. International Journal of Psychophysiology, 63(2), 199–204.
Luo, X., Andrews, M., Fang, Z., & Phang, C. W. (2014). Mobile targeting. Management Science,
60(7), 1738–1756.
Montgomery, C. (2017). Design and analysis of experiments (9th ed.). New York: John Wiley and
Sons.
Montgomery, A. L., Li, S., Srinivasan, K., & Liechty, J. C. (2004). Modeling online browsing and
path analysis using clickstream data. Marketing Science, 23(4), 579–595.
Murray, K. B., Di Muro, F., Finn, A., & Leszczyc, P. P. (2010). The effect of weather on consumer
spending. Journal of Retailing and Consumer Services, 17(6), 512–520.
Park, C. W., Iyer, E. S., & Smith, D. C. (1989). The effects of situational factors on in-store grocery
shopping behavior: The role of store environment and time available for shopping. Journal of
Consumer Research, 15(4), 422–433.
Rossi, P. E., & Allenby, G. M. (1993). A Bayesian approach to estimating household parameters.
Journal of Marketing Research, 30, 171–182.
2 Data Collection 39

Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in
target marketing. Marketing Science, 15(4), 321–340.
Russell, G. J., & Petersen, A. (2000). Analysis of cross category dependence in market basket
selection. Journal of Retailing, 76(3), 367–392.
Steele, A. T. (1951). Weather’s effect on the sales of a department store. Journal of Marketing,
15(4), 436–443.
Vrechopoulos, A. P., O’Keefe, R. M., Doukidis, G. I., & Siomkos, G. J. (2004). Virtual store layout:
An experimental comparison in the context of grocery retail. Journal of Retailing, 80(1), 13–22.
Wedel, M., & Pieters, R. (2000). Eye fixations on advertisements and memory for brands: A model
and findings. Marketing Science, 19(4), 297–312.

www.dbooks.org
Chapter 3
Data Management—Relational Database
Systems (RDBMS)

Hemanth Kumar Dasararaju and Peeyush Taori

1 Introduction

Storage and management of data is a key aspect of data science. Data, simply
speaking, is nothing but a collection of facts—a snapshot of the world—that can
be stored and processed by computers. In order to process and manipulate data
efficiently, it is very important that data is stored in an appropriate form. Data comes
in many shapes and forms, and some of the most commonly known forms of data
are numbers, text, images, and videos. Depending on the type of data, there exist
multiple ways of storage and processing. In this chapter, we focus on one of the
most commonly known and pervasive means of data storage—relational database
management systems. We provide an introduction using which a reader can perform
the essential operations. References for a deeper understanding are given at the end
of the chapter.

2 Motivating Example

Consider an online store that sells stationery to customers across a country. The
owner of this store would like to set up a system that keeps track of inventory, sales,
operations, and potential pitfalls. While she is currently able to do so on her own,
she knows that as her store scales up and starts to serve more and more people, she

H. K. Dasararaju
Indian School of Business, Hyderabad, Telangana, India
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com

© Springer Nature Switzerland AG 2019 41


B. Pochiraju, S. Seshadri (eds.), Essentials of Business Analytics, International
Series in Operations Research & Management Science 264,
https://doi.org/10.1007/978-3-319-68837-4_3
42 H. K. Dasararaju and P. Taori

will no longer have the capacity to manually record transactions and create records
for new occurrences. Therefore, she turns to relational database systems to run her
business more efficiently.
A database is a collection of organized data in the form of rows, columns, tables
and indexes. In a database, even a small piece of information becomes data. We tend
to aggregate related information together and put them under one gathered name
called a Table. For example, all student-related data (student ID, student name, date
of birth, etc.) would be put in one table called STUDENT table. It decreases the
effort necessary to scan for a specific information in an entire database. Since a
database is very flexible, data gets updated and extended when new data is added
and the database shrinks when data is deleted from the database.

3 Database Systems—What and Why?

As data grows in size, there arises a need for a means of storing it efficiently such
that it can be found and processed quickly. In the “olden days” (which was not
too far back), this was achieved via systematic filing systems where individual files
were catalogued and stored neatly according to a well-developed data cataloging
system (similar to the ones you will find in libraries or data storage facilities
in organizations). With the advent of computer systems, this role has now been
assumed by database systems. Plainly speaking, a database system is a digital
record-keeping system or an electronic filing cabinet. Database systems can be used
to store large amounts of data, and data can then be queried and manipulated later
using a querying mechanism/language. Some of the common operations that can
be performed in a database system are adding new files, updating old data files,
creating new databases, querying of data, deleting data files/individual records, and
adding more data to existing data files. Often pre processing and post-processing of
data happen using database languages. For example, one can selectively read data,
verify its correctness, and connect it to data structures within applications. Then,
after processing, write it back into the database for storage and further processing.
With the advent of computers, the usage of database systems has become
ubiquitous in our personal and work lives. Whether we are storing information about
personal expenditures using an Excel file or making use of MySQL database to
store product catalogues for a retail organization, databases are pervasive and in use
everywhere. We also discuss the difference between the techniques discussed in this
chapter compared to methods for managing big data in the next chapter.

3.1 Database Management System

A database management system (DBMS) is the system software that enables users
to create, organize, and manage databases. As Fig. 3.1 illustrates, The DBMS serves
as an interface between the database and the end user, guaranteeing that information
is reliably organized and remains accessible.

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 43

Database
Database Management
Application System Database
(DBMS)

Fig. 3.1 Relating databases to end users

The main objectives of DBMS are mass storage; removal of duplicity—DBMS


makes sure that same data has not been stored earlier; providing multiple user
access—two or more users can work concurrently; data integrity—ensuring the
privacy of the data and preventing unauthorized access; data backup and recovery;
nondependence on a particular platform; and so on. There are dozens of DBMS
products available. Popular products include Microsoft Access, MYSQL, Oracle
from Oracle Corporation, SQL Server from Microsoft, and DB2 from IBM.

3.2 Relational Database Management System

Relational database management system (RDBMS) is a database management


system (DBMS) that is based on the relational model of data. DBMS tells us about
the tables but Relational DBMS specifies about relations between different entities
in the database. The two main principles of the RDBMS are entity integrity and
referential integrity.
• Entity integrity: Here, all the data should be organized by having a unique value
(primary key), so it cannot accept null values.
• Referential integrity: Referential integrity must have constraints specified
between two relations and the relationship must always be consistent (e.g.,
foreign key column must be equal to the primary key column).
– Primary key: Primary key is a column in a table that uniquely identifies the
rows in that relation (table).
– Foreign key: Foreign keys are columns that point to primary key columns of
another table.
Normalization:
Normalization is the database design technique that is used to efficiently organize
the data, optimize the table structures, and remove duplicate data entries. It
separates the larger tables into smaller tables and links them using the relationships.
Normalization is used to improve the speed, for efficient usage of space, and to
increase the data integrity. The important normalizations that are used to organize
the database are as follows:
44 H. K. Dasararaju and P. Taori

• First normal form (1NF): The table must contain “atomic” values only (should
not contain any duplicate values, and cannot hold multiple values).
Example: Suppose the university wants to store the details of students who are
finalists of a competition. Table 3.1 shows the data.
Three students (Jon, Robb, and Ken) have two different parents numbers so the
university put two numbers in the same field as you see in Table 3.1. This table is
not in 1NF as it does not follow the rule “Only atomic values in the field” as there
are multiple values in parents_number field. To make the table into 1NF we should
store the information as shown in Table 3.2.
• Second normal form (2NF): Must follow first normal form and no non-key
attributes are dependent on the proper subset of any candidate key of the table.
Example: Assume a university needs to store the information of the instructors
and the topics they teach. They make a table that resembles the one given below
(Table 3.3) since an instructor can teach more than one topic.

Table 3.1 Students in a university competition


Student_ID Student_Name Address Parents_number
71121 Jon New York 75430105417540
71122 Janet Chicago 1915417
71123 Robb Boston 63648014889636
71124 Zent Los Angeles 7545413
71125 Ken Atlanta 40136924016371

Table 3.2 Students in a university competition sorted efficiently


Student_ID Student_Name Address Parents_number
71121 Jon New York 7543010
71121 Jon New York 5417540
71122 Janet Chicago 1915417
71123 Robb Boston 6364801
71123 Robb Boston 4889636
71124 Zent Los Angeles 7545413
71125 Ken Atlanta 4013692
71125 Ken Atlanta 4016371

Table 3.3 Instructors in a Instructor_ID Topic Instructor_Age


university
56121 Neural Network 37
56121 IoT 37
56132 Statistics 51
56133 Optimization 43
56133 Simulation 43

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 45

Table 3.4 Breaking tables Instructor_ID Instructor_Age


into two in order to agree
with 2NF 56121 37
56132 51
56133 43

Instructor_ID Topic
56121 Neural Network
56121 IoT
56132 Statistics
56133 Optimization
56133 Simulation

Table 3.5 Students in a university competition


Student_ID Student_Name Student_ZIP Student_State Student_city Student_Area
71121 Jon 10001 New York New York Queens Manhattan
71122 Janet 60201 Illinois Chicago Evanston
71123 Robb 02238 Massachusetts Boston Cambridge
71124 Zent 90089 California Los Angeles Trousdale

Here Instructor_ID and Topic are key attributes and Instructor_Age is a non-
key attribute. The table is in 1NF but not in 2NF because the non-key attribute
Instructor_Age is dependent on Instructor_ID. To make the table agree to 2NF, we
can break the table into two tables like the ones given in Table 3.4.
• Third normal form (3NF): Must follow second normal form and none of the non-
key attributes are determined by another non-key attributes.
Example: Suppose the university wants to store the details of students who are
finalists of a competition. The table is shown in Table 3.5.
Here, student_ID is the key attribute and all other attributes are non-key
attributes. Student_State, Student_city, and Student_Area depend on Student_ZIP
and Student_ZIP is dependent on Student_ID that makes the non-key attribute
transitively dependent on the key attribute. This violates the 3NF rules. To make
the table agree to 3NF we can break into two tables like the ones given in Table 3.6.
3NF is the form that is practiced and advocated across most organizational
environments. It is because tables in 3NF are immune to most of the anomalies
associated with insertion, updation, and deletion of data. However, there could be
specific instances when organizations might want to opt for alternate forms of table
normalization such as 4NF and 5NF. While 2NF and 3NF normalizations focus on
functional aspects, 4NF and 5NF are more concerned with addressing multivalued
dependencies. A detailed discussion of 4NF and 5NF forms is beyond the scope
46 H. K. Dasararaju and P. Taori

Table 3.6 Breaking tables into two in order to agree with 3NF
Student table:
Student_ID Student_Name Student_ZIP
71121 Jon 10001
71122 Janet 60201
71123 Robb 02238
71124 Zent 90089
Student_zip table:
Student_ZIP Student_State Student_city Student_Area
10001 New York New York Queens Manhattan
60201 Illinois Chicago Evanston
02238 Massachusetts Boston Cambridge
90089 California Los Angeles Trousdale

of discussion for this chapter, but interested reader can learn more online from
various sources.1 It should be noted that in many organizational scenarios, the focus
is mainly on achieving 3NF.

3.3 Advantages of RDBMS over EXCEL

Most businesses today need to record and store information. Sometimes this may be
only for record keeping and sometimes data is stored for later use. We can store the
data in Microsoft Excel. But why is RDBMS the most widely used method to store
data?
Using Excel we can perform various functions like adding the data in rows and
columns, sorting of data by various metrics, etc. But Excel is a two-dimensional
spreadsheet and thus it is extremely hard to make connections between information
in various spreadsheets. It is easy to view the data or find the particular data from
Excel when the size of the information is small. It becomes very hard to read the
information once it crosses a certain size. The data might scroll many pages when
endeavoring to locate a specific record.
Unlike Excel, in RDBMS, the information is stored independently from the user
interface. This separation of storage and access makes the framework considerably
more scalable and versatile. In RDBMS, data can be easily cross-referenced
between multiple databases using relationships between them but there are no such
options in Excel. RDBMS utilizes centralized data storage systems that makes
backup and maintenance much easier. Database frameworks have a tendency to be
significantly faster as they are built to store and manipulate large datasets unlike
Excel.

1 (http://www.bkent.net/Doc/simple5.htm (accessed on Feb 6, 2019))

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 47

4 Structured Query Language (SQL)

SQL (structured query language) is a computer language exclusive to a particular


application domain in contrast to some other general-purpose language (GPL) such
as C, Java, or Python that is broadly applicable across domains. SQL is text oriented,
and designed for managing (access and manipulate) data. SQL was authorized as a
national standard by the ANSI (American National Standards Institute) in 1992. It is
the standard language for relational database management systems. Some common
relational database management systems that operate using SQL are Microsoft
Access, MySQL, Oracle, SQL Server, and IBM DB2. Even though many database
systems make use of SQL, they also have their unique extensions that are specific
to their systems.
SQL statements are used to select the particular part of the data, retrieve data
from a database, and update data on the database using CREATE, SELECT,
INSERT, UPDATE, DELETE, and DROP commands. SQL commands can be sliced
into four categories: DDL (data definition language), which is used to define the
database structures; DML (data manipulation language), which is used to access and
modify database data; DCL (data control language); and TCL (transaction control
language).

DDL (Data Definition Language):


DDL deals with the database schemas and structure. The following statements
are used to take care of the design and storage of database objects.
1. CREATE: Creates the database, table, index, views, store, procedure, functions,
and triggers.
2. ALTER: Alters the attributes, constraints, and structure of the existing database.
3. DROP: Deletes the objects (table, view, functions, etc.) from the database.
4. TRUNCATE: Removes all records from a table, including the space allocated to
the records.
5. COMMENT: Associates comments about the table or about any objects to the
data dictionary.
6. RENAME: Renames the objects.
DML (Data Manipulation Language):
DML deals with tasks like storing, modifying, retrieving, deleting, and updating
the data in/from the database.
1. SELECT: The only data retrieval statement in SQL, used to select the record(s)
from the database.
2. INSERT: Inserts a new data/observation into the database.
3. UPDATE: Modifies the existing data within the database.
4. DELETE: Removes one or more records from the table.
48 H. K. Dasararaju and P. Taori

Note: There is an important difference between the DROP, TRUNCATE, and


DELETE commands. DELETE (Data alone deleted) operations can be recalled
back (undo), while DROP (Table structure + Data are deleted) and TRUNCATE
operations cannot be recalled back.
DCL (Data Control Language):
Data control languages are used to uphold the database security during multi-
ple user data environment. The database administrator (DBA) is responsible for
“grant/revoke” privileges on database objects.
1. GRANT: Provide access or privilege on database objects to the group of users or
particular user.
2. REVOKE: Remove user access rights or privilege to the database objects.
TCL (Traction Control Language):
Transaction control language statements enable you to control and handle trans-
actions to keep up the trustworthiness of the information within SQL statements.
1. BEGIN: Opens a transaction.
2. COMMIT: Saves the transaction on the database.
3. ROLLBACK: Rollback (undo the insert, delete, or update) the transaction in the
database in case of any errors.
4. SAVEPOINT: Rollback to the particular point (till the savepoint marked) of
transaction. The progression done until the savepoint will be unaltered and all
transaction after that will be rolled back.

4.1 Introduction to MySQL

In this section, we will walk through the basics of creating a database using MySQL2
and query the database using the MySQL querying language. As described earlier
in the chapter, a MySQL database server is capable of hosting many databases. In
databases parlance, a database is often also called a schema. Thus, a MySQL server
can contain a number of schemas. Each of those schemas (database) is made up of a
number of tables, and every table contains rows and columns. Each row represents
an individual record or observation, and each column represents a particular attribute
such as age and salary.
When you launch the MySQL command prompt, you see a command line like
the one below (Fig. 3.2).

2 MySQL Workbench or Windows version can be downloaded from https://dev.mysql.com/


downloads/windows/ (accessed on Feb 15, 2018) for practice purpose.

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 49

Fig. 3.2 MySQL command prompt Interface

The command line starts with “mysql>” and you can run SQL scripts by closing
commands with semicolon (;).

4.2 How to Check the List of Databases Available in MySQL?

In order to get started we will first check the databases that are already present in
a MySQL server. To do so, type “show databases” in the command line. Once
you run this command, it will list all the available databases in the MySQL server
installation. The above-mentioned command is the first SQL query that we have
run. Please note that keywords and commands are case-insensitive in MySQL as
compared to R and Python where commands are case-sensitive in nature.
mysql> SHOW DATABASES;

Output:
+-------------------+
| Database |
+-------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+-------------------+
4 rows in set (0.00 sec)

You would notice that there are already four schemas listed though we have
not yet created any one of them. Out of the four databases, “information_schema”,
“mysql”, and “performance_schema” are created by MySQL server for its internal
monitoring and performance optimization purposes and should not be used when we
are creating our own database. Another schema “test” is created by MySQL during
the installation phase and it is provided for testing purposes. You can remove the
“test” schema or can use it to create your own tables.
50 H. K. Dasararaju and P. Taori

4.3 Creating and Deleting a Database

Now let us create our own database. The syntax for creating a database in MySQL is:
CREATE DATABASE databasename;

Let us create a simple inventory database. We shall create a number of tables


about products and their sales information such as customers, products, orders,
shipments, and employee. We will call the database “product_sales”. In order to
create a database, type the following SQL query:
mysql> CREATE DATABASE product_sales;

Output:
Query OK, 1 row affected (0.00 sec)

mysql> SHOW DATABASES;

Output:
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| product_sales |
| test |
+--------------------+
5 rows in set (0.00 sec)

In the above-mentioned query, we are creating a database called “product_sales.”


Once the query is executed, if you issue the “show databases” command again, then
it will now show five databases (with “product_sales” as the new database). As of
now, “product_sales” will be an empty database, meaning there would be no tables
inside it. We will start creating tables and populating them with data in a while.
In order to delete a database, you need to follow the following syntax:
DROP DATABASE databasename;

In our case, if we need to delete “product_sales”, we will issue the command:


mysql> DROP DATABASE product_sales;

Output:
Query OK, 0 rows affected (0.14 sec)

mysql> SHOW DATABASES;

Output:
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 51

| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)

Oftentimes, when you have to create a database, you might not be sure if a
database of the same name exists already in the system. In such cases, conditions
such as “IF EXISTS” and “IF NOT EXISTS” come in handy. When we execute such
query, then the database is created if there is no other database of the same name.
This helps us in avoiding overwriting of the existing database with the new one.
mysql> CREATE DATABASE IF NOT EXISTS product_sales;

Output:
Query OK, 1 row affected (0.00 sec)

One important point to keep in mind is the use of SQL DROP commands with
extreme care, because once you delete an entity or an entry, then there is no way to
recover the data.

4.4 Selecting a Database

There can be multiple databases available in the MySQL server. In order to work on
a specific database, we have to select the database first. The basic syntax to select a
database is:
USE databasename;

In our case, if we have to select “product_sales” database, we will issue the


command:
mysql> USE product_sales;

Output:
Database changed

When we run the above query, the default database now is “product_sales”.
Whatever operations we will now perform will be performed on this database. This
implies that if you have to use a specific table in the database, then you can simply
do so by calling the table name. If at any point of time you want to check which
your selected database is then issue the command:
mysql> SELECT DATABASE();

Output:
+---------------+
| DATABASE() |
+---------------+
| product_sales |
+---------------+
1 row in set (0.00 sec)
52 H. K. Dasararaju and P. Taori

If you want to check all tables in a database, then issue the following command:
mysql> SHOW TABLES;

Output:
Empty set (0.00 sec)

As of now it is empty since we have not yet created any table. Let us now go
ahead and create a table in the database.

4.5 Table Creation and Deletion

The syntax for creating a new table is:


CREATE TABLE tablename (IF EXISTS);

The above command will create the table with table name as specified by the user.
You can also specify the optional condition IF EXISTS/IF NOT EXISTS similar to
the way you can specify them while creating a database. Since a table is nothing
but a collection of rows and columns, in addition to specifying the table name, you
would also want to specify the column names in the table and the type of data that
each column can contain. For example, let us go ahead and create a table named
“products.” We will then later inspect it in greater detail.
mysql> CREATE TABLE products (productID INT 10 UNSIGNED NOT NULL
AUTO_INCREMENT, code CHAR(6) NOT NULL DEFAULT “, productname
VARCHAR(30) NOT NULL DEFAULT “, quantity INT UNSIGNED NOT NULL
DEFAULT 0, price DECIMAL(5,2) NOT NULL DEFAULT 0.00, PRIMARY
KEY (productID) );

Output:
Query OK, 0 rows affected (0.41 sec)

In the above-mentioned command, we have created a table named “products.”


Along with table name, we have also specified the columns and the type of data
that each column contains within the parenthesis. For example, “products” table
contains five columns—productID, code, productname, quantity, and price. Each of
those columns can contain certain types of data. Let us look at them one by one:
• productID is INT 10 UNSIGNED (INT means integer, it accepts only integer
values for productID). And the number 10 after INT represents the size of the
integer; here in this case productID accepts an integer of maximum size 10.
And the attribute UNSIGNED means nonnegative integers, which means the
productID will accept only positive integers. Thus, if you enter any non-integer
value, negative value, or number great than that of size 10, it will throw you an
error. If you do not specify the attribute UNSIGNED in the command, by default
it will take SIGNED attribute, which accepts both positive and negative integers.
• code is CHAR(6)—CHAR(6) means a fixed-length alphanumeric string that can
contain exactly six characters. It accepts only six characters.

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 53

• productname is VARCHAR(30). Similar to CHAR, VARCHAR stands for a


variable length string that can contain a maximum of 30 characters. The contrast
between CHAR and VARCHAR is that whereas CHAR is a fixed length string,
VARCHAR can vary in length. In practice, it is always better to use VARCHAR
unless you suspect that the string in a column is always going to be of a fixed
length.
• quantity INT. This means that quantity column can contain integer values.
• price DECIMAL(5,2). Price column can contain floating point numbers (decimal
numbers) of length 5 and the length of decimal digits can be a maximum of 2.
Whenever you are working with floating point numbers, it is advisable to use
DECIMAL field.
There are a number of additional points to be noted with regard to the above
statement.
For a number of columns such as productID, productname you would notice the
presence of NOT NULL. NOT NULL is an attribute that essentially tells MySQL
that the column cannot have null values. NULL in MySQL is not a string and is
instead a special character to signify absence of values in the field. Each column
also contains the attribute DEFAULT. This essentially implies that if no value is
provided by the user then use default value for the column. For example, default
value for column quantity will be 0 in case no values are provided when inputting
data to the table.
The column productID has an additional attribute called AUTO_INCREMENT,
and its default value is set to 1. This implies that whenever there is a null value
specified for this column, a default value would instead be inserted but this default
value will be incremented by 1 with a starting value of 1. Thus, if there are two
missing productID entries, then the default values of 1 and 2 would be provided.
Finally, the last line of table creation statement query is PRIMARY KEY
(productID). Primary key for a table is a column or set of columns where each
observation in that column would have a unique value. Thus, if we have to look
up any observation in the table, then we can do so using the primary key for the
table. Although it is not mandatory to have primary keys for a table, it is a standard
practice to have one for every table. This also helps during indexing the table and
makes query execution faster.
If you would now run the command SHOW TABLES, then the table would be
reflected in your database.
mysql> SHOW TABLES;

Output:
+-------------------------+
| Tables_in_product_sales |
+-------------------------+
| products |
+-------------------------+
1 row in set (0.00 sec)
54 H. K. Dasararaju and P. Taori

You can always look up the schema of a table by issuing the “DESCRIBE”
command:
mysql> DESCRIBE products;

Output:
+------------+------------------+------+----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
+------------+------------------+------+----+---------+---------------+
5 rows in set (0.01 sec)

4.6 Inserting the Data

Once we have created the table, it is now time to insert data into the table. For now
we will look at how to insert data manually in the table. Later on we will see how we
can import data from an external file (such as CSV or text file) in the database. Let
us now imagine that we have to insert data into the products table we just created.
To do so, we make use of the following command:
mysql> INSERT INTO products VALUES (1, ’IPH’, ’Iphone 5S Gold’,
300, 625);

Output:
Query OK, 1 row affected (0.13 sec)

When we issue the above command, it will insert a single row of data into the
table “products.” The parenthesis after VALUES specified the actual values that are
to be inserted. An important point to note is that values should be specified in the
same order as that of columns when we created the table “products.” All numeric
data (integers and decimal values) are specified without quotes, whereas character
data must be specified within quotes.
Now let us go ahead and insert some more data into the “products.” table:
mysql> INSERT INTO products VALUES(NULL, ’IPH’,
’Iphone 5S Black’, 8000, 655.25),(NULL, ’IPH’,
’Iphone 5S Blue’, 2000, 625.50);

Output:
Query OK, 2 rows affected (0.13 sec)
Records: 2 Duplicates: 0 Warnings: 0

In the above case, we inserted multiple rows of data at the same time. Each row
of data was specified within parenthesis and each row was separated by a comma
(,). Another point to note is that we kept the productID fields as null when inserting
the data. This is to demonstrate that even if we provide null values, MySQL will
make use of AUTO_INCREMENT operator to assign values to each row.

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 55

Sometimes there might be a need where you want to provide data only for some
columns or you want to provide data in a different order as compared to the original
one when we created the table. This can be done using the following command:
mysql> INSERT INTO products (code, productname, quantity, price)
VALUES (’SNY’, ’Xperia Z1’, 10000, 555.48),(’SNY’, ’Xperia S’,
8000, 400.49);

Output:
Query OK, 2 rows affected (0.13 sec)
Records: 2 Duplicates: 0 Warnings: 0

Notice here that we did not specify the productID column for values to be inserted
in, but rather explicitly specified the columns and their order in which we want
to insert the data. The productID column will be automatically populated using
AUTO_INCREMENT operator.

4.7 Querying the Database

Now that we have inserted some values into the products table, let us go ahead and
see how we can query the data. If you want to see all observations in a database
table, then make use of the SELECT * FROM tablename query:
mysql> SELECT * FROM products;

Output:
+----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+----------+------+-----------------+----------+--------+
5 rows in set (0.00 sec)

SELECT query is perhaps the most widely known query of SQL. It allows you to
query a database and get the observations matching your criteria. SELECT * is the
most generic query, which will simply return all observations in a table. The general
syntax of SELECT query is as follows:
SELECT column1Name, column2Name, ... FROM tableName

This will return selected columns from a particular table name. Another variation
of SELECT query can be the following:
SELECT column1Name, column2Name . . . .from tableName where
somecondition;
56 H. K. Dasararaju and P. Taori

In the above version, only those observations would be returned that match
the criteria specified by the user. Let us understand them with the help of a few
examples:
mysql> SELECT productname, quantity FROM products;

Output:
+-----------------+----------+
| productname | quantity |
+-----------------+----------+
| Iphone 5S Gold | 300 |
| Iphone 5S Black | 8000 |
| Iphone 5S Blue | 2000 |
| Xperia Z1 | 10000 |
| Xperia S | 8000 |
+-----------------+----------+
5 rows in set (0.00 sec)

mysql> SELECT productname, price FROM products WHERE price < 600;

Output:
+-------------+--------+
| productname | price |
+-------------+--------+
| Xperia Z1 | 555.48 |
| Xperia S | 400.49 |
+-------------+--------+
2 rows in set (0.00 sec)

The above query will only give name and price columns for those records whose
price <600.
mysql> SELECT productname, price FROM products
WHERE price >= 600;

Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)

The above query will only give name and price columns for those records whose
price >= 600.
In order to select observations based on string comparisons, enclose the string
within quotes. For example:

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 57

mysql> SELECT productname, price FROM products


WHERE code = ’IPH’;

Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)

The above command gives you the name and price of the products whose code
is “IPH.”
In addition to this, you can also perform a number of string pattern matching
operations, and wildcard characters. For example, you can make use of operators
LIKE and NOT LIKE to search if a particular string contains a specific pattern. In
order to do wildcard matches, you can make use of underscore character “_” for a
single-character match, and percentage sign “%” for multiple-character match. Here
are a few examples:
• “phone%” will match strings that start with phone and can contain any characters
after.
• “%phone” will match strings that end with phone and can contain any characters
before.
• “%phone%” will match strings that contain phone anywhere in the string.
• “c_a” will match strings that start with “c” and end with “a” and contain any
single character in-between.
mysql> SELECT productname, price FROM products WHERE productname
LIKE ’Iphone%’;

Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)

mysql> SELECT productname, price FROM products WHERE productname


LIKE ’%Blue%’;

Output:
+----------------+--------+
| productname | price |
+----------------+--------+
| Iphone 5S Blue | 625.50 |
+----------------+--------+
1 row in set (0.00 sec)
58 H. K. Dasararaju and P. Taori

Additionally, you can also make use of Boolean operators such as AND, OR in
SQL queries to create multiple conditions.
mysql> SELECT * FROM products WHERE quantity >= 5000 AND
productname LIKE ’Iphone%’;

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
+-----------+------+-----------------+----------+--------+
1 row in set (0.00 sec)

This gives you all the details of products whose quantity is >=5000 and the name
like ‘Iphone’.
mysql> SELECT * FROM products WHERE quantity >= 5000 AND price >
650 AND productname LIKE ’Iphone%’;

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
+-----------+------+-----------------+----------+--------+
1 row in set (0.00 sec)

If you want to find whether the condition matches any elements from within a
set, then you can make use of IN operator. For example:
mysql> SELECT * FROM products WHERE productname IN (’Iphone 5S
Blue’, ’Iphone 5S Black’);

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
+-----------+------+-----------------+----------+--------+
2 rows in set (0.00 sec)

This gives the product details for the names provided in the list specified in the
command (i.e., “Iphone 5S Blue”, “Iphone 5S Black”).
Similarly, if you want to find out if the condition looks for values within a specific
range then you can make use of BETWEEN operator. For example:

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 59

mysql> SELECT * FROM products WHERE (price BETWEEN 400 AND 600)
AND (quantity BETWEEN 5000 AND 10000);

Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+-----------+------+-------------+----------+--------+
2 rows in set (0.00 sec)

This command gives you the product details whose price is between 400 and 600
and quantity is between 5000 and 10000, both inclusive.

4.8 ORDER BY Clause

Many a times when we retrieve a large number of results, we might want to sort
them in a specific order. In order to do so, we make use of ORDER BY in SQL. The
general syntax for this is:
SELECT ... FROM tableName
WHERE criteria
ORDER BY columnA ASC|DESC, columnB ASC|DESC

mysql> SELECT * FROM


products WHERE productname LIKE ’Iphone%’ ORDER BY price DESC;

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
+-----------+------+-----------------+----------+--------+
3 rows in set (0.00 sec)

If you are getting a large number of results but want the output to be limited
only to a specific number of observations, then you can make use of LIMIT clause.
LIMIT followed by a number will limit the number of output results that will be
displayed.
mysql> SELECT * FROM products ORDER BY price LIMIT 2;

Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 5 | SNY | Xperia S | 8000 | 400.49 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
+-----------+------+-------------+----------+--------+
2 rows in set (0.00 sec)
60 H. K. Dasararaju and P. Taori

Oftentimes, we might want to display the columns or tables by an intuitive name


that is different from the original name. To be able to do so, we make use of AS
alias.
mysql> SELECT productID AS ID, code AS productCode , productname
AS Description, price AS Unit_Price FROM products ORDER
BY ID;

Output:
+----+-------------+-----------------+------------+
| ID | productCode | Description | Unit_Price |
+----+-------------+-----------------+------------+
| 1 | IPH | Iphone 5S Gold | 625.00 |
| 2 | IPH | Iphone 5S Black | 655.25 |
| 3 | IPH | Iphone 5S Blue | 625.50 |
| 4 | SNY | Xperia Z1 | 555.48 |
| 5 | SNY | Xperia S | 400.49 |
+----+-------------+-----------------+------------+
5 set (0.00 sec)

4.9 Producing Summary Reports

A key part of SQL queries is to be able to provide summary reports from large
amounts of data. This summarization process involves data manipulation and
grouping activities. In order to enable users to provide such summary reports, SQL
has a wide range of operators such as DISTINCT, GROUP BY that allow quick
summarization and production of data. Let us look at these operators one by one.

4.9.1 DISTINCT

A column may have duplicate values. We could use the keyword DISTINCT to
select only distinct values. We can also apply DISTINCT to several columns to
select distinct combinations of these columns. For example:
mysql> SELECT DISTINCT code FROM products;

Output:
+-----+
| Code |
+-----+
| IPH |
| SNY |
+-----+
2 rows in set (0.00 sec)

4.9.2 GROUP BY Clause

The GROUP BY clause allows you to collapse multiple records with a common
value into groups. For example,

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 61

mysql> SELECT * FROM products ORDER BY code, productID;

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+-----------+------+-----------------+----------+--------+
5 rows in set (0.00 sec)mysql> SELECT * FROM products GROUP BY
code; #-- Only first record in each group is shown

Output:
+-----------+------+----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
+-----------+------+----------------+----------+--------+
2 rows in set (0.00 sec)

We can apply GROUP BY clause with aggregate functions to produce group


summary report for each group.
The function COUNT(*) returns the rows selected; COUNT(columnName)
counts only the non-NULL values of the given column. For example,
mysql> SELECT COUNT(*) AS ‘Count‘ FROM products;

Output:
+-------+
| Count |
+-------+
| 5 |
+-------+
1 row in set (0.00 sec)

mysql> SELECT code, COUNT(*) FROM products


GROUP BY code;

Output:
+------+----------+
| code | COUNT(*) |
+------+----------+
| IPH | 3 |
| SNY | 2 |
+------+----------+
2 rows in set (0.00 sec)

We got “IPH” count as 3 because we have three entries in our table with the
product code “IPH” and similarly two entries for the product code “SNY.” Besides
62 H. K. Dasararaju and P. Taori

COUNT(), there are many other aggregate functions such as AVG(), MAX(), MIN(),
and SUM(). For example,
mysql> SELECT MAX(price), MIN(price), AVG(price), SUM(quantity)
FROM products;
Output:
+------------+------------+------------+---------------+
| MAX(price) | MIN(price) | AVG(price) | SUM(quantity) |
+------------+------------+------------+---------------+
| 655.25 | 400.49 | 572.344000 | 28300 |
+------------+------------+------------+---------------+
1 row in set (0.00 sec)

This gives you MAX price, MIN price, AVG price, and total quantities of all the
products available in our products table. Now let us use GROUP BY clause:
mysql> SELECT code, MAX(price) AS ‘Highest Price‘, MIN(price) AS
‘Lowest Price‘ FROM products GROUP BY code;

Output:
+------+---------------+--------------+
| code | Highest Price | Lowest Price |
+------+---------------+--------------+
| IPH | 655.25 | 625.00 |
| SNY | 555.48 | 400.49 |
+------+---------------+--------------+
2 rows in set (0.00 sec)

This means, the highest price of an IPhone available in our database is 655.25
and the lowest price is 625.00. Similarly, the highest price of a Sony is 555.48 and
the lowest price is 400.49.

4.10 Modifying Data

To modify the existing data, use UPDATE, SET command, with the following
syntax:
UPDATE tableName SET columnName = {value|NULL|DEFAULT}, ... WHERE
criteria

mysql> UPDATE products SET quantity = quantity + 50,


price = 600.5 WHERE productname = ’Xperia Z1’;

Output:
Query OK, 1 row affected (0.14 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Let us check the modification in the products table.


mysql> SELECT * FROM products WHERE productname = ’Xperia Z1’;

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 63

Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 4 | SNY | Xperia Z1 | 10050 | 600.50 |
+-----------+------+-------------+----------+--------+
1 row in set (0.00 sec)

You can see that the quantity of Xperia Z1 is increased by 50.

4.11 Deleting Rows

Use the DELETE FROM command to delete row(s) from a table; the syntax is:
DELETE FROM tableName # to delete all rows from the table.
DELETE FROM tableName WHERE criteria # to delete only the row(s)
that meets the criteria. For example,
mysql> DELETE FROM products WHERE productname LIKE ’Xperia%’;

Output:
Query OK, 2 rows affected (0.03 sec)

mysql> SELECT * FROM products;

Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
+-----------+------+-----------------+----------+--------+
3 rows in set (0.00 sec)

mysql> DELETE FROM products;

Output:
Query OK, 3 rows affected (0.14 sec)

mysql> SELECT * FROM products;

Output:
Empty set (0.00 sec)

Beware that “DELETE FROM tableName” without a WHERE clause deletes


ALL records from the table. Even with a WHERE clause, you might have deleted
some records unintentionally. It is always advisable to issue a SELECT command
with the same WHERE clause to check the result set before issuing the DELETE
(and UPDATE).
64 H. K. Dasararaju and P. Taori

4.12 Create Relationship: One-To-Many


4.12.1 PRIMARY KEY

Suppose that each product has one supplier, and each supplier supplies one or more
products. We could create a table called “suppliers” to store suppliers’ data (e.g.,
name, address, and phone number). We create a column with unique value called
supplierID to identify every supplier. We set supplierID as the primary key for the
table suppliers (to ensure uniqueness and facilitate fast search).
In order to relate the suppliers table to the products table, we add a new column
into the “products” table—the supplierID.
We then set the supplierID column of the products table as a foreign key which
references the supplierID column of the “suppliers” table to ensure the so-called
referential integrity. We need to first create the “suppliers” table, because the
“products” table references the “suppliers” table.
mysql> CREATE TABLE suppliers (supplierID INT UNSIGNED NOT NULL
AUTO_INCREMENT, name VARCHAR(30) NOT NULL DEFAULT “, phone
CHAR(8) NOT NULL DEFAULT “, PRIMARY KEY (supplierID));

Output:
Query OK, 0 rows affected (0.33 sec)

mysql> DESCRIBE suppliers;

Output:
+------------+------------------+------+-----+--------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+--------+---------------+
| supplierID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(30) | NO | | | |
| phone | char(8) | NO | | | |
+------------+------------------+------+-----+--------+---------------+
3 rows in set (0.01 sec)

Let us insert some data into the suppliers table.


mysql> INSERT INTO suppliers VALUE (501, ’ABC Traders’,
’88881111’), (502, ’XYZ Company’, ’88882222’), (503, ’QQ Corp’,
’88883333’);

Output:
Query OK, 3 rows affected (0.13 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM suppliers;

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 65

Output:
+------------+-------------+----------+
| supplierID | name | phone |
+------------+-------------+----------+
| 501 | ABC Traders | 88881111 |
| 502 | XYZ Company | 88882222 |
| 503 | QQ Corp | 88883333 |
+------------+-------------+----------+
3 rows in set (0.00 sec)

4.12.2 ALTER TABLE

The syntax for ALTER TABLE is as follows:


ALTER TABLE tableName
{ADD [COLUMN] columnName columnDefinition}
{ALTER|MODIFY [COLUMN] columnName columnDefinition
{SET DEFAULT columnDefaultValue} | {DROP DEFAULT}}
{DROP [COLUMN] columnName [RESTRICT|CASCADE]}
{ADD tableConstraint}
{DROP tableConstraint [RESTRICT|CASCADE]}

Instead of deleting and re-creating the products table, we shall use the statement
“ALTER TABLE” to add a new column supplierID into the products table. As we
have deleted all the records from products in recent few queries, let us rerun the
three INSERT queries referred in the Sect. 4.6 before running “ALTER TABLE.”
mysql> ALTER TABLE products ADD COLUMN supplierID INT UNSIGNED
NOT NULL;

Output:
Query OK, 0 rows affected (0.43 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> DESCRIBE products;

Output:
+-------------+-----------------+-----+-----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------+-----+-----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
| supplierID | int(10) unsigned | NO | | NULL | |
+-------------+-----------------+-----+-----+---------+---------------+
6 rows in set (0.00 sec)
66 H. K. Dasararaju and P. Taori

4.12.3 FOREIGN KEY

Now, we shall add a foreign key constraint on the supplierID columns of the
“products” child table to the “suppliers” parent table, to ensure that every supplierID
in the “products” table always refers to a valid supplierID in the “suppliers” table.
This is called referential integrity.
Before we add the foreign key, we need to set the supplierID of the existing
records in the “products” table to a valid supplierID in the “suppliers” table (say
supplierID = 501).
Now let us set the supplierID of the existing records to a valid supplierID of
“supplier” table. As we have deleted the records from “products” table, we can add
or update using UPDATE command.
mysql> UPDATE products SET supplierID = 501;

Output:
Query OK, 5 rows affected (0.04 sec)
Rows matched: 5 Changed: 5 Warnings: 0

Let us add a foreign key constraint.


mysql> ALTER TABLE products ADD FOREIGN KEY (supplierID)
REFERENCES suppliers (supplierID);

Output:
Query OK, 0 rows affected (0.56 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> DESCRIBE products;

Output:
+------------+-----------------+------+-----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-----------------+------+-----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
| supplierID | int(10) unsigned | NO | MUL | NULL | |
+------------+-----------------+------+-----+---------+---------------+
6 rows in set (0.00 sec)

mysql> SELECT * FROM products;

Output:
+-----------+------+-----------------+----------+--------+------------+
| productID | code | productname | quantity | price | supplierID |
+-----------+------+-----------------+----------+--------+------------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 | 501 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 | 501 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 | 501 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 | 501 |
| 5 | SNY | Xperia S | 8000 | 400.49 | 501 |
+-----------+------+-----------------+----------+--------+------------+
5 rows in set (0.00 sec)

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 67

mysql> UPDATE products SET supplierID = 502 WHERE productID = 1;

Output:
Query OK, 1 row affected (0.13 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> SELECT * FROM products;

Output:
+-----------+------+-----------------+----------+--------+------------+
| productID | code | productname | quantity | price | supplierID |
+-----------+------+-----------------+----------+--------+------------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 | 502 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 | 501 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 | 501 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 | 501 |
| 5 | SNY | Xperia S | 8000 | 400.49 | 501 |
+-----------+------+-----------------+----------+--------+------------+
5 rows in set (0.00 sec)

4.13 SELECT with JOIN

SELECT command can be used to query and join data from two related tables.
For example, to list the product’s name (in products table) and supplier’s name (in
suppliers table), we could join the two tables using the two common supplierID
columns:
mysql> SELECT products.productname, price, suppliers.name FROM
products JOIN suppliers ON products.supplierID
= suppliers.supplierID WHERE price < 650;

Output:
+----------------+--------+-------------+
| productname | price | name |
+----------------+--------+-------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+-------------+
4 rows in set (0.00 sec)

Here we need to use products.name and suppliers.name to differentiate the two


“names.”
Join using WHERE clause (legacy method) is not recommended.
mysql> SELECT products.productname, price, suppliers.name FROM
products, suppliers WHERE products.supplierID =
suppliers.supplierID AND price < 650;
68 H. K. Dasararaju and P. Taori

Output:
+----------------+--------+-------------+
| productname | price | name |
+----------------+--------+-------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+-------------+
4 rows in set (0.00 sec)

In the above query result, two of the columns have the same heading “name.” We
could create aliases for headings. Let us use aliases for column names for display.
mysql> SELECT products.productname AS ‘Product Name’, price,
suppliers.name AS ‘Supplier Name’ FROM products JOIN suppliers
ON products.supplierID = suppliers.supplierID WHERE price < 650;

Output:
+----------------+--------+---------------+
| Product Name | price | Supplier Name |
+----------------+--------+---------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+---------------+
4 rows in set (0.00 sec)

5 Summary

The chapter describes the essential commands for creating, modifying, and querying
an RDBMS. Detailed descriptions and examples can be found in the list of books
and websites listed in the reference section (Elmasri and Navathe 2014; Hoffer
et al. 2011; MySQL using R 2018; MySQL using Python 2018). You can also refer
various websites such as w3schools.com/sql, sqlzoo.net (both accessed on Jan 15,
2019), which help you learn SQL in gamified console. The practice would help you
learn to query large databases, which is quite a nuisance.

Exercises

Ex. 3.1 Print list of all suppliers who do not keep stock for IPhone 5S Black.
Ex. 3.2 Find out the product that has the biggest inventory by value (i.e., the product
that has the highest value in terms of total inventory).
Ex. 3.3 Print the supplier name who maintains the largest inventory of products.

www.dbooks.org
3 Data Management—Relational Database Systems (RDBMS) 69

Ex. 3.4 Due to the launch of a newer model, prices of IPhones have gone down and
the inventory value has to be written down. Create a new column (new_price) where
price is marked down by 20% for all black- and gold-colored phones, whereas it has
to be marked down by 30% for the rest of the phones.
Ex. 3.5 Due to this recent markdown in prices (refer to Ex. 3.4), which supplier
takes the largest hit in terms of inventory value?

References

Elmasri, R., & Navathe, S. B. (2014). Database systems: Models, languages, design and
application. England: Pearson.
Hoffer, J. A., Venkataraman, R., & Topi, H. (2011). Modern database management. England:
Pearson.
MySQL using R. Retrieved February, 2018., from https://cran.r-project.org/web/packages/
RMySQL/RMySQL.pdf.
MySQL using Python. Retrieved February, 2018., from http://mysql-python.sourceforge.net/
MySQLdb.html.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy