0% found this document useful (0 votes)
78 views

Chapter 3 Exploratory Data Analysis

Uploaded by

barnabas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Chapter 3 Exploratory Data Analysis

Uploaded by

barnabas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

EXPLORATORY DATA ANALYSIS

CHAPTER 3

“Introduction to Data Science : Practical Approach with R and Python ”


B.Uma Maheswari and R Sujatha
Copyright @ 2021 Wiley India Pvt. Ltd. All rights reserved.
LEARNING OBJECTIVES
Apply the steps in data pre-processing.
Understand data by looking and visualizing the data
Learn the concept of outliers how to deal with them.
Dealing with missing values during data preprocessing.
Understand the concept of standardization.
Apply R and Python programming for data anlysis
DATA SCIENCE PROCESS MODEL
Objectives of EDA
To develop an understanding of the data
To identify trends and patterns
To understand relationship between variables
To decide on the appropriate models to be executed on
the data
To find answers to questions relating to the data
To test assumptions
STEPS IN DATA PRE-PROCESSING
DATASET DESCRIPTION
S.No. Column Name Description
1 phoneno Phone Number of the customer
2 age Age of the customer (1-> 18-30, 2->31-40, 3->41-50, 4->Above 50)
3 gender Gender of the customer (0->Male, 1->Female)
4 zipcode Zip code of the area where the customer lives
5 calls Number of calls made by the customer per month
6 sms Number of SMS made by the customer per month
7 mms Number of MMS made by the customer per month
8 charges Monthly charges paid by the customer
9 coverage Number of days out of coverage
Type of Complaint (0-> no problem,1->Recharge issues, 2-> Problems in the
10 complaint offer/package , 3->Network problem, 4->Call dropping)
11 sim Single or dual sim (0->Single sim, 1->Dual sim)
12 phone Type of Phone (0->Android, 1-> IOS)
13 prepost Prepaid or Post Paid (0->Prepaid, 1->Post Paid)
14 churn Customer Churn (0-No Churn, 1-Churn)
UNDERSTANDING THE DATA

Summary of the
dataset
Structure of the
dataset
Dimensions of
the data
Load the dataset • dim, nrow,
ncol, names
CONTINOUS AND CATEGORICAL VARIABLES
Continuous variables are quantitative variables which can take
any infinite values and can be measured. Mean, median and mode
can be calculated for continuous variables. For e.g. Height, weight,
speed of the vehicle etc.
Categorical variables are variables which could be categorized
into distinct groups e.g. gender, pass/fail etc. are finite.
In simple words, if we can measure the variables it is a continuous
variable and if we can count the variables it is categorical.
NORMAL DISTRIBUTION

Line drawing
to be drawn
RIGHT SKEWED AND LEFT SKEWED

Line drawing to be drawn


DATA VISUALIZATION
Histogram
(Continuous
variables)
Barplot
(Categorical
variables)
Boxplot
(Continous
variables)
BOXPLOT
A box plot provides a good representation of distribution of quantitative data. It is also known as
a box and whisker plot. It is used in exploratory data analysis to draw inferences from the data..
Boxplot divides the data into quartiles.
The first 25% of the data lies between the minimum value and the start of the box which is the first
quartile(Q1). This is called as whiskers
The second 25% of the data lies between start of the box and the median which is the second
quartile(Q2).
The third 25% of the data lies between the median and the end of the box which is the third
quartile (Q3).
The last 25% of the data lies from the end of the box to the maximum value which is shown as
whiskers.
The length of the whiskers and the position of the median indicates the skewness of the data.
The plot shows the interquartile range (IQR) which is the difference between the 25th and the 75th
percentile
Boxplot also indicates the presence of outliers.
BOX PLOT AND OUTLIERS
1st 2nd 3rd
Minimum Quartile Quartile Maximum
Quartile
value value

Whiskers
Outliers Whiskers

Median
OUTLIER TREATMENT
First 25% of the Second 25% Third 25% of Last 25% of the
data of the data the data data
DEALING WITH MISSING VALUES
STANDARDIZING DATA

This process is also called feature scaling.


This is usually done when there are large differences in the range of values in the
columns of a dataset. This process is done to ensure that the variables are on the same
scale.
This can be done in two ways Normalization and Standardisation.
In normalization the minimum and maximum values are used and in standardisation
mean and standard deviation are used.
MEAN
MEDIAN
MODE
VARIANCE AND STANDARD DEVIATION
The IQR can also be
used to identify
suspected outliers.
In general, a suspected
outlier can exist in the
following two ranges:
= 4 – 16.5= -12.5
= 15 + 16.5= 31.5
Dependent
Independent Variables
Variables

A sample dataset

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy