0% found this document useful (0 votes)
16 views

Session 2 - Data Pre-Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Session 2 - Data Pre-Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DATA PRE-PROCESSING

DATA PREPROCESSING

 Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

 Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw format which is not feasible for the
analysis.

2
NEED FOR DATA PREPROCESSING

 For achieving better results from the applied model in Machine


Learning projects the format of the data has to be in a proper
manner.
 Some specified Machine Learning model needs information in a
specified format, for example, Random Forest algorithm does
not support null values, therefore, to execute random forest
algorithm null values have to be managed from the original raw
data set.
 Another aspect is that the data set should be formatted in such
a way that more than one Machine Learning and Deep Learning
algorithm are executed in one data set, and best out of them is
chosen.

3
DATA PREPROCESSING TECHNIQUES

Data Cleaning Data transformation Data selection

To identify and handle Converting data from one format or Selecting a subset of data from a larger
missing or erroneous data structure to another. It can help to dataset based on certain criteria. It can help
improve the quality of data and make it to reduce the size of the dataset and focus
more suitable for analysis or modeling on relevant data for analysis or modeling

4
DATA CLEANING

 Removing duplicates:

Duplicates can skew the results of data analysis or machine learning models. Removing duplicates
can improve the accuracy of results and reduce the risk of errors.
 Handling missing data:

Missing data can be handled using various techniques, such as deleting missing data, imputing
missing data, or replacing missing data with values such as mean or median.
 Handling outliers:

Outliers can also be considered as missing data. Various techniques, such as replacing outliers with
missing data, can be used to handle outliers.

5
DATA CLEANING

Examples
 Noise and outliers
 Missing values
 Duplicate data

10

6
DATA CLEANING

Handling Noisy Data


 Binning: First, the data is sorted, and then the sorted values are separated and
stored in the form of bins. There are three methods for smoothing data in the bin.
 Partition into (equi-depth) bins

 Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean
value of the bin
 Smoothing by bin median: In this method, the values in the bin are replaced by the median
value;
10

 Smoothing by bin boundary: In this method, minimum and maximum values of the bin values
are taken, and the closest boundary value replaces the values.

7
DATA CLEANING
Handling Noisy Data
 Binning:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 10

* Smoothing by bin boundaries:


- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
8
DATA CLEANING

Handling Noisy Data


 Regression: This is used to smooth the
data and will help to handle data when
unnecessary data is present. For the
analysis, purpose regression helps to
decide the variable which is suitable for
our analysis.
 Clustering: This is used for finding the
outliers and also in grouping the data.
Clustering is generally used in
unsupervised learning.
10

9
DATA TRANSFORMATION
 Standardizing or normalizing data:

Standardizing or normalizing data involves scaling the data to a common scale. This is important for
some machine learning algorithms that are sensitive to the scale of features.

10
DATA TRANSFORMATION

 Encoding categorical data:

Categorical data can be encoded into numerical data using techniques such as one-hot encoding or
label encoding. This is important for some machine learning algorithms that cannot handle
categorical data.

11
DATA TRANSFORMATION
 Aggregation:

Aggregation involves combining multiple data points into a single data point. This can be useful for
summarizing data and reducing dimensionality.
 Discretization:

Discretization involves converting continuous data attribute values into a finite set of intervals with
minimal loss of information and associating with each interval some specific data value or
conceptual labels.
 Feature engineering:

Feature engineering involves creating new features from existing features. These techniques help to
highlight the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively

12
DATA SELECTION

 Random sampling:
Random sampling involves selecting a random subset
of data from the larger dataset. This is useful when
the dataset is too large to process as a whole and a
representative sample is needed.
 Stratified sampling:
Stratified sampling involves dividing the dataset into
subgroups based on a specific variable and then
selecting a random sample from each subgroup. This
is useful when the variable is important for analysis or
modeling.

13
DATA SELECTION
 Feature selection:

Feature selection involves selecting relevant features for analysis or modeling. This is important for
reducing dimensionality and improving the performance of machine learning models.

Chi-square Test:
For categorical features, calculate Chi-square between each
feature and the target and select the desired number of features
with the best Chi-square scores.

Correlation Coefficient:
Good variables correlate highly with the target. Variables should
be correlated with the target but uncorrelated among
themselves.

14
# IMPORTING LIBRARIES

Data Analysis Library


 import pandas as pd
Scientific computing and technical computing
 import scipy
Working with arrays
 import numpy as np
Scientific computing and technical computing
 import seaborn as sns
Scientific computing and technical computing
 import matplotlib.pyplot as plt
 from sklearn.model_selection import train_test_split

To split our data into train and test sets


15
OVERVIEW OF THE PANDAS LIBRARY
 Pandas provides two primary data structures for storing and manipulating data: Series
and DataFrame.
 A Series is a one-dimensional array-like object that can hold any data type
 DataFrame is a two-dimensional table-like object consisting of rows and columns.

16
DATA CLEANING WITH PANDAS

 Handling missing data: Pandas provides methods for filling in missing data or dropping
missing data points. For example, the dropna() method drops any rows or columns that
contain missing data, while the fillna() method fills in missing data with a specified
value.
 Removing duplicates: Pandas provides a drop_duplicates() method that removes
duplicate rows from a DataFrame.
 Correcting errors: Pandas provides methods for replacing or removing incorrect values.
For example, the replace() method can be used to replace specific values with new
values.

17
DATA TRANSFORMATION WITH PANDAS

 Filtering data: Pandas provides methods for selecting specific rows or columns based
on criteria such as a specific value, a range of values, or a boolean expression. For
example, the loc[] method can be used to select rows and columns by label, while the
iloc[] method can be used to select rows and columns by index.
 Sorting data: Pandas provides a sort_values() method for sorting a DataFrame by one
or more columns or indices.
 Grouping data: Pandas provides a groupby() method for grouping a DataFrame by one
or more variables and performing aggregate operations on each group, such as sum,
mean, and count.

18
HOW TO PREPROCESS DATA IN PYTHON STEP-BY-STEP

 Load data in Pandas.


 Drop columns that aren’t useful.
 Drop rows with missing values.
 Create dummy variables.
 Take care of missing data.
 Convert the data frame to NumPy.
 Divide the data set into training data and test data.

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy