0% found this document useful (0 votes)

43 views19 pages

Session 2 - Data Pre-Processing

Uploaded by

Rittik Kumar Naskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views19 pages

Session 2 - Data Pre-Processing

Uploaded by

Rittik Kumar Naskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DATA PRE-PROCESSING

DATA PREPROCESSING

 Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

 Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw format which is not feasible for the
analysis.

2
NEED FOR DATA PREPROCESSING

 For achieving better results from the applied model in Machine

Learning projects the format of the data has to be in a proper
manner.
 Some specified Machine Learning model needs information in a
specified format, for example, Random Forest algorithm does
not support null values, therefore, to execute random forest
algorithm null values have to be managed from the original raw
data set.
 Another aspect is that the data set should be formatted in such
a way that more than one Machine Learning and Deep Learning
algorithm are executed in one data set, and best out of them is
chosen.

3
DATA PREPROCESSING TECHNIQUES

Data Cleaning Data transformation Data selection

To identify and handle Converting data from one format or Selecting a subset of data from a larger
missing or erroneous data structure to another. It can help to dataset based on certain criteria. It can help
improve the quality of data and make it to reduce the size of the dataset and focus
more suitable for analysis or modeling on relevant data for analysis or modeling

4
DATA CLEANING

 Removing duplicates:

Duplicates can skew the results of data analysis or machine learning models. Removing duplicates
can improve the accuracy of results and reduce the risk of errors.
 Handling missing data:

Missing data can be handled using various techniques, such as deleting missing data, imputing
missing data, or replacing missing data with values such as mean or median.
 Handling outliers:

Outliers can also be considered as missing data. Various techniques, such as replacing outliers with
missing data, can be used to handle outliers.

5
DATA CLEANING

Examples
 Noise and outliers
 Missing values
 Duplicate data

6
DATA CLEANING

Handling Noisy Data

 Binning: First, the data is sorted, and then the sorted values are separated and
stored in the form of bins. There are three methods for smoothing data in the bin.
 Partition into (equi-depth) bins

 Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean
value of the bin
 Smoothing by bin median: In this method, the values in the bin are replaced by the median
value;
10

 Smoothing by bin boundary: In this method, minimum and maximum values of the bin values
are taken, and the closest boundary value replaces the values.

7
DATA CLEANING
Handling Noisy Data
 Binning:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 10

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
8
DATA CLEANING

Handling Noisy Data

 Regression: This is used to smooth the
data and will help to handle data when
unnecessary data is present. For the
analysis, purpose regression helps to
decide the variable which is suitable for
our analysis.
 Clustering: This is used for finding the
outliers and also in grouping the data.
Clustering is generally used in
unsupervised learning.
10

9
DATA TRANSFORMATION
 Standardizing or normalizing data:

Standardizing or normalizing data involves scaling the data to a common scale. This is important for
some machine learning algorithms that are sensitive to the scale of features.

10
DATA TRANSFORMATION

 Encoding categorical data:

Categorical data can be encoded into numerical data using techniques such as one-hot encoding or
label encoding. This is important for some machine learning algorithms that cannot handle
categorical data.

11
DATA TRANSFORMATION
 Aggregation:

Aggregation involves combining multiple data points into a single data point. This can be useful for
summarizing data and reducing dimensionality.
 Discretization:

Discretization involves converting continuous data attribute values into a finite set of intervals with
minimal loss of information and associating with each interval some specific data value or
conceptual labels.
 Feature engineering:

Feature engineering involves creating new features from existing features. These techniques help to
highlight the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively

12
DATA SELECTION

 Random sampling:
Random sampling involves selecting a random subset
of data from the larger dataset. This is useful when
the dataset is too large to process as a whole and a
representative sample is needed.
 Stratified sampling:
Stratified sampling involves dividing the dataset into
subgroups based on a specific variable and then
selecting a random sample from each subgroup. This
is useful when the variable is important for analysis or
modeling.

13
DATA SELECTION
 Feature selection:

Feature selection involves selecting relevant features for analysis or modeling. This is important for
reducing dimensionality and improving the performance of machine learning models.

Chi-square Test:
For categorical features, calculate Chi-square between each
feature and the target and select the desired number of features
with the best Chi-square scores.

Correlation Coefficient:
Good variables correlate highly with the target. Variables should
be correlated with the target but uncorrelated among
themselves.

14
# IMPORTING LIBRARIES

Data Analysis Library

 import pandas as pd
Scientific computing and technical computing
 import scipy
Working with arrays
 import numpy as np
Scientific computing and technical computing
 import seaborn as sns
Scientific computing and technical computing
 import matplotlib.pyplot as plt
 from sklearn.model_selection import train_test_split

To split our data into train and test sets

15
OVERVIEW OF THE PANDAS LIBRARY
 Pandas provides two primary data structures for storing and manipulating data: Series
and DataFrame.
 A Series is a one-dimensional array-like object that can hold any data type
 DataFrame is a two-dimensional table-like object consisting of rows and columns.

16
DATA CLEANING WITH PANDAS

 Handling missing data: Pandas provides methods for filling in missing data or dropping
missing data points. For example, the dropna() method drops any rows or columns that
contain missing data, while the fillna() method fills in missing data with a specified
value.
 Removing duplicates: Pandas provides a drop_duplicates() method that removes
duplicate rows from a DataFrame.
 Correcting errors: Pandas provides methods for replacing or removing incorrect values.
For example, the replace() method can be used to replace specific values with new
values.

17
DATA TRANSFORMATION WITH PANDAS

 Filtering data: Pandas provides methods for selecting specific rows or columns based
on criteria such as a specific value, a range of values, or a boolean expression. For
example, the loc[] method can be used to select rows and columns by label, while the
iloc[] method can be used to select rows and columns by index.
 Sorting data: Pandas provides a sort_values() method for sorting a DataFrame by one
or more columns or indices.
 Grouping data: Pandas provides a groupby() method for grouping a DataFrame by one
or more variables and performing aggregate operations on each group, such as sum,
mean, and count.

18
HOW TO PREPROCESS DATA IN PYTHON STEP-BY-STEP

 Load data in Pandas.

 Drop columns that aren’t useful.
 Drop rows with missing values.
 Create dummy variables.
 Take care of missing data.
 Convert the data frame to NumPy.
 Divide the data set into training data and test data.

Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
166963946585
No ratings yet
166963946585
1 page
Unit 2
No ratings yet
Unit 2
46 pages
Data Preprocessing 1_annotated
No ratings yet
Data Preprocessing 1_annotated
23 pages
Data Processing
No ratings yet
Data Processing
14 pages
Data_Preprocessing_Visualization
No ratings yet
Data_Preprocessing_Visualization
25 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
PREREQUISITE SESSION 4
No ratings yet
PREREQUISITE SESSION 4
12 pages
Data Mining
No ratings yet
Data Mining
31 pages
Unit 4_Working With Graphs _python
No ratings yet
Unit 4_Working With Graphs _python
49 pages
ML 2022
No ratings yet
ML 2022
10 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
dm(2)
No ratings yet
dm(2)
3 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
maillistking
No ratings yet
maillistking
152 pages
datascience
No ratings yet
datascience
26 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing in Python Pandas (With Code)
No ratings yet
Data Preprocessing in Python Pandas (With Code)
11 pages
Archer and SAP Working Together for Enterprise Compliance
No ratings yet
Archer and SAP Working Together for Enterprise Compliance
14 pages
Data Prep
No ratings yet
Data Prep
33 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Clustering and Heirachial Clustering
No ratings yet
Clustering and Heirachial Clustering
74 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Second Assessment Schedule April'25
No ratings yet
Second Assessment Schedule April'25
2 pages
Unit - II
No ratings yet
Unit - II
56 pages
A9-A12 (Vocabulary sets 1-8) - KEYS
No ratings yet
A9-A12 (Vocabulary sets 1-8) - KEYS
3 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Web Designing Unit-2
No ratings yet
Web Designing Unit-2
35 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
certschief19122024121817demo
No ratings yet
certschief19122024121817demo
6 pages
Brochure f Icraest 2025
No ratings yet
Brochure f Icraest 2025
2 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
JVC MP-XV841 941 SM
No ratings yet
JVC MP-XV841 941 SM
64 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Time Gpt
No ratings yet
Time Gpt
12 pages
Project Management
No ratings yet
Project Management
9 pages
Outlook_Live Project
No ratings yet
Outlook_Live Project
2 pages
JD-Instock Manager
No ratings yet
JD-Instock Manager
2 pages
Notes On CRO PDF
100% (2)
Notes On CRO PDF
28 pages
Object Oriented Systems Analysis and Design: Two Parts
No ratings yet
Object Oriented Systems Analysis and Design: Two Parts
14 pages
Emma Hart 001 5000k
No ratings yet
Emma Hart 001 5000k
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
JD - LGMP Strategy - Copy
No ratings yet
JD - LGMP Strategy - Copy
3 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Genshin Impact (@GenshinImpact) Twitter
No ratings yet
Genshin Impact (@GenshinImpact) Twitter
1 page
Advanced Vehicle Tracking and Monitoring
100% (1)
Advanced Vehicle Tracking and Monitoring
71 pages
Sustainable Transformation With Integrated Digital Tools
No ratings yet
Sustainable Transformation With Integrated Digital Tools
4 pages
SMTP Response Code With Its Meaning
No ratings yet
SMTP Response Code With Its Meaning
2 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Recruitment of Local Bank Officer 2025-26
No ratings yet
Recruitment of Local Bank Officer 2025-26
3 pages
College ERP Using MERN Stack
No ratings yet
College ERP Using MERN Stack
7 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Mathssssss
No ratings yet
Mathssssss
4 pages
Siemens Remote Service - Planning Guide
No ratings yet
Siemens Remote Service - Planning Guide
44 pages
DWM
No ratings yet
DWM
14 pages
Automotive Aftermarket Outlook
No ratings yet
Automotive Aftermarket Outlook
13 pages
En Tekstenuitleg Net Articles Software Access Validation Rule Tutorial List of Access Validation Rules
No ratings yet
En Tekstenuitleg Net Articles Software Access Validation Rule Tutorial List of Access Validation Rules
3 pages
Ruckus R310: Benefits
No ratings yet
Ruckus R310: Benefits
6 pages
Dependability and Security Assurance in Software Engineering
No ratings yet
Dependability and Security Assurance in Software Engineering
74 pages
EE 209 Sp2014 Tarng Syllabus
No ratings yet
EE 209 Sp2014 Tarng Syllabus
8 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
TSAP Series: All Purpose Temperature Sensor
No ratings yet
TSAP Series: All Purpose Temperature Sensor
2 pages
100 Prompts For Teachers To Ask ChatGPT - Teacher Tech
No ratings yet
100 Prompts For Teachers To Ask ChatGPT - Teacher Tech
13 pages
TeamDesk REST API
No ratings yet
TeamDesk REST API
30 pages
Precios Aquamine
No ratings yet
Precios Aquamine
19 pages
Assignment Strategic Innovation
No ratings yet
Assignment Strategic Innovation
4 pages
Ariba Event Management Guide
No ratings yet
Ariba Event Management Guide
269 pages
Computer Science Class Xii 2021 22 Investigatory Project
100% (2)
Computer Science Class Xii 2021 22 Investigatory Project
36 pages
Library Science
No ratings yet
Library Science
60 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Session 2 - Data Pre-Processing

Uploaded by

Session 2 - Data Pre-Processing

Uploaded by

DATA PRE-PROCESSING

 For achieving better results from the applied model in Machine

Data Cleaning Data transformation Data selection

Handling Noisy Data

* Smoothing by bin boundaries:

Handling Noisy Data

 Encoding categorical data:

Data Analysis Library

To split our data into train and test sets

 Load data in Pandas.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.