0% found this document useful (0 votes)

8 views

Unit 2

Uploaded by

rk73462002

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Unit 2

Uploaded by

rk73462002

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT 2 :

DATA PREPROCESSING, ANALYSIS AND VISUALIZATION

DATA SCIENCE AND

MACHINE LEARNING
AMAL YADAV
MPIT AMROHA
DATA PREPROCESSING
Data Preprocessing in ML

 Data preprocessing is a process of preparing the

raw data and making it suitable for a machine
learning model.
 It is the first and crucial step while creating a
machine learning model.
 A real-world data generally contains noises, missing
values, and maybe in an unusable format which
cannot be directly used for machine learning
models.
 It also increases the accuracy and efficiency of a
machine learning model.
Data Preprocessing Techniques

1) Data Cleaning:
(a) Missing Data :
 Dropping rows/columns: drop rows/columns having NaN values
 Checking for duplicates: keep first instance only
 Estimate missing values: with feature’s mean, mode or median
(b) Noisy Data:
 Binning Method: divide data into equal-size parts and then
data can be replaced by mean and boundary values
 Clustering: related data grouped into cluster. Outliers may go
unnoticed, or they may fall outside of clusters.
 Regression:By fitting data to a regression function, data can be
smoothed out.
Data Preprocessing Techniques

2) Data Transformation: This stage is used to convert the data into a format
that can be used in the data analysis process.
a) Normalization:
 It is done to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
b) Concept Hierarchy Generation:
 Here attributes are converted from lower level to higher level in hierarchy.
 For Example-The attribute “city” can be converted to “country”.
c) Smoothing:
 techniques include binning, clustering, and regression.
d) Aggregation:
 process of applying summary or aggregation operations on data.
 Daily sales data, for example, might be combined to calculate monthly and
annual totals.
Data Preprocessing Techniques

3) Data Integration: 4) Data Reduction:

 It is involved in a data analysis task that
combines data from multiple sources into a
coherent data store. a) Dimensionality Reduction :
 These sources may include multiple
 there could be hundreds of features,
databases.
also known as dimensions, here we
minimize the number of features

b) Numerosity Reduction:
 Data is replaced or estimated using
alternative and smaller data
representations.
Preprocessing Techniques in ML

• Mean removal:
 It involves removing the mean from each feature so that it is centered on
zero. Mean removal helps in removing any bias from the features.
• Scaling:
 The values of every feature in a data point can vary between random
values. So, it is important to scale them so that this matches specified
rules.
• Normalization
 Normalization involves adjusting the values in the feature vector so as to
measure them on a common scale.
• Binarization
 Binarization is used to convert a numerical feature vector into a Boolean
vector.
Preprocessing Techniques in ML

• One Hot Encoding:

 It may be required to deal with numerical values that are few and
scattered, and you may not need to store these values. In such situations
you can use One Hot Encoding technique.
 If the number of distinct values is k, it will transform the feature into a k-
dimensional vector where only one value is 1 and all other values are 0.
• Label Encoding
 Label encoding refers to changing the word labels into numbers so that
the algorithms can understand how to work on them.
Data Analyses

1) Loading the dataset

2) Summarizing the dataset

 See Jupyter Notebook for example

Data Visualization

1) Univariate Plots

2) Multivariate Plots
Data Visualization

Univariate Plots Multivariate Plots

• In univariate analysisit explores each variable
separately • Multivariate analysis is required
when more than two variables have
a) Histogram b) Bar Chart c) Pie Chart to be analyzed simultaneously.
Training vs Test Data

 The main difference between training data and testing data is

that training data is the subset of original data that is used to
train the machine learning model, whereas testing data is used to
check the accuracy of the model.
 The training dataset is generally larger in size compared to the
testing dataset. The general ratios of splitting train and test
datasets are 80:20, 70:30, or 90:10.
 Training data is well known to the model as it is used to train the
model, whereas testing data is like unseen/new data to the
model.
Training vs Test Data

Features Training Data Testing Data

The machine-learning model is trained using
training data. The more training data a model Testing data is used to evaluate the model’s
Purpose has, the more accurate predictions it can performance.
make.

Until evaluation, the testing data is not

By using the training data, the model can gain
exposed to the model. This guarantees that the
Exposure knowledge and become more accurate in its
model cannot learn the testing data by heart
predictions.
and produce flawless forecasts.

This training data distribution should be similar

The distribution of the testing data and the
Distribution to the distribution of actual data that the
data from the real world differs greatly.
model will use.

By making predictions on the testing data and

Use To stop overfitting, training data is utilized. comparing them to the actual labels, the
performance of the model is assessed.
Size Typically larger Typically smaller
Attributes and its Types in Data Analytics
Performance Measures

 Performance metrics in machine learning are used to evaluate the

performance of a machine learning model.
 To evaluate the performance of a classification model:
1) Accuracy
2) Confusion Matrix
3) Precision
4) Recall
5) F-Score
6) AUC(Area Under the Curve)-ROC
Performance Measures

1) Accuracy 2) Precision 3) Recall or Sensitivity

• It can be determined as the • It is calculated as the
• It is calculated as the
number of correct number of true positive
number of true positive
predictions to the total instances divided by the
instances divided by the
number of predictions. sum of true positive and
sum of true positive and
false positive instances.
false negative instances.
Performance Measures

4) F1 Score: 5) ROC AUC Score: 6) Confusion Matrix:

• F1 score is the harmonic • ROC AUC (Receiver Operating
• A confusion matrix is a table that is used to
mean of precision and Characteristic Area Under the
Curve) score is a measure of the evaluate the performance of a classification
recall. It is a balanced model.
ability of a classifier to
measure that takes into distinguish between positive
account both precision and and negative instances
recall.

1.True Positive(TP): In this case, the prediction

outcome is true, and it is true in reality, also.
2.True Negative(TN): in this case, the prediction
outcome is false, and it is false in reality, also.
3.False Positive(FP): In this case, prediction
outcomes are true, but they are false in actuality.
4.False Negative(FN): In this case, predictions
are false, and they are true in actuality.
THANK YOU

Assignment: Case Study - 1: Operation Analytics
59% (27)
Assignment: Case Study - 1: Operation Analytics
5 pages
Ticket Details & Travel Information: Lufthansa Booking Code: Ouhwhh
No ratings yet
Ticket Details & Travel Information: Lufthansa Booking Code: Ouhwhh
4 pages
The Complete Plant Based Diet To Shred by Anais Zanotti
100% (3)
The Complete Plant Based Diet To Shred by Anais Zanotti
96 pages
Improvements in Offshore Cathodic Protection Retrofits: - Wasco Energy
No ratings yet
Improvements in Offshore Cathodic Protection Retrofits: - Wasco Energy
15 pages
1635838720082
No ratings yet
1635838720082
35 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Module 2
No ratings yet
Module 2
8 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
CSC407_Chapter 2-3
No ratings yet
CSC407_Chapter 2-3
46 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data
No ratings yet
Data
36 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Cse3001 Ai Ml m2
No ratings yet
Cse3001 Ai Ml m2
118 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
ML Lect1
100% (1)
ML Lect1
51 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
MachineLearning Presentation
No ratings yet
MachineLearning Presentation
71 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Unit - II
No ratings yet
Unit - II
56 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Unit .1
No ratings yet
Unit .1
7 pages
NN-7
No ratings yet
NN-7
26 pages
Social Media Analytics Techniques[1] (1)
No ratings yet
Social Media Analytics Techniques[1] (1)
77 pages
Data Mining
No ratings yet
Data Mining
40 pages
TIS - Intro To Machine Learning
No ratings yet
TIS - Intro To Machine Learning
18 pages
ML_DA
No ratings yet
ML_DA
55 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
ML_1
No ratings yet
ML_1
13 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data_in_machine_learning
No ratings yet
Data_in_machine_learning
7 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
19839 LOR Architectural Guide
No ratings yet
19839 LOR Architectural Guide
186 pages
Infosys_DC_List
No ratings yet
Infosys_DC_List
4 pages
TEKNOLOGI MEMBRAN-sumber
No ratings yet
TEKNOLOGI MEMBRAN-sumber
45 pages
Instant Download A Small Dose of Toxicology The Health Effects of Common Chemicals 1st Edition Steven G. Gilbert PDF All Chapters
100% (6)
Instant Download A Small Dose of Toxicology The Health Effects of Common Chemicals 1st Edition Steven G. Gilbert PDF All Chapters
81 pages
Busbar Sizing Calculation
100% (2)
Busbar Sizing Calculation
4 pages
PAS 2424 - 2014 Quantitative Surface Test For The Evaluation of Residual Antimicrobial (Bactericidal and - or Yeasticidal) Efficacy - Libgen - Li
No ratings yet
PAS 2424 - 2014 Quantitative Surface Test For The Evaluation of Residual Antimicrobial (Bactericidal and - or Yeasticidal) Efficacy - Libgen - Li
40 pages
GU-612 - v3.1 - Guidelines - Incident Investigation and Reporting v1
No ratings yet
GU-612 - v3.1 - Guidelines - Incident Investigation and Reporting v1
190 pages
Cyber Threat Management is an Essential Aspect of Cybersecurity
No ratings yet
Cyber Threat Management is an Essential Aspect of Cybersecurity
5 pages
ADVIA 120 2120 2120i Hematology System Operator S Guide, English, RUO, REF 11220948, 2016 DXDCM 09008b8380829456-1484183876340
No ratings yet
ADVIA 120 2120 2120i Hematology System Operator S Guide, English, RUO, REF 11220948, 2016 DXDCM 09008b8380829456-1484183876340
54 pages
Pneumatic Posisioner
No ratings yet
Pneumatic Posisioner
10 pages
Report On The Utilization of Fy 2019 Sangguniang Kabataan (SK) Funds
100% (1)
Report On The Utilization of Fy 2019 Sangguniang Kabataan (SK) Funds
1 page
ABAP KT Tracker v1.0-2
No ratings yet
ABAP KT Tracker v1.0-2
119 pages
Folic Acid.
No ratings yet
Folic Acid.
35 pages
Syncope
No ratings yet
Syncope
25 pages
Mermaid 5 by KeikoJade
No ratings yet
Mermaid 5 by KeikoJade
8 pages
Re A Children - 2000
No ratings yet
Re A Children - 2000
5 pages
Changing Modes of Production in Indian Agriculture
No ratings yet
Changing Modes of Production in Indian Agriculture
2 pages
Rephrasing Practice1
No ratings yet
Rephrasing Practice1
72 pages
Ifa Update (1) - 241018 - 155840
No ratings yet
Ifa Update (1) - 241018 - 155840
19 pages
RR 12-2007 PDF
No ratings yet
RR 12-2007 PDF
7 pages
A Classroom Management Plan
100% (8)
A Classroom Management Plan
108 pages
Aetc Kath Ethics Form
No ratings yet
Aetc Kath Ethics Form
7 pages
Vendor Accreditation Form SLPHC_1
No ratings yet
Vendor Accreditation Form SLPHC_1
3 pages
DIN_DubaiCare_Basic_Network_List
No ratings yet
DIN_DubaiCare_Basic_Network_List
15 pages
Erotic Revelations Clinical applications and perverse scenarios 1st Edition Andrea Celenza 2024 scribd download
100% (1)
Erotic Revelations Clinical applications and perverse scenarios 1st Edition Andrea Celenza 2024 scribd download
61 pages
Test 2 Section 2 Structure and Written Expression
No ratings yet
Test 2 Section 2 Structure and Written Expression
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT 2 :

DATA PREPROCESSING, ANALYSIS AND VISUALIZATION

DATA SCIENCE AND

 Data preprocessing is a process of preparing the

3) Data Integration: 4) Data Reduction:

• One Hot Encoding:

1) Loading the dataset

2) Summarizing the dataset

 See Jupyter Notebook for example

Univariate Plots Multivariate Plots

 The main difference between training data and testing data is

Features Training Data Testing Data

Until evaluation, the testing data is not

This training data distribution should be similar

By making predictions on the testing data and

 Performance metrics in machine learning are used to evaluate the

1) Accuracy 2) Precision 3) Recall or Sensitivity

4) F1 Score: 5) ROC AUC Score: 6) Confusion Matrix:

1.True Positive(TP): In this case, the prediction

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.