0% found this document useful (0 votes)

47 views19 pages

JAVA Advanced 3

Data pre-processing involves cleaning, transforming, and reducing raw data to prepare it for analysis. This includes [1] cleaning data by handling missing values, smoothing noisy data, and resolving inconsistencies, [2] reducing data through techniques like dimensionality reduction and data compression, and [3] transforming data through normalization, discretization, and other techniques to put it in a format suitable for mining algorithms. Pre-processing improves data quality and prepares it for accurate analysis.

Uploaded by

Lucky Mahanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views19 pages

JAVA Advanced 3

Uploaded by

Lucky Mahanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Pre-Processing

What is Data
What is data?
Categorical (Qualitative)
• Nominal scales – number is just a symbol that identifies a
quality
– 0=male, 1=female
– 1=green, 2=blue, 3=red, 4=white
• Ordinal – rank order

Quantitative (continuous and discrete)

• Interval – units are of identical size (i.e. Years)
• Ratio – distance from an absolute zero (i.e. Age, reaction
time)
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
A. Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

6
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– Use mean, median or mode
– the most probable value: inference-based such as Bayesian
formula or decision tree

7
How to Handle Noisy Data?

• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)

8
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)

9
B. Data Integration

• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
10
10
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple

databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
11
11
C. Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
12
D. Data Transformation
• A function that maps the entire set of values of a given attribute to a new set
of replacement values s.t. each old value can be identified with one of the new
values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
13
Some examples of algorithms where feature
scaling matters are:
• K-nearest neighbors (KNN) with a Euclidean distance
measure is sensitive to magnitudes and hence should be
scaled for all features to weigh in equally.
• K-Means uses the Euclidean distance measure here feature
scaling matters.
• Scaling is critical while performing Principal Component
Analysis(PCA). PCA tries to get the features with maximum
variance, and the variance is high for high magnitude features
and skews the PCA towards high magnitude features.
• We can speed up gradient descent by scaling because θ
descends quickly on small ranges and slowly on large ranges,
and oscillates inefficiently down to the optimum when the
variables are very uneven.
How to perform feature scaling?
Below are the few ways we can do feature scaling.
1) Min Max Scaler
2) Standard Scaler
3) Max Abs Scaler
4) Robust Scaler
5) Quantile Transformer Scaler
6) Power Transformer Scaler
7) Unit Vector Scaler

Ref: https://
towardsdatascience.com/all-about-feature-scaling-bcc0ad75c
b35
Normalization
• Min-max Scaler:

• Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000 16
Normalization
• Z-score normalization / Standard Scaler (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000

• Normalization by decimal scaling

v
v ' j Where j is the smallest integer such that Max(|ν’|) < 1
10

17
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

18
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-
up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Week 2
No ratings yet
Week 2
96 pages
Unit 3
No ratings yet
Unit 3
164 pages
DM_merged
No ratings yet
DM_merged
169 pages
Applications of Statistical Software For Data Analysis
80% (5)
Applications of Statistical Software For Data Analysis
5 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
253777
No ratings yet
253777
66 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
DM Data transformation techniques
No ratings yet
DM Data transformation techniques
25 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Lec7
No ratings yet
Lec7
45 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit I
No ratings yet
Unit I
57 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining
No ratings yet
Data Mining
40 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Normalization
No ratings yet
Normalization
35 pages
Forest Management
No ratings yet
Forest Management
15 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Correlation
No ratings yet
Correlation
14 pages
ACC Letter - by - 159 - Cube Test Result
No ratings yet
ACC Letter - by - 159 - Cube Test Result
9 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R - Download the ebook now for an unlimited reading experience
100% (1)
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R - Download the ebook now for an unlimited reading experience
57 pages
Estadistica (Solucionario) - Devore
No ratings yet
Estadistica (Solucionario) - Devore
483 pages
1730985268603_Sampling Theory - Course Guide Book
No ratings yet
1730985268603_Sampling Theory - Course Guide Book
3 pages
CH 09
No ratings yet
CH 09
172 pages
QTT PDF
No ratings yet
QTT PDF
43 pages
(Thomas A. McMahon, Russel G. Mein) Reservoir Capa (BookFi)
No ratings yet
(Thomas A. McMahon, Russel G. Mein) Reservoir Capa (BookFi)
227 pages
08 Frequentist
No ratings yet
08 Frequentist
19 pages
Imran Muneeb Zayan ISL CH4 (8-21-2024 1445)
No ratings yet
Imran Muneeb Zayan ISL CH4 (8-21-2024 1445)
19 pages
Sandeep Garg Economics Class 11 Solutions For Chapter 8 - Measures of Correlation
0% (1)
Sandeep Garg Economics Class 11 Solutions For Chapter 8 - Measures of Correlation
5 pages
Camarao FamilyStructures
No ratings yet
Camarao FamilyStructures
11 pages
Vilar (2020)
No ratings yet
Vilar (2020)
52 pages
FIA117V 3 EXTRA QUESTIONS
No ratings yet
FIA117V 3 EXTRA QUESTIONS
6 pages
IOT Domain
No ratings yet
IOT Domain
70 pages
Module 3
No ratings yet
Module 3
35 pages
Question 11 and 12 For Assignment
0% (1)
Question 11 and 12 For Assignment
58 pages
Data Science Formula - Very Imp
No ratings yet
Data Science Formula - Very Imp
6 pages
AP Stats Exercise
No ratings yet
AP Stats Exercise
4 pages
A Comparison of Time Series and Machine Learning Models For Inflation Forecasting Empirical Evidence From The USA
No ratings yet
A Comparison of Time Series and Machine Learning Models For Inflation Forecasting Empirical Evidence From The USA
9 pages
Applied Statistics MCQ
0% (2)
Applied Statistics MCQ
7 pages
JAVA Advanced 1
No ratings yet
JAVA Advanced 1
17 pages
JAVA Advanced
No ratings yet
JAVA Advanced
18 pages
Cox Proportional-Hazards Regression For Survival Data: John Fox 15 June 2008 (Small Corrections)
No ratings yet
Cox Proportional-Hazards Regression For Survival Data: John Fox 15 June 2008 (Small Corrections)
18 pages
Assignment 3
No ratings yet
Assignment 3
1 page
ANOVA 4 Regression
No ratings yet
ANOVA 4 Regression
2 pages
QMT 11 Syllabus
No ratings yet
QMT 11 Syllabus
5 pages
RIVERA ECE11 Enabling Assessment 3
No ratings yet
RIVERA ECE11 Enabling Assessment 3
4 pages
Formula Sheet
100% (1)
Formula Sheet
2 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
Quiz Ch7 Statistics Questions and Answers
No ratings yet
Quiz Ch7 Statistics Questions and Answers
6 pages
Analisis Deskriptif A. Pretest B.: Descriptive Statistics
No ratings yet
Analisis Deskriptif A. Pretest B.: Descriptive Statistics
2 pages
Nota Statistik Asas To Build A Frequency Table
No ratings yet
Nota Statistik Asas To Build A Frequency Table
2 pages
SQL Mastery: From Novice Queries to Advanced Database Wizardry
From Everand
SQL Mastery: From Novice Queries to Advanced Database Wizardry
Scott Markham
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

JAVA Advanced 3

Uploaded by

JAVA Advanced 3

Uploaded by

Data Pre-Processing

Quantitative (continuous and discrete)

• Measures for data quality: A multidimensional view

• Redundant data occur often when integration of multiple

• Min-max normalization: to [new_minA, new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

• Normalization by decimal scaling

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.