Unit 2
Unit 2
1) Data Cleaning:
(a) Missing Data :
Dropping rows/columns: drop rows/columns having NaN values
Checking for duplicates: keep first instance only
Estimate missing values: with feature’s mean, mode or median
(b) Noisy Data:
Binning Method: divide data into equal-size parts and then
data can be replaced by mean and boundary values
Clustering: related data grouped into cluster. Outliers may go
unnoticed, or they may fall outside of clusters.
Regression:By fitting data to a regression function, data can be
smoothed out.
Data Preprocessing Techniques
2) Data Transformation: This stage is used to convert the data into a format
that can be used in the data analysis process.
a) Normalization:
It is done to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
b) Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
c) Smoothing:
techniques include binning, clustering, and regression.
d) Aggregation:
process of applying summary or aggregation operations on data.
Daily sales data, for example, might be combined to calculate monthly and
annual totals.
Data Preprocessing Techniques
b) Numerosity Reduction:
Data is replaced or estimated using
alternative and smaller data
representations.
Preprocessing Techniques in ML
• Mean removal:
It involves removing the mean from each feature so that it is centered on
zero. Mean removal helps in removing any bias from the features.
• Scaling:
The values of every feature in a data point can vary between random
values. So, it is important to scale them so that this matches specified
rules.
• Normalization
Normalization involves adjusting the values in the feature vector so as to
measure them on a common scale.
• Binarization
Binarization is used to convert a numerical feature vector into a Boolean
vector.
Preprocessing Techniques in ML
1) Univariate Plots
2) Multivariate Plots
Data Visualization