Session 2 - Data Pre-Processing
Session 2 - Data Pre-Processing
DATA PREPROCESSING
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw format which is not feasible for the
analysis.
2
NEED FOR DATA PREPROCESSING
3
DATA PREPROCESSING TECHNIQUES
To identify and handle Converting data from one format or Selecting a subset of data from a larger
missing or erroneous data structure to another. It can help to dataset based on certain criteria. It can help
improve the quality of data and make it to reduce the size of the dataset and focus
more suitable for analysis or modeling on relevant data for analysis or modeling
4
DATA CLEANING
Removing duplicates:
Duplicates can skew the results of data analysis or machine learning models. Removing duplicates
can improve the accuracy of results and reduce the risk of errors.
Handling missing data:
Missing data can be handled using various techniques, such as deleting missing data, imputing
missing data, or replacing missing data with values such as mean or median.
Handling outliers:
Outliers can also be considered as missing data. Various techniques, such as replacing outliers with
missing data, can be used to handle outliers.
5
DATA CLEANING
Examples
Noise and outliers
Missing values
Duplicate data
10
6
DATA CLEANING
Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean
value of the bin
Smoothing by bin median: In this method, the values in the bin are replaced by the median
value;
10
Smoothing by bin boundary: In this method, minimum and maximum values of the bin values
are taken, and the closest boundary value replaces the values.
7
DATA CLEANING
Handling Noisy Data
Binning:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 10
9
DATA TRANSFORMATION
Standardizing or normalizing data:
Standardizing or normalizing data involves scaling the data to a common scale. This is important for
some machine learning algorithms that are sensitive to the scale of features.
10
DATA TRANSFORMATION
Categorical data can be encoded into numerical data using techniques such as one-hot encoding or
label encoding. This is important for some machine learning algorithms that cannot handle
categorical data.
11
DATA TRANSFORMATION
Aggregation:
Aggregation involves combining multiple data points into a single data point. This can be useful for
summarizing data and reducing dimensionality.
Discretization:
Discretization involves converting continuous data attribute values into a finite set of intervals with
minimal loss of information and associating with each interval some specific data value or
conceptual labels.
Feature engineering:
Feature engineering involves creating new features from existing features. These techniques help to
highlight the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively
12
DATA SELECTION
Random sampling:
Random sampling involves selecting a random subset
of data from the larger dataset. This is useful when
the dataset is too large to process as a whole and a
representative sample is needed.
Stratified sampling:
Stratified sampling involves dividing the dataset into
subgroups based on a specific variable and then
selecting a random sample from each subgroup. This
is useful when the variable is important for analysis or
modeling.
13
DATA SELECTION
Feature selection:
Feature selection involves selecting relevant features for analysis or modeling. This is important for
reducing dimensionality and improving the performance of machine learning models.
Chi-square Test:
For categorical features, calculate Chi-square between each
feature and the target and select the desired number of features
with the best Chi-square scores.
Correlation Coefficient:
Good variables correlate highly with the target. Variables should
be correlated with the target but uncorrelated among
themselves.
14
# IMPORTING LIBRARIES
16
DATA CLEANING WITH PANDAS
Handling missing data: Pandas provides methods for filling in missing data or dropping
missing data points. For example, the dropna() method drops any rows or columns that
contain missing data, while the fillna() method fills in missing data with a specified
value.
Removing duplicates: Pandas provides a drop_duplicates() method that removes
duplicate rows from a DataFrame.
Correcting errors: Pandas provides methods for replacing or removing incorrect values.
For example, the replace() method can be used to replace specific values with new
values.
17
DATA TRANSFORMATION WITH PANDAS
Filtering data: Pandas provides methods for selecting specific rows or columns based
on criteria such as a specific value, a range of values, or a boolean expression. For
example, the loc[] method can be used to select rows and columns by label, while the
iloc[] method can be used to select rows and columns by index.
Sorting data: Pandas provides a sort_values() method for sorting a DataFrame by one
or more columns or indices.
Grouping data: Pandas provides a groupby() method for grouping a DataFrame by one
or more variables and performing aggregate operations on each group, such as sum,
mean, and count.
18
HOW TO PREPROCESS DATA IN PYTHON STEP-BY-STEP
19