0% found this document useful (0 votes)
5 views

Data pre Processing

Data preprocessing involves cleaning, transforming, and organizing raw data for effective analysis. Key steps include data cleaning to address missing and noisy data, data transformation for suitable mining formats, and data reduction to enhance storage efficiency. Techniques such as normalization, clustering, and dimensionality reduction are essential for preparing data for analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data pre Processing

Data preprocessing involves cleaning, transforming, and organizing raw data for effective analysis. Key steps include data cleaning to address missing and noisy data, data transformation for suitable mining formats, and data reduction to enhance storage efficiency. Techniques such as normalization, clustering, and dimensionality reduction are essential for preparing data for analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data pre Processing

• Data Preprocessing refers to the steps taken to


clean, transform, and organize raw data into a
structured format before analysis.

•It ensures that the data is ready for accurate and


effective analysis.
1. Data Cleaning:

• The data can have many irrelevant and missing


parts. To handle this part, data cleaning is done. It

• involves handling of missing data, noisy data etc.


(a). Missing Data:
• This situation arises when some data is missing in the
data. It can be handled in various ways.
• Some of them are:
1. Ignore the tuples:
• This approach is suitable only when the dataset we
have is quite large and multiple values are missing
within a tuple.
2.Fill the Missing values:
• There are various ways to do this task. You can
choose to fill the missing values
• manually, by attribute mean or the most probable
value.
(b) Noisy Data:
• Noisy data is a meaningless data that can’t be
interpreted by machines. It can be generated due to
faulty data collection, data entry errors etc. It can
be handled in following ways :
1. Binning Method:
Binning is the process of dividing continuous data
into smaller, defined ranges or "bins.“
2. Regression:
Here data can be made smooth by fitting it to a
regression function. The regression used may be
linear or multiple.
3. Clustering:

• This approach groups the similar data in a cluster.


The outliers may be undetected or it will fall outside
the clusters.
2. Data Transformation:
This step is taken in order to transform the data in
appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
• It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
• In this strategy, new attributes are constructed from
the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to
higher level in hierarchy. For Example
The attribute “city” can be converted to “country”.
3. Data Reduction:
• Since data mining is a technique that is used to
handle huge amount of data. While working with
huge volume of data, analysis became harder in
such cases.
• In order to get rid of this, we uses data reduction
technique. It aims to increase the storage
efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:

1. Data Cube Aggregation:


Aggregation operation is applied to data for the
construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest
all can be discarded. For performing attribute
selection, one can use level of significance and p-
value of the attribute. the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data,
for example: Regression Models

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can
be lossy or lossless.
If after reconstruction from compressed data, original data
can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction.
The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy