UNIT-2
UNIT-2
Data Pre-Processing
Data pre-processing
Human/hardware/software problems
Noisy data (incorrect values) may come from
Data cleaning
Data integration
Data transformation
Normalization and aggregation
Data reduction
Data discretization
Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data.
The various methods for handling the problem of missing values in data
tuples include:
(a)Ignoring the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of
missing values per attribute
varies considerably.
In this technique,
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
o28, 29, 34
Partition into (equi-depth) bins(equi depth of 4 since each bin
contains three values):
- Bin 1: 4, 8, 9, 15
29, 29
o Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
26, 34
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is
9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values
for the data tuples are (in
increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a)Use smoothing by bin means to smooth the above data, using a bin depth
of 3. Illustrate your steps.
Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing
by bin means
with a bin
depth of 3.
•Step 1: Sort the data. (This step is not required here as the data are
already sorted.)
•Step 4: Replace each of the values in each bin by the arithmetic mean
calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
Unique rule is a rule says that each value of the given attribute must be
different from all other values of that attribute
Consecutive rule is a rule says that there can be no missing values
between the lowest and highest values of the attribute and that all values
must also be unique.
Null rule specifies the use of blanks, question marks, special characters or
other strings that may indicate the null condition and how such values
should be handled.
It combines data from multiple sources into a coherent store. There are
number of issues to consider during data integration.
Issues:
1. Correlation analysis
The result of the equation is > 0, then A and B are positively correlated,
which means the value of A increases as the values of B increases. The
higher value may indicate redundancy that may be removed.
The result of the equation is = 0, then A and B are independent
and there is no correlation between them.
If the resulting value is < 0, then A and B are negatively correlated
where the values of one attribute increase as the value of one
attribute decrease which means each attribute may discourages each
other.
-also called Pearson’s product moment coefficient
Smoothing: which works to remove noise from the data
Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute weekly and annual total scores.
Generalization of the data: where low-level or “primitive” (raw)
In which data are scaled to fall within a small, specified range, useful for
classification algorithms involving neural networks, distance
measurements such as nearest neighbor classification and clustering.
There are 3 methods for data normalization. They are:
1)min-max normalization
2)z-score normalization
3)normalization by decimal scaling
v ' max
Anew
Amin ( max
_ A min
new _ new
A) _ min Av is the value to
be normalized
Example :.
Normalize the following group of data –
1000,2000,3000,9000
using min-max normalization by setting min:0 and max:1
Solution –
----------------- +0 =0
9000-1000
Case-2: normalizing 2000 –
v = 2000, putting all values in the formula,we get
v '= (2000-1000) X (1-0)
----------------- + 0 =0 .125
9000-1000
Case-3: normalizing 3000 –
v=3000, putting all values in the formula,we get
v'=(3000-1000) X (1-0)
----------------- + 0 =0 .25
9000-1000
Case-4: normalizing 9000 –
v=9000, putting all values in the formula, we get
v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
Outcome :
Hence, the normalized values of 1000,2000,3000,9000 are 0,
0.125, .25, 1.
Z-score normalization / zero-mean normalization: In which values of
an attribute A are normalized based on the mean and standard deviation of
A. It can be defined as,
v' v meanA
stand _
devA
This method is useful when min and max value of attribute A are
unknown or when outliers that are dominate min-max normalization.
A value, v, of attribute A is normalized to v’ by computing
Example 2:
Salary bonus Formula CGPA Normalized after Decimal scalin
400 400 / 1000 0.4
310 310 / 1000 0.31
Example 3:
2.5.1 Data cube aggregation: Reduce the data to the concept level
needed in the analysis. Queries regarding aggregated information should be
answered using data cube when possible. Data cubes store
multidimensional aggregated information. The following figure shows a data
cube for multidimensional analysis of sales data with respect to annual sales
per item type for each branch.
Each cells holds an aggregate data value, corresponding to the data point in
multidimensional space.
Data cubes provide fast access to pre computed, summarized data, thereby
benefiting on-line analytical processing as well as data mining.
The following database consists of sales per quarter for the years 1997-
1999.
Suppose, the analyzer interested in the annual sales rather than sales per
quarter, the above data can be aggregated so that the resulting data
summarizes the total sales per year instead of per quarter. The resulting
data in smaller in volume, without loss of information necessary for the
analysis task.
2.5.2 Dimensionality Reduction
It is the process of reducing the number of random variables or
attributes under consideration
Feature selection is a must for any data mining product. That is because,
when you build a data mining model, the dataset frequently contains more
information than is needed to build the model. For example, a dataset may
contain 500 columns that describe characteristics of customers, but
perhaps only 50 of those columns are used to build a particular model. If
you keep the unneeded columns while building the model, more CPU and
memory are required during the training process, and more storage space
is required for the completed model.
The mining algorithm itself is used to determine the attribute sub set, then
it is called wrapper approach or filter approach. Wrapper approach leads
to greater accuracy since it optimizes the evaluation measure of the
algorithm while removing attributes.
Data compression
Wavelet transforms
Principal components analysis.
Wavelet compression is a form of data compression well suited for image
compression. The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector D, transforms it to
a numerically different vector, D0, of wavelet coefficients.
1.The length, L, of the input data vector must be an integer power of two.
This condition can be met by padding the data vector with zeros, as
necessary.
3.The two functions are applied to pairs of the input data, resulting in two
sets of data of length L/2.
4.The two functions are recursively applied to the sets of data obtained in
the previous loop, until the resulting data sets obtained are of desired
length.
5.A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
If wavelet coefficients are larger than some user-specified threshold then it
can be retained.
The remaining coefficients are set to 0.
Parametric method
Parametric: Assume the data fits some model, then estimate model
parameters, and store only the parameters, instead of actual data.
Non parametric: In which histogram, clustering and sampling is used
to store reduced form of data.
The following data are a list of prices of commonly sold items. The
numbers have been sorted.
Sampling
Advantages of sampling
Discretization:
Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual data values.
Concept Hierarchy
Attribute Value
Profession Teacher, Bussiness Man, Peon etc
Postal Code 42200, 42300 etc
Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1
and 2. These attributes are Quantitative Attributes.
Example of Continuous Attribute
Attribute Value
Height 5.4…, 6.5….. etc
Weight 50.09…. etc
There are five methods for numeric concept hierarchy generation. These
include:
1. binning,
2.histogram analysis,
3.clustering analysis,
4.entropy-based Discretization, and
5.data segmentation by “natural partitioning".
Procedure:
Segmentation by Natural Partitioning
year 1997
cover a wide range, from -$351,976.00 to $4,700,896.50. A user wishes to
have a concept hierarchy for profit automatically generated.
Suppose that the data within the 5%-tile and 95%-tile are between -
$159,876 and $1,838,761. The results of applying the 3-4-5 rule are
shown in following figure
Step 1: Based on the above information, the minimum and maximum values
are: MIN = - $351, 976.00, and MAX = $4, 700, 896.50. The low (5%-tile)
and high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159, 876, and HIGH = $1, 838,761.
Step 2: Given LOW and HIGH, the most significant digit is at the million
dollar digit position (i.e., msd =
1,000,000). Rounding LOW down to the million dollar digit, we get LOW’ = -
$1; 000; 000; and rounding
HIGH up to the million dollar digit, we get HIGH’ = +$2; 000; 000.
Step 3: Since this interval ranges over 3 distinct values at the most
significant digit, i.e., (2; 000; 000-(-1, 000; 000))/1, 000, 000 = 3, the
segment is partitioned into 3 equi-width sub segments according to the 3-4-
5 rule: (-$1,000,000 - $0], ($0 - $1,000,000], and ($1,000,000 - $2,000,000].
This represents the top tier of the hierarchy.
Step 4: We now examine the MIN and MAX values to see how they “fit" into
the first level partitions. Since the first interval, (-$1, 000, 000 - $0] covers
the MIN value, i.e., LOW’ < MIN, we can adjust the left boundary of this
interval to make the interval smaller. The most
significant digit of MIN is the hundred thousand digit position. Rounding
MIN down to this position, we get MIN0’ = -$400, 000.
Therefore, the first interval is redefined as (-$400,000 - 0]. Since the last
interval, ($1,000,000-$2,000,000] does not cover the MAX value, i.e., MAX >
HIGH’, we need to create a new interval to cover it. Rounding up MAX at its
most significant digit position, the new interval is ($2,000,000 - $5,000,000].
Hence, the top most level of the hierarchy contains four partitions, (-
$400,000 - $0], ($0 - $1,000,000], ($1,000,000 - $2,000,000], and
($2,000,000 - $5,000,000].