0% found this document useful (0 votes)
11 views

UNIT-2

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

UNIT-2

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT –2

Data Pre-Processing

Data Pre-processing: Data Pre-processing: An Overview, Data Cleaning,


Data Integration, Data Reduction, Data Transformation and Data
Discretization

Data pre-processing

Data pre-processing describes any type of processing performed on raw


data to prepare it for another processing procedure. Commonly used as a
preliminary data mining practice, data pre-processing transforms the data
into a format that will be more easily and effectively processed for the
purpose of the user.

Data pre-processing describes any type of processing performed on raw


data to prepare it for another processing procedure. Commonly used as a
preliminary data mining practice, data pre-processing transforms the data
into a format that will be more easily and effectively processed for the
purpose of the user

2.1 Why Data Pre-processing?

Data in the real world is dirty. It can be in incomplete, noisy and


inconsistent from. These data needs to be pre-processed in order to help
improve the quality of the data, and quality of the mining results.

If no quality data , then no quality mining results. The quality


decision is always based on the quality data.

If there is much irrelevant and redundant information present or noisy


and unreliable data, then knowledge discovery during the training phase
is more difficult
Incomplete data: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data. e.g., occupation=“ ”.

Noisy data: containing errors or outliers data. e.g., Salary=“-10”

Inconsistent data: containing discrepancies in codes or names. e.g.,


Age=“42” Birthday=“03/07/1997”

Incomplete data may come from


 “Not applicable” data value when collected

Different considerations between the time when the data was


collected and when it is analyzed.

Human/hardware/software problems
Noisy data (incorrect values) may come from

Faulty data collection by instruments


Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from

 Different data sources


Functional dependency violation (e.g., modify some linked data)
Major Tasks in Data Preprocessing

 Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers,


and resolve inconsistencies

 Data integration

Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

 Data reduction

Obtains reduced representation in volume but produces the same or

similar analytical results

 Data discretization

Part of data reduction but with particular importance, especially for


numerical data

Forms of Data Preprocessing


2.3 Data Cleaning

Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data.

Various methods for handling this problem:

2.3.1 Missing Values

The various methods for handling the problem of missing values in data
tuples include:

(a)Ignoring the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of
missing values per attribute
varies considerably.

(b)Manually filling in the missing value: In general, this approach is


time-consuming and may not be a reasonable task for large data sets with
many missing values, especially when the value to be filled in is not easily
determined.
(c)Using a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.”
Hence, although this method is simple, it is not recommended.

(d)Using the attribute mean for quantitative (numeric) values or


attribute mode for categorical (nominal) values, for all samples
belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value
with the average income value for customers in the same credit risk category
as that of the given tuple.
(e)Using the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.

2.3.2 Noisy data:

Noise is a random error or variance in a measured variable. Data smoothing


technique is used for removing such noisy data.

Several Data smoothing techniques:

1 Binning methods: Binning methods smooth a sorted data value by


consulting the neighbourhood", or values around it. The sorted values are
distributed into a number of 'buckets', or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.

In this technique,

1.The data for first sorted


2.Then the sorted list partitioned into equi-depth of bins.
3.Then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by
the mean value of the bin.
b.
Smoothing by bin medians: Each value in the bin is replaced
c. by the bin median.
Smoothing by boundaries: The min and max values of a bin
are identified as the bin boundaries. Each bin value is replaced
by the closest boundary value.
 Example: Binning Methods for Data Smoothing

o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
o28, 29, 34
Partition into (equi-depth) bins(equi depth of 4 since each bin
contains three values):
- Bin 1: 4, 8, 9, 15

- Bin 2 :21, 21, 24, 25

- Bin 3 :26, 28, 29, 34

o Smoothing by bin means:


- Bin 1: 9, 9, 9, 9

- Bin 2 :23, 23, 23,

23 - Bin 3 :29, 29,

29, 29
o Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15

- Bin 2 :21, 21, 25,

25 - Bin 3 :26, 26,

26, 34
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is
9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values
for the data tuples are (in
increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a)Use smoothing by bin means to smooth the above data, using a bin depth
of 3. Illustrate your steps.
Comment on the effect of this technique for the given data.

The following steps are required to smooth the above data using smoothing
by bin means
with a bin
depth of 3.

•Step 1: Sort the data. (This step is not required here as the data are
already sorted.)

•Step 2: Partition the data into equi-depth bins of depth


3.Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22 Bin 4: 22, 25, 25 Bin
5: 25, 25, 30 Bin 6: 33, 33, 35 Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46,
52, 70
•Step 3: Calculate the arithmetic mean of each bin.

•Step 4: Replace each of the values in each bin by the arithmetic mean
calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56

2 Clustering: Outliers in the data may be detected by clustering, where


similar values are organized into groups, or ‘clusters’. Values that fall
outside of the set of clusters may be considered outliers.
3 Regression : smooth by fitting the data into regression functions.

Linear regression involves finding the best line to fit two


variables, so that one variable can be used to predict the other.

  Multiple linear regression is an extension of linear regression,


where more than two variables are involved and the data are fit
to a multidimensional surface.

Finding Discrepancies in the data

Field overloading: is a kind of source of errors that typically occurs when


developers compress new attribute definitions into unused portions of
already defined attributes.

Unique rule is a rule says that each value of the given attribute must be
different from all other values of that attribute
Consecutive rule is a rule says that there can be no missing values
between the lowest and highest values of the attribute and that all values
must also be unique.

Null rule specifies the use of blanks, question marks, special characters or
other strings that may indicate the null condition and how such values
should be handled.

Commercial tools that can aid in the discrepancy detection step


Data Scrubbing tools-use simple domain knowledge (spell checking, parsing
techniques)to detect errors and make corrections in the data
Data Auditing tools-analyze the data to discover rules and relationships and
detecting data that violate such conditions.

Data Migration tools-allow simple tranformations


2.4 Data Integration and Transformation

2.4.1 Data Integration

It combines data from multiple sources into a coherent store. There are
number of issues to consider during data integration.

Issues:

 Schema integration: refers integration of metadata from different


sources.
Entity identification problem: Identifying entity in one data source
similar to entity in another table. For example, customer_id in one db
and customer_no in another db refer to the same entity
Detecting and resolving data value conflicts: Attribute values
from different sources can be different due to different
representations, different scales. E.g. metric vs. British units

Redundancy: is another issue while performing data integration.


Redundancy can occur due to the following reasons:
Object identification: The same attribute may have
different names in different db

 Derived Data: one attribute may be derived from another


attribute.

Handling redundant data in data integration

1. Correlation analysis

For numeric data

Some redundancy can be identified by correlation analysis. The


correlation between two variables A and B can be measured by

  The result of the equation is > 0, then A and B are positively correlated,
which means the value of A increases as the values of B increases. The
higher value may indicate redundancy that may be removed.
The result of the equation is = 0, then A and B are independent
and there is no correlation between them.

If the resulting value is < 0, then A and B are negatively correlated
where the values of one attribute increase as the value of one
attribute decrease which means each attribute may discourages each

other.
-also called Pearson’s product moment coefficient

For categorical data


Example: Contingency Table Data

 2.4.2 Data Transformation


 Data transformation can involve the following:


Smoothing: which works to remove noise from the data
Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute weekly and annual total scores.
  Generalization of the data: where low-level or “primitive” (raw)

data are replaced by higher-level concepts through the use of concept


hierarchies. For example, cat egorical attributes, like street, can be
generalized to higher-level concepts, like city or country.
Normalization: where the attribute data are scaled so as to fall
within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.

Attribute construction (feature construction): this is where


new attributes are constructed and added from the given set of
attributes to help the mining process.
 Discretization the raw values of a numeric attribute are
replaced by interval labels(0-10,11-20)
Normalization

In which data are scaled to fall within a small, specified range, useful for
classification algorithms involving neural networks, distance
measurements such as nearest neighbor classification and clustering.
There are 3 methods for data normalization. They are:

1)min-max normalization 
2)z-score normalization
3)normalization by decimal scaling

Min-max normalization: performs linear transformation on the original


data values. Suppose that minA and maxA are the minimum and maximum values
and maximum value from data is fetched and each value is replaced
according to the following formula.It can be defined as,

v '  max
  Anew
Amin ( max
_ A min
new _ new
A) _ min Av is the value to
be normalized

minA,maxA are minimum and maximum values of an


attribute A new_ maxA, new_ minA are the
normalization range.

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute value of A
respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e
boundary value of range required) respectively.
Eg

Suppose the minimum and maximum value for an attribute profit(P)


are Rs. 10, 000 and Rs. 100, 000. We want to plot the profit in the
range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for
attribute profit can be plotted to:

And hence, we get the value of v’ as 0.11

Example :.
Normalize the following group of data –
1000,2000,3000,9000
using min-max normalization by setting min:0 and max:1

Solution –

here,new_max(A)=1 , as given in question- max=1


new_min(A)=0,as given in question- min=0
max(A)=9000,as the maximum data among 1000,2000,3000,9000 is
9000
min(A)=1000,as the minimum data among 1000,2000,3000,9000 is
1000
Case-1:
v normalizing
= 1000 1000
, putting all values in –the formula,we get
v' = (1000-1000) X (1-0)

----------------- +0 =0
9000-1000
Case-2: normalizing 2000 –
v = 2000, putting all values in the formula,we get
v '= (2000-1000) X (1-0)
----------------- + 0 =0 .125
9000-1000
Case-3: normalizing 3000 –
v=3000, putting all values in the formula,we get
v'=(3000-1000) X (1-0)
----------------- + 0 =0 .25
9000-1000
Case-4: normalizing 9000 –
v=9000, putting all values in the formula, we get
v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
Outcome :
Hence, the normalized values of 1000,2000,3000,9000 are 0,
0.125, .25, 1.
Z-score normalization / zero-mean normalization: In which values of
an attribute A are normalized based on the mean and standard deviation of
A. It can be defined as,

v' v  meanA
 stand _
devA

This method is useful when min and max value of attribute A are
unknown or when outliers that are dominate min-max normalization.

Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000,


for the attribute P. Using z-score normalization, a value of 85000 for P
can be transformed to:

And hence we get the value of v’ to be 2.5


Normalization by decimal scaling: normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A. A value v of A is normalized
to v’ by computing,


 A value, v, of attribute A is normalized to v’ by computing

 where j is the smallest integer such that Max(|v’|) < 1.


For example:
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.

For normalizing the values we divide the numbers by 100
(i.e., j = 2) or (number of integers in the largest number) so
that values come out to be as 0.98, 0.97 and so on.

CGPAFormula CGPA Normalized after Decimal scaling


2 2/10 0.2 0.3
3 3/10
We will check the maximum value among our attribute CGPA.
Here maximum value is 3 so we can convert it to a decimal by
dividing by 10. Why 10?
we will count total numbers in our maximum value and then put 1
and after 1 we can put zeros equal to the length of the maximum
value.

Example 2:
Salary bonus Formula CGPA Normalized after Decimal scalin
400 400 / 1000 0.4
310 310 / 1000 0.31
Example 3:

Salary Formula CGPA Normalized after Decimal scaling


40,000 40,000 / 100000 0.4
31, 000 31,000 / 100000 0.31

2.5 Data Reduction techniques


These techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity
of the original data. Data reduction includes,

1.Data cube aggregation, where aggregation operations are applied to


the data in the construction of a data cube.
2.Attribute subset selection, where irrelevant, weakly relevant
or redundant attributes or dimensions may be detected and
removed.

3.Dimensionality reduction, where encoding mechanisms are used to


reduce the data set size. Examples: Wavelet Transforms Principal
Components Analysis

4.Numerosity reduction, where the data are replaced or estimated by


alternative, smaller data representations such as parametric models
(which need store only the model parameters instead of the actual
data) or nonparametric methods such as clustering, sampling, and the
use of histograms.

5.Discretization and concept hierarchy generation, where raw data


values for attributes are replaced by ranges or higher conceptual levels.
Data Discretization is a
form of numerosity reduction that is very useful for the automatic
generation of concept hierarchies.

2.5.1 Data cube aggregation: Reduce the data to the concept level
needed in the analysis. Queries regarding aggregated information should be
answered using data cube when possible. Data cubes store
multidimensional aggregated information. The following figure shows a data
cube for multidimensional analysis of sales data with respect to annual sales
per item type for each branch.
Each cells holds an aggregate data value, corresponding to the data point in
multidimensional space.
Data cubes provide fast access to pre computed, summarized data, thereby
benefiting on-line analytical processing as well as data mining.

The cube created at the lowest level of abstraction is referred to as


the base cuboid. A cube for the highest level of abstraction is the apex
cuboid. The lowest level of a data cube (base cuboid). Data cubes created
for varying levels of abstraction are sometimes referred to as cuboids, so
that a “data cube" may instead refer to a lattice of cuboids. Each higher
level of abstraction further reduces the resulting data size.

The following database consists of sales per quarter for the years 1997-
1999.

Suppose, the analyzer interested in the annual sales rather than sales per
quarter, the above data can be aggregated so that the resulting data
summarizes the total sales per year instead of per quarter. The resulting
data in smaller in volume, without loss of information necessary for the
analysis task.
2.5.2 Dimensionality Reduction
It is the process of reducing the number of random variables or
attributes under consideration

1)wavelet transforms- 2)Principal Component Analysis


Transform or project the original data into smaller space

It reduces the data set size by removing irrelevant attributes. This is a


method of attribute subset selection are applied. A heuristic method of
attribute of sub set selection is explained here:

2.5.3 Attribute subset selection / Feature selection


Removes irrelevant,weakly relevant or redundant attributes are
detected and removed.

Feature selection is a must for any data mining product. That is because,
when you build a data mining model, the dataset frequently contains more
information than is needed to build the model. For example, a dataset may
contain 500 columns that describe characteristics of customers, but
perhaps only 50 of those columns are used to build a particular model. If
you keep the unneeded columns while building the model, more CPU and
memory are required during the training process, and more storage space
is required for the completed model.

In which select a minimum set of features such that the probability


distribution of different classes given the values for those features is as
close as possible to the original distribution given the values of all features

Basic heuristic methods of attribute subset selection include the


following techniques, some of which are illustrated below:

1.Step-wise forward selection: The procedure starts with an empty set of


attributes. The best of the original attributes is determined and added to
the set. At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
2.Step-wise backward elimination: The procedure starts with the full
set of attributes. At each step, it removes the worst attribute remaining
in the set.

3.Combination forward selection and backward elimination: The step-


wise forward selection and backward elimination methods can be combined,
where at each step one selects the best attribute and removes the worst from
among the remaining attributes.

4.Decision tree induction: Decision tree induction constructs a flow-


chart-like structure where each internal (non-leaf) node denotes a test on
an attribute, each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best" attribute to partition the data into individual
classes. When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All attributes that do
not appear in the tree are assumed to be irrelevant. The set of attributes
appearing in the tree form the reduced subset of attributes.
Wrapper approach/Filter approach:

The mining algorithm itself is used to determine the attribute sub set, then
it is called wrapper approach or filter approach. Wrapper approach leads
to greater accuracy since it optimizes the evaluation measure of the
algorithm while removing attributes.

Data compression

In data compression, data encoding or transformations are applied so as to


obtain a reduced or “compressed" representation of the original data. If the
original data can be reconstructed from the compressed data without any
loss of information, the data compression technique used is called lossless.
If, instead, we can reconstruct only an approximation of the original data,
then the data compression technique is called lossy. Effective methods of
lossy data compression:

 Wavelet transforms
 Principal components analysis.
Wavelet compression is a form of data compression well suited for image
compression. The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector D, transforms it to
a numerically different vector, D0, of wavelet coefficients.

The general algorithm for a discrete wavelet transform is as follows.

1.The length, L, of the input data vector must be an integer power of two.
This condition can be met by padding the data vector with zeros, as
necessary.

2.Each transform involves applying two functions:


 data smoothing
 calculating weighted difference

3.The two functions are applied to pairs of the input data, resulting in two
sets of data of length L/2.
4.The two functions are recursively applied to the sets of data obtained in
the previous loop, until the resulting data sets obtained are of desired
length.

5.A selection of values from the data sets obtained in the above iterations
are designated the wavelet coefficients of the transformed data.
If wavelet coefficients are larger than some user-specified threshold then it
can be retained.
The remaining coefficients are set to 0.

Haar2 and Daubechie4 are two popular wavelet transforms.


Principal Component Analysis (PCA)
-also called as Karhunen-Loeve (K-L) method
Procedure

•Given N data vectors from k-dimensions, find c <= k orthogonal


vectors that can be best used to represent data

– The original data set is reduced (projected) to one


consisting of N data vectors on c principal components
(reduced dimensions)
•Each data vector is a linear combination of the c principal component
vectors
•Works for ordered and unordered attributes
•Used when the number of dimensions is large

The principal components (new set of axes) give important information


about variance. Using the strongest components one can reconstruct a
good approximation of the original signal.
2.5.4 Numerosity Reduction

Data volume can be reduced by choosing alternative smaller forms of


data. This tech. can be

 Parametric method

 Non parametric method

Parametric: Assume the data fits some model, then estimate model
parameters, and store only the parameters, instead of actual data.
Non parametric: In which histogram, clustering and sampling is used
to store reduced form of data.

Numerosity reduction techniques:

1 Regression and log linear models:


 Can be used to approximate the given data
 In linear regression, the data are modeled to fit a
straight line using Y = α + β X, where α, β are
coefficients
•Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
Log-linear model: The multi-way table of joint probabilities is
approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad


bcd 2 Histogram
 Divide data into buckets and store average (sum) for
each bucket  A bucket represents an
attribute-value/frequency pair
 It can be constructed optimally in one dimension using dynamic
programming  It divides up the range of possible values in a data set
into classes or groups. For

each group, a rectangle (bucket) is constructed with a base length


equal to the range of values in that specific group, and an area
proportional to the number of observations falling into that group.

 The buckets are displayed in a horizontal axis while height of a


bucket represents the average frequency of the values.
Example:

The following data are a list of prices of commonly sold items. The
numbers have been sorted.

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15,


15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30

The buckets can be determined based on the following partitioning rules,


including the following.

1.Equi-width: histogram with bars having the same width


2.Equi-depth: histogram with bars having the same height
V-Optimal: histogram with least variance
3. (countb*valueb)
MaxDiff: bucket boundaries defined by user specified
4. threshold

V-Optimal and MaxDiff histograms tend to be the most accurate and


practical. Histograms are highly effective at approximating both sparse and
dense data, as well as highly skewed, and uniform data.

Clustering techniques consider data tuples as objects. They partition the


objects into groups or clusters, so that objects within a cluster are “similar"
to one another and “dissimilar" to objects in other clusters. Similarity is
commonly defined in terms of how “close" the objects are in space, based on
a distance function.
Quality of clusters measured by their diameter (max distance between any
two objects in the cluster) or centroid distance (avg. distance of each cluster
object from its centroid)

Sampling

Sampling can be used as a data reduction technique since it allows a large


data set to be represented by a much smaller random sample (or subset) of
the data. Suppose that a large data set, D, contains N tuples. Let's have a
look at some possible samples for D.

1. simple random sample without replacement (SRSWOR) of size


n: This is created by drawing n of the
N tuples from D (n < N), where the probably of drawing any tuple in D is
1=N, i.e., all tuples are equally likely.
2.Simple random sample with replacement (SRSWR) of size n:
This is similar to SRSWOR, except that each time a tuple is drawn from
D, it is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.

3.Cluster sample: If the tuples in D are grouped into M mutually disjoint


“clusters", then a SRS of m clusters can be obtained, where m < M. For
example, tuples in a database are usually retrieved a page at a time, so that
each page can be considered a cluster. A reduced data representation can
be obtained by applying, say, SRSWOR to the pages, resulting in a cluster
sample of the tuples.

4.Stratified sample: If D is divided into mutually disjoint parts called


“strata", a stratified sample of D is generated by obtaining a SRS at each
stratum. This helps to ensure a representative sample, especially when the
data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group. In
this way, the age group having the smallest number of customers will be
sure to be represented.

Advantages of sampling

1.An advantage of sampling for data reduction is that the cost of


obtaining a sample is proportional to the size of the sample, n, as
opposed to N, the data set size. Hence, sampling complexity is
potentially sub-linear to the size of the data.

2.When applied to data reduction, sampling is most commonly used to


estimate the answer to an aggregate query.

2.6 Discretization and concept hierarchies

Discretization:
Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual data values.

Concept Hierarchy

A concept hierarchy for a given numeric attribute defines a Discretization


of the attribute. Concept hierarchies can be used to reduce the data by
collecting and replacing low level concepts (such as numeric values for the
attribute age) by higher level concepts (such as **young, middle-aged, or
senior).
2.6.1 Discretization and Concept hierarchy for numerical data:

Qualitative Attributes such as Nominal, Ordinal, and Binary Attributes.


2. Quantitative Attributes such as Discrete and Continuous Attributes.

Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic rank


All Values have a meaningful order. For example, Grade-A means highest
marks, B means marks are less than A, C means marks are less than grades
A and B, and so on. Ordinal Attributes are Quantitative Attributes.
Examples of Ordinal Attributes
Attribute
Grade Value
BPS- Basic pay scale A, B, C, D, F
16, 17, 18
Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can also be
in a categorical form. Discrete Attributes are Quantitative Attributes.
Examples of Discrete Data

Attribute Value
Profession Teacher, Bussiness Man, Peon etc
Postal Code 42200, 42300 etc
Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1
and 2. These attributes are Quantitative Attributes.
Example of Continuous Attribute
Attribute Value
Height 5.4…, 6.5….. etc
Weight 50.09…. etc

Continuous — real numbers, e.g., integer or real numbers

There are five methods for numeric concept hierarchy generation. These
include:

1. binning,
2.histogram analysis,
3.clustering analysis,
4.entropy-based Discretization, and
5.data segmentation by “natural partitioning".

An information-based measure called “entropy" can be used to


recursively partition the values of a numeric attribute A, resulting in a
hierarchical Discretization.

Procedure:
Segmentation by Natural Partitioning

Example: Suppose that profits at different branches of a company for the

year 1997
cover a wide range, from -$351,976.00 to $4,700,896.50. A user wishes to
have a concept hierarchy for profit automatically generated.

Suppose that the data within the 5%-tile and 95%-tile are between -
$159,876 and $1,838,761. The results of applying the 3-4-5 rule are
shown in following figure
Step 1: Based on the above information, the minimum and maximum values
are: MIN = - $351, 976.00, and MAX = $4, 700, 896.50. The low (5%-tile)
and high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159, 876, and HIGH = $1, 838,761.

Step 2: Given LOW and HIGH, the most significant digit is at the million
dollar digit position (i.e., msd =
1,000,000). Rounding LOW down to the million dollar digit, we get LOW’ = -
$1; 000; 000; and rounding
HIGH up to the million dollar digit, we get HIGH’ = +$2; 000; 000.

Step 3: Since this interval ranges over 3 distinct values at the most
significant digit, i.e., (2; 000; 000-(-1, 000; 000))/1, 000, 000 = 3, the
segment is partitioned into 3 equi-width sub segments according to the 3-4-
5 rule: (-$1,000,000 - $0], ($0 - $1,000,000], and ($1,000,000 - $2,000,000].
This represents the top tier of the hierarchy.

Step 4: We now examine the MIN and MAX values to see how they “fit" into
the first level partitions. Since the first interval, (-$1, 000, 000 - $0] covers
the MIN value, i.e., LOW’ < MIN, we can adjust the left boundary of this
interval to make the interval smaller. The most
significant digit of MIN is the hundred thousand digit position. Rounding
MIN down to this position, we get MIN0’ = -$400, 000.
Therefore, the first interval is redefined as (-$400,000 - 0]. Since the last
interval, ($1,000,000-$2,000,000] does not cover the MAX value, i.e., MAX >
HIGH’, we need to create a new interval to cover it. Rounding up MAX at its
most significant digit position, the new interval is ($2,000,000 - $5,000,000].
Hence, the top most level of the hierarchy contains four partitions, (-
$400,000 - $0], ($0 - $1,000,000], ($1,000,000 - $2,000,000], and
($2,000,000 - $5,000,000].

Step 5: Recursively, each interval can be further partitioned according to


the 3-4-5 rule to form the next lower level of the hierarchy:
- The first interval (-$400,000 - $0] is partitioned into 4 sub-intervals: (-
$400,000 - - $300,000], (-$300,000 - -$200,000], (-$200,000 - -
$100,000], and (-$100,000 - $0].

- The second interval, ($0- $1,000,000], is partitioned into 5 sub-


intervals: ($0 - $200,000], ($200,000 - $400,000], ($400,000 -
$600,000], ($600,000 - $800,000], and ($800,000 -$1,000,000].

- The third interval, ($1,000,000 - $2,000,000], is partitioned into 5 sub-


intervals: ($1,000,000 - $1,200,000], ($1,200,000 - $1,400,000],
($1,400,000 - $1,600,000], ($1,600,000 - $1,800,000], and ($1,800,000
- $2,000,000].
- The last interval, ($2,000,000 - $5,000,000], is partitioned into 3 sub-
intervals:
($2,000,000 - $3,000,000], ($3,000,000 - $4,000,000], and ($4,000,000
- $5,000,000].
2.6.2 Concept hierarchy generation for category data

A concept hierarchy defines a sequence of mappings from set of low-


level concepts to higher-level, more general concepts.
It organizes the values of attributes or dimension into gradual levels of
abstraction. They are useful in mining at multiple levels of abstraction

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy