Week 2
Week 2
• Data Integration
• Data Reduction
• Summary
2
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
3
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean/median/mode
• the attribute mean/median/mode for all samples belonging
to the same class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
4
How to Handle Noisy Data?
• Binning
• first sort data and partition into bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
5
Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
6
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
7
Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
8
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• …
• Numerosity reduction (some simply call it: Data Reduction)
• Histograms, clustering, sampling
• Regression and Log-Linear Models
• …
9
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data analysis
• Allow easier visualization
• Dimensionality reduction techniques
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
• …
10
Geometric interpretation of PCA
• Which variable is the principle variable?
• Max variance
11
• First PC is direction of maximum variance from
origin
• Subsequent PCs are orthogonal to 1st PC and
describe maximum residual variance
12
PCA transformation
• Coefficients of the linear combination which
transform the observations onto the PCs are
formed by eigenvalues of the covariance matrix
• Covariance matrix (3 dimensions)
13
PCA Algorithm
• Input: Data Matrix
• Step 1: Normalize data matrix
• Step 2: Get Covariance Matrix
• Step 3: Calculate Eigen Vectors and Eigen Values of
the covariance matrix
• Step 4: Sort the Eigen Vectors: Take the
eigenvalues λ₁, λ₂, …, λp and sort them from largest
to smallest.
• Step 5: Choose first k eigen vectors and calculate
the new features
14
• Project the standardized points in the new feature
space
15
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in
one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
16
• Task: Supervised (Classification/Regression)
• Iterative process
• Subset generation
• Subset selection
• Termination condition
• Search
• Forward
• Backward
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative,
smaller forms of data representation
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …
• Parametric methods
Sampling:
Choose a representative subset of the out
With ement
c
data repla
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
21
Parametric methods
22
Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
23
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization
24
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 - 12,000
$73,600 is mapped to (1.0 - 0) + 0 = 0.716
98,000 - 12,000
73,600 - 54,000
• Ex. From the data, μ = 54,000, σ = 16,000. Then = 1.225
16,000
26
Summary about data preparation
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data transformation and data discretization
27
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics
29
30
• Data visualization is part art and part science.
• Position scales -
determine where in a
graphic different data
values are located.
• 2-dimension visualizations
- two numbers are
required to uniquely
specify a point, and
therefore we need two
position scales.
• 3-d: 3 position scales
33
• Cartesian coordinates
34
Example: two axes representing two
different units
35
Example: Same unit & change in unit
Note:
• Use equal grids for same unit
• Cartesian coordinate systems are invariant under linear
transformations 36
• What if we want to visualize highly skewed data?
• Nonlinear axes
• Even spacing in data units corresponds to uneven spacing in the
visualization
37
• curved axes
• polar coordinate
• Pole
• Radius
• Polar angle
• geospatial data
38
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics
40
Color as a tool to distinguish
• We frequently use color to distinguish discrete
items or groups that do not have an intrinsic order,
such as different countries on a map or different
manufacturers of a certain product.
• In this case, we use a qualitative color scale. Such a
scale contains a finite set of specific colors that are
chosen to look clearly distinct from each other
while also being equivalent to each other.
• The second condition requires that no one color
should stand out relative to the others.
41
42
Color to represent numerical
values
• Color can also be used to represent data values, such as
income, temperature, or speed – continuous
• In this case, we use a sequential color scale. Such a
scale contains a sequence of colors that clearly indicate
• (i) which values are larger or smaller than which other ones
and
• (ii) how distant two specific values are from each other.
• Sequential scales can be based on a single hue (e.g.,
from dark blue to light blue) or on multiple hues (e.g.,
from dark red to light yellow).
43
44
Color as a tool to highlight
• There may be specific categories or values in the
dataset that carry key information about the story
we want to tell, and we can strengthen the story by
emphasizing the relevant figure elements to the
reader.
• This effect can be achieved with accent color scales,
which are color scales that contain both a set of
subdued colors and a matching set of stronger,
darker, and/or more saturated colors.
45
46
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics
49
Bar Plot/Chart
• Commonly visualized with vertical bars.
50
• A bar plot/chart
• presents categorical data
• with rectangular bars
• the bars’ heights or lengths are proportional to the
values that they represent.
• One axis of the chart shows the specific categories being
compared
• the other axis represents a measured value.
51
• Regardless of whether we place bars vertically or
horizontally, we need to pay attention to the order
in which the bars are arranged.
54
• Whenever there is a natural ordering (i.e., when
our categorical variable is an ordered factor) we
should retain that ordering in the visualization.
55
56
Grouped bars
• When we are interested in two categorical variables
at the same time, we can visualize this dataset with
a grouped bar plot.
• we first draw a group of bars at each position along
the x axis, determined by one categorical variable
• then we draw bars within each group according to the
other categorical variable
57
Stacked Bars
• Instead of drawing groups of bars side-by-side, it is
sometimes preferable to stack bars on top of each
other.
• Stacking is useful when the sum of the amounts
represented by the individual stacked bars is in itself
a meaningful amount.
• Stacked bar charts are designed to help you
simultaneously compare totals and notice sharp
changes at the item level that are likely to have the
most influence on movements in category totals.
60
61
Dot plots and heatmaps
• Bars are not the only option for visualizing
amounts.
• One important limitation of bars is that they need
to start at zero, so that the bar length is
proportional to the amount shown.
• In this case, we can indicate amounts by placing
dots at the appropriate locations along
the x or y axis.
63
64
65
66
Heatmap
• As an alternative to mapping data values onto
positions via bars or dots, we can map data values
onto colors. Such a figure is called a heatmap.
• Heat maps make it easy to visualize complex data
and understand it at a glance.
67
Internet adoption over time, for select countries. Color represents the percent
of internet users for the respective country and year. Countries were ordered
by percent internet users in 2016. Data source: World Bank 68
Internet adoption over time, for select countries. Countries were ordered by
the year in which their internet usage first exceeded 20%. Data source: World
Bank 69
A Click map of user clicks on web vs mobile app 70
Figure: Stock price over time for four
major tech companies. The stock price
for each company has been normalized
to equal 100 in June 2012.
75
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics
77
Visualizing distributions:
Histograms and density plots
• How a particular variable is distributed in a
dataset?
78
• The age distribution among the passengers by
grouping all passengers into bins with comparable
ages and then counting the number of passengers
in each bin
79
Histogram
• A histogram displays the shape and spread of
continuous sample data.
80
bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. 82
Density plot
• Visualize the underlying probability distribution of
the data by drawing an appropriate continuous
curve
• Probability Density
• A random variable x has a probability distribution f(x).
• The relationship between the outcomes of a random
variable and its probability is referred to as the
probability density, or simply the “density.”
83
• Note that we have two requirements on f(x):
• Density estimation
• All we have access to is a sample of observations
• We must assume a probability distribution
85
The height of the curve is scaled such that the area under the curve equals
one. The density estimate was performed with a Gaussian kernel and a
bandwidth of 2. 86
88
(a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian
89
kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2.
• Be careful with the tails
90
• Histogram and density plots
• both are highly intuitive and visually appealing
• both share the limitation that the resulting figure
depends to a substantial degree on parameters the user
has to choose, such as the bin width for histograms and
the bandwidth for density plots.
• both have to be considered as an interpretation of the
data rather than a direct visualization of the data itself.
96
Visualizing distributions: Empirical
cumulative distribution functions
and q-q plots
• Aggregate methods that highlight properties of the
distribution rather than the individual data points
• Require no arbitrary parameter choices
• Show all of the data at once
• A little less intuitive
97
Empirical cumulative
distribution function (ECDF)
• An ECDF is an estimator of the Cumulative
Distribution Function.
• If you have a set of samples (X1 < X2 < … < Xn) from
an observed random variable, then the ECDF is
98
• Assume our hypothetical class has 50 students, and
the students just completed an exam on which they
could score between 0 and 100 points.
• How can we best visualize the class performance,
for example to determine appropriate grade
boundaries?
99
• A different way of thinking about this visualization
is the following:
• We can rank all students by the number of points they
obtained, in ascending order (so the student with the
fewest points receives the lowest rank and the student
with the most points the highest)
• Then plot the rank versus the actual points obtained.
100
• ECDF (not normalized)
101
• ECDF (normalized)
102
Highly skewed distributions
• Many empirical
datasets display highly
skewed distributions,
in particular with
heavy tails to the right,
and these distributions
can be challenging to
visualize.
104
Log Transformation
105
Quantile–quantile plots
• Quantile–quantile (q-q) plots are a useful
visualization when we want to determine to what
extent the observed data points do or do not follow
a given distribution.
• q-q plots are also based on ranking the data and
visualizing the relationship between ranks and
actual values
• The ranks are used to predict where a given data
point should fall if the data were distributed
according to a specified reference distribution.
106
Example:
• Assume the data values have a mean of 10 and a
standard deviation of 3
• Assuming a normal distribution, we would expect
• a data point ranked at the 50th percentile to lie at
position 10 (the mean)
• a data point at the 84th percentile to lie at position 13
(one standard deviation above from the mean)
•…
107
108
Amounts
111
Distributions
112
Proportions
113
Relationships
114
Geospatial data
115
• Uncertainty
116
• Reference for visualization