0% found this document useful (0 votes)

17 views

Week 2

This document discusses various techniques for data preprocessing including data cleaning, integration, reduction, transformation, and discretization. Data cleaning techniques address issues like missing data, noise, and inconsistencies. Data integration combines data from multiple sources. Data reduction reduces data volume through dimensionality reduction, feature selection, and sampling. Data transformation techniques like normalization and discretization map raw data values to new replacement values.

Uploaded by

veceki2439

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Week 2

Uploaded by

veceki2439

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

CP610 Data Analysis

- Data Preprocessing & Visualization

Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

2
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records

3
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean/median/mode
• the attribute mean/median/mode for all samples belonging
to the same class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree

4
How to Handle Noisy Data?
• Binning
• first sort data and partition into bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers

5
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

6
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id º B.cust-#

• Integrate metadata from different sources

• Handling Redundancy in Data Integration

• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table
• Redundant attributes may be able to be detected by correlation analysis

7
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

8
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• …
• Numerosity reduction (some simply call it: Data Reduction)
• Histograms, clustering, sampling
• Regression and Log-Linear Models
• …
9
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data analysis
• Allow easier visualization
• Dimensionality reduction techniques
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
• …
10
Geometric interpretation of PCA
• Which variable is the principle variable?
• Max variance

11
• First PC is direction of maximum variance from
origin
• Subsequent PCs are orthogonal to 1st PC and
describe maximum residual variance

12
PCA transformation
• Coefficients of the linear combination which
transform the observations onto the PCs are
formed by eigenvalues of the covariance matrix
• Covariance matrix (3 dimensions)

13
PCA Algorithm
• Input: Data Matrix
• Step 1: Normalize data matrix
• Step 2: Get Covariance Matrix
• Step 3: Calculate Eigen Vectors and Eigen Values of
the covariance matrix
• Step 4: Sort the Eigen Vectors: Take the
eigenvalues λ₁, λ₂, …, λp and sort them from largest
to smallest.
• Step 5: Choose first k eigen vectors and calculate
the new features

14
• Project the standardized points in the new feature
space

15
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in
one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA

16
• Task: Supervised (Classification/Regression)
• Iterative process
• Subset generation
• Subset selection
• Termination condition

• Search
• Forward
• Backward
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative,
smaller forms of data representation

• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

• Parametric methods
Sampling:
Choose a representative subset of the out
With ement
c
data repla

• Simple random sampling

With
• There is an equal probability of replac
em ent
selecting any particular item
Raw Data
• With vs Without replacement
• Stratified sampling:
• Partition the data set, and draw
samples from each partition
(proportionally, i.e.,
approximately the same
percentage of the data)
• Used in conjunction with skewed
data
20
Histogram Analysis
40

• Divide data into buckets and store 35

average (sum) for each bucket 30
25
• Partitioning rules:
20
• Equal-width: equal bucket
range 15

• Equal-frequency (or equal- 10

depth) 5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
21
Parametric methods

E.g. Regression. Assume the data fits some model, estimate

model parameters – compression of the data

22
Data Preprocessing
• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

23
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization

24
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 - 12,000
$73,600 is mapped to (1.0 - 0) + 0 = 0.716
98,000 - 12,000

• Z-score normalization (μ: mean, σ: standard deviation):

v - µA
v' =
s A

73,600 - 54,000
• Ex. From the data, μ = 54,000, σ = 16,000. Then = 1.225
16,000

• Normalization by decimal scaling

v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j
j=5; 73,600/(10^5) = 0.736 25
Discretization
• Discretization: Divide the range of a continuous attribute into intervals.
Prepare for further analysis, e.g., classification, association rule mining…

• Interval labels can then be used to replace actual data values

• Reduce data size by discretization
• Discretization can be performed recursively on an attribute
• Split (top-down) vs. merge (bottom-up)
• Supervised vs. unsupervised approaches
• Equal Width Binning, Equal Frequency Binning, Clustering-Based Discretization,
Density-Based Discretization, Decision Tree-Based Discretization, …

26
Summary about data preparation
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data transformation and data discretization

27
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis

• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing data: Mapping data
onto aesthetics
• Commonly used aesthetics in data visualization:
• Position
• Shape
• Size
• Color
• Line width
• Line type

Which one(s) can only represent discrete data??

Scales map data values onto
aesthetics
• The mapping between data values and aesthetics
values is created via scales.

• A scale must be one-to-one

30
• Data visualization is part art and part science.

Examples of ugly, bad, and wrong figures 31

Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis

• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Coordinate systems and axes

• Position scales -
determine where in a
graphic different data
values are located.
• 2-dimension visualizations
- two numbers are
required to uniquely
specify a point, and
therefore we need two
position scales.
• 3-d: 3 position scales

33
• Cartesian coordinates

• The most widely used

coordinate system

• The axis run orthogonally to

each other.

• Data values are placed in an

even spacing along both axis

34
Example: two axes representing two
different units

35
Example: Same unit & change in unit

Note:
• Use equal grids for same unit
• Cartesian coordinate systems are invariant under linear
transformations 36
• What if we want to visualize highly skewed data?

• Nonlinear axes
• Even spacing in data units corresponds to uneven spacing in the
visualization

Example: log scale

37
• curved axes
• polar coordinate
• Pole
• Radius
• Polar angle

• geospatial data

38
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis

• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Color Scales
• There are three fundamental use cases for color in
data visualizations:
• (i) distinguish groups of data from each other - discrete
• (ii) represent numerical data values - continuous
• (iii) highlight

40
Color as a tool to distinguish
• We frequently use color to distinguish discrete
items or groups that do not have an intrinsic order,
such as different countries on a map or different
manufacturers of a certain product.
• In this case, we use a qualitative color scale. Such a
scale contains a finite set of specific colors that are
chosen to look clearly distinct from each other
while also being equivalent to each other.
• The second condition requires that no one color
should stand out relative to the others.

41
42
Color to represent numerical
values
• Color can also be used to represent data values, such as
income, temperature, or speed – continuous
• In this case, we use a sequential color scale. Such a
scale contains a sequence of colors that clearly indicate
• (i) which values are larger or smaller than which other ones
and
• (ii) how distant two specific values are from each other.
• Sequential scales can be based on a single hue (e.g.,
from dark blue to light blue) or on multiple hues (e.g.,
from dark red to light yellow).

43
44
Color as a tool to highlight
• There may be specific categories or values in the
dataset that carry key information about the story
we want to tell, and we can strengthen the story by
emphasizing the relevant figure elements to the
reader.
• This effect can be achieved with accent color scales,
which are color scales that contain both a set of
subdued colors and a matching set of stronger,
darker, and/or more saturated colors.

45
46
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis

• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing amounts
• We have a set of categories (e.g., brands of cars,
cities, or sports) and a quantitative value for each
category.
• The standard visualization in this scenario is the bar
chart. Variation include grouped and stacked bars.
• Alternatives to the bar plot are the dot plot and
the heatmap.

49
Bar Plot/Chart
• Commonly visualized with vertical bars.

50
• A bar plot/chart
• presents categorical data
• with rectangular bars
• the bars’ heights or lengths are proportional to the
values that they represent.
• One axis of the chart shows the specific categories being
compared
• the other axis represents a measured value.

• The bars can be plotted vertically or horizontally.

51
• Regardless of whether we place bars vertically or
horizontally, we need to pay attention to the order
in which the bars are arranged.

54
• Whenever there is a natural ordering (i.e., when
our categorical variable is an ordered factor) we
should retain that ordering in the visualization.

55
56
Grouped bars
• When we are interested in two categorical variables
at the same time, we can visualize this dataset with
a grouped bar plot.
• we first draw a group of bars at each position along
the x axis, determined by one categorical variable
• then we draw bars within each group according to the
other categorical variable

57
Stacked Bars
• Instead of drawing groups of bars side-by-side, it is
sometimes preferable to stack bars on top of each
other.
• Stacking is useful when the sum of the amounts
represented by the individual stacked bars is in itself
a meaningful amount.
• Stacked bar charts are designed to help you
simultaneously compare totals and notice sharp
changes at the item level that are likely to have the
most influence on movements in category totals.

60
61
Dot plots and heatmaps
• Bars are not the only option for visualizing
amounts.
• One important limitation of bars is that they need
to start at zero, so that the bar length is
proportional to the amount shown.
• In this case, we can indicate amounts by placing
dots at the appropriate locations along
the x or y axis.

63
64
65
66
Heatmap
• As an alternative to mapping data values onto
positions via bars or dots, we can map data values
onto colors. Such a figure is called a heatmap.
• Heat maps make it easy to visualize complex data
and understand it at a glance.

67
Internet adoption over time, for select countries. Color represents the percent
of internet users for the respective country and year. Countries were ordered
by percent internet users in 2016. Data source: World Bank 68
Internet adoption over time, for select countries. Countries were ordered by
the year in which their internet usage first exceeded 20%. Data source: World
Bank 69
A Click map of user clicks on web vs mobile app 70
Figure: Stock price over time for four
major tech companies. The stock price
for each company has been normalized
to equal 100 in June 2012.

75
Data Visualization Outline
• Visualizing data: Mapping data onto aesthetics

• Coordinate systems and axis

• Color Scales
• Visualization for different types of data
• Amount
• Distribution
•…
Visualizing distributions
• Visualizing distributions: Histograms and density
plots
• Visualizing a single distribution
• Visualizing multiple distributions at the same time
• Visualizing distributions: Empirical cumulative
distribution functions and q-q plots
• Empirical cumulative distribution functions
• Highly skewed distributions
• Quantile–quantile plots

77
Visualizing distributions:
Histograms and density plots
• How a particular variable is distributed in a
dataset?

• E.g. There were approximately 1300 passengers on the Titanic and we

have reported ages for 756 of them. We might want to know how many
passengers of what ages there were on the Titanic, i.e., how many
children, young adults, middle-aged people, seniors, and so on. We call
the relative proportions of different ages among the passengers the age
distribution of the passengers.

78
• The age distribution among the passengers by
grouping all passengers into bins with comparable
ages and then counting the number of passengers
in each bin

79
Histogram
• A histogram displays the shape and spread of
continuous sample data.

80
bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. 82
Density plot
• Visualize the underlying probability distribution of
the data by drawing an appropriate continuous
curve
• Probability Density
• A random variable x has a probability distribution f(x).
• The relationship between the outcomes of a random
variable and its probability is referred to as the
probability density, or simply the “density.”

83
• Note that we have two requirements on f(x):

• Density estimation
• All we have access to is a sample of observations
• We must assume a probability distribution

• Kernel Density Estimation

• Nonparametric method for using a dataset to estimating
probabilities for new points.
84
• Kernel Density Estimation [McLachlan, 1992, Silverman, 1998]

85
The height of the curve is scaled such that the area under the curve equals
one. The density estimate was performed with a Gaussian kernel and a
bandwidth of 2. 86
88
(a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian
89
kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2.
• Be careful with the tails

90
• Histogram and density plots
• both are highly intuitive and visually appealing
• both share the limitation that the resulting figure
depends to a substantial degree on parameters the user
has to choose, such as the bin width for histograms and
the bandwidth for density plots.
• both have to be considered as an interpretation of the
data rather than a direct visualization of the data itself.

96
Visualizing distributions: Empirical
cumulative distribution functions
and q-q plots
• Aggregate methods that highlight properties of the
distribution rather than the individual data points
• Require no arbitrary parameter choices
• Show all of the data at once
• A little less intuitive

97
Empirical cumulative
distribution function (ECDF)
• An ECDF is an estimator of the Cumulative
Distribution Function.
• If you have a set of samples (X1 < X2 < … < Xn) from
an observed random variable, then the ECDF is

98
• Assume our hypothetical class has 50 students, and
the students just completed an exam on which they
could score between 0 and 100 points.
• How can we best visualize the class performance,
for example to determine appropriate grade
boundaries?

99
• A different way of thinking about this visualization
is the following:
• We can rank all students by the number of points they
obtained, in ascending order (so the student with the
fewest points receives the lowest rank and the student
with the most points the highest)
• Then plot the rank versus the actual points obtained.

100
• ECDF (not normalized)

101
• ECDF (normalized)

102
Highly skewed distributions
• Many empirical
datasets display highly
skewed distributions,
in particular with
heavy tails to the right,
and these distributions
can be challenging to
visualize.

104
Log Transformation

105
Quantile–quantile plots
• Quantile–quantile (q-q) plots are a useful
visualization when we want to determine to what
extent the observed data points do or do not follow
a given distribution.
• q-q plots are also based on ranking the data and
visualizing the relationship between ranks and
actual values
• The ranks are used to predict where a given data
point should fall if the data were distributed
according to a specified reference distribution.

106
Example:
• Assume the data values have a mean of 10 and a
standard deviation of 3
• Assuming a normal distribution, we would expect
• a data point ranked at the 50th percentile to lie at
position 10 (the mean)
• a data point at the 84th percentile to lie at position 13
(one standard deviation above from the mean)
•…

107
108
Amounts

111
Distributions

112
Proportions

113
Relationships

114
Geospatial data

115
• Uncertainty

116
• Reference for visualization

Data Visualization with Python: Create an impact with

meaningful data insights using interactive and engaging
visuals. By Mario Dobler and Tim Großmann. (ISBN-13: 978-
1789956467)

Interactive Visualization: Insight through Inquiry. By Bill

Ferster and Ben Shneiderman. (ISBN-13: 978-0262018159)

Fundamentals of Data Visualization. By Claus O. Wilke. (ISBN-

13: 978-1492031086)

MATH 341 - Operations Research I - Kamran Rashid
No ratings yet
MATH 341 - Operations Research I - Kamran Rashid
5 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Normalization
No ratings yet
Normalization
35 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data_Preprocessing-2
No ratings yet
Data_Preprocessing-2
30 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
253777
No ratings yet
253777
66 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Module 8
No ratings yet
Module 8
13 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Theory and Problems of Signals & Systems - Hsu - Schaum95
No ratings yet
Theory and Problems of Signals & Systems - Hsu - Schaum95
238 pages
Roots of Equations Case Studies
No ratings yet
Roots of Equations Case Studies
16 pages
Digital Signal Processing Practical File
No ratings yet
Digital Signal Processing Practical File
49 pages
MCSL - 228 Solved Assignment
No ratings yet
MCSL - 228 Solved Assignment
37 pages
Unit 5-1
No ratings yet
Unit 5-1
6 pages
Bitcoin Address Generation in Pure Python - OPSXCQ Blog
No ratings yet
Bitcoin Address Generation in Pure Python - OPSXCQ Blog
1 page
DAA First-Internal Question Paper (2018) 3.4
No ratings yet
DAA First-Internal Question Paper (2018) 3.4
2 pages
Exp 7
No ratings yet
Exp 7
3 pages
Unit No. 05 - Reinforced and Deep Learning
No ratings yet
Unit No. 05 - Reinforced and Deep Learning
44 pages
1 Pre-Lab: Lab 07: Sampling, Convolution, and FIR Filtering
0% (1)
1 Pre-Lab: Lab 07: Sampling, Convolution, and FIR Filtering
12 pages
TOS 1st Quarter MATH 8
No ratings yet
TOS 1st Quarter MATH 8
2 pages
Linear Predictor: Nature of Linear Prediction
No ratings yet
Linear Predictor: Nature of Linear Prediction
9 pages
MBSTU Topic List Sheet1
No ratings yet
MBSTU Topic List Sheet1
1 page
Data Science Interview Guide
No ratings yet
Data Science Interview Guide
23 pages
Image Processing Lab Manual
No ratings yet
Image Processing Lab Manual
19 pages
Interpolation
No ratings yet
Interpolation
24 pages
Assignment-1: PMSCS 616: Algorithm Analysis & Design
No ratings yet
Assignment-1: PMSCS 616: Algorithm Analysis & Design
3 pages
Adaptive and Generic Corner Detection Based On The Accelerated Segment Test
No ratings yet
Adaptive and Generic Corner Detection Based On The Accelerated Segment Test
14 pages
Notes Polynomials
No ratings yet
Notes Polynomials
3 pages
Coding: Simpson'S 1/3 Rule
No ratings yet
Coding: Simpson'S 1/3 Rule
3 pages
320 - CS8391 Data Structures - Important Questions 2
0% (1)
320 - CS8391 Data Structures - Important Questions 2
1 page
PPT Lecture 1.0 LP Graphical Methods
No ratings yet
PPT Lecture 1.0 LP Graphical Methods
36 pages
Lemp El Ziv Compression
No ratings yet
Lemp El Ziv Compression
6 pages
MATLAB Function Reference
No ratings yet
MATLAB Function Reference
13 pages
TORA Output
No ratings yet
TORA Output
2 pages
Lab.8. DFT and FFT Transforms
No ratings yet
Lab.8. DFT and FFT Transforms
16 pages
Rapor 1
No ratings yet
Rapor 1
4 pages
Unit-I - ADS - IMP QP
No ratings yet
Unit-I - ADS - IMP QP
2 pages
Data Science & Analytics Placement Assurance Program Brochure
No ratings yet
Data Science & Analytics Placement Assurance Program Brochure
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Week 2

Uploaded by

Week 2

Uploaded by

CP610 Data Analysis

- Data Preprocessing & Visualization

• Data Transformation and Data Discretization

• Data Transformation and Data Discretization

• Schema integration: e.g., A.cust-id º B.cust-#

• Handling Redundancy in Data Integration

• Data Transformation and Data Discretization

• Simple random sampling

• Divide data into buckets and store 35

• Equal-frequency (or equal- 10

E.g. Regression. Assume the data fits some model, estimate

• Data Transformation and Data Discretization

• Z-score normalization (μ: mean, σ: standard deviation):

• Normalization by decimal scaling

• Interval labels can then be used to replace actual data values

• Coordinate systems and axis

Which one(s) can only represent discrete data??

• A scale must be one-to-one

Examples of ugly, bad, and wrong figures 31

• Coordinate systems and axis

• The most widely used

• The axis run orthogonally to

• Data values are placed in an

Example: log scale

• Coordinate systems and axis

• Coordinate systems and axis

• The bars can be plotted vertically or horizontally.

• Coordinate systems and axis

• E.g. There were approximately 1300 passengers on the Titanic and we

• Kernel Density Estimation

Data Visualization with Python: Create an impact with

Interactive Visualization: Insight through Inquiry. By Bill

Fundamentals of Data Visualization. By Claus O. Wilke. (ISBN-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.