0% found this document useful (0 votes)

13 views

Univariate and Multivariate Data Exploration

Uploaded by

anugrahrk6

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Univariate and Multivariate Data Exploration

Uploaded by

anugrahrk6

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

1.

DATA SETS
● The most popular datasets used to learn
data science is probably the Iris dataset,
introduced by Ronald Fisher, in his seminal
work on discriminant analysis,
● “The use of multiple measurements in
taxonomic problems” (Fisher, 1936)
● The Iris dataset contains 150 observations of
three different species
● Iris setosa, Iris virginica, and I. versicolor, with 50
observations each.
● Each observation consists of four attributes: sepal
length, sepal width, petal length, and petal width.
● The fifth attribute, the label, is the name of the
species observed.
1.1 Types of Data
● Data come in different formats and types
● For example, the temperature in weather data
can be expressed as any of the following
formats:
– Numeric centigrade (31 C, 33.3 C) or Fahrenheit (100
F, 101.45 F) or on the Kelvin scale
– Ordered labels as in hot, mild, or cold
– Number of days within a year below 0 C (10 days in a
year below freezing)
Types of Data Contd...
● Numerical or Continous

Continuous values can be denoted by numbers and take

an infinite number of values between digits.
– An integer is a special form of the numeric data type which
does not have decimals in the value or more precisely does
not have infinite values between consecutive numbers.
– If a zero point is defined, numeric data become a ratio or
real data type.Examples include temperature in Kelvin
scale, bank account balance, and income.
Categorical or Nominal
● Categorical data types are attributes treated as
distinct symbols or just names. The color of the iris
of the human eye is a categorical data type because
it takes a value like black, green, blue, gray, etc.
● An ordered nominal data type is a special case of a
categorical data type where there is some kind of
order among the values.
● An example of an ordered data type is temperature
expressed as hot, mild, cold.
● Not all data science tasks can be performed on all data
types.
● For example,the neural network algorithm does not work
with categorical data.
● However,one data type can be converted to another
using a type conversion process,but this may be
accompanied with possible loss of information.
● For example, credit scores expressed in poor, average,
good, and excellent categories can be converted to
either 1, 2, 3, and 4
2.DESCRIPTIVE STATISTICS
● Some examples of descriptive statistics
include average annual income, medium
home price in a neighborhood, range of
credit scores of a population, etc.
● Descriptive statistics can be broadly
classified into univariate and multivariate
exploration depending on the number of
attributes under analysis.
2.1 Univariate Exploration
● Univariate data exploration denotes
analysis of one attribute at a time
● Measure of Central Tendency
● The objective of finding the central location of an attribute is to quantify
the dataset with one central or most common number.
Mean: The mean is the arithmetic average of all observations in the
dataset. It is calculated by summing all the data points and dividing by
the number of data points.
● Median: The median is the value of the central point in the
distribution.The median is calculated by sorting all the observations from
small to large and selecting the mid-point observation in the sorted list. If
the number of data points is even, then the average of the middle two
data points is used as the median.
● Mode: The mode is the most frequently occurring observation. In the
dataset, data points may be repetitive, and the most repetitive data point
is the mode of the dataset.
Measure of Spread
● Range: The range is the difference between the maximum value
and the minimum value of the attribute.
● The range is simple to calculate and articulate but has shortcomings
as it is severely impacted by the presence of outliers and fails to
consider the distribution of all other data points in the attributes.
● Deviation: The variance and standard deviation measures the
spread, by considering all the values of the attribute.
● Deviation is simply measured as the difference between any given
value (x i ) and the mean of the sample (μ). The variance is the sum
of the squared deviations of all data points divided by the number of
data points.
● Standard deviation is the square root of the variance. Since
the standard deviation is measured in the same units as the
attribute, it is easy to understand the magnitude of the metric.
● High standard deviation means the data points are spread
widely around the central point.
● Low standard deviation means data points are closer to the
central point.
● If the distribution of the data aligns with the normal
distribution, then 68% of the data points lie within one
standard deviation from the mean.
2.2 Multivariate Exploration
● Multivariate exploration is the study of more
than one attribute in the dataset
simultaneously. This technique is critical to
understanding the relationship between the
attributes.
● Similar to univariate explorations, the
measure of central tendency and variance
in the data will be discussed.
2.2.1 Central Data Point
● In the Iris dataset, each data point as a set of all the four attributes can be
expressed:observation i: {sepal length, sepal width, petal length, petal width}
● For example, observation one: {5.1, 3.5, 1.4, 0.2}.
● This observation point can also be expressed in four-dimensional Cartesian
coordinates and can beplotted in a graph (although plotting more than three
dimensions in a visualgraph can be challenging).
● In this way, all 150 observations can be expressed in Cartesian coordinates. If
the objective is to find the most “typical” observation point, it would be a data
point made up of the mean of each attribute in the dataset independently. For
the Iris dataset, the central mean point is {5.006, 3.418, 1.464, 0.244}.
● This data point may not be an actual observation. It will be a hypothetical data
point with the most typical attribute values.
2.2.2 Correlation
● Correlation measures the statistical
relationship between two attributes,
particularly dependence of one attribute on
another attribute. When two attributes are
highly correlated with each other, they both
vary at the same rate with each other either
in the same or in opposite directions.
● Pearson correlation coefficient
● Correlation between two attributes is
commonly measured by the Pearson
correlation coefficient (r), which measures
the strength of linear dependence.
● Correlation coefficients take a value from
-1 <= r <= 1
● A value closer to 1 or -1 indicates the two
attributes are highly correlated, with perfect
correlation at 1 or -1.A correlation value of 0
means there is no linear relationship
between two attributes
Scatter Plot
Scatter multiple plot of Iris
dataset
● Bubble Chart
● A bubble chart is a variation of a simple
scatterplot with the addition of one more
attribute, which is used to determine the size
of the data point.
● In the Iris dataset, petal length and petal width
are used for x and y-axis, respectively and
sepal width is used for the size of the data
point. The color of the data point represents a
species class label
Bubble Chart of Iris Data
Density Chart
● Density charts are similar to the scatterplots, with
one more dimension included as a background
color. The data point can also be colored to
visualize one dimension, and hence, a total of four
dimensions can be visualized in a density chart.
● In the example in Fig. petal length is used for the
x-axis, sepal length for the y-axis, sepal width for
the background color, and class label for the data
point color.
Distribution Chart
● For continuous numeric attributes like petal
length, instead of visualizing the actual data in
the sample, its normal distribution function can
be visualized instead.
● If a dataset exhibits normal distribution, then
68.2% of data points will fall within one
standard deviation from the mean; 95.4% of
the points will fall within 2σ and 99.7% within
3σ of the mean.
Distribution Chart

DXC - MF - SD - 2267306 - S4TWL - SD Simplified Data Models
No ratings yet
DXC - MF - SD - 2267306 - S4TWL - SD Simplified Data Models
2 pages
Sap Fiori 2.0 For Sap Hcm1
No ratings yet
Sap Fiori 2.0 For Sap Hcm1
220 pages
IBM Spectrum Protect - Level 2 Quiz
No ratings yet
IBM Spectrum Protect - Level 2 Quiz
15 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
DS Assignment
No ratings yet
DS Assignment
12 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
CH 2
No ratings yet
CH 2
68 pages
Data Mining: Exploring Data: Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data: Lecture Notes For Chapter 3
21 pages
Module 1
No ratings yet
Module 1
64 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02 Data
No ratings yet
02 Data
35 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Lect 3
No ratings yet
Lect 3
51 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
Wk. 4. Exploring Data (12-05-2021)
No ratings yet
Wk. 4. Exploring Data (12-05-2021)
10 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Data Mining Data Exploration
No ratings yet
Data Mining Data Exploration
66 pages
02 Data
No ratings yet
02 Data
64 pages
02 Data
No ratings yet
02 Data
41 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
10
No ratings yet
10
7 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
34 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
BT 3041: Analysis and Interpretation of Biological Data
No ratings yet
BT 3041: Analysis and Interpretation of Biological Data
57 pages
02Data
No ratings yet
02Data
65 pages
02 Data
No ratings yet
02 Data
62 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
3 Data
No ratings yet
3 Data
64 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Data Mining and Analysis
No ratings yet
Data Mining and Analysis
25 pages
02 Data
No ratings yet
02 Data
65 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
data mining 2
No ratings yet
data mining 2
64 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Bsc Computer Science Cs Semester 6 2022 November Data Analytics 2019 Pattern
No ratings yet
Bsc Computer Science Cs Semester 6 2022 November Data Analytics 2019 Pattern
2 pages
Introduction of DBMS LEC 1
No ratings yet
Introduction of DBMS LEC 1
21 pages
AWaj W2 D
No ratings yet
AWaj W2 D
7 pages
Sumit Meshram Resume
No ratings yet
Sumit Meshram Resume
2 pages
MNIST Based Handwritten Digits Recognition
No ratings yet
MNIST Based Handwritten Digits Recognition
5 pages
Sparql: Parql Rotocol ND DF Uery Anguage
No ratings yet
Sparql: Parql Rotocol ND DF Uery Anguage
22 pages
Ch -1 Concept of OODB (1)
No ratings yet
Ch -1 Concept of OODB (1)
42 pages
Postgresql For Data Architects Discover How To Design Develop and Maintain Your Database Application Effectively With Postgresql Maymala
100% (2)
Postgresql For Data Architects Discover How To Design Develop and Maintain Your Database Application Effectively With Postgresql Maymala
62 pages
1972 Bayer Mccreight
No ratings yet
1972 Bayer Mccreight
17 pages
Module 1 Chapter 2
No ratings yet
Module 1 Chapter 2
53 pages
Chameli Devi Group of Institutions, Indore: CS-503 (C) Cyber Security Unit - V
No ratings yet
Chameli Devi Group of Institutions, Indore: CS-503 (C) Cyber Security Unit - V
10 pages
DBMS Labmanual
No ratings yet
DBMS Labmanual
63 pages
Tejas Resume
No ratings yet
Tejas Resume
2 pages
Oracle Architecture Overview
No ratings yet
Oracle Architecture Overview
36 pages
Information Management: E-Business False Information System
No ratings yet
Information Management: E-Business False Information System
16 pages
AcademyCloudFoundations Module 08
No ratings yet
AcademyCloudFoundations Module 08
64 pages
Advanced Database (Lab)
No ratings yet
Advanced Database (Lab)
11 pages
Point of Sale System (POS) in Php/Mysql/Html/Css
No ratings yet
Point of Sale System (POS) in Php/Mysql/Html/Css
3 pages
(Ebook) Software Engineering and Testing: An Introduction (Computer Science) by B. B. Agarwal, S. P. Tayal, M. Gupta ISBN 9780763783020, 9781934015551, 1934015555, 0763783021 - The complete ebook is available for download with one click
100% (1)
(Ebook) Software Engineering and Testing: An Introduction (Computer Science) by B. B. Agarwal, S. P. Tayal, M. Gupta ISBN 9780763783020, 9781934015551, 1934015555, 0763783021 - The complete ebook is available for download with one click
52 pages
JDBC connection to Microsoft Access
No ratings yet
JDBC connection to Microsoft Access
6 pages
Computer Science 12
No ratings yet
Computer Science 12
2 pages
SE Rec2
No ratings yet
SE Rec2
5 pages
Mid Term - (DBMS) Labfile (Sahil Negi)
No ratings yet
Mid Term - (DBMS) Labfile (Sahil Negi)
29 pages
Group-1 Question Paper Class Xi Subject: Informatics Practices (065) Time: 3 Hours Max. Marks: 70
No ratings yet
Group-1 Question Paper Class Xi Subject: Informatics Practices (065) Time: 3 Hours Max. Marks: 70
10 pages
A Tendency Administrator Which Statement Is Incorrect About Oci Object Storage?
No ratings yet
A Tendency Administrator Which Statement Is Incorrect About Oci Object Storage?
33 pages
HANA Based BW Transformation - SAP Blogs
No ratings yet
HANA Based BW Transformation - SAP Blogs
32 pages
BDA Assign 2 20221-22 O
No ratings yet
BDA Assign 2 20221-22 O
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Univariate and Multivariate Data Exploration

Uploaded by

Univariate and Multivariate Data Exploration

Uploaded by

1.

Continuous values can be denoted by numbers and take

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.