0% found this document useful (0 votes)

87 views37 pages

Big Data Lec5

Uploaded by

mohyahmad52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views37 pages

Big Data Lec5

Uploaded by

mohyahmad52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Big Data Analysis

Lec. 5
Dr. Mona Abbass
Content
The data cycle
The data pipeline
Data engineering
Data preparation
Data cleansing
The data cycle
Acquiring and representing data
 A data analysis pipeline consist of four stages.

Two simple means of representing complex, structured data:

 The table
 The document
Representing structured data:
tables
 the table is a very common schema for representing structured
data
 According to the W3C draft Model for Tabular Data and
Metadata on the Web (W3C, 2015),
 Metadata is a data about the dataset itself.
 Tabular data is data that is structured into rows, each of which
contains information about some thing.
 Each row contains the same number of cells (although some of
these cells may be empty), which provide values of properties of
the thing described by the row.
Representing structured data:
tables
 In tabular data, cells within the same column provide values for
the same property of the thing described by the particular row.

 This is what differentiates tabular data from other line-oriented

formats.
 According to the W3C model, then, a table must contain at least
one column and at least one row.

 Spreadsheets use worksheets of two-dimensional, cell-based

tabular displays.
Representing tables in web pages
Tabular data can be represented in two forms within a web page,
both of which allow the browser and its plug-ins to handle the
interaction between the logical and physical aspects of the data.
The two forms are:
 (XML) <table> element
 JavaScript data object
Representing structured data:
documents
 Data scientists like to refer to a document as meaning any file or
representation that embodies a particular data record.
 Books are usually divided into chapters, or sections; they may
contain illustrations, footnotes, endnotes, tables of contents,
indexes and special headings. They may employ a single
typeface, or a variety of them. We can look on all of these first of
all as structural data, none of which could be captured in a
simple sequence of Unicode characters.
 So, how can the structure of our data be captured?
 The most widespread way of capturing document structure is
markup.
 To preserve semantics of each element we need a tool like XML
The data pipeline
 In practice, activities that appear neatly separated actually are
combined or revisited.
 For example, after initial preparation and some early analysis
there may be a need to identify and acquire more data, which
will itself require preparation and analysis.
Data engineering
 The multi-disciplinary practice of engineering computing
systems, computer software, or extracting information partly
through the analysis of data’ (Buntine, 1997).
Data engineering
The tasks of Data engineers include:
 Collecting data over space and time
 Cleaning it of errors
 Anonymizing it (remove identifying information from )
 Filtering it
 Representing it so that it can be exported from one system and
imported into others
 Sorting and storing it across distributed systems
 Shaping it into forms that allow it to be analyzed
 Visualizing it.
 Must respect legal and ethical concerns
Data preparation
 Purpose:
1. Convert acquired ‘raw’ datasets into valid, consistent data,
using structures and representations that will make analysis
straightforward.
 Initial Steps:
1. Explore the content, values and the overall shape ? of the
data.
2. Determine the purpose for which the data will be used
3. Determine the type and aims of the analysis to be applied to
it?
Data preparation
 Possible discovered problems with real data:
1. Data is wrongly packaged
2. Some values may not make sense
3. Some values may be missing
4. The format doesn’t seem right
5. The data doesn’t have the right structure for the tools and
packages to be used with it, for example, it might be represented in
an XML schema, and a CSV format is required.
Data preparation
 Activities:
1. Data cleansing: remove or repair obvious errors and inconsistencies in
the dataset
2. Data integration: combining datasets
3. data transformation: shaping datasets

 Terms like data harmonization and data enhancement are also used.

 Note:
1. Some of the techniques used in data preparation – especially in
transformation and integration – are also used to manipulate data
during analysis
2. Conversely, some analysis techniques are also used in data preparation
Data cleansing
Is the process of:
 detecting and correcting errors in a dataset.
 It can even mean removing irrelevant parts of the data we will
look at this later in the section.
 Having found errors – incomplete, incorrect, inaccurate or
irrelevant data – a decision must be made about how to handle
them.
Data cleansing headaches
Errors can be introduced into data in many ways:
 user input mistakes
 transport errors
 conversion between representations
 disagreements about the meaning of data elements
Some error types:
 Incorrect formats
 Incorrect structures
 Inaccurate values –can be hardest to identify and correct
without additional data or complex checking processes. (Is ‘Jean
Smit’ the real name of a person in a survey?)
Data cleansing headaches
Most operational systems try to keep ‘dirty’ data out of the data
store, by:
 Input validation
 database constraints
 error checking
However, despite these efforts, errors will occur
Dirty data refers to data that is inaccurate, incomplete, inconsistent, or
otherwise flawed, making it unreliable for analysis or decision-making.

Characteristics of Dirty Data

•Inaccurate Data: Data that contains errors or is incorrect. For example,

•a misspelled name or an incorrect phone number.

•Incomplete Data: Missing values or fields that are not filled in. For instance,
• a record that lacks a crucial piece of information like an email address or a product pr

•Inconsistent Data: Data that does not match across different records or datasets.
• This could be due to variations in data entry (e.g., "NY" vs. "New York")
Characteristics of Dirty Data

•Duplicated Data: Duplicate entries that represent the same entity multiple times,
•which can lead to inflated counts or incorrect analysis.

•Irrelevant Data: Data that is not applicable to the analysis or decision-making process,
•such as outdated information or data from unrelated sources.

•Outdated Data: Information that is no longer current or valid, which can

•happen in rapidly changing environments, such as customer contact information.
Causes of Dirty Data

• Human Error: Mistakes made during data entry, such as typos or

omissions.
• System Errors: Technical glitches or bugs in software that lead to
incorrect data collection or storage.
• Lack of Standardization: Inconsistent formats or standards for
data entry across different departments or systems.

• Data Migration Issues: Problems that arise when transferring data

from one system to another, which can lead to loss or corruption
of data
Example 1
Identify possible errors and issues that might require further
attention in the table.
• 1.Missing Values
• Check for any empty cells or fields that should contain data. Missing values can affect
the overall analysis and interpretation of the data.
• 2. Inconsistent Formatting
• Look for inconsistencies in how data is formatted. For example:
• Dates in different formats (e.g., MM/DD/YYYY vs. DD/MM/YYYY).
• Currency represented in different ways (e.g., "$100" vs. "100 USD").
• 3. Duplicate Entries
• Identify any duplicate rows that represent the same record. This can lead to inflated
counts or misleading analysis.
• 4. Outlier Detection
• Check for values that significantly deviate from the rest of the data. Outliers might
indicate data entry errors or unique cases that need special handling.
• 5. Data Type Mismatches
• Ensure that the data types are consistent with what is expected. For example,
numeric fields should not contain text values.
• 6. Incorrect Data Values
• Review the data for incorrect values that don't make sense within the context. For
example, negative ages or impossible dates.
• 7. Inconsistent Categorical Values
• Check for variations in categorical data (e.g., “NY”, “New York”, “N.Y.”). These
inconsistencies can lead to difficulties in data aggregation and analysis.
• 8. Validation Issues
• Ensure that the data meets certain validation rules. For example, a field requiring an
email address should not contain any non-email entries.
• 9. Misleading Aggregations
• Look for aggregated values that may be misleading due to underlying data issues. For
example, averages that include outliers may not represent the true central tendency.
Classification of error types
 Validity
 Accuracy
 Completeness
 Consistency
 Uniformity
Validity
 Do the data values match any specified constraints, value limits,
and formats for the column in which they appear?
Accuracy
 Checking correctness requires some external ‘gold standard’ to
check them against (e.g. a table of valid postcodes, would show
that M60 9HP isn’t a postcode that is currently is use).
Otherwise, hints based on spelling and capitalization are the best
hope.
Completeness
 Are all the values required present? Everyone has a DOB and a postcode,
although they may not know the value (assuming they are in the UK – if
they live elsewhere they may not have a postcode), but can the dataset be
considered complete with some of these missing? This will depend on the
purpose of any future analysis.
Consistency
 If two values should be the same but are not, then there is an inconsistency.
So, if the two rows with ‘John Smith’ and ‘J. Smith’, do indeed represent a
single individual, John Smith, then the data for that individual’s monthly
income is inconsistent.
Uniformity
 The DOB field contains date values drawn from two different calendars,
which would create problems in later processing. It would be necessary to
choose a base representation and translate all values to that form. A similar
issue appears in the income column.
Combining data from multiple sources
 Harmonization is the data cleansing activity of creating a common
form for non-uniform data.
 Mixed forms more often occur when two or more data sources use
different base representations.
Examples:
Imagine a company with two departments:
 One stores local phone numbers
 the other stores them in international format.
Approaches to handling dirty data
Fix it–
 replace incorrect or missing values with the correct values
Remove it–
 remove the value, or a group of values (or rows of data or data
elements) from the dataset
Replace it–
 substitute a default marker for the incorrect value, so that later
processing can recognize it is dealing with inappropriate values
Leave it–
 simply note that it was identified and leave it, hoping that its
impact on subsequent processing is minimal.
Documenting data cleansing
It is necessary to:
 Document how the dirty data was identified and handled, and
for what reason
 Maintain the data in both raw and ‘cleaned’ form
 If the data originally came from operational systems it might be
necessary to feed the findings back to the managers of these
systems
Benefits of Documenting data cleansing
 Allows others to consider the changes made and ensure they
were both valid and sensible.
 Helps to build a core of approaches and methods for the kinds of
datasets that are frequently used.
 Allows managers of operations systems where the data came
from to adjust and improve their validation processes.
 Allows you, in time, to develop effective cleansing regimes for
specialized data assets.
Data laundering and data obfuscating
Two further data cleansing activities:
 Data laundering attempts to break the link between the dataset
and its (valid) provenance.
 Data obfuscating (aka data anonymization) is the process of
removing the link between sensitive data and the real-world
entities to which it applies, while at the same time retaining the
value and usefulness of that data.
Data laundering and data obfuscating
The key difference between these activities and data
cleansing itself is this:
 In data cleansing we are trying to document and maintain the
full provenance of our dataset;
 In laundering we want to lose its history, and
 In obfuscation we’re trying to produce anonymized but useful
data.
Data integration and transformation
 A new dataset may be in the wrong shape
 For example, data held in a tree-like structure may be needed in
table form.
 Another reason for reshaping data is to choose a subset of a
dataset for some purpose
 Finally, reshaping may also mean combining multiple datasets.
Thanks
Dr. Mona Abbass
E-mail mona_abbass12@hotmail.com

Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Probabilistic Reasoning (Unit-3)
No ratings yet
Probabilistic Reasoning (Unit-3)
21 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
DAV Quantum
No ratings yet
DAV Quantum
143 pages
UNIT-V-MCA-305-ADVANCED DBMS
No ratings yet
UNIT-V-MCA-305-ADVANCED DBMS
25 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
DAA Unit - 1
No ratings yet
DAA Unit - 1
68 pages
Student Feedback Management System Project Report
No ratings yet
Student Feedback Management System Project Report
53 pages
Dbms
No ratings yet
Dbms
99 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Rich Automata Solns
100% (1)
Rich Automata Solns
187 pages
Software Validation Book
100% (9)
Software Validation Book
102 pages
Unit 3 Final
100% (1)
Unit 3 Final
38 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
Transaction Management and Concurrency Control and Recovery in DBMS
No ratings yet
Transaction Management and Concurrency Control and Recovery in DBMS
68 pages
WEKA Lab Questions Answers
No ratings yet
WEKA Lab Questions Answers
5 pages
KNN Algorithm
100% (1)
KNN Algorithm
11 pages
Running Risk Analysis For The SAP S - 4HANA and SAP ... - SAP Community
No ratings yet
Running Risk Analysis For The SAP S - 4HANA and SAP ... - SAP Community
15 pages
Unit-2 SE Notes DKPJ
No ratings yet
Unit-2 SE Notes DKPJ
21 pages
Whitepaper PDF File-2022030717570600001
No ratings yet
Whitepaper PDF File-2022030717570600001
78 pages
Devops Practical
No ratings yet
Devops Practical
43 pages
SI241
No ratings yet
SI241
1 page
Digital Certificates (Certification Authority)
100% (1)
Digital Certificates (Certification Authority)
23 pages
Complex Engineering Problem
No ratings yet
Complex Engineering Problem
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
e-book-decision-matrix-fabric-databricks-snowflake-v1
No ratings yet
e-book-decision-matrix-fabric-databricks-snowflake-v1
7 pages
U2
No ratings yet
U2
18 pages
Iare DWDM and WT Lab Manual PDF
No ratings yet
Iare DWDM and WT Lab Manual PDF
69 pages
DBMS - Unit 4
No ratings yet
DBMS - Unit 4
52 pages
GCP Fundamentals: Core Infrastructure: Getting Started With Google Cloud Platform
No ratings yet
GCP Fundamentals: Core Infrastructure: Getting Started With Google Cloud Platform
48 pages
DS Lecture 01 - Introduction PDF
No ratings yet
DS Lecture 01 - Introduction PDF
23 pages
Image Processing 7-FrequencyFiltering
No ratings yet
Image Processing 7-FrequencyFiltering
66 pages
Java Terminal Exam
No ratings yet
Java Terminal Exam
55 pages
Decision Tables Exercises
100% (1)
Decision Tables Exercises
3 pages
Syllabus
No ratings yet
Syllabus
9 pages
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
No ratings yet
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
44 pages
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
No ratings yet
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
5 pages
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
24 pages
Cisco CIMC CLI Configuration Guide
No ratings yet
Cisco CIMC CLI Configuration Guide
80 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
DBMS LAB MANUAL FINAL (AutoRecovered)
No ratings yet
DBMS LAB MANUAL FINAL (AutoRecovered)
46 pages
DIGITAL ASSIGNMENT 3 ELA Wireshark
No ratings yet
DIGITAL ASSIGNMENT 3 ELA Wireshark
5 pages
2009 S Pre Exam2 Review 6up PDF
No ratings yet
2009 S Pre Exam2 Review 6up PDF
9 pages
BSNL It Tool 2011
No ratings yet
BSNL It Tool 2011
45 pages
Chapters 6-8 Exercises
100% (1)
Chapters 6-8 Exercises
3 pages
Unit - 9 System Construction and Implementation
No ratings yet
Unit - 9 System Construction and Implementation
20 pages
Case Study - Cost Estimation With COCOMO
0% (1)
Case Study - Cost Estimation With COCOMO
2 pages
Java Collections PDF
No ratings yet
Java Collections PDF
566 pages
AuditScripts CIS Controls Executive Assessment Tool V8.0a
No ratings yet
AuditScripts CIS Controls Executive Assessment Tool V8.0a
7 pages
AL3391 AI UNIT 2 NOTES EduEngg
No ratings yet
AL3391 AI UNIT 2 NOTES EduEngg
24 pages
Compiler-All-Anna-Question Till Nov-2016 PDF
No ratings yet
Compiler-All-Anna-Question Till Nov-2016 PDF
39 pages
Space and Time Trade Off
No ratings yet
Space and Time Trade Off
8 pages
SQL Level 2 - Powerpoint Joins
No ratings yet
SQL Level 2 - Powerpoint Joins
43 pages
Tony Ciliberti, PE: Reliability Dynamics LLC
No ratings yet
Tony Ciliberti, PE: Reliability Dynamics LLC
18 pages
Faculty of Engineering Scit B. Tech It/Cse/Cce VI Semester First Mid Term Examination: 2021-22 Data Mining and Warehousing (IT3240)
No ratings yet
Faculty of Engineering Scit B. Tech It/Cse/Cce VI Semester First Mid Term Examination: 2021-22 Data Mining and Warehousing (IT3240)
2 pages
DVR Guru Prasad Mallela JavaSpringJ2EE Java
No ratings yet
DVR Guru Prasad Mallela JavaSpringJ2EE Java
8 pages
Termination and Resumptive Model
No ratings yet
Termination and Resumptive Model
2 pages
006 Practical List of DM-2023
No ratings yet
006 Practical List of DM-2023
1 page
R12 Supplier Tables
0% (1)
R12 Supplier Tables
5 pages
IS202 Data Management: General Information 2013 - 2014 Term 2
No ratings yet
IS202 Data Management: General Information 2013 - 2014 Term 2
13 pages
Assignment SE
No ratings yet
Assignment SE
1 page
Cybersecurity Project
No ratings yet
Cybersecurity Project
20 pages
FINC 614 Introduction To Data Science Mid Term Exam Do The Following in R and Turn in A Word or PDF Document Generated With Knitr, Via Blackboard
No ratings yet
FINC 614 Introduction To Data Science Mid Term Exam Do The Following in R and Turn in A Word or PDF Document Generated With Knitr, Via Blackboard
1 page
LVM Howto
No ratings yet
LVM Howto
73 pages
Vu Study M Parallel and Distributed Computer
No ratings yet
Vu Study M Parallel and Distributed Computer
29 pages
Test Plan For QA
No ratings yet
Test Plan For QA
5 pages
UNIT 2 QUESTION BANK
No ratings yet
UNIT 2 QUESTION BANK
2 pages
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
No ratings yet
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
4 pages
Unified Library Application
No ratings yet
Unified Library Application
11 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
Certificate in Computing (Cic) Term-End Examination R - CO 11 June, 2012 - CD - CIC 01: The Context
No ratings yet
Certificate in Computing (Cic) Term-End Examination R - CO 11 June, 2012 - CD - CIC 01: The Context
10 pages
Module 20: Data Recovery Operations
No ratings yet
Module 20: Data Recovery Operations
13 pages
DCCN Prefinal Paper
No ratings yet
DCCN Prefinal Paper
2 pages
AoA Important Question
100% (1)
AoA Important Question
3 pages
Lab2 SNMP Advanced
No ratings yet
Lab2 SNMP Advanced
6 pages
Docks Stands Driver RRFDW wn32 6.3.9600.2202 A13 03
No ratings yet
Docks Stands Driver RRFDW wn32 6.3.9600.2202 A13 03
4 pages
LAB # 07 Facts and Rules in PROLOG: Objective
No ratings yet
LAB # 07 Facts and Rules in PROLOG: Objective
6 pages
MRP Workbench Oracle
No ratings yet
MRP Workbench Oracle
5 pages
CS463 Digital Image Processing - Image
No ratings yet
CS463 Digital Image Processing - Image
3 pages
Unit V: Software Engineering, A Practitioner's Approach - Pressman Roger. S. TMH. (Strictly 5th Ed)
No ratings yet
Unit V: Software Engineering, A Practitioner's Approach - Pressman Roger. S. TMH. (Strictly 5th Ed)
52 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
Pds
No ratings yet
Pds
2 pages
Criar Cache
No ratings yet
Criar Cache
3 pages
09 HSC SDD Task 3
No ratings yet
09 HSC SDD Task 3
8 pages
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
3 pages
Rayleigh Model
No ratings yet
Rayleigh Model
9 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Lec5

Uploaded by

Big Data Lec5

Uploaded by

Big Data Analysis

Two simple means of representing complex, structured data:

 This is what differentiates tabular data from other line-oriented

 Spreadsheets use worksheets of two-dimensional, cell-based

Characteristics of Dirty Data

•Inaccurate Data: Data that contains errors or is incorrect. For example,

•Outdated Data: Information that is no longer current or valid, which can

• Human Error: Mistakes made during data entry, such as typos or

• Data Migration Issues: Problems that arise when transferring data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.