Big Data Lec5
Big Data Lec5
Lec. 5
Dr. Mona Abbass
Content
The data cycle
The data pipeline
Data engineering
Data preparation
Data cleansing
The data cycle
Acquiring and representing data
A data analysis pipeline consist of four stages.
Terms like data harmonization and data enhancement are also used.
Note:
1. Some of the techniques used in data preparation – especially in
transformation and integration – are also used to manipulate data
during analysis
2. Conversely, some analysis techniques are also used in data preparation
Data cleansing
Is the process of:
detecting and correcting errors in a dataset.
It can even mean removing irrelevant parts of the data we will
look at this later in the section.
Having found errors – incomplete, incorrect, inaccurate or
irrelevant data – a decision must be made about how to handle
them.
Data cleansing headaches
Errors can be introduced into data in many ways:
user input mistakes
transport errors
conversion between representations
disagreements about the meaning of data elements
Some error types:
Incorrect formats
Incorrect structures
Inaccurate values –can be hardest to identify and correct
without additional data or complex checking processes. (Is ‘Jean
Smit’ the real name of a person in a survey?)
Data cleansing headaches
Most operational systems try to keep ‘dirty’ data out of the data
store, by:
Input validation
database constraints
error checking
However, despite these efforts, errors will occur
Dirty data refers to data that is inaccurate, incomplete, inconsistent, or
otherwise flawed, making it unreliable for analysis or decision-making.
•Incomplete Data: Missing values or fields that are not filled in. For instance,
• a record that lacks a crucial piece of information like an email address or a product pr
•Inconsistent Data: Data that does not match across different records or datasets.
• This could be due to variations in data entry (e.g., "NY" vs. "New York")
Characteristics of Dirty Data
•Duplicated Data: Duplicate entries that represent the same entity multiple times,
•which can lead to inflated counts or incorrect analysis.
•Irrelevant Data: Data that is not applicable to the analysis or decision-making process,
•such as outdated information or data from unrelated sources.