0% found this document useful (0 votes)
73 views

Big Data Lec5

Uploaded by

mohyahmad52
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Big Data Lec5

Uploaded by

mohyahmad52
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Big Data Analysis

Lec. 5
Dr. Mona Abbass
Content
The data cycle
The data pipeline
Data engineering
Data preparation
Data cleansing
The data cycle
Acquiring and representing data
 A data analysis pipeline consist of four stages.

Two simple means of representing complex, structured data:


 The table
 The document
Representing structured data:
tables
 the table is a very common schema for representing structured
data
 According to the W3C draft Model for Tabular Data and
Metadata on the Web (W3C, 2015),
 Metadata is a data about the dataset itself.
 Tabular data is data that is structured into rows, each of which
contains information about some thing.
 Each row contains the same number of cells (although some of
these cells may be empty), which provide values of properties of
the thing described by the row.
Representing structured data:
tables
 In tabular data, cells within the same column provide values for
the same property of the thing described by the particular row.

 This is what differentiates tabular data from other line-oriented


formats.
 According to the W3C model, then, a table must contain at least
one column and at least one row.

 Spreadsheets use worksheets of two-dimensional, cell-based


tabular displays.
Representing tables in web pages
Tabular data can be represented in two forms within a web page,
both of which allow the browser and its plug-ins to handle the
interaction between the logical and physical aspects of the data.
The two forms are:
 (XML) <table> element
 JavaScript data object
Representing structured data:
documents
 Data scientists like to refer to a document as meaning any file or
representation that embodies a particular data record.
 Books are usually divided into chapters, or sections; they may
contain illustrations, footnotes, endnotes, tables of contents,
indexes and special headings. They may employ a single
typeface, or a variety of them. We can look on all of these first of
all as structural data, none of which could be captured in a
simple sequence of Unicode characters.
 So, how can the structure of our data be captured?
 The most widespread way of capturing document structure is
markup.
 To preserve semantics of each element we need a tool like XML
The data pipeline
 In practice, activities that appear neatly separated actually are
combined or revisited.
 For example, after initial preparation and some early analysis
there may be a need to identify and acquire more data, which
will itself require preparation and analysis.
Data engineering
 The multi-disciplinary practice of engineering computing
systems, computer software, or extracting information partly
through the analysis of data’ (Buntine, 1997).
Data engineering
The tasks of Data engineers include:
 Collecting data over space and time
 Cleaning it of errors
 Anonymizing it (remove identifying information from )
 Filtering it
 Representing it so that it can be exported from one system and
imported into others
 Sorting and storing it across distributed systems
 Shaping it into forms that allow it to be analyzed
 Visualizing it.
 Must respect legal and ethical concerns
Data preparation
 Purpose:
1. Convert acquired ‘raw’ datasets into valid, consistent data,
using structures and representations that will make analysis
straightforward.
 Initial Steps:
1. Explore the content, values and the overall shape ? of the
data.
2. Determine the purpose for which the data will be used
3. Determine the type and aims of the analysis to be applied to
it?
Data preparation
 Possible discovered problems with real data:
1. Data is wrongly packaged
2. Some values may not make sense
3. Some values may be missing
4. The format doesn’t seem right
5. The data doesn’t have the right structure for the tools and
packages to be used with it, for example, it might be represented in
an XML schema, and a CSV format is required.
Data preparation
 Activities:
1. Data cleansing: remove or repair obvious errors and inconsistencies in
the dataset
2. Data integration: combining datasets
3. data transformation: shaping datasets

 Terms like data harmonization and data enhancement are also used.

 Note:
1. Some of the techniques used in data preparation – especially in
transformation and integration – are also used to manipulate data
during analysis
2. Conversely, some analysis techniques are also used in data preparation
Data cleansing
Is the process of:
 detecting and correcting errors in a dataset.
 It can even mean removing irrelevant parts of the data we will
look at this later in the section.
 Having found errors – incomplete, incorrect, inaccurate or
irrelevant data – a decision must be made about how to handle
them.
Data cleansing headaches
Errors can be introduced into data in many ways:
 user input mistakes
 transport errors
 conversion between representations
 disagreements about the meaning of data elements
Some error types:
 Incorrect formats
 Incorrect structures
 Inaccurate values –can be hardest to identify and correct
without additional data or complex checking processes. (Is ‘Jean
Smit’ the real name of a person in a survey?)
Data cleansing headaches
Most operational systems try to keep ‘dirty’ data out of the data
store, by:
 Input validation
 database constraints
 error checking
However, despite these efforts, errors will occur
Dirty data refers to data that is inaccurate, incomplete, inconsistent, or
otherwise flawed, making it unreliable for analysis or decision-making.

Characteristics of Dirty Data

•Inaccurate Data: Data that contains errors or is incorrect. For example,


•a misspelled name or an incorrect phone number.

•Incomplete Data: Missing values or fields that are not filled in. For instance,
• a record that lacks a crucial piece of information like an email address or a product pr

•Inconsistent Data: Data that does not match across different records or datasets.
• This could be due to variations in data entry (e.g., "NY" vs. "New York")
Characteristics of Dirty Data

•Duplicated Data: Duplicate entries that represent the same entity multiple times,
•which can lead to inflated counts or incorrect analysis.

•Irrelevant Data: Data that is not applicable to the analysis or decision-making process,
•such as outdated information or data from unrelated sources.

•Outdated Data: Information that is no longer current or valid, which can


•happen in rapidly changing environments, such as customer contact information.
Causes of Dirty Data

• Human Error: Mistakes made during data entry, such as typos or


omissions.
• System Errors: Technical glitches or bugs in software that lead to
incorrect data collection or storage.
• Lack of Standardization: Inconsistent formats or standards for
data entry across different departments or systems.

• Data Migration Issues: Problems that arise when transferring data


from one system to another, which can lead to loss or corruption
of data
Example 1
Identify possible errors and issues that might require further
attention in the table.
• 1.Missing Values
• Check for any empty cells or fields that should contain data. Missing values can affect
the overall analysis and interpretation of the data.
• 2. Inconsistent Formatting
• Look for inconsistencies in how data is formatted. For example:
• Dates in different formats (e.g., MM/DD/YYYY vs. DD/MM/YYYY).
• Currency represented in different ways (e.g., "$100" vs. "100 USD").
• 3. Duplicate Entries
• Identify any duplicate rows that represent the same record. This can lead to inflated
counts or misleading analysis.
• 4. Outlier Detection
• Check for values that significantly deviate from the rest of the data. Outliers might
indicate data entry errors or unique cases that need special handling.
• 5. Data Type Mismatches
• Ensure that the data types are consistent with what is expected. For example,
numeric fields should not contain text values.
• 6. Incorrect Data Values
• Review the data for incorrect values that don't make sense within the context. For
example, negative ages or impossible dates.
• 7. Inconsistent Categorical Values
• Check for variations in categorical data (e.g., “NY”, “New York”, “N.Y.”). These
inconsistencies can lead to difficulties in data aggregation and analysis.
• 8. Validation Issues
• Ensure that the data meets certain validation rules. For example, a field requiring an
email address should not contain any non-email entries.
• 9. Misleading Aggregations
• Look for aggregated values that may be misleading due to underlying data issues. For
example, averages that include outliers may not represent the true central tendency.
Classification of error types
 Validity
 Accuracy
 Completeness
 Consistency
 Uniformity
Validity
 Do the data values match any specified constraints, value limits,
and formats for the column in which they appear?
Accuracy
 Checking correctness requires some external ‘gold standard’ to
check them against (e.g. a table of valid postcodes, would show
that M60 9HP isn’t a postcode that is currently is use).
Otherwise, hints based on spelling and capitalization are the best
hope.
Completeness
 Are all the values required present? Everyone has a DOB and a postcode,
although they may not know the value (assuming they are in the UK – if
they live elsewhere they may not have a postcode), but can the dataset be
considered complete with some of these missing? This will depend on the
purpose of any future analysis.
Consistency
 If two values should be the same but are not, then there is an inconsistency.
So, if the two rows with ‘John Smith’ and ‘J. Smith’, do indeed represent a
single individual, John Smith, then the data for that individual’s monthly
income is inconsistent.
Uniformity
 The DOB field contains date values drawn from two different calendars,
which would create problems in later processing. It would be necessary to
choose a base representation and translate all values to that form. A similar
issue appears in the income column.
Combining data from multiple sources
 Harmonization is the data cleansing activity of creating a common
form for non-uniform data.
 Mixed forms more often occur when two or more data sources use
different base representations.
Examples:
Imagine a company with two departments:
 One stores local phone numbers
 the other stores them in international format.
Approaches to handling dirty data
Fix it–
 replace incorrect or missing values with the correct values
Remove it–
 remove the value, or a group of values (or rows of data or data
elements) from the dataset
Replace it–
 substitute a default marker for the incorrect value, so that later
processing can recognize it is dealing with inappropriate values
Leave it–
 simply note that it was identified and leave it, hoping that its
impact on subsequent processing is minimal.
Documenting data cleansing
It is necessary to:
 Document how the dirty data was identified and handled, and
for what reason
 Maintain the data in both raw and ‘cleaned’ form
 If the data originally came from operational systems it might be
necessary to feed the findings back to the managers of these
systems
Benefits of Documenting data cleansing
 Allows others to consider the changes made and ensure they
were both valid and sensible.
 Helps to build a core of approaches and methods for the kinds of
datasets that are frequently used.
 Allows managers of operations systems where the data came
from to adjust and improve their validation processes.
 Allows you, in time, to develop effective cleansing regimes for
specialized data assets.
Data laundering and data obfuscating
Two further data cleansing activities:
 Data laundering attempts to break the link between the dataset
and its (valid) provenance.
 Data obfuscating (aka data anonymization) is the process of
removing the link between sensitive data and the real-world
entities to which it applies, while at the same time retaining the
value and usefulness of that data.
Data laundering and data obfuscating
The key difference between these activities and data
cleansing itself is this:
 In data cleansing we are trying to document and maintain the
full provenance of our dataset;
 In laundering we want to lose its history, and
 In obfuscation we’re trying to produce anonymized but useful
data.
Data integration and transformation
 A new dataset may be in the wrong shape
 For example, data held in a tree-like structure may be needed in
table form.
 Another reason for reshaping data is to choose a subset of a
dataset for some purpose
 Finally, reshaping may also mean combining multiple datasets.
Thanks
Dr. Mona Abbass
E-mail mona_abbass12@hotmail.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy