0% found this document useful (0 votes)
28 views

L1 - Introduction To Data Science

Uploaded by

bhattibaba118
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

L1 - Introduction To Data Science

Uploaded by

bhattibaba118
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

ICT583 Data Science Applications

TOPIC 1: Introduction to Data Science


Outline
· What is data science and why now?
· The data science Venn diagram
· Types of data
· Asking interesting questions from data
From industry age to information age
Data explosion
What is Data Science?
Whenever we use the word "data," we refer to a
collection of information in either an organized or
unorganized format. These formats have the
following qualities:
Organized data: This refers to data that is
sorted into a row/column structure, where every
row represents a single observation and the
columns represent the characteristics of that
observation.
Unorganized data: This is the type of data that
is in a free form, usually text or raw audio/signals
that must be parsed further to become organized.
What is Data Science?
Data science is all about how we take data, use it to
acquire knowledge, and then use that knowledge to do the
following:
Make decisions
Predict the future
Understand the past/present
Create new industries/products
This unit is all about the methods of data science,
including how to process data, gather insights, and use
those insights to make informed decisions and predictions.
Data science is about using data in order to gain new
insights that you would otherwise have missed.
What is Data Science?

Data science won't replace the human


brain, but complement it, working
alongside it. Data science should not be
thought of as an end-all solution; it is
merely an opinion—a very informed
opinion, but an opinion nonetheless. It
deserves a seat at the table.
Why Data Science?
But why should that necessitate an entirely new set
of vocabulary? What was wrong with our previous
forms of analysis?
The volume of data makes it literally impossible for a
human to parse it in a reasonable time frame.
Data is collected in various forms and from different
sources, and often comes in a very unorganized format.
Data can be missing, incomplete, or just flat out wrong.
Data can be in very different scales, and that makes it tough
to compare it.
The data science Venn diagram
The Venn diagram of data science
Skill Sets for Data Science
Data Science is the intersection of these three key areas.
To gain knowledge from data, we must be able to utilize
computer programming to access the data, understand
the mathematics behind the models we derive, and above
all, understand our analyses' place in the domain we are
in. This includes the presentation of data.
If we are creating a model to predict heart attacks in
patients, is it better to create a PDF of information or an
app where you can type in numbers and get a quick
prediction? All these decisions must be made by the data
scientist.
Machine learning
The intersection of math
and coding is machine
learning.
It is important to note that
without the explicit ability
to generalize any models or
results to a domain,
machine learning
algorithms remain just that
—algorithms sitting on your
computer.
The math
This unit will guide you through some basic math
needed for data science, specifically statistics and
probability. These subdomains of mathematics can
help you better understand some main concepts in
the data science and what are called models.
A data model refers to an organized and formal
relationship between elements of data, usually
meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize


relationships between variables.
Example - Spawner-recruit models

In biology, a model known as the spawner-recruit model is to used


to judge the biological health of a species. It is a basic relationship
between the number of healthy parental units of a species and the
number of new units in the group of animals.
Example - Spawner-recruit models

Essentially, models allow us to plug in one variable to get


the other.

Let's say we knew that a group of salmon had 1.15 (in


thousands) spawners. Then,

This result can be very beneficial to estimate how the


health of a population is changing. If we can create these
models, we can visually observe how the relationship
between the two variables can change.
Computer programming

Computer languages are how we communicate with


the machine and tell it to do our order. A computer
speaks many languages and, like a book, can be
written in many languages; similarly, data science
can also be done in many languages.
R is preferred programming language to meet the
data science needs, and it’s a beginner-friendly
language.
Why R?

R is the most popular choice for data scientists. Following are some of the key
reasons as to why we will use R:
R is reliable and useful in academia for many years. Traditionally, R was
used for research purposes at the academy because it provided various
statistical tools for analysis. With the advancements in data science and the
need for analyzing data, R became a popular choice in the industry as well.
R is an ideal tool when it comes to data wrangling. It allows the usage of
several preprocessed packages that makes data wrangling a lot easier.
R provides its famous ggplot2 package which is most famous for its
visualizations. Ggplot2 provides aesthetic visualizations that cater to all the
data operations. Furthermore, ggplot2 provides a degree of interactivity to
the users so that they can understand the data embedded in the visualization
more clearly.
R contains machine learning packages for various operations. Boosting,
building random forests or performing regression and classification, machine
learning provides a wide array of packages.
Domain knowledge

It focuses mainly on having knowledge of the


particular topic you are working on. For example, if
you are a financial analyst working on stock market
data, you need to a lot of domain knowledge. If you
are a journalist looking at worldwide adoption rates,
you might benefit from consulting an expert in that
field.
Great data scientists can apply their skills to any
area, even if they aren't fluent in it. Data scientists
can adapt to the field and contribute meaningfully
when their analysis is complete.
Developing Curiosity

The good data scientist develops a curiosity about


the domain/application they are working in.
• They talk shop with the people whose data they
are working on.
• They read the newspaper every day, to get a
broader perspective on the world.
Some more terminology
Machine learning: This refers to giving computers the ability to
learn from data without explicit "rules" being given by a
programmer.
It combines the power of computers with intelligent learning
algorithms in order to automate the discovery of relationships in
data and create of powerful data models. Speaking of data
models, we will concern ourselves with the following two basic
types of data models:
Probabilistic model: This refers to using probability to find a
relationship between elements that includes a degree of
randomness.
Statistical model: This refers to taking advantage of
statistical theorems to formalize relationships between data
elements in a (usually) simple mathematical formula.
Some more terminology
Exploratory data analysis (EDA) refers to preparing
data in order to standardize results and gain quick
insights.
It is concerned with data visualization and preparation.
This is where we turn unorganized data into organized
data and also clean up missing/incorrect data points.
During EDA, we will create many types of plots and use
these plots to identify key features and relationships to
exploit in our data models.
Data mining is the process of finding relationships
between elements of data.
It is the part of data science where we try to find
relationships between variables (think spawner-recruit
model).
Data science case studies
Marketing dollars
A dataset shows the
relationships between
TV, radio, and
newspaper sales. The
goal is to analyze the
relationships between
the three different
marketing mediums
and how they affect
the sale of a product.
TV, radio, and newspaper categories are measured in "thousands
of dollars"; the sales in "thousands of widgets sold."
Data science case studies
If we plot each variable against the sales, we get the following
graph:

Note none of these variables form a very strong line, and


therefore they might not work well in predicting sales on their
own. In this case, we will have to create a more complex model
than the one we used in the spawner-recruiter model and combine
all three variables in order to model sales.
Types of data

We will look at the following topics:

• Structured versus unstructured data

• Quantitative versus qualitative data

If you don't understand the type of data that you


are working with, then you might waste a lot of
time applying models that are known to be
ineffective with that specific type of data.
Structured versus unstructured data

Structured (organized) data: This is data that


can be thought of as observations and
characteristics. It is usually organized using a
table method (rows and columns).

Unstructured (unorganized) data: This data


exists as a free entity and does not follow any
standard organization hierarchy.
Examples

Most data that exists in text form, including server


logs and Facebook posts, is unstructured

Scientific observations, as recorded by careful


scientists, are kept in a very neat and organized
(structured) format

A genetic sequence of chemical nucleotides (for


example, ACGTATTGCA) is unstructured, even if the
order of the nucleotides matters, as we cannot form
descriptors of the sequence using a row/column
format without taking a further look
Structured versus unstructured data

Most statistical and machine learning models were built with


structured data in mind and cannot work on the loose
interpretation of unstructured data.

So, why even talk about unstructured data? Because it is so


common!

Tweets, emails, literature, and server logs are generally


unstructured forms of data.

With most of our data existing in this free-form format, we must


turn to pre-analysis techniques, called pre-processing, in order to
apply structure to at least a part of the data for further analysis.
Quantitative versus qualitative data

Quantitative data: This data can be described


using numbers, and basic mathematical
procedures, including addition, are possible on
the set.

Qualitative data: This data cannot be


described using numbers and basic
mathematics. This data is generally thought of
as being described using natural categories
and language.
Example - Coffee shop data

Name of coffee shop: qualitative


Revenue (in thousands of dollars): quantatitive
Zip code: qualitative
Average monthly customers: quantatitive
Country of coffee origin: qualitative

Each of these characteristics can be classified as


either quantitative or qualitative.
Important things to note

Even though a zip code is being described


using numbers, it is not quantitative. This is
because you can't talk about the sum of all zip
codes or an average zip code. These are
nonsensical descriptions.
Pretty much whenever a word is used to
describe a characteristic, it is a qualitative
factor.
Types of questions you may ask...

For a quantitative column, you may ask


questions such as the following:
What is the average value?
Does this quantity increase or decrease
over time (if time is a factor)?
Is there a threshold where if this number
became too high or too low, it would signal
trouble for the company?
Types of questions you may ask...

For a qualitative column, none of the


preceding questions can be answered.
However, the following questions only apply to
qualitative values:
Which value occurs the most and the least?
How many unique values are there?
What are these unique values?
Digging deeper
Quantitative data can be broken down one step further
into
Discrete data: This describes data that is counted. It can
only take on certain values.
Example: a dice roll, because it can only take on six values,
and the number of customers in a coffee shop, because you
can't have a real range of people.
Continuous data: This describes data that is measured. It
exists on an infinite range of values.
Example: a person's weight, because it can be 68kg or
89.66kg (note the decimals). The height of a person or
building is a continuous number because an infinite scale of
decimals is possible.
Digging deeper
Qualitative data can be broken down one step further into
Nominal data: This label variables without any order or
quantitative value.
Example: The color of hair can be considered nominal data,
as one color can’t be compared with another color.
Ordinal data: This present in some kind of order by their
position on the scale. It exists on an infinite range of values.
Example: Ranking of people in a competition (First, Second,
Third, etc.)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy