0% found this document useful (0 votes)

28 views

L1 - Introduction To Data Science

Uploaded by

bhattibaba118

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

L1 - Introduction To Data Science

Uploaded by

bhattibaba118

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

ICT583 Data Science Applications

TOPIC 1: Introduction to Data Science

Outline
· What is data science and why now?
· The data science Venn diagram
· Types of data
· Asking interesting questions from data
From industry age to information age
Data explosion
What is Data Science?
Whenever we use the word "data," we refer to a
collection of information in either an organized or
unorganized format. These formats have the
following qualities:
Organized data: This refers to data that is
sorted into a row/column structure, where every
row represents a single observation and the
columns represent the characteristics of that
observation.
Unorganized data: This is the type of data that
is in a free form, usually text or raw audio/signals
that must be parsed further to become organized.
What is Data Science?
Data science is all about how we take data, use it to
acquire knowledge, and then use that knowledge to do the
following:
Make decisions
Predict the future
Understand the past/present
Create new industries/products
This unit is all about the methods of data science,
including how to process data, gather insights, and use
those insights to make informed decisions and predictions.
Data science is about using data in order to gain new
insights that you would otherwise have missed.
What is Data Science?

Data science won't replace the human

brain, but complement it, working
alongside it. Data science should not be
thought of as an end-all solution; it is
merely an opinion—a very informed
opinion, but an opinion nonetheless. It
deserves a seat at the table.
Why Data Science?
But why should that necessitate an entirely new set
of vocabulary? What was wrong with our previous
forms of analysis?
The volume of data makes it literally impossible for a
human to parse it in a reasonable time frame.
Data is collected in various forms and from different
sources, and often comes in a very unorganized format.
Data can be missing, incomplete, or just flat out wrong.
Data can be in very different scales, and that makes it tough
to compare it.
The data science Venn diagram
The Venn diagram of data science
Skill Sets for Data Science
Data Science is the intersection of these three key areas.
To gain knowledge from data, we must be able to utilize
computer programming to access the data, understand
the mathematics behind the models we derive, and above
all, understand our analyses' place in the domain we are
in. This includes the presentation of data.
If we are creating a model to predict heart attacks in
patients, is it better to create a PDF of information or an
app where you can type in numbers and get a quick
prediction? All these decisions must be made by the data
scientist.
Machine learning
The intersection of math
and coding is machine
learning.
It is important to note that
without the explicit ability
to generalize any models or
results to a domain,
machine learning
algorithms remain just that
—algorithms sitting on your
computer.
The math
This unit will guide you through some basic math
needed for data science, specifically statistics and
probability. These subdomains of mathematics can
help you better understand some main concepts in
the data science and what are called models.
A data model refers to an organized and formal
relationship between elements of data, usually
meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize

relationships between variables.
Example - Spawner-recruit models

In biology, a model known as the spawner-recruit model is to used

to judge the biological health of a species. It is a basic relationship
between the number of healthy parental units of a species and the
number of new units in the group of animals.
Example - Spawner-recruit models

Essentially, models allow us to plug in one variable to get

the other.

Let's say we knew that a group of salmon had 1.15 (in

thousands) spawners. Then,

This result can be very beneficial to estimate how the

health of a population is changing. If we can create these
models, we can visually observe how the relationship
between the two variables can change.
Computer programming

Computer languages are how we communicate with

the machine and tell it to do our order. A computer
speaks many languages and, like a book, can be
written in many languages; similarly, data science
can also be done in many languages.
R is preferred programming language to meet the
data science needs, and it’s a beginner-friendly
language.
Why R?

R is the most popular choice for data scientists. Following are some of the key
reasons as to why we will use R:
R is reliable and useful in academia for many years. Traditionally, R was
used for research purposes at the academy because it provided various
statistical tools for analysis. With the advancements in data science and the
need for analyzing data, R became a popular choice in the industry as well.
R is an ideal tool when it comes to data wrangling. It allows the usage of
several preprocessed packages that makes data wrangling a lot easier.
R provides its famous ggplot2 package which is most famous for its
visualizations. Ggplot2 provides aesthetic visualizations that cater to all the
data operations. Furthermore, ggplot2 provides a degree of interactivity to
the users so that they can understand the data embedded in the visualization
more clearly.
R contains machine learning packages for various operations. Boosting,
building random forests or performing regression and classification, machine
learning provides a wide array of packages.
Domain knowledge

It focuses mainly on having knowledge of the

particular topic you are working on. For example, if
you are a financial analyst working on stock market
data, you need to a lot of domain knowledge. If you
are a journalist looking at worldwide adoption rates,
you might benefit from consulting an expert in that
field.
Great data scientists can apply their skills to any
area, even if they aren't fluent in it. Data scientists
can adapt to the field and contribute meaningfully
when their analysis is complete.
Developing Curiosity

The good data scientist develops a curiosity about

the domain/application they are working in.
• They talk shop with the people whose data they
are working on.
• They read the newspaper every day, to get a
broader perspective on the world.
Some more terminology
Machine learning: This refers to giving computers the ability to
learn from data without explicit "rules" being given by a
programmer.
It combines the power of computers with intelligent learning
algorithms in order to automate the discovery of relationships in
data and create of powerful data models. Speaking of data
models, we will concern ourselves with the following two basic
types of data models:
Probabilistic model: This refers to using probability to find a
relationship between elements that includes a degree of
randomness.
Statistical model: This refers to taking advantage of
statistical theorems to formalize relationships between data
elements in a (usually) simple mathematical formula.
Some more terminology
Exploratory data analysis (EDA) refers to preparing
data in order to standardize results and gain quick
insights.
It is concerned with data visualization and preparation.
This is where we turn unorganized data into organized
data and also clean up missing/incorrect data points.
During EDA, we will create many types of plots and use
these plots to identify key features and relationships to
exploit in our data models.
Data mining is the process of finding relationships
between elements of data.
It is the part of data science where we try to find
relationships between variables (think spawner-recruit
model).
Data science case studies
Marketing dollars
A dataset shows the
relationships between
TV, radio, and
newspaper sales. The
goal is to analyze the
relationships between
the three different
marketing mediums
and how they affect
the sale of a product.
TV, radio, and newspaper categories are measured in "thousands
of dollars"; the sales in "thousands of widgets sold."
Data science case studies
If we plot each variable against the sales, we get the following
graph:

Note none of these variables form a very strong line, and

therefore they might not work well in predicting sales on their
own. In this case, we will have to create a more complex model
than the one we used in the spawner-recruiter model and combine
all three variables in order to model sales.
Types of data

We will look at the following topics:

• Structured versus unstructured data

• Quantitative versus qualitative data

If you don't understand the type of data that you

are working with, then you might waste a lot of
time applying models that are known to be
ineffective with that specific type of data.
Structured versus unstructured data

Structured (organized) data: This is data that

can be thought of as observations and
characteristics. It is usually organized using a
table method (rows and columns).

Unstructured (unorganized) data: This data

exists as a free entity and does not follow any
standard organization hierarchy.
Examples

Most data that exists in text form, including server

logs and Facebook posts, is unstructured

Scientific observations, as recorded by careful

scientists, are kept in a very neat and organized
(structured) format

A genetic sequence of chemical nucleotides (for

example, ACGTATTGCA) is unstructured, even if the
order of the nucleotides matters, as we cannot form
descriptors of the sequence using a row/column
format without taking a further look
Structured versus unstructured data

Most statistical and machine learning models were built with

structured data in mind and cannot work on the loose
interpretation of unstructured data.

So, why even talk about unstructured data? Because it is so

common!

Tweets, emails, literature, and server logs are generally

unstructured forms of data.

With most of our data existing in this free-form format, we must

turn to pre-analysis techniques, called pre-processing, in order to
apply structure to at least a part of the data for further analysis.
Quantitative versus qualitative data

Quantitative data: This data can be described

using numbers, and basic mathematical
procedures, including addition, are possible on
the set.

Qualitative data: This data cannot be

described using numbers and basic
mathematics. This data is generally thought of
as being described using natural categories
and language.
Example - Coffee shop data

Name of coffee shop: qualitative

Revenue (in thousands of dollars): quantatitive
Zip code: qualitative
Average monthly customers: quantatitive
Country of coffee origin: qualitative

Each of these characteristics can be classified as

either quantitative or qualitative.
Important things to note

Even though a zip code is being described

using numbers, it is not quantitative. This is
because you can't talk about the sum of all zip
codes or an average zip code. These are
nonsensical descriptions.
Pretty much whenever a word is used to
describe a characteristic, it is a qualitative
factor.
Types of questions you may ask...

For a quantitative column, you may ask

questions such as the following:
What is the average value?
Does this quantity increase or decrease
over time (if time is a factor)?
Is there a threshold where if this number
became too high or too low, it would signal
trouble for the company?
Types of questions you may ask...

For a qualitative column, none of the

preceding questions can be answered.
However, the following questions only apply to
qualitative values:
Which value occurs the most and the least?
How many unique values are there?
What are these unique values?
Digging deeper
Quantitative data can be broken down one step further
into
Discrete data: This describes data that is counted. It can
only take on certain values.
Example: a dice roll, because it can only take on six values,
and the number of customers in a coffee shop, because you
can't have a real range of people.
Continuous data: This describes data that is measured. It
exists on an infinite range of values.
Example: a person's weight, because it can be 68kg or
89.66kg (note the decimals). The height of a person or
building is a continuous number because an infinite scale of
decimals is possible.
Digging deeper
Qualitative data can be broken down one step further into
Nominal data: This label variables without any order or
quantitative value.
Example: The color of hair can be considered nominal data,
as one color can’t be compared with another color.
Ordinal data: This present in some kind of order by their
position on the scale. It exists on an infinite range of values.
Example: Ranking of people in a competition (First, Second,
Third, etc.)

Morphology Practice With Answers
73% (11)
Morphology Practice With Answers
16 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
DS231_Week_2
No ratings yet
DS231_Week_2
33 pages
DS231 Module 2
No ratings yet
DS231 Module 2
33 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
03-07-2024-Data Science - Orentation Programme
No ratings yet
03-07-2024-Data Science - Orentation Programme
53 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Kadir
No ratings yet
Kadir
84 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
Unit I
No ratings yet
Unit I
52 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
Data v2
No ratings yet
Data v2
25 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Unit 1
No ratings yet
Unit 1
76 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Project Report
No ratings yet
Project Report
29 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
6220010
No ratings yet
6220010
37 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
Unit 1
No ratings yet
Unit 1
28 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
Data Science
No ratings yet
Data Science
18 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Module 1
No ratings yet
Module 1
35 pages
Beginning Data Science in R 4 - Data Analysis, Visualization, - Thomas Mailund - 2, 2022 - Apress - 9781484281543 - Anna's Archive
No ratings yet
Beginning Data Science in R 4 - Data Analysis, Visualization, - Thomas Mailund - 2, 2022 - Apress - 9781484281543 - Anna's Archive
545 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Data Science
100% (2)
Data Science
33 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
ST2195 Complete
No ratings yet
ST2195 Complete
430 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Data Science Material
No ratings yet
Data Science Material
48 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Data Sciences in Telecommunication-Chapitre-1
No ratings yet
Data Sciences in Telecommunication-Chapitre-1
20 pages
What Is Data Science - A Beginner's Guide To Data Science - Edureka
No ratings yet
What Is Data Science - A Beginner's Guide To Data Science - Edureka
14 pages
ML Chapter 01
No ratings yet
ML Chapter 01
19 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Topic 05-Effective Visual Design
No ratings yet
Topic 05-Effective Visual Design
43 pages
Topic 09-Presenting BA
No ratings yet
Topic 09-Presenting BA
37 pages
Topic 10-Data Mining
No ratings yet
Topic 10-Data Mining
24 pages
ICT515 Assignment1
No ratings yet
ICT515 Assignment1
2 pages
Assignment 2 Data Science Application Project
No ratings yet
Assignment 2 Data Science Application Project
3 pages
ICT582 Topic 08
No ratings yet
ICT582 Topic 08
37 pages
ICT583 Case Study (1) (1) .Edited
No ratings yet
ICT583 Case Study (1) (1) .Edited
9 pages
Topic 8
No ratings yet
Topic 8
25 pages
L2 - Mathematical Preliminaries.
No ratings yet
L2 - Mathematical Preliminaries.
42 pages
Week 04
No ratings yet
Week 04
2 pages
Topic 1
No ratings yet
Topic 1
3 pages
Computers Education: Chiu-Liang Chen, Cheng-Chih Wu
No ratings yet
Computers Education: Chiu-Liang Chen, Cheng-Chih Wu
18 pages
Topic 3
No ratings yet
Topic 3
18 pages
Topic 6
No ratings yet
Topic 6
32 pages
Topic 7
No ratings yet
Topic 7
16 pages
Topic 5
No ratings yet
Topic 5
29 pages
ICT622 Topic 10 Lecture Slides 2024
No ratings yet
ICT622 Topic 10 Lecture Slides 2024
30 pages
ICT622 Topic 6 Workshop Slides 2024
No ratings yet
ICT622 Topic 6 Workshop Slides 2024
40 pages
Assignment1 PC Template
No ratings yet
Assignment1 PC Template
12 pages
Optimal Capacitor Bank Allocation in Power Distribution System White Paper Wp917001en
No ratings yet
Optimal Capacitor Bank Allocation in Power Distribution System White Paper Wp917001en
9 pages
Environmental Education and Training
No ratings yet
Environmental Education and Training
47 pages
Schneider Georgakis 2013 How To NOT Make The Extended Kalman Filter Fail
No ratings yet
Schneider Georgakis 2013 How To NOT Make The Extended Kalman Filter Fail
29 pages
Stoneridge, Inc., Control Devices Division
No ratings yet
Stoneridge, Inc., Control Devices Division
4 pages
ADHD in Children
No ratings yet
ADHD in Children
8 pages
Unit 6 Standards
No ratings yet
Unit 6 Standards
12 pages
Staircase Design Dog Legged 2 Landing
No ratings yet
Staircase Design Dog Legged 2 Landing
5 pages
Math-1XX3 W2023 Outline
No ratings yet
Math-1XX3 W2023 Outline
5 pages
Finextra Axway Psd2 Paper Final
No ratings yet
Finextra Axway Psd2 Paper Final
36 pages
Exam Ni Practical Research 2
No ratings yet
Exam Ni Practical Research 2
3 pages
GECG Curriculum - Biomedical Engineering
No ratings yet
GECG Curriculum - Biomedical Engineering
52 pages
Bigwas Reviewer
No ratings yet
Bigwas Reviewer
13 pages
5G Architecture and Specifications
No ratings yet
5G Architecture and Specifications
1 page
Fatigue B.E Myd Muhammad Ali Siddiqui
No ratings yet
Fatigue B.E Myd Muhammad Ali Siddiqui
48 pages
Solid Layer Thermal-Conductivity Measurement Techniques: KE Goodson
No ratings yet
Solid Layer Thermal-Conductivity Measurement Techniques: KE Goodson
12 pages
NHIPPL
No ratings yet
NHIPPL
4 pages
June 2019 Paper Mark Scheme
No ratings yet
June 2019 Paper Mark Scheme
20 pages
XR06CX: 1. Contents 7. Front Panel Commands
100% (1)
XR06CX: 1. Contents 7. Front Panel Commands
3 pages
Wenzhou Windmill Valve Co.,Ltd: Chemical Analysis of Material Certifaicate Inspection Certifaicate
100% (1)
Wenzhou Windmill Valve Co.,Ltd: Chemical Analysis of Material Certifaicate Inspection Certifaicate
1 page
8th ST TL English Grammar by Manjuantha Aj
No ratings yet
8th ST TL English Grammar by Manjuantha Aj
17 pages
ALU - BFD Session Down - RCA - TAC
No ratings yet
ALU - BFD Session Down - RCA - TAC
11 pages
SAP MM Procurement Type
No ratings yet
SAP MM Procurement Type
109 pages
CCT203 2308-2020-1201 Proj
No ratings yet
CCT203 2308-2020-1201 Proj
17 pages
Understanding and Localizing The SDGs 1
No ratings yet
Understanding and Localizing The SDGs 1
26 pages
Workout Thesis Statement
100% (3)
Workout Thesis Statement
8 pages
Cognitive Level of Analysis
100% (1)
Cognitive Level of Analysis
46 pages
SWAMI VIVEKANANDA COMPLETE WORKS Vol 3 PDF
No ratings yet
SWAMI VIVEKANANDA COMPLETE WORKS Vol 3 PDF
450 pages
Low Melting Glasses in ZnO-Fe2O3-P2O5 System With High Chemical
No ratings yet
Low Melting Glasses in ZnO-Fe2O3-P2O5 System With High Chemical
8 pages
Digital Certificate Authentication With Three-Level Cryptography SHA-256 DSA 3DES
No ratings yet
Digital Certificate Authentication With Three-Level Cryptography SHA-256 DSA 3DES
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L1 - Introduction To Data Science

Uploaded by

L1 - Introduction To Data Science

Uploaded by

ICT583 Data Science Applications

TOPIC 1: Introduction to Data Science

Data science won't replace the human

Essentially, we will use math in order to formalize

In biology, a model known as the spawner-recruit model is to used

Essentially, models allow us to plug in one variable to get

Let's say we knew that a group of salmon had 1.15 (in

This result can be very beneficial to estimate how the

Computer languages are how we communicate with

It focuses mainly on having knowledge of the

The good data scientist develops a curiosity about

Note none of these variables form a very strong line, and

We will look at the following topics:

• Structured versus unstructured data

• Quantitative versus qualitative data

If you don't understand the type of data that you

Structured (organized) data: This is data that

Unstructured (unorganized) data: This data

Most data that exists in text form, including server

Scientific observations, as recorded by careful

A genetic sequence of chemical nucleotides (for

Most statistical and machine learning models were built with

So, why even talk about unstructured data? Because it is so

Tweets, emails, literature, and server logs are generally

With most of our data existing in this free-form format, we must

Quantitative data: This data can be described

Qualitative data: This data cannot be

Name of coffee shop: qualitative

Each of these characteristics can be classified as

Even though a zip code is being described

For a quantitative column, you may ask

For a qualitative column, none of the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.