0% found this document useful (0 votes)
2 views

Unit1

The document outlines a course on Data Science at the Noida Institute of Engineering and Technology, detailing its objectives, outcomes, and evaluation scheme. It covers foundational concepts, data formats, exploratory data analysis, and visualization techniques using programming languages like R and Python. Additionally, it highlights the significance of data science in various industries and the skills required for successful implementation.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit1

The document outlines a course on Data Science at the Noida Institute of Engineering and Technology, detailing its objectives, outcomes, and evaluation scheme. It covers foundational concepts, data formats, exploratory data analysis, and visualization techniques using programming languages like R and Python. Additionally, it highlights the significance of data science in various industries and the skills required for successful implementation.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Noida Institute of Engineering and Technology, Greater

Noida

Introduction To Data Science

Unit: 1

Introduction To Data Science


Dr. Mritunjay Rai
Assistant Professor
ECE-DEPT
B-Tech VIIth Sem
NIET

12 December 2023 OS Unit-1 1


THE CONCEPT
Faculty LEARNING TASK
Introduction

Dr.

12 December 2023 OS Unit-1 2


Evaluation scheme

Sl Sub Subject Periods Evaluation Schemes End Semester Total Credit


. ject
N Co
o. des L T P CT TA Total PS TE PE

1 Departmental Core - I 3 0 0 30 20 50 100 150 3

2 Departmental Elective 3 0 0 30 20 50 100 150 3


V

3 Open Elective II 3 0 0 30 20 50 100 150 3

4 Open Elective III 3 0 0 30 20 50 100 150 3

5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1

MOOCs (Essential for 0 0 2


Hons. Degree) OS Unit-1

Total 700 14
12 December 2023 3
CONTENT
Course objective
B. TECH. (Data Science)

Course code L T P Credits


3 0 0 3

Course title Data Analytics

Course objective:

The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.

12 December 2023 OS Unit-1 4


CONTENT
Course Outcomes

Course outcomes : After completion of this course students will be able to

CO 1 Understand the fundamental concepts of data analytics in the areas that plays major K1
role within the realm of data science.

CO 2 Explain and exemplify the most common forms of data and its representations. K2

CO 3 Understand and apply data pre-processing techniques. K3

CO4 Analyse data using exploratory data analysis. K4

CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.

12 December 2023 OS Unit-1 5


Text Books
THE CONCEPT LEARNING TASK

Text books:

1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:

1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals,


Neha Sharma, Santanu Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017
3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber,
Jian Pei, Morgan Kaufmann, 2012.

12 December 2023 OS Unit-1 6


Branch wise Applications

•Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities

12 December 2023 OS Unit-1 7


Course Objectives

• The objective of this course is to understand the fundamental concepts of Data


analytics and learn about various types of data formats and their manipulations.

• It helps students to learn exploratory data analysis and visualization techniques in


addition to R/Python/Tableau programming language.

OS Unit-1
12 December 2023 8
Course Outcomes

Course outcome: After completion of this course students will be able to:

CO 1 Understand the fundamentals of an operating systems, functions and their K1, K2


structure and functions.

CO2 Implement concept of process management policies, CPU Scheduling and K5


thread man
agement.
CO3 Understand and implement the requirement of process synchronization K2,K5
and apply deadlock handling algorithms.

CO4 Evaluate the memory management and its allocation policies. K5

CO5 Understand and analyze the I/O management and File systems K2, K4

OS Unit-1

12 December 2023 9
Program Outcomes

1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning

12 December 2023 OS Unit-1 10


COs and POs Mapping

Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1

3 3 3 - - - - - - - - 1
2

3 3 3 - - - - - - - - 1
3

3 2 1 - - - - - - - - 1
4

3 2 2 - - - - - - - - 1
5

Average
3 2.4 2.2 - - - - - - - - 1

12 December 2023 OS Unit-1 11


Program Specific Outcomes(PSOs)

On successful completion of B. Tech. (DS) Program, the Data


Science graduates will be able to:

• PSO1:- Analyse, design and develop solutions by applying fundamental concepts of


Data Science.
• PSO2:-Apply technical knowledge while using modern tools and technologies for
solving complex problems.
• PSO3:-Collaborate different fields of science and technology with right attitude, to
work as an individual or as a team, and demonstrating professional ethics for the
well-being of the society.

12 December 2023 OS Unit-1 12


COs and PSOs Mapping

Course Outcome PSO1 PSO2 PSO3

1 3 - -

3 2 -
2

3 2 -
3

3 2 2
4

3 2 -
5

Average
3 2 2

12 December 2023 OS Unit-1 13


Program Educational Objectives (PEOs)

•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.

•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.

•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.

12 December 2023 OS Unit-1 14


Faculty wise Result Analysis

• NA

12 December 2023 OS Unit-1 15


End Semester Question Paper Templates (Offline
Pattern/Online Pattern

12 December 2023 OS Unit-1 16


End Semester Question Paper Templates (Offline
Pattern/Online Pattern

12 December 2023 OS Unit-1 17


End Semester Question Paper Templates (Offline
Pattern/Online Pattern

12 December 2023 OS Unit-1 18


End Semester Question Paper Templates (Offline
Pattern/Online Pattern

12 December 2023 OS Unit-1 19


End Semester Question Paper Templates (Offline Pattern/Online
Pattern

12 December 2023 OS Unit-1 20


Prerequisite and Recap
Prerequisite
 Basic Knowledge of Statistics and Probability.
 Basic Knowledge of DBMS.
 Basic Knowledge of Python.

Recap
 To Understand the types of Data.
 To understand Data Classification and analyze the various File Formats.
 To import and export data in R/Python

12 December 2023 OS Unit-1 21


Brief Introduction about the subject with video

Data analytics (DA) is the area of examining data sets in order to find trends
and draw conclusions about the information they contain. Increasingly, data
analytics is done with the aid of specialized systems and software.

YouTube/other Video Links


https://www.youtube.com/watch?v=KxryzSO1Fjs

12 December 2023 OS Unit-1 22


CONTENT
Introduction to Data Science:
1. What is Data Science
2. Big Data, the 5 V’s
3. Evolution of Data Science
4. Datafication
5. Skill sets needed
6. Data Science Lifecycle
7. Types of Data Analysis
8. Data Science Tools and technologies
9. Need for Data Science
10. Analysis Vs Analytics Vs Reporting
11. Big Data Ecosystem
12. Future of Data Science
13. Applications of Data Science in various fields
14. Crowd sourcing analytics
15. Data Security Issues
16. Use cases of Data science-Facebook
17. Netflix, Amazon, Uber, AirBnB.

12 December 2023 OS Unit-1 23


THE CONCEPT LEARNING TASK
Unit Objective

⮚ Understand the significance Data Science in Industry.


⮚ Understanding basic concept of Data Science implementation and its
techniques.

⮚ Describe a formal definition to its frameworks.

⮚ Understanding the concept of Data Science in industry.


⮚ Understand the challenges faced by Data Science implementation process.

⮚ Describe the standards & requirements of implementing Data Science in


cloud environment.

12 December 2023 OS Unit-1 24


THE CONCEPT LEARNING TASK
Unit Objective

The objective of the Unit 1 is :


1.To provide an overview of an exciting growing field of data science.

2. To inculcate the preliminary knowledge of domain of data science and also


elaborate following topics such as:

• History of Data Science


• Data Science platform

• Challenges in traditional system

• Data analytics tools

12 December 2023 OS Unit-1 25


THE CONCEPT LEARNING TASK
Topic Objective

Objective:
▪ In this topic we learn about how data science came into existence
and what was the industry need for Bid Data. This shows the
innovation importance of the technology as an open source
framework.

Recap:

▪ Revision of database systems.

12 December 2023 OS Unit-1 26


What is Data Science

What is Data Science?

12 December 2023 OS Unit-1 27


What is Data Science

What is Data Science?

12 December 2023 OS Unit-1 28


THE CONCEPT LEARNING
What is Data Science TASK

• As you can see from the above image, a Data Analyst usually explains what is
going on by processing history of the data. On the other hand, Data Scientist
not only does the exploratory analysis to discover insights from it, but also uses
various advanced machine learning algorithms to identify the occurrence of a
particular event in the future. A Data Scientist will look at the data from many
angles, sometimes angles not known earlier.

• So, Data Science is primarily used to make decisions and predictions making use
of predictive causal analytics, prescriptive analytics (predictive plus decision
science) and machine learning.

• Predictive causal analytics – If you want a model which can predict the
possibilities of a particular event in the future, you need to apply predictive
causal analytics. Say, if you are providing money on credit, then the probability
of customers making future credit payments on time is a matter of concern for
you. Here, you can build a model which can perform predictive analytics on the
payment history of the customer to predict if the future payments will be on
time or not.

12 December 2023 OS Unit-1 29


THE CONCEPT LEARNING TASK
• Prescriptive analytics: If you want a model which has the intelligence of taking its own
decisions and the ability to modify it with dynamic parameters, you certainly need
prescriptive analytics for it. This relatively new field is all about providing advice. In
other terms, it not only predicts but suggests a range of prescribed actions and
associated outcomes. The best example for this is Google’s self-driving car which I had
discussed earlier too. The data gathered by vehicles can be used to train self-driving
cars. You can run algorithms on this data to bring intelligence to it. This will enable your
car to take decisions like when to turn, which path to take, when to slow down or speed
up.
• Machine learning for making predictions — If you have transactional data of a finance
company and need to build a model to determine the future trend, then machine
learning algorithms are the best bet. This falls under the paradigm of supervised
learning. It is called supervised because you already have the data based on which you
can train your machines. For example, a fraud detection model can be trained using a
historical record of fraudulent purchases.
• Machine learning for pattern discovery — If you don’t have the parameters based on
which you can make predictions, then you need to find out the hidden patterns within
the dataset to be able to make meaningful predictions. This is nothing but the
unsupervised model as you don’t have any predefined labels for grouping.
12 December 2023 OS Unit-1 30
What is Data Science

What is Data Science?

12 December 2023 OS Unit-1 31


THE CONCEPT LEARNING
What is Data Science TASK

Why the Hype Around Data Science?

● IBM predicts that demand for data scientists will soar by 28% by 2022

● Data scientist roles have grown over 650% since 2012, but currently, 35,000
people in the US have data science skills, while hundreds of companies are
hiring for those roles.

● Software engineering is a common starting point for professionals who


are in the top five fastest growing jobs today. The career path to Machine
Learning Engineer and Big Data Developer begins with a solid software
engineering background.

● Data Science gives you career flexibility

12 December 2023 OS Unit-1 32


What is Data Science

Who Are Data Scientists?

12 December 2023 OS Unit-1 33


What is Data Science

Who Are Data Scientists?

12 December 2023 OS Unit-1 34


What is Data Science

Who Are Data Scientists?

12 December 2023 OS Unit-1 35


THE CONCEPT
Introduction
Big Data, LEARNING
to Big
theData TASK
5 V’splatform

Objective:
▪ This topic introduces the big data as an open source framework
with its ecosystems and also how processing takes place worth huge
amount of data in cloud infrastructure.

Recap:

▪ Revision of cloud interfaces and high performance computing.

12 December 2023 OS Unit-1 36


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 37


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 38


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 39


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 40


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 41


THE CONCEPT LEARNING
Types of digital data TASK

Objective:
▪ This topic deals with the concept of different types if big data
requirements on digital platform and how they are occupying space
in our day to day environment.

Recap:

▪ Revision cloud infrastructure basics.

12 December 2023 OS Unit-1 42


THE CONCEPT LEARNING
Types of digital data TASK

12 December 2023 OS Unit-1 43


THE CONCEPT LEARNING
Types of digital data TASK

12 December 2023 OS Unit-1 44


THE CONCEPT LEARNING
Types of digital data TASK

12 December 2023 OS Unit-1 45


THE CONCEPT LEARNING TASK

12 December 2023 OS Unit-1 46


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 47


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 48


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 49


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 50


THE CONCEPT LEARNING
Big Data, the 5 V’s TASK

12 December 2023 OS Unit-1 51


THE Evolution
CONCEPT LEARNING
of Data Science TASK

Objective:
▪ This Unit is basically dealing with the data science and its evolution. It also
focusses on how we can manage data science with the open source
frameworks.

Recap:

▪ Revision of data generation process.

12 December 2023 OS Unit-1 52


THE Evolution
CONCEPT LEARNING
of Data Science TASK

Need of Data Science

• Traditionally, the data that we had was mostly structured and small in size, which
could be analyzed by using the simple BI tools. Unlike data in the traditional
systems which was mostly structured, today most of the data is unstructured or
semi-structured.

• This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not capable of
processing this huge volume and variety of data. This is why we need more
complex and advanced analytical tools and algorithms for processing, analyzing
and drawing meaningful insights out of it.
• This is not the only reason why Data Science has become so popular.

12 December 2023 OS Unit-1 53


THE Evolution
CONCEPT LEARNING
of Data Science TASK

Let’s dig deeper and see how Data Science is being used in various domains.

• How about if you could understand the precise requirements of your customers
from the existing data like the customer’s past browsing history, purchase history,
age and income. No doubt you had all this data earlier too, but now with the vast
amount and variety of data, you can train models more effectively and
recommend the product to your customers with more precision. Wouldn’t it be
amazing as it will bring more business to your organization?
• Let’s take a different scenario to understand the role of Data Science in decision
making. How about if your car had the intelligence to drive you home? The self-
driving cars collect live data from sensors, including radars, cameras and lasers to
create a map of its surroundings. Based on this data, it takes decisions like when
to speed up, when to speed down, when to overtake, where to take a turn –
making use of advanced machine learning algorithms.

12 December 2023 OS Unit-1 54


THE Evolution
CONCEPT LEARNING
of Data Science TASK

Need of Data Science


• Traditionally, the data that we had was mostly structured and small in size, which
could be analyzed by using the simple BI tools. Unlike data in the traditional systems
which was mostly structured, today most of the data is unstructured or semi-
structured.
• This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not capable of
processing this huge volume and variety of data. This is why we need more complex
and advanced analytical tools and algorithms for processing, analyzing and drawing
meaningful insights out of it.
• This is not the only reason why Data Science has become so popular.
• Let’s see how Data Science can be used in predictive analytics. Let’s take weather
forecasting as an example. Data from ships, aircrafts, radars, satellites can be
collected and analyzed to build models. These models will not only forecast the
weather but also help in predicting the occurrence of any natural calamities. It will
help you to take appropriate measures beforehand and save many precious lives.

12 December 2023 OS Unit-1 55


THE
Big CONCEPT
Data LEARNING
Datafication
importance TASK
and applications

Objective:
▪ This unit objective is to specify the Datafication which refers to the fact that
daily interactions of living things can be rendered into a data format and put
to social use.

▪ Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 56


THE
Big CONCEPT
Data LEARNING
Datafication
importance TASK
and applications

• Datafication, according to Mayer Schoenberger and Cukier is the transformation


of social action into online quantified data, thus allowing for real-time tracking and
predictive analysis. Simply said, it is about taking previously invisible
process/activity and turning it into data, that can be monitored, tracked, analysed
and optimised. Latest technologies we use have enabled lots of new ways of
‘datify’ our daily and basic activities.

• Summarizing, datafication is a technological trend turning many aspects of our


lives into computerized data using processes to transform organizations into data-
driven enterprises by converting this information into new forms of value.
• Datafication refers to the fact that daily interactions of living things can be
rendered into a data format and put to social use.

12 December 2023 OS Unit-1 57


THE
Big CONCEPT
Data LEARNING
Datafication
importance TASK
and applications

Examples:
• And here could be many examples of datafication.
• Let’s say social platforms, Facebook or Instagram, for example, collect and
monitor data information of our friendships to market products and services to
us and surveillance services to agencies which in turn changes our behavior;
promotions that we daily see on the socials are also the result of the monitored
data. In this model, data is used to redefine how content is created by
datafication being used to inform content rather than recommendation
systems.

• However, there are other industries where datafication process is actively used:
• Insurance: Data used to update risk profile development and business models.
• Banking: Data used to establish trustworthiness and likelihood of a person
paying back a loan.
• Human resources: Data used to identify e.g. employees risk-taking profiles.
• Hiring and recruitment: Data used to replace personality tests.
• Social science research: Datafication replaces sampling techniques and
restructures the manner in which social science research is performed.

12 December 2023 OS Unit-1 58


THE CONCEPT
Skill sets LEARNING
needed TASK

Objective:
▪ This unit objective is to specify Programming Skills for Data Science
which brings together all the fundamental skills needed to transform
raw data into actionable insights. While there is no specific rule
about the selection of programming language, Python and R are the
most favored ones.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 59


THE
Big CONCEPT
Data Skill sets LEARNING
importance needed TASK
and applications

Datafication

12 December 2023 OS Unit-1 60


THE
Big CONCEPT
Data
Data LEARNING
Science
importance Lifecycle
and TASK
applications

Objective:
▪ This unit objective is to specify the data science life cycle which is an
iterative set of steps you take to deliver a data science project or product.
Because every data science project and team are different, every specific
data science life cycle is different. However, most data science projects tend
to flow through the same general life cycle

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 61


THE
Big CONCEPT
Data
Data LEARNING
Science
importance Lifecycle
and TASK
applications

Business Intelligence (BI) vs. Data Science

• BI basically analyzes the previous data to find hindsight and insight to describe the
business trends. BI enables you to take data from external and internal sources,
prepare it, run queries on it and create dashboards to answer the questions
like quarterly revenue analysis or business problems. BI can evaluate the impact of
certain events in the near future.
• Data Science is a more forward-looking approach, an exploratory way with the
focus on analyzing the past or current data and predicting the future outcomes
with the aim of making informed decisions. It answers the open-ended questions
as to “what” and “how” events occur.

12 December 2023 OS Unit-1 62


THE
Big CONCEPT
Data
Data LEARNING
Science
importance Lifecycle
and TASK
applications

Lifecycle of Data Science


• Here is a brief overview of the main phases of the Data Science Lifecycle:

12 December 2023 OS Unit-1 63


THE
Big CONCEPT
Data
Data LEARNING
Science
importance Lifecycle
and TASK
applications

• Phase 1—Discovery: Before you begin the project, it is important to understand the various
specifications, requirements, priorities and required budget. You must possess the ability to
ask the right questions. Here, you assess if you have the required resources present in terms
of people, technology, time and data to support the project. In this phase, you also need to
frame the business problem and formulate initial hypotheses (IH) to test.
• Phase 2—Data preparation: In this phase, you require analytical sandbox in which you can
perform analytics for the entire duration of the project. You need to explore, preprocess and
condition data prior to modeling. Further, you will perform ETLT (extract, transform, load
and transform) to get data into the sandbox. Let’s have a look at the Statistical Analysis flow
below.

• You can use R for data cleaning, transformation, and visualization. This will help you to spot
the outliers and establish a relationship between the variables. Once you have cleaned and
prepared the data, it’s time to do exploratory analytics on it. Let’s see how you can achieve
that.

• Phase 3—Model planning: Here, you will determine the methods and techniques to draw the
relationships between variables. These relationships will set the base for the algorithms
which you will implement in the next phase. You will apply Exploratory Data Analytics (EDA)
using various statistical formulas and visualization tools.
12 December 2023 OS Unit-1 64
THE
Data
Big CONCEPT
Science
Data ToolsLEARNING
importanceand TASK
andtechnologies
applications

Phase 4—Model building: In this phase, you will develop datasets for training and testing
purposes. You will consider whether your existing tools will suffice for running the
models or it will need a more robust environment (like fast and parallel processing). You
will analyze various learning techniques like classification, association and clustering to
build the model.

You can achieve model building through the following tools

12 December 2023 OS Unit-1 65


THE
Data
Big CONCEPT
Science
Data ToolsLEARNING
importanceand TASK
andtechnologies
applications

Phase 5—Operationalize: In this phase, you deliver final reports, briefings,


code and technical documents. In addition, sometimes a pilot project is also
implemented in a real-time production environment. This will provide you a
clear picture of the performance and other related constraints on a small
scale before full deployment.

Phase 6—Communicate results: Now it is important to evaluate if you have


been able to achieve your goal that you had planned in the first phase. So, in
the last phase, you identify all the key findings, communicate to the
stakeholders and determine if the results of the project are a success or a
failure based on the criteria developed in Phase

12 December 2023 OS Unit-1 66


THE
Data
Big CONCEPT
Science
Data ToolsLEARNING
importanceand TASK
andtechnologies
applications

Objective:
▪ This unit objective is to specify the data science Top 9 Tools [Most Used in 2021]
❖ Apache Spark.
❖ BigML.
❖ D3.js.
❖ MATLAB.
❖ SAS.
❖ Tableau.
❖ Matplotlib.
❖ Scikit-learn.
Recap:
▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 67


THE
Data
Big CONCEPT
Science
Data ToolsLEARNING
importanceand TASK
andtechnologies
applications

Let’s have a look at various model planning


tools.

R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.
Although, many tools are present in the market but R is the most commonly used
tool.
Now that you have got insights into the nature of your data and have decided the
algorithms to be used. In the next stage, you will apply the algorithm and build up a
12 December 2023 OS Unit-1 68
model.
THE
Data
Big CONCEPT
Science
Data ToolsLEARNING
importanceand TASK
andtechnologies
applications

12 December 2023 OS Unit-1 69


THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications

Objective:
▪ This unit objective is to specify that Data Analysis can be separated and organized
into 6 types, arranged with an increasing order of difficulty as following types:
❖ Descriptive Analysis
❖ Exploratory Analysis
❖ Inferential Analysis
❖ Predictive Analysis
❖ Causal Analysis
❖ Mechanistic Analysis
Recap:
▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 70


THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications

1. Descriptive Analysis
Goal — Describe or Summarize a set of Data
Description:
The very first analysis performed
Generates simple summaries about samples and measurements
common descriptive statistics (measures of central tendency, variability,
frequency, position, etc)
Example:
Take the COVID-19 statistics page on google for example, the line graph is just a
pure summary of the cases/deaths, a presentation and description of the
population of a particular country infected by the virus
Summary:
Descriptive Analysis is the first step in analysis where you summarize and
describe the data you have using descriptive statistics, and its result is a simple
presentation of your data.

12 December 2023 OS Unit-1 71


THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications
2. Exploratory Analysis (EDA)
Goal — Examine or explore data and find relationships between variables which
were previously unknown
Description:
EDA helps you discover relationships between measures in your data, which are
not evidence for the existence of the correlation, as denoted by the phrase
(Correlation doesn’t imply causation)
useful for discovering new connections — forming hypothesis and drives design
planning and data collection
Example:
Climate change is an increasingly important topic as the global temperature is
gradually rising over the years. One example of EDA on climate change is by
taking the rise in temperature over the years, say 1950 to 2020 for example, and
the increase of human activities and industrialization, and form relationships
from the data, e.g. increasing number of factories, cars on the road and airplane
flights increase correlates.
Summary:
EDA explores data to find relationships between measures that tells us they exist,
without the cause. They can be used OS
12 December 2023
to formulate
Unit-1
hypotheses. 72
THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications
3. Inferential Analysis
Goal— Using a small sample of data to infer about a larger population,
The goal of statistical modeling itself, is all about using a small amount of information to
extrapolate and generalize information to a larger group.
Description:
Using estimated data that value in population and give a measure of uncertainty (standard
deviation) in your estimation
Accuracy of inference depends heavily on sampling scheme; if the sample isn’t
representative of the population, the generalization will be inaccurate (ref Central Limit
Theorem).
Example:
The idea of inferring about the population at large with a smaller sample is quite intuitive,
many statistics you see on the media and the internet are inferential, a prediction of an
event based on a small sample. To give an example, a psychology study for the benefits of
sleep, a total of 500 people involved in the study, when followed up with the candidates,
they reported to have better overall attention and well-being with 7–9 hours of sleep, while
those with less sleep and more sleep suffered with reduced attention and energy. This
report from 500 people was just a tiny portion of 7b people in the world, thus an inference
of the larger population.
Summary:
IA extrapolates and generalizes the information of the larger group with a smaller sample to
12 December 2023 OS Unit-1 73
generate analysis and predictions.
THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications
4. Predictive Analysis
Goal — Using historical or current data to find patterns to make predictions about
the future
Description:
Accuracy of the predictions depends on the input variables
Accuracy also depends on the types of models, a linear model might work well in
some cases, and vice-versa
Using a variable to predict another doesn’t denote a causal relationships
Example:
The 2020 US election is a popular topic and many prediction models are built to
predict the winning candidate FiveThirtyEight did a great 2016 Election forecast
and is back at it again in 2020. Prediction analysis for an election would require
input variables such as historical polling data, trends and the current polling data
in order to get a good prediction. Something as large as an election wouldn’t just
be using a linear model, but a complex model with certain tunings to best serve
it’s purpose.
Summary:
PA takes data from the past and present to make predictions about the future.
12 December 2023 OS Unit-1 74
THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications
5. Causal Analysis
Goal — Looks at the cause and effect of relationships between variables, focused on finding
the cause of a correlation.
Description:
To find the cause, you have to question whether the observed correlations driving your
conclusion are valid, as just looking at the data (surface) won’t help you discover the hidden
mechanisms underlying the correlations Applied in randomized studies focused on
identifying causation
the gold standard in data analysis, scientific studies where cause of phenomenon is to be
extracted and singled out, like separating wheat from chaff
Challenges:
Good data is hard to find and requires expensive research and studies. These studies are
analyzed in aggregate (multiple groups), and the observed relationships are just average
effects (mean) of the whole population (meaning the results might not apply to everyone)
Example: Say you want test out this new drug that improves human strength and focus, and
to do that you perform randomized control trials for the drug to test the effect of the drug.
You compare the sample of candidates for your new drug vs the candidates receiving mock
control with a few test for on strength and overall focus and attention and observe how the
drug affects the outcome
Summary: CA is about finding out the causal relationship between variables, change one
variable and what happens to another.
12 December 2023 OS Unit-1 75
THE
Big CONCEPT
Types
Data LEARNING
of Data
importance Analysis
and TASK
applications

6. Mechanistic Analysis
Goal — Understand exact changes in variables that lead to other changes in other variables
Description:
Applied in physical or engineering sciences, situations that require high precision and little
room for error(only noise in data is measurement error)
Designed to understand a biological or behavioral process, the pathophysiology of a
disease, or the mechanism of action of an intervention. (by NIH)
Example:
Many graduate-level research and complex topics are suitable examples, but to put it in a
simple manner, let’s say an experiment is done to simulate safe and effective nuclear fusion
to power the world, a mechanistic analysis of the study would entail precise balance of
controlling and manipulating variables with highly accurate measures of both variables and
the desired outcomes. It’s this intricate and meticulous modus operandi (strategy) towards
these big topics that allows for scientific breakthroughs and advancement of society.
Summary:
MA is in some ways a predictive analysis, but modified to tackle studies that require high
precision and meticulous methodologies for physical or engineering science.

12 December 2023 OS Unit-1 76


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Objective:
▪ This unit objective is to specify the reason why we need data science
is the ability to process and interpret data. This enables companies
to make informed decisions around growth, optimization, and
performance. For example, machine learning is now being used to
make sense of every kind of data – big or small.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 77


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications
• Data science is an emerging field of science that has multiple aspects – for one, it
studies and examines a huge amount of data; for another, its branches extend in
almost every field.
• The data we work on is not simple; it is complex data that is structured in many layers.
Data science is founded on three main components, and they are statistics,
mathematics, and programming language.
• Artificial intelligence encapsulates the concepts of all three fields and acts as the
machinery or brain of data science. Data science uses techniques, procedures,
algorithms, rules, and tools from all these three components and works as a unified
mechanism to solve the complex problems that arise in the world around us.As the
name says, the world of data science revolves around data. Data is a chunk of
information that holds knowledge within a single unit, whereas data science uses
mathematical algorithms, rules, and artificial intelligence in dealing with the collection,
refining, aligning, storage, manipulation, and utilization of data.
• The entire idea is to perform result-driven calculations on the data to get insights for
business and research. (Blei and Smyth, 2017)From business to the health industry,
science to our everyday lives, marketing to research, in fact, for everything in a
fraternity, data is required to thrust the movement forward. Computer science and
information technology have taken over our lives, and it is advancing with each passing
day with such velocity and variety that the operational techniques used a few years
back have now become obsolete.

12 December 2023 OS Unit-1 78


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

• The same is the case with challenges and problems. The problems and
concerns of the past for a specific theme, illness, or shortfall may not be the
same today as they have advanced in terms of complexity.

• Every field of science and study or organization, therefore, needs an updated


set of operational systems and technology to keep up with the challenges of
today and tomorrow as well as to derive solutions for unanswered questions.
Data science is changing the course of the action plan of business procedures
and plans. It is solving intricate problems with the help of technology. Data
science provides the inside knowledge, which is derived from the big data after
processes of extraction and the information. This information is sometimes
collected from the ongoing sources within the system, and most of the time, it
is mined from external sources.

• Data is the key component for every business, as businesses need it to analyze
their current scenario based on past facts and performance and make decisions
for future challenges. They need data to survive in today’s competitive market
and mature their decision-making power, which would enhance their
productivity and profitability. Today, data science is the requirement of every
business to make business forecasts and predictions based on facts and figures,
which are collected in the form of data and processed through data science.
12 December 2023 OS Unit-1 79
THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Data Science
• There are also two ways of looking at data: with the intent to explain
behavior that has already occurred, and you have gathered data for it; or
to use the data you already have in order to predict future behavior that
has not yet happened.

12 December 2023 OS Unit-1 80


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Data Science explaining the past


• Business Intelligence
• Before data science jumps into predictive analytics, it must look at the patterns
of behaviour the past provides, analyse them to draw insight and inform the path
for forecasting. Business intelligence focuses precisely on this: providing data-
driven answers to questions like: How many units were sold? In which region
were the most goods sold? Which type of goods sold where? How did the email
marketing perform last quarter in terms of click-through rates and revenue
generated? How does that compare to the performance in the same quarter of
last year?
• Although Business Intelligence does not have “data science” in its title, it is part
of data science, and not in any trivial sense.
• What does Business Intelligence do?
• Of course, Business Intelligence Analysts can apply Data Science to measure
business performance. But in order for the Business Intelligence Analyst to
achieve that, they must employ specific data handling techniques.
• The starting point of all data science is data. Once the relevant data is in the
hands of the BI Analyst (monthly revenue, customer, sales volume, etc.), they
must quantify the observations, calculate KPIs and examine measures to extract
insights from their data.

12 December 2023 OS Unit-1 81


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Data Science is about telling a story


• Apart from handling strictly numerical information, data science, and specifically
business intelligence, is about visualizing the findings, and creating easily
digestible images supported only by the most relevant numbers. After all, all levels
of management should be able to understand the insights from the data and
inform their decision-making.

Business intelligence analysts create dashboards and reports, accompanied by


graphs, diagrams, maps, and other comparable visualizations to present the
findings relevant to the current business objectives.
12 December 2023 OS Unit-1 82
THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Data Science predicting the future

• Predictive analytics in data science rest on the shoulders of


explanatory data analysis, which is precisely what we were
discussing up to this point. Once the BI reports and
dashboards have been prepared and insights – extracted
from them – this information becomes the basis for
predicting future values. And the accuracy of these
predictions lies in the methods used.
• Recall the distinction between traditional data and big data
in data science.
• We can make a similar distinction regarding predictive
analytics and their methods: traditional data science
methods vs. Machine Learning. One deals primarily with
traditional data, and the other – with big data.

12 December 2023 OS Unit-1 83


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Traditional forecasting methods in Data Science: What are


they?
• Traditional forecasting methods comprise the classical statistical methods for
forecasting – linear regression analysis, logistic regression analysis, clustering,
factor analysis, and time series. The output of each of these feeds into the more
sophisticated machine learning analytics, but let’s first review them individually.
• A quick side-note. Some in the data science industry refer to several of these
methods as machine learning too, but in this article machine learning refers to
newer, smarter, better methods, such as deep learning.

12 December 2023 OS Unit-1 84


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

• Linear regression
• In data science, the linear regression model is used for quantifying causal
relationships among the different variables included in the analysis. Like the
relationship between house prices, the size of the house, the neighborhood,
and the year built. The model calculates coefficients with which you can predict
the price of a new house, if you have the relevant information available.
• Logistic regression
• Since it’s not possible to express all relationships between variables as linear,
data science makes use of methods like the logistic regression to create non-
linear models. Logistic regression operates with 0s and 1s. Companies apply
logistic regression algorithms to filter job candidates during their screening
process. If the algorithm estimates that the probability that a prospective
candidate will perform well in the company within a year is above 50%, it
would predict 1, or a successful application. Otherwise, it will predict 0.
• Cluster analysis
• This exploratory data science technique is applied when the observations in the
data form groups according to some criteria. Cluster analysis takes into account
that some observations exhibit similarities, and facilitates the discovery of new
significant predictors, ones that were not part of the original conceptualization
of the data.

12 December 2023 OS Unit-1 85


THE
Big CONCEPT
Need
Data importanceLEARNING
for Data Science
and TASK
applications

Factor analysis
If clustering is about grouping observations together, factor analysis is about
grouping features together. Data science resorts to using factor analysis to
reduce the dimensionality of a problem. For example, if in a 100-item
questionnaire each 10 questions pertain to a single general attitude, factor
analysis will identify these 10 factors, which can then be used for a regression
that will deliver a more interpretable prediction. A lot of the techniques in data
science are integrated like this.
Time series analysis
Time series is a popular method for following the development of specific
values over time. Experts in economics and finance use it because their subject
matter is stock prices and sales volume – variables that are typically plotted
against time.

12 December 2023 OS Unit-1 86


THE
Big CONCEPT
Analysis
Data importanceLEARNING
Vs Analytics TASK
Vs applications
and Reporting

Objective:
▪ This unit objective is to specify the reason of Analysis: The process
of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business
performance. __Reporting translates raw data into information.
Analysis transforms data and information into insights.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 87


THE CONCEPT LEARNING TASK
Data Analytics

Data Analytics
• Data analytics is the often complex process of examining big data to uncover
information -- such as hidden patterns, correlations, market trends and
customer preferences -- that can help organizations make informed business
decisions.
• On a broad scale, data analytics technologies and techniques give organizations
a way to analyze data sets and gather new information. Business intelligence
(BI) queries answer basic questions about business operations and
performance.
• Big data analytics is a form of advanced analytics, which involve complex
applications with elements such as predictive models, statistical algorithms and
what-if analysis powered by analytics systems.
• Why is big data analytics important?
• Organizations can use big data analytics systems and software to make data-
driven decisions that can improve business-related outcomes. The benefits may
include more effective marketing, new revenue opportunities, customer
personalization and improved operational efficiency. With an effective strategy,
these
12 benefits
December 2023 can provide competitive advantages over rivals.
OS Unit-1 88
THE CONCEPT
Analysis vsLEARNING
Reporting TASK

1. Purpose: Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings to their
business, as reporting extracts that and makes it easier to understand.

Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier (think of
a dashboard, charts, and graphs, which are reporting tools and not analysis reports),
analysis interprets this information and provides recommendations on actions.

2.Tasks: Reporting includes building, configuring, consolidating, organizing,


formatting, and summarizing. Analysis consists of questioning, examining, interpreting,
comparing, and confirming. With big data, predicting is possible as well.

12 December 2023 OS Unit-1 89


THE CONCEPT
Analysis vsLEARNING
Reporting TASK

3. Outputs: Reporting has a push approach, as it pushes information to users and outputs
come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and
to answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations.

4. Delivery: Considering that reporting involves repetitive tasks—often with truckloads of


data, automation has been a lifesaver, especially now with big data. It’s not surprising that
the first thing outsourced are data entry services since outsourcing companies are perceived
as data reporting experts.
Analysis requires a more custom approach, with human minds doing superior reasoning
and analytical thinking to extract insights, and technical skills to provide efficient steps
towards accomplishing a specific goal.

5. Value: This Path to Value illustrates how data converts into value by reporting and
analysis such that it’s not achievable without the other.

12 December 2023 OS Unit-1 90


THE CONCEPT LEARNING TASK
Big Data Ecosystem

Objective:
▪ This Unit is basically dealing with the big data ecosystem and
frameworks. It also focusses on how we can manage big data with
the open source frameworks.

Recap:

▪ Revision of data generation process.

OS Unit-1

12 December 2023 91
THE CONCEPT LEARNING TASK
Big Data Ecosystem

12 December 2023 OS Unit-1 92


THE
Big CONCEPT
Data LEARNING
importance TASK
and applications

Main Technology Components Of Big Data

1. Data Management
2. Data Mining
3. Hadoop
4. In-Memory Analytics
5. Predictive Analytics
6. Text Mining
Why is big data concepts analytics
important?
1. Reduced cost
2. Quick decision making
3. New products and features

12 December 2023 OS Unit-1 93


THE CONCEPT
ApplicationLEARNING
of Big Data TASK

12 December 2023 OS Unit-1 94


THE
Big CONCEPT
Future
Data LEARNING
of Data
importance Science
and TASK
applications

Objective:
▪ This unit objective is to specify as You can think about the data
increase from IoT or from social data at the edge. If we look a little
bit more ahead, the US Bureau of Labor Statistics predicts that by
2026—so around six years from now—there will be 11.5 million jobs
in data science and analytics.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 95


THE
Big CONCEPT
Future
Data LEARNING
of Data
importance Science
and TASK
applications
• Data Science is a colossal pool of multiple data operations. These data operations also
involve machine learning and statistics. Machine Learning algorithms are very much
dependent on data. This data is fed to our model in the form of training set and test
set which is eventually used for fine-tuning our model with various algorithmic
parameters.

• By all means, advancement in Machine Learning is the key contributor towards the
future of data science.

• In particular, Data Science also covers:

• Data Integration.
• Distributed Architecture.
• Automating Machine learning.
• Data Visualization.
• Dashboards and BI.
• Data Engineering.
• Deployment in production mode
• Automated, data-driven decisions.

12 December 2023 OS Unit-1 96


THE
Big CONCEPT
Future
Data LEARNING
of Data
importance Science
and TASK
applications

• i. Data Science currently does not have a fixed definition due to its vast number of data
operations. These data operations will only increase in the future. However, the
definition of data science will become more specific and constrained as it will only
incorporate essential areas that define the core data science.

• ii. In the near future, Data Scientists will have the ability to take on areas that are
business-critical as well as several complex challenges. This will facilitate the businesses
to make exponential leaps in the future. Companies in the present are facing a huge
shortage of data scientists. However, this is set to change in the future.

• In India alone, there will be an acute shortage of data science professionals until 2020.
The main reason for this shortage is India is because of the varied set of skills required
for data science operations.

• There are very few existing curricula that address the requirements of data scientists
and train them. However, this is gradually changing with the introduction of Data
Science degrees and bootcamps that can transform a professional from a quantitative
background or a software background into a fully-fledged data scientist.

12 December 2023 OS Unit-1 97


THE
Big CONCEPT
Future
Data LEARNING
of Data
importance Science
and TASK
applications

Data Science Future Career Predictions


• According to IBM, there is a predicted increase in the data science job openings by
364,000 to 2,720,000. You can learn more about the demand prediction by IBM –
Data Scientists Demand Prediction for 2020

• We can summarize the trends leading to the future of data science in the following
three points –

• The increase of complex data science algorithms will be subsumed in packages in a


magnitude making them quite easier to deploy. For example, a simple machine
learning algorithms like decision trees which required huge resources in the past can
now be easily deployed.
• Large Scale Enterprises are rapidly adopting machine learning for driving their
business in several ways. Automation of several tasks is one of the key future goals of
the industries. As a result, they are able to prevent losses from taking place.
• As discussed above, the prevalence of academic programs and data literacy initiatives
are allowing students to get exposed to data related disciplines. This is imparting a
competitive edge to the students in order to help them stay ahead of the curve.

12 December 2023 OS Unit-1 98


THE
Big CONCEPT
Applications
Data importance LEARNING
of Data Science TASK
in various
and applications fields

Objective:
▪ This unit objective is to specify the disciplinary areas that make up
the data science field include mining, statistics, machine learning,
analytics, and programming. Statistical measures or predictive
analytics use this extracted data to gauge events that are likely to
happen in the future based on what the data shows happened in the
past.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 99


THE
Big CONCEPT
Applications
Data importance LEARNING
of Data Science TASK
in various
and applications fields

• Through this blog, we bring to you, 10 applications that build


upon the concepts of Data Science, exploring various
domains such as the following:

• Fraud and Risk Detection


• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
• Advanced Image Recognition
• Speech Recognition
• Airline Route Planning
• Gaming
• Augmented Reality

12 December 2023 OS Unit-1 100


THE
Big CONCEPT
Applications
Data importance LEARNING
of Data Science TASK
in various
and applications fields

12 December 2023 OS Unit-1 101


THE
Big CONCEPT
DataCrowd LEARNING
sourcing
importance andanalyticsTASK
applications

Objective:
▪ This unit objective is to specify that Crowdsourcing data is an
effective way to seek the help of a large audience usually through
the internet to gather information on how to solve the company's
problems, generate new ideas and innovations. ... These brief
understudies towards idea and collaboration, and give useable
information to them.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 102


THE CONCEPT LEARNING TASK
Big challenge in big data analytics: Manpower bottleneck
▪ Automatic data analysis techniques (e.g. machine learning) are often
considered as main components of data analytics
▪ Data analysis is heavily labor intensive
– Manual processing dominates a large portion of data analysis process
– 1990s-2000s: introduction of data mining techniques and data analysis process
standards (e.g., CRISP-DM)
data analysis process
data

business collection/ data cleansing/ modeling/ evaluation/


understandin computerizatio curation visualizatio interpretatio
g n n n

12 December 2023 OS Unit-1 103


THE CONCEPT LEARNING
Data Security Issues TASK

Objective:
▪ This unit objective is to specify the integrity and privacy of data are
at risk from unauthorized users, external sources listening in on the
network, and internal users giving away the store. This section
explains the risky situations and potential attacks that could
compromise your data.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 104


THE CONCEPT LEARNING
Data Security Issues TASK
Data Security Risks
• Below are several common issues faced by organizations of all sizes as they attempt to
secure sensitive data.
• Accidental Exposure
• A large percentage of data breaches are not the result of a malicious attack but are
caused by negligent or accidental exposure of sensitive data. It is common for an
organization’s employees to share, grant access to, lose, or mishandle valuable data,
either by accident or because they are not aware of security policies.
• This major problem can be addressed by employee training, but also by other
measures, such as data loss prevention (DLP) technology and improved access
controls.
• Phishing and Other Social Engineering Attacks
• Social engineering attacks are a primary vector used by attackers to access sensitive
data. They involve manipulating or tricking individuals into providing private
information or access to privileged accounts.

12 December 2023 OS Unit-1 105


THE
Big CONCEPT
Data Data LEARNING
Security
importance Issues
and TASK
applications

• Phishing is a common form of social engineering. It involves messages


that appear to be from a trusted source, but in fact are sent by an
attacker. When victims comply, for example by providing private
information or clicking a malicious link, attackers can compromise their
device or gain access to a corporate network.
• Insider Threats
• Insider threats are employees who inadvertently or intentionally
threaten the security of an organization’s data. There are three types of
insider threats:
• Non-malicious insider—these are users that can cause harm
accidentally, via negligence, or because they are unaware of security
procedures.
• Malicious insider—these are users who actively attempt to steal data or
cause harm to the organization for personal gain.
• Compromised insider—these are users who are not aware that their
accounts or credentials were compromised by an external attacker. The
attacker can then perform malicious activity, pretending to be a
legitimate user.

12 December 2023 OS Unit-1 106


THE CONCEPT LEARNING
Data Security Issues TASK
• Ransomware
• Ransomware is a major threat to data in companies of all sizes.
Ransomware is malware that infects corporate devices and encrypts data,
making it useless without the decryption key. Attackers display a ransom
message asking for payment to release the key, but in many cases, even
paying the ransom is ineffective and the data is lost.
• Many types of ransomware can spread rapidly, and infect large parts of a
corporate network. If an organization does not maintain regular backups,
or if the ransomware manages to infect the backup servers, there may be
no way to recover.
• Data Loss in the Cloud
• Many organizations are moving data to the cloud to facilitate easier
sharing and collaboration. However, when data moves to the cloud, it is
more difficult to control and prevent data loss. Users access data from
personal devices and over unsecured networks. It is all too easy to share a
file with unauthorized parties, either accidentally or maliciously.

12 December 2023 OS Unit-1 107


THE CONCEPT LEARNING TASK

• There are several technologies and practices that can improve data
security. No one technique can solve the problem, but by combining
several of the techniques below, organizations can significantly improve
their security posture.
• Data Discovery and Classification
• Modern IT environments store data on servers, endpoints, and cloud
systems. Visibility over data flows is an important first step in
understanding what data is at risk of being stolen or misused. To properly
protect your data, you need to know the type of data, where it is, and
what it is used for. Data discovery and classification tools can help.
• Data detection is the basis for knowing what data you have. Data
classification allows you to create scalable security solutions, by
identifying which data is sensitive and needs to be secured. Data
detection and classification solutions enable tagging files on endpoints,
file servers, and cloud storage systems, letting you visualize data across
the enterprise, to apply the appropriate security policies.

12 December 2023 OS Unit-1 108


THE CONCEPT LEARNING TASK
Common Data Security Solutions and Techniques
• Data Masking
• Data masking lets you create a synthetic version of your organizational data, which you can use
for software testing, training, and other purposes that don’t require the real data. The goal is
to protect data while providing a functional alternative when needed.
• Data masking retains the data type, but changes the values. Data can be modified in a number
of ways, including encryption, character shuffling, and character or word substitution.
Whichever method you choose, you must change the values in a way that cannot be reverse-
engineered.
• Identity Access Management
• Identity and Access Management (IAM) is a business process, strategy, and technical
framework that enables organizations to manage digital identities. IAM solutions allow IT
administrators to control user access to sensitive information within an organization.
• Systems used for IAM include single sign-on systems, two-factor authentication, multi-factor
authentication, and privileged access management. These technologies enable the
organization to securely store identity and profile data, and support governance, ensuring that
the appropriate access policies are applied to each part of the infrastructure.

12 December 2023 OS Unit-1 109


THE CONCEPT LEARNING TASK
Common Data Security Solutions and Techniques
• Data Encryption
• Data encryption is a method of converting data from a readable format (plaintext)
to an unreadable encoded format (ciphertext). Only after decrypting the
encrypted data using the decryption key, the data can be read or processed.
• In public-key cryptography techniques, there is no need to share the decryption
key – the sender and recipient each have their own key, which are combined to
perform the encryption operation. This is inherently more secure.
• Data encryption can prevent hackers from accessing sensitive information. It is
essential for most security strategies and is explicitly required by many
compliance standards.
• Data Loss Prevention (DLP)
• To prevent data loss, organizations can use a number of safeguards, including
backing up data to another location. Physical redundancy can help protect data
from natural disasters, outages, or attacks on local servers. Redundancy can be
performed within a local data center, or by replicating data to a remote site or
cloud environment.
• Beyond basic measures like backup, DLP software solutions can help protect
organizational data. DLP software automatically analyzes content to identify
sensitive data, enabling central control and enforcement of data protection
policies, and alerting in real-time when it detects anomalous use of sensitive data,
for example, large quantities of data copied outside the corporate network.
12 December 2023 OS Unit-1 110
THE CONCEPT LEARNING TASK
Common Data Security Solutions and Techniques
• Governance, Risk, and Compliance (GRC)
• GRC is a methodology that can help improve data security and compliance:
• Governance creates controls and policies enforced throughout an organization to ensure compliance
and data protection.
• Risk involves assessing potential cybersecurity threats and ensuring the organization is prepared for
them.
• Compliance ensures organizational practices are in line with regulatory and industry standards when
processing, accessing, and using data.
• Password Hygiene
• One of the simplest best practices for data security is ensuring users have unique, strong passwords.
Without central management and enforcement, many users will use easily guessable passwords or
use the same password for many different services. Password spraying and other brute force attacks
can easily compromise accounts with weak passwords.
• A simple measure is enforcing longer passwords and asking users to change passwords frequently.
However, these measures are not enough, and organizations should consider multi-factor
authentication (MFA) solutions that require users to identify themselves with a token or device they
own, or via biometric means.
• Another complementary solution is an enterprise password manager that stores employee
passwords in encrypted form, reducing the burden of remembering passwords for multiple
corporate systems, and making it easier to use stronger passwords. However, the password manager
itself becomes a security vulnerability for the organization.

12 December 2023 OS Unit-1 111


THE CONCEPT LEARNING TASK
Common Data Security Solutions and Techniques
Authentication and Authorization
• Organizations must put in place strong authentication methods, such as OAuth for
web-based systems. It is highly recommended to enforce multi-factor
authentication when any user, whether internal or external, requests sensitive or
personal data.
• In addition, organizations must have a clear authorization framework in place,
which ensures that each user has exactly the access rights they need to perform a
function or consume a service, and no more. Periodic reviews and automated
tools should be used to clean up permissions and remove authorization for users
who no longer need them.
Data Security Audits
• The organization should perform security audits at least every few months. This
identifies gaps and vulnerabilities across the organizations’ security posture. It is a
good idea to perform the audit via a third-party expert, for example in a
penetration testing model. However, it is also possible to perform a security audit
in house. Most importantly, when the audit exposes security issues, the
organization must devote time and resources to address and remediate them.
12 December 2023 OS Unit-1 112
THE CONCEPT LEARNING TASK
Common Data Security Solutions and Techniques
Anti-Malware, Antivirus, and Endpoint Protection
• Malware is the most common vector of modern cyberattacks, so organizations
must ensure that endpoints like employee workstations, mobile devices, servers,
and cloud systems, have appropriate protection. The basic measure is antivirus
software, but this is no longer enough to address new threats like file-less attacks
and unknown zero-day malware.
• Endpoint protection platforms (EPP) take a more comprehensive approach to
endpoint security. They combine antivirus with a machine-learning-based analysis
of anomalous behavior on the device, which can help detect unknown attacks.
Most platforms also provide endpoint detection and response (EDR) capabilities,
which help security teams identify breaches on endpoints as they happen,
investigate them, and respond by locking down and reimaging affected endpoints.

12 December 2023 OS Unit-1 113


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK

Objective:
▪ This unit objective is to specify Fundamentally, a data analytics use
case is the manner in which the business user leverages data and
the analytics system to derive insights to answer tangible business
questions for decision making.

Recap:

▪ Revision of need of Data science in Industry.

12 December 2023 OS Unit-1 114


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK

Netflix Case
• Netflix, an internet streaming media provider, is a bright example of datafication
process. It provides services in more than 40 countries and 33 million streaming
members. Originally, operations were more physical in nature with its core business in
mail order-based disc rental (DVD and Blu-ray). Simply said, the operating model was
that the subscriber creates and maintains the queue (an ordered list) of media content
that they want to rent (for example, a movie). If you limit the total number of disks, the
contents can be stored for a long time, as the subscriber wishes. However, to rent a
new disk, the subscriber sends the previous one back to Netflix, which then forwards
the next available disk to the subscribers queue. Thus, the business goal of the disk
rental model is to help people fill their turn. The model has changed and now Netflix is
actively transforming their service into a smart one, actively using datafication
processes.

• It’s noticeable that in all aspects of the streamlined implementation of the Netflix
business, a gradual change occurs where the IT infrastructure and artifacts completely
free media content from its physical manifestation; for example, a disk and its mail
delivery. While streaming, subscribers can select videos before making a reservation,
they can consume multiple videos in one session and observe viewing statistics to a
much finer degree; and in real time, to a greater extent.

12 December 2023 OS Unit-1 115


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
• Therefore, much more data is dematerialized in the streaming model. In addition, data
sources have become diverse and diverse – including catalog data (more than 1000
facets are now associated with the title), search terms, streaming queues and games,
interactions and external sources such as movie reviews and social data. Removing time
and distance from the business model has increased the potential for interaction
between the provider and the subscriber through dynamic personalization: by
household, genre, etc. Explaining the content to promote trust, ranking, ranking and
review and social influence resulting from the fact that associated friends watched or
evaluated.
• On a daily basis, Netflix’s dematerialization has about 30 million daily games and 3
million odd queries to inform about the dynamics of recommendations. What offers
through dematerialization and a combination of liquidity has allowed an interesting
manifestation of density due to the recent transition of Netflix from streaming content to
its creation. Statistical analysis of user behaviour over the years has been used to inform
content, not recommendations, introducing Netflix with an interesting intersection of
the genre, actors and director. The result of this data crossing was their recent remake of
the television series House of Cards, a political thriller.

12 December 2023 OS Unit-1 116


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Netflix

• Netflix initially started as a DVD rental service in 1998. It mostly relied on a third party
postal services to deliver its DVDs to the users. This resulted in heavy losses which
they soon mitigated with the introduction of their online streaming service in 2007.
• In order to make this happen, Netflix invested in a lot of algorithms to provide a
flawless movie experience to its users. One of such algorithms is the recommendation
system that is used by Netflix to provide suggestions to the users.
• A recommendation system understands the needs of the users and provides
suggestions of the various cinematographic products.
• A recommendation system is a platform that provides its users with various contents
based on their preferences and likings. A recommendation system takes the
information about the user as an input.
• This information can be in the form of the past usage of product or the ratings that
were provided to the product. It then processes this information to predict how much
the user would rate or prefer the product. A recommendation system makes use of a
variety of machine learning algorithms.

12 December 2023 OS Unit-1 117


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
• Another important role that a recommendation system plays today is to search for
similarity between different products. In the case of Netflix, the recommendation
system searches for movies that are similar to the ones you have watched or have
liked previously.
• This is an important method for scenarios that involve cold start. In cold start, the
company does not have much of the user data available to generate
recommendations.
• Therefore, based on the movies that are watched, Netflix provides recommendations
of the films that share a degree of similarity. There are two main types of
Recommendation Systems –
• 1. Content-based recommendation systems
• In a content-based recommendation system, the background knowledge of the
products and customer information are taken into consideration. Based on the
content that you have viewed on Netflix, it provides you with similar suggestions.
• For example, if you have watched a film that has a sci-fi genre, the content-based
recommendation system will provide you with suggestions for similar films that have
the same genre.

12 December 2023 OS Unit-1 118


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Collaborative filtering recommendation systems
• Unlike the content based filtering that provided recommendations of similar products,
Collaborative Filtering provides recommendations based on the similar profiles of its
users. One key advantage of collaborative filtering is that it is independent of the
product knowledge.
• Rather, it relies on the users with a basic assumption that what the users liked in the
past will also like in the future. For example, if a person A watches crime, sci-fi and
thriller genres and B watches sci-fi, thriller and action genres then A will also like
action and B will like crime genre.
• There is also a third type of recommendation system that combines both Content and
Collaborative techniques. This form of recommendation system is known as Hybrid
Recommendation System. Netflix makes the primary of use Hybrid Recommendation
System for suggesting content to its users.

12 December 2023 OS Unit-1 119


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Uber – Using Data to Make Rides Better

• Next in data science use cases is Uber. Uber is a popular smartphone application that
allows you to book a cab. Uber makes extensive use of Big Data. After all, Uber has to
maintain a large database of drivers, customers, and several other records.
• It is therefore, rooted in Big Data and makes use of it to derive insights and provide
the best services to its users. Uber shares the big data principle with crowdsourcing.
That is, registered drivers in the area can help anyone who wants to go somewhere.
• As mentioned above, Uber contains a database of drivers. Therefore, whenever you
hail for a cab, Uber matches your profile with the most suitable driver. What
differentiates Uber from other cab companies is that Uber charges you based on the
time it takes to cover the distance and not the distance itself.
• It calculates the time taken through various algorithms that also make use of data
related to traffic density and weather conditions.
• Uber makes the best use of data science to calculate its surge pricing. When there are
less drivers available to more riders, the price of the ride goes up. This happens only
during the scarcity of drivers in any given area.
• However, if the demand for Uber rides is less, then Uber charges a lower rate. This
dynamic pricing is rooted in Big Data and makes excellent usage of data science to
calculate the fares based on the parameters.

12 December 2023 OS Unit-1 120


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Facebook – Using Data to Revolutionize Social Networking & Advertising

• Facebook is a social-media leader of the world today. With millions of users around the
world, Facebook utilizes a large scale quantitative research through data science to
gain insights about the social interactions of the people.
• Facebook has become a hub of innovation where it has been using advanced
techniques in data science to study user behavior and gain insights to improve their
product. Facebook makes use of advanced technology in data science called deep
learning.
• 2Mmachine learning project vintage colorizer
• Using deep learning, Facebook makes use of facial recognition and text analysis. In
facial recognition, Facebook uses powerful neural networks to classify faces in the
photographs. It uses its own text understanding engine called “DeepText” to
understand user sentences.
• It also uses Deep Text to understand people’s interest and aligning photographs with
texts.
• However, more than being a social media platform, Facebook is more of an
advertisement corporation. It uses deep learning for targeted advertising. Using this, it
decides what kind of advertisements the users should view.
• It uses the insights gained from the data to cluster users based on their preferences
and provides them with the advertisements that appeal to them.

12 December 2023 OS Unit-1 121


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Amazon – Transforming E-commerce with Data Science
• Since its inception, Amazon has been working hard to make itself a customer-centric
platform. Amazon heavily relies on predictive analytics to increase customer
satisfaction. It does so through a personalized recommendation system.
• This recommendation system is a hybrid type that also involves collaborative filtering
which is comprehensive in nature. Amazon analyzes the historical purchases of the
user to recommend more products.

• This also comes through the suggestions that are drawn from the other users who use
similar products or provide similar ratings.
• Amazon has an anticipatory shipping model that uses big data for predicting the
products that are most likely to be purchased by its users. It analyzes the pattern of
your purchases and sends products to your nearest warehouse which you may utilize
in the future.
• Amazon also optimizes the prices on its websites by keeping in mind various
parameters like the user activity, order history, prices offered by the competitors,
product availability, etc. Using this method, Amazon provides discounts on popular
items and earns profits on less popular items.
• Another area where every e-commerce platform is addressing is Fraud Detection.
Amazon has its own novel ways and algorithms to detect fraud sellers and fraudulent
purchases.
• Other than online platforms, Amazon has been optimizing the packaging of products in
warehouses and increasing the efficiency of packaging lines through the data collected
from the workers.
12 December 2023 OS Unit-1 122
Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK

12 December 2023 OS Unit-1 123


Use cases of Data science-Facebook, Netflix,
THE CONCEPT LEARNING
Amazon, Uber, AirBnB.
TASK
Airbnb – Using Data to Make Stays More Comfortable

• Airbnb is an international hospitality company that allows you to host accommodations


as well as find them through its mobile app and website. It is a data-centric industry. It
contains a massive big data of customer and host information, homestays and lodge
records, as well as website traffic.
• Data Science plays a pivotal role in this company. It uses data to provide better search
results to its customers. It makes use of demographic analytics to analyze bounce rates
from their websites.
• In 2014, Airbnb found out that users from certain countries would click the
neighborhood link, browse the page and photos and not make any booking.
• In order to mitigate this issue, Airbnb released a different version for the users from
those countries and replaced neighborhood links with the top travel destinations. This
saw a 10% improvement in the lift rate for those users.
• Furthermore, Airbnb makes use of knowledge graphs where the user’s preferences are
matched with the various parameters to provide ideal lodgings and localities. It has also
optimized its search engine to provide better results to the customers and find
compatible hosts.

12 December 2023 OS Unit-1 124


Faculty Video
THELinks, You tube
CONCEPT & NPTEL Video
LEARNING TASKLinks and
Online Courses Details

You Tube video

https://www.youtube.com/watch?v=KxryzSO1Fjs

https://www.youtube.com/watch?v=-ETQ97mXXF0

https://www.youtube.com/watch?v=H4YcqULY1-Q

https://www.youtube.com/watch?v=fn1rKKNLuzk&list=PL15FRvx6P0O
WTlNBS_93NHG2hIn9cynVT

https://www.youtube.com/watch?v=XohgKT13FKY&list=PLqICp9VkFcbE
WeZ0Q-_6gs-HCRaqe5eyf

12 December 2023 OS Unit-1 125


THE CONCEPT
Daily LEARNING
Quiz TASK

1. Explain the differences between supervised and unsupervised learning?


2. Why Python is used for Data Cleaning in DS?
3. Why R is used in Data Visualization?
4. What do you understand by linear regression?
5. What do you understand by logistic regression?
6. How is Data Science different from traditional application programming?
7. How are Data Science and Machine Learning related to each other?
8. Explain the difference between Data Science and Data Analytics?
9. Define the term deep learning?
10. List out the libraries in Python used for Data Analysis and Scientific
Computations?

12 December 2023 OS Unit-1 126


THE CONCEPT LEARNING
Weekly/monthly/Unit TASK
Wise Assignment.

Assignment 1
1.Explaion about Big Data-Characteristics and applications?
2.Explain The building blocks of Hadoop?
3.Explain Why is Big Data Important?
4.What is data Analysis? Why python is used for data analysis?
5.What are the applications of machine Learning in data science?
6.What are the problems face when handling large data?
7.What do you understand with crowdsourcing analytics?
8. What do u mean by 5v’s of Big Data?
9. What are security challenges of Data Science?
10. How data science can be used in medical industry? Explain briefly?

12 December 2023 OS Unit-1 127


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

12 December 2023 OS Unit-1 128


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

12 December 2023 OS Unit-1 129


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

12 December 2023 OS Unit-1 130


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

12 December 2023 OS Unit-1 131


Old Question Papers

12 December 2023 OS Unit-1 132


Glossary Questions

1. Raw Data ig original _______ of data.


2.Creating reproducible _______ is performed by Data Scientist.
3. Hadoop is a framework that works with a variety of related tools. Common
cohorts include ____________.
4. __________characteristic of big data is relatively more concerned to data
science.
5. ____________analytical capabilities are provided by information management
company?
6. __________step is performed by data scientist after acquiring the data.
7. _______ are not sufficient to describe big data.
8. _____focuses on the discovery of (previously) unknown properties on the data.
9. As companies move past the experimental phase with Hadoop, many cite the
need for additional capabilities, including _______________.
10. Data in ____ bytes size is called big data. ...

12 December 2023 OS Unit-1 133


THE CONCEPT LEARNING TASK
References

12 December 2023 OS Unit-1 134


THE CONCEPT
Expected Questions LEARNING
for UniversityTASK
Exam

1.What is data science and its benefits?


2. Explain role and stages in data science?
3. What are the goals of data science?
4. What is data Analysis? Why python is used for data analysis?
5. Explain supervised and unsupervised machine Learning?
6. Why we need the machine Learning in data science?
7.Explain general techniques for handling volumes of data?
8.What are the problems face when handling large data?
9.Explain different stages of data Science?
10.Difference between data Analysis and analytics?

12 December 2023 OS Unit-1 135


THE CONCEPT LEARNING TASK
Recap

⮚ This unit provide us fundamentals domain of Big Data and its latest
trends in industry.
⮚ In this unit we are also benefitted with the knowledge of different
types of data
⮚ and very important one are the 5 V’s of Big Data and we also through
the concept of reporting vs analysis which is used in industry
prospects.
⮚ This unit will impart us with knowledge of business analytics and tolls
used in data science.

12 December 2023 OS Unit-1 136

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy