100% found this document useful (1 vote)

337 views

Module - 1 IDS

This document provides an introduction to data science. It defines data science as dealing with vast volumes of data using tools and techniques to find patterns and derive meaningful information for business decisions. The data science lifecycle involves five stages: capture, maintain, process, analyze, and communicate. Applications of data science discussed include healthcare, gaming, image recognition, logistics, and fraud detection. Statistical modeling techniques covered are regression analysis, time series analysis, cluster analysis, and survival analysis.

Uploaded by

Druthi Gs

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

337 views

Module - 1 IDS

Uploaded by

Druthi Gs

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

MODULE 1: INTRODUCTION

Introduction: What is Data Science?

Data science is the domain of study that deals with vast volumes of data using modern tools and techniques
to find unseen patterns, derive meaningful information, and make business decisions.
Data science is the combination of statistics, mathematics, programming, and problem-solving; capturing
data in ingenious (clear) ways; the ability to look at things differently; and the activity of cleansing,
preparing, and aligning data.

The Data Science Lifecycle

Lifecycle consists of five distinct stages, each with its own tasks:
1. Capture: This stage involves gathering raw structured and unstructured data.
2. Maintain: This stage covers taking the raw data and putting it in a form that can be used.
3. Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization.
4. Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, and Text Mining. Here is the
real meat of the lifecycle. This stage involves performing the various analyses on the data.
5. Communicate: In this final step, analysts prepare the analyses in easily readable forms such as
charts, graphs, and reports.

Applications of Data Science

1. Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and cure
diseases.
2. Gaming
Video and computer games are now being created with the help of data science and that has taken the
gaming experience to the next level.
3. Image Recognition
Identifying patterns is one of the most commonly known applications of data science.
4.Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and
increase operational efficiency.
6. Fraud Detection
Fraud detection comes to the next in the list of applications of data science. Banking and financial
institutions use data science and related algorithms to detect fraudulent transactions.

Big Data and Data Science hype – and getting past the hype
Note: Big data refers to significant volumes of data that cannot be processed effectively with the
traditional applications that are currently used. The processing of big data begins with raw data that isn’t
aggregated and is most often impossible to store in the memory of a single computer.
Data science enables companies not only to understand data from multiple sources but also to enhance
decision making. As a result, data science is widely used in almost every industry, including health care,
finance, marketing, banking, city planning, and more. If you are probably means you have something
useful to contribute to making data science into a more legitimate field that has the power to have a positive
impact on society. So, what is eyebrow-raising (shows surprise) about Big Data and data science? Let’s
count the ways:
There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What
does “data science” mean? What is the relationship between Big Data and data science? Is data science the
science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and
tech companies?
Why do many people refer to Big Data as crossing disciplines such as finance, tech, etc. and to data science
as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so
ambiguous; they’re more or less meaningless.
There’s a distinct lack of respect for the researchers in academia and industry labs who have been
working on this kind of stuff for years, and whose work is based on decades of work by statisticians,
computer scientists, mathematicians, engineers, and scientists of all types.
The hype is crazy—The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.
Statisticians already feel that they are studying and working on the “Science of Data.” That’s their bread
and butter Although we will make the case that data science is not just a rebranding of statistics or machine
learning but rather a field unto itself, the media often describes data science in a way that makes it sound
like as if it’s simply statistics or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a ‘science’ is probably isn’t.” Although there might
be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what
it represents may not be science but more of a craft (Create documents, which will make an impact).

Why now? – Datafication, Current landscape of perspectives, Skill sets

Data Science helps businesses to comprehend vast amounts of data from different sources, extract useful
insights, and make better data-driven choices. Data Science is used extensively in several industrial fields,
such as marketing, healthcare, finance, banking, and policy work.
It’s not only the massiveness that makes all this new data interesting (or poses challenges). It’s that the data
itself, often in real time, becomes the building blocks of data products. On the Internet, this means Amazon
recommendation systems, friend recommendations on Facebook, film and music recommendations, and so
on.
Datafication can be defined as a process that “aims to transform most aspects of a business into
quantifiable data (data that can be counted or measured in numerical values) that can be tracked,
monitored, and analyzed.
Datafication is a process of “taking all aspects of life and turning them into data.”
Ex: LinkedIn datafies professional networks
Datafication is an interesting concept and led us to consider its importance with respect to people’s
intentions about sharing their own data. We are being datafied , or rather our actions are, and when we
“like” something online, we are intending to be datafied.
When we merely browse the Web, we are unintentionally, or at least passively, being datafied through
cookies that we might or might not be aware of.
And when we walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors or cameras.
The Current Landscape
Data science is the process of extracting information, understanding and learning from raw data to
inform decision making in a proactive and systematic fashion that can be generalized.
Data Science Jobs:
Data scientists need to be experts in computer science, statistics, communication, data visualization, and to
have extensive domain ex‐pertise.
Data Analyst bridge the gap between the data scientists and the business analysts, organizing and
analyzing data to answer the questions the organization poses.
Data engineers focus on developing, deploying, managing, and optimizing the organization’s data.

A data science profile need skill levels in the following domains:

 Computer science
 Math
 Statistics
 Machine learning
 Domain expertise
 Communication and presentation skills
 Data visualization
Needed Statistical Inference: Populations and samples
Population: A population is the entire group that you want to draw conclusion about.
Sample: Sample is the specific group that you will collect data from. Sample size is less than the size of
population.
Generally, population refers to the people who live in a particular area at a specific time. But in statistics,
population refers to data on your study of interest. It can be a group of individuals, objects, events,
organizations, etc.

If you had to collect the same data from a larger population, say the entire country of India, it would be
impossible to draw reliable conclusions because of geographical and accessibility constraints, not to
mention time and resource constraints. A lot of data would be missing or might be unreliable. Furthermore,
due to accessibility issues, marginalized tribes or villages might not provide data at all, making the data
biased towards certain regions or groups.
Samples are used when :
 The population is too large to collect data.
 The data collected is not reliable.
 The population is hypothetical and is unlimited in size.
Statistical Inference is the process of using a sample to infer the properties of a population.
Consider N (Sample size)used to represent the total number of observations in the population.
ALL  Population
For statistical inference N < ALL

Statistical Modeling:

Note: Data modeling is a process of creating a conceptual representation of data objects and their
relationships to one another. The process of data modeling typically involves several steps, including
requirements gathering, conceptual design, logical design, physical design, and implementation.

Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think
the underlying process might be with your model. What comes first? What influences what? What causes
what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math.
So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship,
you’d write down y=mx + b
Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how
things affect other things or what happens over time.

Some techniques addressed under statistical modeling:

Regression analysis: Regression analysis is used to discover the connection between one or more
independent variables and one or more dependent variables.
Time series analysis: Time series analysis is used to evaluate data that has been gathered over time. It is
used to identify data trends, patterns, and seasonal fluctuations.
Cluster analysis: This technique is used to group comparable things.
Survival analysis: Survival analysis is used to assess time-to-event data, such as how long it takes for a
patient to recover.
Decision trees: They are used to discover the most critical factors in a decision-making process.
Neural networks: Neural networks are used to simulate complicated interactions between variables. They
are used in image recognition, natural language processing, among other things.
Probability Distribution
Note: Probability denotes the possibility of something happening. It is a mathematical concept that predicts
how likely events are to occur. The probability values are expressed between 0 and 1. The definition of
probability is the degree to which something is likely to occur. This fundamental theory of probability is
also applied to probability distributions.

A probability distribution is a statistical function that describes all the possible values and probabilities
for a random variable within a given range. This range will be bound by the minimum and maximum
possible values, but where the possible value would be plotted on the probability distribution.

Types of Probability Distribution

The probability distribution is divided into two parts:
1. Discrete Probability Distributions
2. Continuous Probability Distributions
A discrete distribution describes the probability of occurrence of each value of a discrete random
variable. The number of spoiled apples out of 6 in your refrigerator can be an example of a discrete
probability distribution.
A continuous distribution describes the probabilities of a continuous random variable's possible values. A
continuous random variable has an infinite and uncountable set of possible values.

Note: random variable is variable whose value is unknown or a function that assigns value to each of an
experiment outcomes.
Conditional Probability
The probability of A given B is called the conditional probability and it is calculated using the formula
P(A | B) = P(A ∩ B) / P(B) , when P(B) > 0.

Example:

Suppose we roll a balanced 6 sided die once. Consider the events A={1,2,3,4,5} and
B={3,4,5,6}. What is the conditional probability of A, given B?

P(A∩B)=3/6

P(B)=4/6

P(A|B) = 3/4
Joint Probability:

Joint probability is the product of the individual probabilities of independent events.

Mathematically, P (A and B) = P(A) x P(B). The probability of A times the probability of B
equals the joint probability of A and B happening at the same time.
Model fitting

Model fitting is a measure of how well a machine learning model generalizes to similar data to
that on which it was trained.

Fitting refers to adjusting the parameters in the model to improve accuracy.

Note: Bias is the difference between our actual and predicted values. Bias is the simple
assumptions that our model makes about our data to be able to predict new data.

When the Bias is high, assumptions made by our model are too basic, the model can’t capture the
important features of our data.

Overfitting negatively impacts the performance of the model on new data. It occurs when a
model learns the details and noise in the training data too efficiently. When random fluctuations
or the noise in the training data are picked up and learned as concepts by the model, the model
“overfits”. Overfitting has low bias.

Underfitting happens when the machine learning model cannot sufficiently model the training
data nor generalize new data. An underfit machine learning model is not a suitable model; this
will be obvious as it will have a poor performance on the training data. Underfitting has high
bias.

A model that is well-fitted produces more accurate outcomes. A fitted model has low bias.
Introduction to R

R is a popular programming language used for statistical computing and graphical presentation.

Its most common use is to analyze and visualize data.

Why R?

 It is a great resource for data analysis, data visualization, data science and machine
learning

 It provides many statistical techniques (such as statistical tests, classification, clustering

and data reduction)

 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc.

 It works on different platforms (Windows, Mac, Linux)

 It is open-source and free

Unlike many other programming languages, you can output code in R without using a print
function.

Ex: print("Hello World!")

Comments: Comments can be used to explain R code, and to make it more readable. It can also
be used to prevent execution when testing alternative code.

Comments start with a #. When executing code, R will ignore anything that starts with #.

Creating Variables in R

Variables are containers for storing data values.

R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name.

Ex:

name <- "John"

age <- 40

name # output "John"

age # output 40

Note: In other programming language, it is common to use = as an assignment operator. In R, we

can use both = and <- as assignment operators.

A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume). Rules for R variables are:

 A variable name must start with a letter and can be a combination of letters, digits,
period(.)
and underscore( _ ). If it starts with period(.), it cannot be followed by a digit.

 A variable name cannot start with a number or underscore ( _ )

 Variable names are case-sensitive (age, Age and AGE are three different variables)

 Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)

Data Types

 numeric - (10.5, 55, 787)

 integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character - ("k", "R is exciting", "FALSE", "11.5")
 logical - (TRUE or FALSE)

Ex:

# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)

# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)

Operators

Operators are used to perform operations on variables and values.

R divides the operators in the following groups:

 Arithmetic operators

 Assignment operators

 Comparison operators

 Logical operators

 Miscellaneous operators
Arithmetic operators are used with numeric values to perform common mathematical operations:

Assignment operators are used to assign values to variables (<-)

R Comparison Operators
R Logical Operators

R Miscellaneous Operators

Vectors
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings.
Ex 1:
# Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits
Ex 2:
# Vector with numerical values in a sequence
numbers <- 1:10

numbers
Ex 3:
fruits <- c("banana", "apple", "orange")

length(fruits) # length function is used to find the number of items

Access Vectors

You can access the vector items by referring to its index number inside brackets []. The first item
has index 1, the second item has index 2, and so on.

Ex 1:

fruits <- c("banana", "apple", "orange")

# Access the first item (banana)

fruits[1]

Ex 2:

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)

fruits[c(1, 3)]

Ex 3:

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access all items except for the first item

fruits[c(-1)]

Ex 4:

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Change "banana" to "pear"

fruits[1] <- "pear"

# Print fruits
fruits

Lists

A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.

To create a list, use the list( ) function.

Ex 1:

# List of strings
thislist <- list("apple", "banana", "cherry")

# Print the list

thislist

Access Lists

You can access the list items by referring to its index number, inside brackets. The first item has
index 1, the second item has index 2, and so on:

Ex 1:

thislist <- list("apple", "banana", "cherry")

thislist[1]

Ex 2:

thislist <- list("apple", "banana", "cherry")

thislist[1] <- "blackcurrant"

# Print the updated list

thislist

Ex 3:

thislist <- list("apple", "banana", "cherry")

length(thislist)

Check if Item Exists

To find out if a specified item is present in a list, use the %in% operator.

Ex:

thislist <- list("apple", "banana", "cherry")

"apple" %in% thislist

Add List Items

To add an item to the end of the list, use the append() function.

Ex:

thislist <- list("apple", "banana", "cherry")

append(thislist, "orange")

To add an item to the right of a specified index, add "after=index number" in the append() function.

Ex:

thislist <- list("apple", "banana", "cherry")

append(thislist, "orange", after = 2)

Join Two Lists

There are several ways to join, or concatenate, two or more lists in R.

The most common way is to use the c() function, which combines two elements together.

Ex:

list1 <- list("a", "b", "c")

list2 <- list(1,2,3)
list3 <- c(list1,list2)

list3

R Arrays

Compared to matrices, arrays can have more than two dimensions.

We can use the array( ) function to create an array, and the dim parameter to specify the
dimensions.

Ex:
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray

# An array with more than one dimension

multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray

Taking Input from User in R Programming

In R language readline() method takes input in string format. If one inputs an integer then it is
inputted as a string, lets say, one wants to input 255, then it will input as “255”, like a string.So
one needs to convert that inputted value to the format that he needs. In this case, string “255” is
converted to integer 255. To convert the inputted value to the desired data type, there are some
functions in R,

as.integer(n); —> convert to integer

as.numeric(n); —> convert to numeric type

Ex:

var = readline( );
var= as.integer( );
print(var)

Product Allocation in SAP
100% (3)
Product Allocation in SAP
17 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Automobile Management System Project Report
100% (7)
Automobile Management System Project Report
53 pages
Resume - Mukesh Mistry
No ratings yet
Resume - Mukesh Mistry
6 pages
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
Horowitz and Sahani, Fundamentals of Computer Algorithms, 2ND Edition PDF
0% (1)
Horowitz and Sahani, Fundamentals of Computer Algorithms, 2ND Edition PDF
777 pages
UG WBS PNR AddMultiElements 12.2 194
0% (1)
UG WBS PNR AddMultiElements 12.2 194
250 pages
Unit-I Python Notes
No ratings yet
Unit-I Python Notes
62 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Detail Explanation of Heap, Reheap Up, Reheap Down.. With An Example
100% (1)
Detail Explanation of Heap, Reheap Up, Reheap Down.. With An Example
27 pages
Cyber Security Lab Manual
No ratings yet
Cyber Security Lab Manual
112 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Module II
No ratings yet
Module II
22 pages
1353360372sql Practice Questions
No ratings yet
1353360372sql Practice Questions
24 pages
DAP Lab Manual
No ratings yet
DAP Lab Manual
20 pages
Guidelines For The Preparation of 8Th Sem B.E./B. Tech. Project Reports
No ratings yet
Guidelines For The Preparation of 8Th Sem B.E./B. Tech. Project Reports
4 pages
PPT04-Knowledge Representation
No ratings yet
PPT04-Knowledge Representation
37 pages
Business Analytics CCW331 Tech Publication (f16's Yht)[1]
No ratings yet
Business Analytics CCW331 Tech Publication (f16's Yht)[1]
233 pages
Relational Algebra and SQL
No ratings yet
Relational Algebra and SQL
68 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Unit 1 PPT
No ratings yet
Unit 1 PPT
72 pages
FDS IMPORTANT QUESTIONS EduEngg
100% (1)
FDS IMPORTANT QUESTIONS EduEngg
7 pages
Data Analyst Roadmap by Shakra Shamim
0% (1)
Data Analyst Roadmap by Shakra Shamim
13 pages
DBMS Notes 2 - TutorialsDuniya
No ratings yet
DBMS Notes 2 - TutorialsDuniya
98 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
B Tech CSBS
No ratings yet
B Tech CSBS
43 pages
Algorithms Lab Viva Questions
No ratings yet
Algorithms Lab Viva Questions
2 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Machine Learning
No ratings yet
Machine Learning
90 pages
AD3491 UNIT 1 NOTES EduEngg
100% (1)
AD3491 UNIT 1 NOTES EduEngg
35 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Bridge Course Computer Science
No ratings yet
Bridge Course Computer Science
2 pages
Java University Paper Questions MCA Mumbai University
No ratings yet
Java University Paper Questions MCA Mumbai University
2 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Cloud Computing Notes(Unit-1 to 5)
100% (1)
Cloud Computing Notes(Unit-1 to 5)
98 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
Data Visualization Using Tableau: A LAB Manual Cum Work Book
100% (1)
Data Visualization Using Tableau: A LAB Manual Cum Work Book
6 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
Machine Learning - AL3451 - Notes - Unit 5 - Design and Analysis of Machine Learning Experiments
No ratings yet
Machine Learning - AL3451 - Notes - Unit 5 - Design and Analysis of Machine Learning Experiments
33 pages
MC4112 Set2
No ratings yet
MC4112 Set2
3 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Hbase Lab Manual3.0-Update
No ratings yet
Hbase Lab Manual3.0-Update
8 pages
Problem Solving and Python Programming L T P C
No ratings yet
Problem Solving and Python Programming L T P C
1 page
Chapter 7 - RUN - TIME ENVIRONMENT
No ratings yet
Chapter 7 - RUN - TIME ENVIRONMENT
85 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
KDD Vs Data Mining
No ratings yet
KDD Vs Data Mining
2 pages
Unit 5
No ratings yet
Unit 5
104 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
CHAPTER 03: Big Data Technology Landscape
No ratings yet
CHAPTER 03: Big Data Technology Landscape
81 pages
The Role of Algorithms in Computing
No ratings yet
The Role of Algorithms in Computing
9 pages
Oracle Bba
No ratings yet
Oracle Bba
16 pages
Ch06 Deep Feedforward Networks
No ratings yet
Ch06 Deep Feedforward Networks
90 pages
Cs2255-Database Management Systems: Question Bank Unit - I
No ratings yet
Cs2255-Database Management Systems: Question Bank Unit - I
4 pages
Trackpad Ver. 2.0 Class 8
From Everand
Trackpad Ver. 2.0 Class 8
Nidhi Arora
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
DSV Module-1
No ratings yet
DSV Module-1
26 pages
Data Science
No ratings yet
Data Science
85 pages
DP-1100Plus&DP-2200&DP-2100&DP-2200Plus&DP-2200Vet Sevice Manual
No ratings yet
DP-1100Plus&DP-2200&DP-2100&DP-2200Plus&DP-2200Vet Sevice Manual
88 pages
log
No ratings yet
log
37 pages
Ga Buffer Tank PDF
No ratings yet
Ga Buffer Tank PDF
1 page
Number: 200-301 Passing Score: 825 Time Limit: 120 Min File Version: 1.0
No ratings yet
Number: 200-301 Passing Score: 825 Time Limit: 120 Min File Version: 1.0
122 pages
DCOM Troubleshooting
No ratings yet
DCOM Troubleshooting
14 pages
Usage Guide
No ratings yet
Usage Guide
6 pages
Resume Anupama
No ratings yet
Resume Anupama
3 pages
Syntax Directed Translation SDT
No ratings yet
Syntax Directed Translation SDT
52 pages
Solution For Saving The Motor Electric Consumption
No ratings yet
Solution For Saving The Motor Electric Consumption
40 pages
Intelligent Transportation Systems 3
No ratings yet
Intelligent Transportation Systems 3
10 pages
Linux Baza
No ratings yet
Linux Baza
32 pages
ISO-7573-2008
No ratings yet
ISO-7573-2008
9 pages
BCOS-183
No ratings yet
BCOS-183
4 pages
Mohan Resume
No ratings yet
Mohan Resume
2 pages
Backend as a service in web development
No ratings yet
Backend as a service in web development
48 pages
Word Templates For Web Accessibility
100% (1)
Word Templates For Web Accessibility
6 pages
Unibera Profile
No ratings yet
Unibera Profile
31 pages
Operating System Course Plan
No ratings yet
Operating System Course Plan
25 pages
APPSEC2013 OWASP Testing Guide v4 Alpha
No ratings yet
APPSEC2013 OWASP Testing Guide v4 Alpha
27 pages
PIC16 F 690
No ratings yet
PIC16 F 690
294 pages
Analysis On Improving Energy Efficiency Through Green Cloud Computing in Iot Networks
No ratings yet
Analysis On Improving Energy Efficiency Through Green Cloud Computing in Iot Networks
7 pages
"Nuzulul Quran": Ikutilah Lomba Tartil, Azan, Dan Ma'Asyiral
No ratings yet
"Nuzulul Quran": Ikutilah Lomba Tartil, Azan, Dan Ma'Asyiral
2 pages
P Ensioners Information System With Voice Assistance of Philippine Veterans Affairs Office Field Service Extension Office - Tagbilaran City, Bohol
No ratings yet
P Ensioners Information System With Voice Assistance of Philippine Veterans Affairs Office Field Service Extension Office - Tagbilaran City, Bohol
52 pages
ISA 2 Regular Solution
No ratings yet
ISA 2 Regular Solution
4 pages
Students' Grade Record Profiling System Complete Capstone Documentation
100% (1)
Students' Grade Record Profiling System Complete Capstone Documentation
24 pages
MoniPert Quick Starting Guide
No ratings yet
MoniPert Quick Starting Guide
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module - 1 IDS

Uploaded by

Module - 1 IDS

Uploaded by

MODULE 1: INTRODUCTION

Introduction: What is Data Science?

The Data Science Lifecycle

Applications of Data Science

Why now? – Datafication, Current landscape of perspectives, Skill sets

A data science profile need skill levels in the following domains:

Some techniques addressed under statistical modeling:

Types of Probability Distribution

Joint probability is the product of the individual probabilities of independent events.

Fitting refers to adjusting the parameters in the model to improve accuracy.

Its most common use is to analyze and visualize data.

 It provides many statistical techniques (such as statistical tests, classification, clustering

 It works on different platforms (Windows, Mac, Linux)

 It is open-source and free

Ex: print("Hello World!")

Variables are containers for storing data values.

name <- "John"

name # output "John"

Note: In other programming language, it is common to use = as an assignment operator. In R, we

 A variable name cannot start with a number or underscore ( _ )

 Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)

 numeric - (10.5, 55, 787)

Operators are used to perform operations on variables and values.

R divides the operators in the following groups:

Assignment operators are used to assign values to variables (<-)

length(fruits) # length function is used to find the number of items

fruits <- c("banana", "apple", "orange")

# Access the first item (banana)

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access all items except for the first item

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Change "banana" to "pear"

To create a list, use the list( ) function.

# Print the list

thislist <- list("apple", "banana", "cherry")

thislist <- list("apple", "banana", "cherry")

# Print the updated list

thislist <- list("apple", "banana", "cherry")

Check if Item Exists

thislist <- list("apple", "banana", "cherry")

"apple" %in% thislist

thislist <- list("apple", "banana", "cherry")

thislist <- list("apple", "banana", "cherry")

append(thislist, "orange", after = 2)

Join Two Lists

There are several ways to join, or concatenate, two or more lists in R.

list1 <- list("a", "b", "c")

Compared to matrices, arrays can have more than two dimensions.

# An array with more than one dimension

Taking Input from User in R Programming

as.integer(n); —> convert to integer

as.numeric(n); —> convert to numeric type

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.