Module - 1 IDS
Module - 1 IDS
Big Data and Data Science hype – and getting past the hype
Note: Big data refers to significant volumes of data that cannot be processed effectively with the
traditional applications that are currently used. The processing of big data begins with raw data that isn’t
aggregated and is most often impossible to store in the memory of a single computer.
Data science enables companies not only to understand data from multiple sources but also to enhance
decision making. As a result, data science is widely used in almost every industry, including health care,
finance, marketing, banking, city planning, and more. If you are probably means you have something
useful to contribute to making data science into a more legitimate field that has the power to have a positive
impact on society. So, what is eyebrow-raising (shows surprise) about Big Data and data science? Let’s
count the ways:
There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What
does “data science” mean? What is the relationship between Big Data and data science? Is data science the
science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and
tech companies?
Why do many people refer to Big Data as crossing disciplines such as finance, tech, etc. and to data science
as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so
ambiguous; they’re more or less meaningless.
There’s a distinct lack of respect for the researchers in academia and industry labs who have been
working on this kind of stuff for years, and whose work is based on decades of work by statisticians,
computer scientists, mathematicians, engineers, and scientists of all types.
The hype is crazy—The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.
Statisticians already feel that they are studying and working on the “Science of Data.” That’s their bread
and butter Although we will make the case that data science is not just a rebranding of statistics or machine
learning but rather a field unto itself, the media often describes data science in a way that makes it sound
like as if it’s simply statistics or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a ‘science’ is probably isn’t.” Although there might
be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what
it represents may not be science but more of a craft (Create documents, which will make an impact).
If you had to collect the same data from a larger population, say the entire country of India, it would be
impossible to draw reliable conclusions because of geographical and accessibility constraints, not to
mention time and resource constraints. A lot of data would be missing or might be unreliable. Furthermore,
due to accessibility issues, marginalized tribes or villages might not provide data at all, making the data
biased towards certain regions or groups.
Samples are used when :
The population is too large to collect data.
The data collected is not reliable.
The population is hypothetical and is unlimited in size.
Statistical Inference is the process of using a sample to infer the properties of a population.
Consider N (Sample size)used to represent the total number of observations in the population.
ALL Population
For statistical inference N < ALL
Statistical Modeling:
Note: Data modeling is a process of creating a conceptual representation of data objects and their
relationships to one another. The process of data modeling typically involves several steps, including
requirements gathering, conceptual design, logical design, physical design, and implementation.
Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think
the underlying process might be with your model. What comes first? What influences what? What causes
what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math.
So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship,
you’d write down y=mx + b
Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how
things affect other things or what happens over time.
A probability distribution is a statistical function that describes all the possible values and probabilities
for a random variable within a given range. This range will be bound by the minimum and maximum
possible values, but where the possible value would be plotted on the probability distribution.
Note: random variable is variable whose value is unknown or a function that assigns value to each of an
experiment outcomes.
Conditional Probability
The probability of A given B is called the conditional probability and it is calculated using the formula
P(A | B) = P(A ∩ B) / P(B) , when P(B) > 0.
Example:
Suppose we roll a balanced 6 sided die once. Consider the events A={1,2,3,4,5} and
B={3,4,5,6}. What is the conditional probability of A, given B?
P(A∩B)=3/6
P(B)=4/6
P(A|B) = 3/4
Joint Probability:
Model fitting is a measure of how well a machine learning model generalizes to similar data to
that on which it was trained.
Note: Bias is the difference between our actual and predicted values. Bias is the simple
assumptions that our model makes about our data to be able to predict new data.
When the Bias is high, assumptions made by our model are too basic, the model can’t capture the
important features of our data.
Overfitting negatively impacts the performance of the model on new data. It occurs when a
model learns the details and noise in the training data too efficiently. When random fluctuations
or the noise in the training data are picked up and learned as concepts by the model, the model
“overfits”. Overfitting has low bias.
Underfitting happens when the machine learning model cannot sufficiently model the training
data nor generalize new data. An underfit machine learning model is not a suitable model; this
will be obvious as it will have a poor performance on the training data. Underfitting has high
bias.
A model that is well-fitted produces more accurate outcomes. A fitted model has low bias.
Introduction to R
R is a popular programming language used for statistical computing and graphical presentation.
Why R?
It is a great resource for data analysis, data visualization, data science and machine
learning
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc.
Unlike many other programming languages, you can output code in R without using a print
function.
Comments: Comments can be used to explain R code, and to make it more readable. It can also
be used to prevent execution when testing alternative code.
Comments start with a #. When executing code, R will ignore anything that starts with #.
Creating Variables in R
Ex:
A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume). Rules for R variables are:
A variable name must start with a letter and can be a combination of letters, digits,
period(.)
and underscore( _ ). If it starts with period(.), it cannot be followed by a digit.
Variable names are case-sensitive (age, Age and AGE are three different variables)
Data Types
Ex:
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Operators
Arithmetic operators
Assignment operators
Comparison operators
Logical operators
Miscellaneous operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
R Comparison Operators
R Logical Operators
R Miscellaneous Operators
Vectors
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings.
Ex 1:
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Ex 2:
# Vector with numerical values in a sequence
numbers <- 1:10
numbers
Ex 3:
fruits <- c("banana", "apple", "orange")
You can access the vector items by referring to its index number inside brackets []. The first item
has index 1, the second item has index 2, and so on.
Ex 1:
Ex 2:
Ex 3:
Ex 4:
# Print fruits
fruits
Lists
A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.
# List of strings
thislist <- list("apple", "banana", "cherry")
Access Lists
You can access the list items by referring to its index number, inside brackets. The first item has
index 1, the second item has index 2, and so on:
Ex 1:
thislist[1]
Ex 2:
Ex 3:
length(thislist)
To find out if a specified item is present in a list, use the %in% operator.
Ex:
To add an item to the end of the list, use the append() function.
Ex:
append(thislist, "orange")
To add an item to the right of a specified index, add "after=index number" in the append() function.
Ex:
The most common way is to use the c() function, which combines two elements together.
Ex:
list3
R Arrays
We can use the array( ) function to create an array, and the dim parameter to specify the
dimensions.
Ex:
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
In R language readline() method takes input in string format. If one inputs an integer then it is
inputted as a string, lets say, one wants to input 255, then it will input as “255”, like a string.So
one needs to convert that inputted value to the format that he needs. In this case, string “255” is
converted to integer 255. To convert the inputted value to the desired data type, there are some
functions in R,
Ex:
var = readline( );
var= as.integer( );
print(var)