Unit 2
Unit 2
Unit -2
PROBABILITY AND STATISTICS
Introduction to probability and statistics, Population and sample, Normal and Gaussian
distributions, Probability Density Function, Descriptive statistics, notion of probability,
distributions, mean, variance, covariance, covariance matrix, understanding univariate and
multivariate normal distributions, introduction to hypothesis testing, confidence interval for
estimates.
1. Gender
2. Binary Data (Yes/No)
3. Attributes of a vehicle like color, mileage, number of doors, etc.
1.2. What are Statistics?
The field of Statistics deals with the collection, presentation, analysis, and use of data to make
decisions, solve problems, and design products and processes. Statistics is the science of
learning from data, and of measuring, controlling, and communicating uncertainty; and it thereby
provides the navigation essential for controlling the course of scientific and societal advances. In
simple terms, statistics is the science of data. Statistics is defined as collection, compilation,
analysis and interpretation of numerical data.
1. Variable
A variable is a characteristic or condition that can change or take on different values. Most
research begins with a general question about the relationship between two variables for a
specific group of individuals.
Types of Variables
Variables can be classified as discrete or continuous. Discrete variables (such as class size)
consist of indivisible categories, and continuous variables (such as time or weight) are
infinitely divisible into whatever units a researcher may choose. For example, time can be
measured to the nearest minute, second, half-second, etc.
2. Measuring Variables
To establish relationships between variables, researchers must observe the variables and record
their observations. This requires that the variables be measured. The process of measuring a
variable requires a set of categories called a scale of measurement and a process that classifies
each individual into one category.
3. Data
The measurements obtained in a research study are called the data. The goal of statistics is to
help researchers organize and interpret the data.
Quantitative
The data which are statistical or numerical are known as Quantitative data. Quantitative data
is generated through. Quantitative data is also known as Structured data. Experiments, Tests,
Surveys, Market Report. Quantitative data is again divided into Continuous data and
Discrete data.
Continuous Data
Continuous data is the data which can have any value. That means Continuous data can give
infinite outcomes so it should be grouped before representing on a graph.
Examples
• The speed of a vehicle as it passes a checkpoint
• The mass of a cooking apple
• The time taken by a volunteer to perform a task
Discrete Data
• Discrete data can have certain values. That means only a finite number can be
categorized as discrete data.
• Numbers of cars sold at a dealership during a given month
• Number of houses in certain block
• Number of fish caught on a fishing trip
• Number of complaints received at the office of airline on a given day
• Number of customers who visit at bank during any given hour
• Number of heads obtained in three tosses of a coin
Differences between Discrete and Continuous data
• Numerical data could be either discrete or continuous
• Continuous data can take any numerical value (within a range); For example, weight,
height, etc.
• There can be an infinite number of possible values in continuous data
• Discrete data can take only certain values by finite ‘jumps’, i.e., it ‘jumps’ from one
value to another but does not take any intermediate value between them (For example,
number of students in the class
Qualitative
Data that deals with description or quality instead of numbers are known as Quantitative
data. Qualitative data is also known as unstructured data. Because this type of data is
loosely compact and can’t be analyzed conventionally.
4. Population
The entire group of individuals is called the population. For example, a researcher may
be interested in the relation between class size (variable 1) and academic performance
(variable 2) for the population of third-grade children.
5. Sample
Usually, populations are so large that a researcher cannot examine the entire group.
Therefore, a sample is selected to represent the population in a research study. The goal
is to use the results obtained from the sample to help answer questions about the
population.
Sampling Error
• The discrepancy between a sample statistic and its population parameter is called
sampling error.
• Defining and measuring sampling error is a large part of inferential statistics.
1.3. Frequency Distribution
Frequency Distribution (or Frequency Table)
Shows how a data set is partitioned among all of several categories (or classes) by listing
all of the categories along with the number (frequency) of data values in each of them
Frequency Distribution
When data are in original form, they are called raw data
Organizing Data:
Categorical distribution
Grouped distribution
Ungrouped distribution
Frequency distribution refers to data classified on the basis of some variable that can be
measured such as prices, weight, height, wages etc.
Mean
• The mean represents the average value of the dataset.
• It can be calculated as the sum of all the values in the dataset divided by the number of
values. In general, it is considered as the arithmetic mean.
• Some other measures of mean used to find the central tendency are as follows:
• Geometric Mean (nth root of the product of n numbers)
• Harmonic Mean (the reciprocal of the average of the reciprocals)
• Weighted Mean (where some values contribute more than others)
• It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the data, then
the mean value differs.
Calculating the Mean
Calculate the mean of the following data:
1 5 4 3 2
Sum the scores (X):
1 + 5 + 4 + 3 + 2 = 15
Divide the sum (X = 15) by the number of scores (N = 5):
15 / 5 = 3
Mean = X = 3
The Median
• The median is simply another name for the 50th percentile
• Sort the data from highest to lowest
• Find the score in the middle
• If N, the number of scores, is even the median is the average of the middle two scores
Median Example
What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
Sort the scores:
15 14 12 10 10 9 8 8 7 3 3
Determine the middle score:
middle = (N + 1) / 2 = (11 + 1) / 2 = 6
Middle score = median = 9
Median Example
What is the median of the following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
• Determine the middle score:
middle = (N + 1) / 2 = (6 + 1) / 2 = 3.5
• Median = average of 3rd and 4th scores:
(19 + 18) / 2 = 18.5
Mode
The mode is the score that occurs most frequently in a set of data.
Variance
• Variance is the average squared deviation from the mean of a set of data.
• It is used to find the Standard deviation.
n
1
• σ 2= ∑
n i=1
( xi −x)2
• This is a good measure of how much variation exists in the sample, normalized by sample
size.
• It has the nice property of being additive.
• The only problem is that the variance is measured in units squared
How to find Variance
• Find the Mean of the data.
• Subtract the mean from each value – the result is called the deviation from the mean.
• Square each deviation of the mean.
• Find the sum of the squares.
• Divide the total by the number of items.
How to find Variance? - Example
• Suppose you're given the data set 1, 2, 2, 4, 6. (X = 1,2,2,4,6) One Variable X
• Calculate the mean of your data set. The mean of the data is (1+2+2+4+6)/5
• Mean= 15/5 = 3.
• Subtract the mean from each of the data values and list the differences. Subtract 3 from
each of the values 1, 2, 2, 4, 6
• 1-3 = -2 2-3 = -1 2-3 = -1 4-3 = 1 6-3 = 3
• Your list of differences is -2, -1, -1, 1, 3 (deviation)
• You need to square each of the numbers -2, -1, -1, 1, 3
(-2)2 = 4, (-1)2 = 1, (-1)2 = 1, (1)2 = 1, (3)2 = 9
• Your list of squares is 4, 1, 1, 1, 9, Add the squares 4+1+1+1+9 = 16
• Subtract one from the number of data values you started with. You began this process (it
may seem like a while ago) with five data values. One less than this is 5-1 = 4.
• Divide the sum from step four by the number from step five. The sum was 16, and the
number from the previous step was 4. You divide these two numbers 16/4 = 4.
Variation in one variable
• So, these four measures all describe aspects of the variation in a single variable:
• a. Sum of the squared deviations
• b. Variance
• c. Standard deviation
• d. Standard error
• Can we adapt them for thinking about the way in which two variables might vary
together?
Covariance
• In mathematics and statistics, covariance is a measure of the relationship between two
random variables. (X, Y)
• More precisely, covariance refers to the measure of how two random variables in a
data set will change together.
• Positive covariance: Indicates that two variables tend to move in the same direction.
• Negative covariance: Reveals that two variables tend to move in inverse directions.
• The covariance between two random variables X and Y can be calculated using the
following formula (for population):
The covariance matrix is a math concept that occurs in several areas of machine
learning.If you have a set of n numeric data items, where each data item has d
dimensions, then the covariance matrix is a d-by-d symmetric square matrix where there
are variance values on the diagonal and covariance values off the diagonal.
• Suppose you have a set of n=5 data items, representing 5 people, where each data item
has a Height (X), test Score (Y), and Age (Z) (therefore d = 3):
• Covariance Matrix
Covariance Matrix
• The covariance matrix for this data set is:
• The 11.50 is the variance of X, 1250.0 is the variance of Y, and 110.0 is the variance of
Z. For variance, in words, subtract each value from the dimension mean. Square, add
them up, and divide by n-1. For example, for X:
• Var(X) = [ (64–68.0)^2 + (66–68.0^2 + (68-68.0)^2 + (69-68.0)^2 +(73-68.0)^2 ] / (5-1)
= (16.0 + 4.0 + 0.0 + 1.0 + 25.0) / 4 = 46.0 / 4 = 11.50.
Covariance Matrix
Covar(XY) = [ (64-68.0)*(580-600.0) + (66-68.0)*(570-600.0) + (68-68.0)*(590-600.0)
+ (69-68.0)*(660-600.0) + (73-68.0)*(600-600.0) ] / (5-1) = [80.0 + 60.0 + 0 + 60.0 +
0] / 4 = 200 / 4 = 50.0
If you examine the calculations carefully, you’ll see the pattern to compute the
covariance of the XZ and YZ columns. And you’ll see that Covar(XY) = Covar(YX).
Standard Deviation
• Variability is a term that describes how spread out a distribution of scores (or darts) is.
• Variance and standard deviation are closely related ways of measuring, or quantifying,
variability.
• Standard deviation is simply the square root of variance
• Find the mean (or arithmetic average) of the scores. To find the mean, add up the scores
and divide by n where n is the number of scores.
• Find the sum of squared deviations (abbreviated SSD). To get the SSD, find the sum of
the squares of the differences (or deviations) between the mean and all the individual
scores.
• Find the variance. If you are told that the set of scores constitute a population, divide the
SSD by n to find the variance. If instead you are told, or can infer, that the set of scores
constitute a sample, divide the SSD by (n – 1) to get the variance.
• Find the standard deviation. To get the standard deviation, take the square root of the
variance.
How to find Standard Deviation – Example (in Population score)
Example 1: Find the SSD, variance, and standard deviation for the following population of
scores: 1, 2, 3, 4, 5 using the list of steps given above.
• Find the mean. The mean of these five numbers (the population mean) is (1+2+3+4+5)/5
= 15/5 = 3.
• Let’s use the definitional formula for SSD for its calculation: SSD is the sum of the
squares of the differences (squared deviations) between the mean and the individual
scores. The squared deviations are (3-1) 2, (3-2)2, (3-3) 2, (3-4) 2, and (3-5) 2. That is, 4, 1,
0, 1, and 4. The SSD is then 4 + 1 + 0 + 1 + 4 = 10.
• Divide SSD by n, since this is a population of scores, to get the variance. So the variance
is 10/5 = 2.
• The standard deviation is the square root of the variance. So the standard deviation is the
square root of 2. = √ 2 =1.4142
• For practice, let’s also compute the SSD using the computational formula, ∑ i (xi) 2 –
(1/N)(∑i xi) 2. ∑i (xi) 2 = 12 + 22+ 32 + 42 + 52 = 1 + 4 + 9 + 16 + 25 = 55. (1/N)
(∑i xi) 2 = (1/5) (1 + 2 + 3 + 4 + 5) 2 = (1/5) (152) = 45. So SSD = 55 – 45 = 10, just like
before.
• And the sample standard deviation is square root of 8/3 = √ 2.6 (SQRT 0F 2.6) =
1.6124
• But there are many cases where the data tends to be around a central value with no bias
left or right, and it gets close to a "Normal Distribution" like this:
Normal Distribution is symmetric, which means its tails on one side are the mirror
image of the other side. But this is not the case with most datasets. Generally, data
points cluster on one side more than the other. We call these types of distributions
Skewed Distributions.
In the Normal Distribution, Mean, Median and Mode are equal but in a negatively skewed
distribution, we express the general relationship between the central tendency measured as:
When data points cluster on the left side of the distribution, then the tail would be longer on
the right side. This is the property of Right Skewed Distribution. Here, the tail is longer in the
positive direction so we also call it Positively Skewed Distribution.
A positive z-score means that your x-value is greater than the mean.
A negative z-score means that your x-value is less than the mean.
A z-score of zero means that your x-value is equal to the mean.
Converting a normal distribution into the standard normal distribution allows you to:
x = individual value
μ = mean
σ = standard deviation
For example, Suppose there are two students: Ross and Rachel. Ross scored 65 in the exam of
paleontology and Rachel scored 80 in the fashion designing exam.
No, because the way people performed in paleontology may be different from the way people
performed in fashion designing. The variability may not be the same here.
So, a direct comparison by just looking at the scores will not work.
Consider paleontology marks follow a normal distribution with mean 60 and a standard deviation
of 4. On the other hand, the fashion designing marks follow a normal distribution with mean 79
and standard deviation of 2.
standard deviations above the mean score. Hence we can say that Ross Performed better than
Rachel.
A density plot is a smoothed, continuous version of a histogram estimated from the data. The
most common form of estimation is known as kernel density estimation (KDE). In this method,
a continuous curve (the kernel) is drawn at every individual data point and all of these curves
are then added together to make a single smooth density estimation.
Q_Q Plot
Quantiles are cut points dividing the range of a probability distribution into continuous intervals
with equal probabilities or dividing the observations in a sample in the same way.
2 quantile is known as the Median
4 quantile is known as the Quartile
10 quantile is known as the Decile
100 quantile is known as the Percentile
10 quantile will divide the Normal Distribution into 10 parts each having 10 % of the data points.
The Q-Q plot or quantile-quantile plot is a scatter plot created by plotting two sets of quantiles
against one another.
Histogram
Density With a Histogram
• The first step in density estimation is to create a histogram of the observations in the
random sample.
• A histogram is a plot that involves first grouping the observations into bins and counting
the number of events that fall into each bin.
• The counts, or frequencies of observations, in each bin are then plotted as a bar graph
with the bins on the x-axis and the frequency on the y-axis.
• The choice of the number of bins is important as it controls the coarseness of the
distribution (number of bars) and, in turn, how well the density of the observations is
plotted.
• It is a good idea to experiment with different bin sizes for a given data sample to get
multiple perspectives or views on the same data.
Correlational Studies
• The goal of a correlational study is to determine whether there is a relationship between
two variables and to describe the relationship.
• A correlational study simply observes the two variables as they exist naturally.
Correlational Studies
Experiment
• The goal of an experiment is to demonstrate a cause-and-effect relationship between two
variables; that is, to show that changing the value of one variable causes change to occur
in a second variable.
• In an experiment, one variable is manipulated to create treatment conditions.
• A second variable is observed and measured to obtain scores for a group of individuals in
each of the treatment conditions.
• The measurements are then compared to see if there are differences between treatment
conditions.
• All other variables are controlled to prevent them from influencing the results.
• In an experiment, the manipulated variable is called the independent variable and the
observed variable is the dependent variable.
1.6.What Is a Probability Density Function (PDF)?
Now, consider a continuous random variable x, which has a probability density function, that
defines the range of probabilities taken by this function as f(x). After plotting the pdf, you get a
graph as shown below:
In the above graph, you get a bell-shaped curve after plotting the function against the variable.
The blue curve shows this. Now consider the probability of a point b. To find it, you need to find
the area under the curve to the left of b. This is represented by P(b). To find the probability of a
variable falling between points a and b, you need to find the area of the curve between a and b.
As the probability cannot be more than P(b) and less than P(a), you can represent it as:
For the probability of 3 inches of rainfall, you plot a line that intersects the y-axis at the same
point on the graph as a line extending from 3 on the x-axis does. This tells you that the
probability of 3 inches of rainfall is less than or equal to 0.5.
1.7.Descriptive Statistics
What is Statistics?
Statistics is the science of collecting data and analyzing them to infer proportions (sample) that
are representative of the population. In other words, statistics is interpreting data in order to make
predictions for the population.
Descriptive Statistics
Descriptive Statistics is summarizing the data at hand through certain numbers like mean, median
etc. so as to make the understanding of the data easier. It does not involve any generalization or
inference beyond what is available. This means that the descriptive statistics are just the
representation of the data (sample) available and not based on any theory of probability.
Commonly Used Measures
1. Measures of Central Tendency
2. Measures of Dispersion (or Variability)
Measures of Central Tendency
A Measure of Central Tendency is a one number summary of the data that typically describes the
center of the data. These one number summary is of three types.
1. Mean : Mean is defined as the ratio of the sum of all the observations in the data to
the total number of observations. This is also known as Average. Thus mean is a
number around which the entire data set is spread.
2. Median : Median is the point which divides the entire data into two equal halves.
One-half of the data is less than the median, and the other half is greater than the
same. Median is calculated by first arranging the data in either ascending or
descending order.
If the number of observations are odd, median is given by the middle observation in
the sorted form.
If the number of observations are even, median is given by the mean of the two
middle observation in the sorted form.
An important point to note that the order of the data (ascending or descending) does not effect the
median.
3. Mode : Mode is the number which has the maximum frequency in the entire data set, or in
other words,mode is the number that appears the maximum number of times. A data can have one
or more than one mode.
If there is only one number that appears maximum number of times, the data has one
mode, and is called Uni-modal.
If there are two numbers that appear maximum number of times, the data has two
modes, and is called Bi-modal.
If there are more than two numbers that appear maximum number of times, the data
has more than two modes, and is called Multi-modal.
Example to compute the Measures of Central Tendency
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23
Mean — Mean is calculated as
Mode — Mode is given by the number that occurs maximum number of times. Here, 17
and 21 both occur twice. Hence, this is a Bimodal data and the modes are 17 and 21.
Since Median and Mode does not take all the data points for calculations, these are robust
to outliers, i.e. these are not effected by outliers.
At the same time, Mean shifts towards the outlier as it considers all the data points. This
means if the outlier is big, mean overestimates the data and if it is small, the data is
underestimated.
If the distribution is symmetrical, Mean = Median = Mode. Normal distribution is an
example.
Probability and Statistics form the basis of Data Science. The probability theory is very much
helpful for making the prediction. Estimates and predictions form an important part of Data
science. With the help of statistical methods, we make estimates for the further analysis. Thus,
statistical methods are largely dependent on the theory of probability. And all of probability and
statistics is dependent on Data.
1.8.1. Data
Data is the collected information(observations) we have about something or facts and statistics
collected together for reference or analysis.
Data — a collection of facts (numbers, words, measurements, observations, etc) that has been
translated into a form that computers can process
Helps in understanding more about the data by identifying relationships that may exist
between 2 variables.
Helps in predicting the future or forecast based on the previous trend of data.
Data matters a lot nowadays as we can infer important information from it. Now let’s delve into
how data is categorized. Data can be of 2 types categorical and numerical data. For Example in a
bank, we have regions, occupation class, gender which follow categorical data as the data is
within a fixed certain value and balance, credit score, age, tenure months follow numerical
continuous distribution as data can follow an unlimited range of values.
Note: Categorical Data can be visualized by Bar Plot, Pie Chart, Pareto Chart. Numerical Data
can be visualized by Histogram, Line Plot, Scatter Plot
The qualitative and quantitative data is very much similar to the above categorical and numerical
data.
Nominal: Data at this level is categorized using names, labels or qualities. eg: Brand Name,
ZipCode, Gender.
Ordinal: Data at this level can be arranged in order or ranked and can be compared. eg: Grades,
Star Reviews, Position in Race, Date
Interval: Data at this level can be ordered as it is in a range of values and meaningful differences
between the data points can be calculated. eg: Temperature in Celsius, Year of Birth
Ratio: Data at this level is similar to interval level with added property of an inherent zero.
Mathematical calculations can be performed on these data points. eg: Height, Age, Weight
Before performing any analysis of data, we should determine if the data we’re dealing with is
population or sample.
Population: Collection of all items (N) and it includes each and every unit of our study. It is hard
to define and the measure of characteristic such as mean, mode is called parameter.
Sample: Subset of the population (n) and it includes only a handful units of the population. It is
selected at random and the measure of the characteristic is called as statistics.
For Example, say you want to know the mean income of the subscribers to a movie subscription
service(parameter). We draw a random sample of 1000 subscribers and determine that their mean
income(x̄ ) is $34,500 (statistic). We conclude that the population mean income (μ) is likely to be
close to $34,500 as well.
1.8.4. Measures of Central Tendency
The measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also classed as summary statistics.
1.8.5. Mean: The mean is equal to the sum of all the values in the data set divided by the number
of values in the data set i.e the calculated average. It susceptible to outliers when unusual
values are added it gets skewed i.e deviates from the typical central value.
1.8.6. Median: The median is the middle value for a dataset that has been arranged in order of
magnitude. Median is a better alternative to mean as it is less affected by outliers and
skewness of the data. The median value is much closer than the typical central value.
1.8.7. Mode: The mode is the most commonly occurring value in the dataset. The mode can,
therefore sometimes consider the mode as being the most popular option.
For Example, In a dataset containing {13,35,54,54,55,56,57,67,85,89,96} values. Mean is 60.09.
Median is 56. Mode is 54.
Skewness: Skewness is the asymmetry in a statistical distribution, in which the curve appears
distorted or skewed towards to the left or to the right. Skewness indicates whether the data is
concentrated on one side.
Positive Skewness: Positive Skewness is when the mean>median>mode. The outliers are skewed
to the right i.e the tail is skewed to the right.
Negative Skewness: Negative Skewness is when the mean<median<mode. The outliers are
skewed to the left i.e the tail is skewed to the left.
The measure of central tendency gives a single value that represents the whole value; however,
the central tendency cannot describe the observation fully. The measure of dispersion helps us to
study the variability of the items i.e the spread of data.
Remember: Population Data has N data points and Sample Data has (n-1) data points. (n-1) is
called Bessel’s Correction and it is used to reduce bias.
1.8.10. Range: The difference between the largest and the smallest value of a data, is termed as
the range of the distribution. Range does not consider all the values of a series, i.e. it takes
only the extreme items and middle items are not considered significant. eg: For
{13,33,45,67,70} the range is 57 i.e(70–13).
1.8.11. Variance: Variance measures how far is the sum of squared distances from each point to
the mean i.e the dispersion around the mean.
Note: The units of values and variance is not equal so we use another variability measure.
1.8.12. Standard Deviation: As Variance suffers from unit difference so standard deviation is
used. The square root of the variance is the standard deviation. It tells about the
concentration of the data around the mean of the data set.
For eg: {3,5,6,9,10} are the values in a dataset.
Given measurements on a sample, what is the difference between a standard deviation and
a standard error?
A standard deviation is a sample estimate of the population parameter; that is, it is an estimate
of the variability of the observations. Since the population is unique, it has a unique standard
deviation, which may be large or small depending on how variable the observations are. We
would not expect the sample standard deviation to get smaller because the sample gets larger.
However, a large sample would provide a more precise estimate of the population standard
deviation than a small sample.
A standard error, on the other hand, is a measure of precision of an estimate of a population
parameter. A standard error is always attached to a parameter, and one can have standard errors
of any estimate, such as mean, median, fifth centile, even the standard error of the standard
deviation. Since one would expect the precision of the estimate to increase with the sample size,
the standard error of an estimate will decrease as the sample size increases.
1.8.13. Coefficient of Variation(CV): It is also called as the relative standard deviation. It is the
ratio of standard deviation to the mean of the dataset.
Standard deviation is the variability of a single dataset. Whereas the coefficient of variance can be
used for comparing 2 datasets.
From the above example, we can see that the CV is the same. Both methods are precise. So it is
perfect for comparisons.
Measures of Relationship
Covariance does not give effective information about the relation between 2 variables as it is not
normalized.
The value of correlation ranges from -1 to 1. -1 indicates negative correlation i.e with an increase
in 1 variable independent there is a decrease in the other dependent variable.1 indicates positive
correlation i.e with an increase in 1 variable independent there is an increase in the other
dependent variable.0 indicates that the variables are independent of each other.
For Example,
Correlation 0.889 tells us Height and Weight has a positive correlation. It is obvious that as the
height of a person increases weight too increases.
Gaussian distribution is a synonym for normal distribution. S is a set of random values whose
probability distribution looks like the picture below.
This is a bell-shaped curve. If a probability distribution plot forms a bell-shaped curve like above
and the mean, median, and mode of the sample are the same that distribution is called normal
distribution or Gaussian distribution.
The Gaussian distribution is parameterized by two parameters:
• The mean and The variance
So, the Gaussian density is the highest at the point of µ or mean, and further, it goes from the
mean, the Gaussian density keeps going lower.
This is the formula for the bell-shaped curve where sigma square is called the variance.
Mean =0, and different sigmas
This is the probability distribution of a set of random numbers with µ is equal to 0 and sigma is 1.
In the first picture, µ is 0 which means the highest probability density is around 0 and the sigma is
one. means the width of the curve is 1. the height of the curve is about 0.5 and the range is -4 to 4
(look at x-axis). The variance sigma square is 1.
Here is another set of random numbers that has a µ of 0 and sigma 0.5 in the second figure.
Because the µ is 0, like the previous picture the highest probability density is at around 0 and the
sigma is 0.5. So, the width of the curve is 0.5. The variance sigma square becomes 0.25.
As the width of the curve is half the previous curve, the height became double. The range changed
to -2 to 2 (x-axis) which is the half of the previous picture.
In this picture, sigma is 2 and µ is 0 as the previous two pictures. Compare it to figure 1 where
sigma was 1. This time height became half of figure 1. Because the width became double as the
sigma became double. The variance sigma square is 4, four times bigger than figure 1. Look at the
range in the x-axis, it’s -8 to 8.
Here, we changed µ to 3 and sigma is 0.5 as figure 2. So, the shape of the curve is exactly the
same as figure 2 but the center shifted to 3. Now the highest density is at around 3.
It changes shapes with the different values of sigma but the area of the curve stays the same.
One important property of probability distribution is, the area under the curve is integrated to one.
Parameter Estimation
Calculating µ is straight forward. it’s simply the average. Take the summation of all the data and
divide it by the total number of data.
The summation symbol in this equation is the determinant of sigma which is actually an n x n
matrix of sigma.
Visual Representation of Multivariate Gaussian Distribution
Standard Normal Distribution
when the standard deviation sigma shrinks, the range also shrinks. At the same time, the height of
the curve becomes higher to adjust the area.
In the contrast, when sigma is larger, the variability becomes wider. So, the height of the curve
gets lower.
The sigma values for both x1 and x2 will not be the same always.
the range looks like an eclipse. It shrunk for the x1 as the standard deviation sigma is smaller for
sigma.
Change the Correlation Factor Between the Variables
This is a completely different scenario. The off-diagonal values are not zeros anymore. It’s 0.5. It
shows that x1 and x2 are correlated by a factor of 0.5.
The eclipse has a diagonal direction now. x1 and x2 are growing together as they are positively
correlated.
When x1 is large x2 also large and when x1 is small, x2 is also small.
Different Means
The center of the curve shifts from zero for x2 now.
The center position or the highest probability distribution area should be at 0.5 now.
The center of the highest probability in the x1 direction is 1.5. At the same time, the center of the
highest probability is -0.5 for x2 direction.
1.10. Hypothesis Testing
Hypothesis testing is a part of statistical analysis, where we test the assumptions made regarding a
population parameter.
It is generally used when we were to compare:
a single group with an external standard
two or more groups with each other
A Parameter is a number that describes the data from the population whereas, a Statistic is a
number that describes the data from a sample.
Terminologies
1.10.1. Null Hypothesis: Null hypothesis is a statistical theory that suggests there is no statistical
significance exists between the populations.
It is denoted by H0 and read as H-naught.
1.10.2. Alternative Hypothesis: An Alternative hypothesis suggests there is a significant
difference between the population parameters. It could be greater or smaller. Basically, it
is the contrast of the Null Hypothesis.
It is denoted by Ha or H1.
H0 must always contain equality(=). Ha always contains difference(≠, >, <).
For example, if we were to test the equality of average means (µ) of two groups:
for a two-tailed test, we define H0: µ1 = µ2 and Ha: µ1≠µ2
for a one-tailed test, we define H0: µ1 = µ2 and Ha: µ1 > µ2 or Ha: µ1 < µ2
1.10.3. Level of significance: Denoted by alpha or α. It is a fixed probability of wrongly rejecting
a True Null Hypothesis. For example, if α=5%, that means we are okay to take a 5% risk
and conclude there exists a difference when there is no actual difference.
1.10.4. Test Statistic: It is denoted by t and is dependent on the test that we run. It is deciding
factor to reject or accept Null Hypothesis.
The four main test statistics are given in the below table:
In hypothesis testing, the following rules are used to either reject or accept the hypothesis given
a of 0.05. Keep in mind that if you were to have an of 0.1, you’re results would be given
with 90% confidence and the example above, with a p-value of 0.06, would reject .
p-value: It is the proportion of samples (assuming the Null Hypothesis is true) that would be as
extreme as the test statistic. It is denoted by the letter p.
Now, assume we are running a two-tailed Z-Test at 95% confidence. Then, the level of
significance (α) = 5% = 0.05. Thus, we will have (1-α) = 0.95 proportion of data at the center, and
α = 0.05 proportion will be equally shared to the two tails. Each tail will have (α/2) = 0.025
proportion of data.
The critical value i.e., Z95% or Zα/2 = 1.96 is calculated from the Z-scores table.
Now, take a look at the below figure for a better understanding of critical value, test-statistic, and
p-value.
Steps of Hypothesis testing
For a given business problem,
1. Start with specifying Null and Alternative Hypotheses about a population parameter
2. Set the level of significance (α)
3. Collect Sample data and calculate the Test Statistic and P-value by running a Hypothesis
test that well suits our data
4. Make Conclusion: Reject or Fail to Reject Null Hypothesis
5. Confusion Matrix in Hypothesis testing
To plot a confusion matrix, we can take actual values in columns and predicted values in rows or
vice versa.
Confidence: The probability of accepting a True Null Hypothesis. It is denoted as (1-α)
Power of test: The probability of rejecting a False Null Hypothesis i.e., the ability of the test to
detect a difference. It is denoted as (1-β) and its value lies between 0 and 1.
Type I error: Occurs when we reject a True Null Hypothesis and is denoted as α.
Type II error: Occurs when we accept a False Null Hypothesis and is denoted as β.
Accuracy: Number of correct predictions / Total number of cases
The factors that affect the power of the test are sample size, population variability, and the
confidence (α).
Confidence and power of test are directly proportional. Increasing the confidence increases the
power of the test.
Type 1 and 2 errors occur when we reject or accept our null hypothesis when, in reality, we
shouldn’t have. This happens because, while statistics is powerful, there is a certain chance that
you may be wrong. The table below summarizes these types of errors.
Accept Reject
In reality, is Correct: is true and statistical test Incorrect: Type 1 error - is true
actually true accepts and statistical test rejects
Confidence Interval
A confidence interval, in statistics, refers to the probability that a population parameter will fall
between a set of values for a certain proportion of times.
A confidence interval is the mean of your estimate plus and minus the variation in that estimate.
This is the range of values you expect your estimate to fall between if you redo your test, within a
certain level of confidence.
Confidence, in statistics, is another way to describe probability. For example, if you construct a
confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the
estimate will fall between the upper and lower values specified by the confidence interval.
The desired confidence level is usually one minus the alpha ( a ) value you used in the statistical
test:
Confidence level = 1 − a
So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level
would be 1 − 0.05 = 0.95, or 95%.
When to use confidence intervals?
Confidence intervals can be calculated for many kinds of statistical estimates, including:
Proportions
Population means
Differences between population means or proportions
Estimates of variation among groups
These are all point estimates, and don’t give any information about the variation around the
number. Confidence intervals are useful for communicating the variation around a point estimate.
Example: Variation around an estimate
You survey 100 Brits and 100 Americans about their television-watching habits, and find that
both groups watch an average of 35 hours of television per week.
However, the British people surveyed had a wide variation in the number of hours watched, while
the Americans all watched similar amounts.
Even though both groups have the same point estimate (average number of hours watched), the
British estimate will have a wider confidence interval than the American estimate because there is
more variation in the data.
Calculating a confidence interval
Most statistical programs will include the confidence interval of the estimate when you run a
statistical test.
If you want to calculate a confidence interval on your own, you need to know:
The point estimate you are constructing the confidence interval for
The critical values for the test statistic
The standard deviation of the sample
The sample size
Once you know each of these components, you can calculate the confidence interval for your
estimate by plugging them into the confidence interval formula that corresponds to your data.
Point estimate
The point estimate of your confidence interval will be whatever statistical estimate you are
making (e.g. population mean, the difference between population means, proportions, variation
among groups).
Example: Point estimate - In the TV-watching example, the point estimate is the mean number of
hours watched: 35.
Finding the critical value
Critical values tell you how many standard deviations away from the mean you need to go in
order to reach the desired confidence level for your confidence interval.
There are three steps to find the critical value.
1. Choose your alpha ( a ) value.
The alpha value is the probability threshold for statistical significance. The most common alpha
value is p = 0.05, but 0.1, 0.01, and even 0.001 are sometimes used. It’s best to look at the papers
published in your field to decide which alpha value to use.
2. Decide if you need a one-tailed interval or a two-tailed interval.
You will most likely use a two-tailed interval unless you are doing a one-tailed t-test.
For a two-tailed interval, divide your alpha by two to get the alpha value for the upper and lower
tails.
3. Look up the critical value that corresponds with the alpha value.
If your data follows a normal distribution, or if you have a large sample size (n > 30) that is
approximately normally distributed, you can use the z-distribution to find your critical values.
For a z-statistic, some of the most common values are shown in this table:
Confidence level 90% 95% 99%
If you are using a small dataset (n ≤ 30) that is approximately normally distributed, use the t-
distribution instead.
The t-distribution follows the same shape as the z-distribution, but corrects for small sample sizes.
For the t-distribution, you need to know your degrees of freedom (sample size minus 1).
Check out this set of t tables to find your t-statistic. The author has included the confidence level
and p-values for both one-tailed and two-tailed tests to help you find the t-value you need.
For normal distributions, like the t-distribution and z-distribution, the critical value is the same on
either side of the mean.
Example: Critical value In the TV-watching survey, there are more than 30 observations and the
data follow an approximately normal distribution (bell curve), so we can use the z-distribution for
our test statistics.
For a two-tailed 95% confidence interval, the alpha value is 0.025, and the corresponding critical
value is 1.96.
This means that to calculate the upper and lower bounds of the confidence interval, we can take
the mean ±1.96 standard deviations from the mean.
Finding the standard deviation
Most statistical software will have a built-in function to calculate your standard deviation, but to
find it by hand you can first find your sample variance, then take the square root to get the
standard deviation.
1.Find the sample variance
Sample variance is defined as the sum of squared differences from the mean, also known as the
mean-squared-error (MSE):
To find the MSE, subtract your sample mean from each value in the dataset, square the resulting
number, and divide that number by n − 1 (sample size minus 1).
Then add up all of these numbers to get your total sample variance (s2). For larger sample sets,
it’s easiest to do this in Excel.
2.Find the standard deviation.
The standard deviation of your estimate (s) is equal to the square root of the sample
variance/sample error (s2):
Example: Standard deviation In the television-watching survey, the variance in the GB estimate
is 100, while the variance in the USA estimate is 25. Taking the square root of the variance gives
us a sample standard deviation (s) of:
10 for the GB estimate.
5 for the USA estimate.
Sample size
The sample size is the number of observations in your data set.
Example: Sample size In our survey of Americans and Brits, the sample size is 100 for each
group.
Confidence interval for the mean of normally-distributed data
Normally-distributed data forms a bell shape when plotted on a graph, with the sample mean in
the middle and the rest of the data distributed fairly evenly on either side of the mean.
The confidence interval for data which follows a standard normal distribution is:
Where:
CI = the confidence interval
X̄ = the population mean
Z* = the critical value of the z-distribution
σ = the population standard deviation
√n = the square root of the population size
The confidence interval for the t-distribution follows the same formula, but replaces the Z* with
the t*.
In real life, you never know the true values for the population (unless you can do a complete
census). Instead, we replace the population values with the values from our sample data, so the
formula becomes:
Where:
ˆx = the sample mean
s = the sample standard deviation
Example: Calculating the confidence interval- In the survey of Americans’ and Brits’
television watching habits, we can use the sample mean, sample standard deviation, and sample
size in place of the population mean, population standard deviation, and population size.
To calculate the 95% confidence interval, we can simply plug the values into the formula.
For the USA:
So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98.
For GB:
So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.
1. The confidence ‘level’ refers to the long term success rate of the method i.e. how often this type
of interval will capture the parameter of interest.
2. A specific confidence interval gives a range of plausible values for the parameter of interest.
3. A larger margin of error produces a wider confidence interval that is more likely to contain the
parameter of interest(increased confidence)
4. Increasing the confidence will increase the margin of error resulting in a wider interval.
1) Find the median for the data set: 34, 22, 15, 25, 10.
2) Find the median for the data set: 19, 34, 22, 15, 25, 10.
Step 1: Arrange data in increasing order 10, 15, 19, 22, 25, 34
Step 2: There are 6 numbers in the data set, n = 6.
Step 3: n = 6, so n is an even number Median = average of two middle numbers median =
(19+22)/2= 20.5
Notes: Mean and median don’t have to be numbers from the data set!Mean and median can
only take one value each.Mean is influenced by extreme values, while median is resistant.
3) Find the mode for the data set: 19, 19, 34, 3, 10, 22, 10, 15, 25, 10, 6.
The number that occurs the most is number 10, mode = 10.
4) Find the mode for the data set: 19, 19, 34, 3, 10, 22, 10, 15, 25, 10, 6, 19.
Number 10 occurs 3 times, but also number 19 occurs 3 times, since there is no number that
occur 4 times both numbers 10 and 19 are mode, mode = {10, 19}.
Notes: Mode is always the number from the data set. Mode can take zero, one, or more than one
values. (There can be zero modes, one mode, two modes, ...)
5) Find the mean, median, mode, and range for the following list of values: 13, 18, 13,
14, 13, 16, 14, 21, 13
Solution: The mean is the usual average, so we’ll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn’t a value from the original list. This is a common result. You
should not assume that your mean will be one of your original numbers.
The median is the middle value, so first we’ll have to rewrite the list in numerical order:
There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:
The mode is the number that is repeated more often than any other, so 13 is the mode, since 13 is
being repeated 4 times.
The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
Mean: 15 |median: 14
|mode: 13 |range: 8
6) Find the mean, median, mode, and range for the following list of values: 1, 2, 4, 7
(1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4 = 3.5
The median is the middle number. In this example, the numbers are already listed in numerical
order, so we don’t have to rewrite the list. But there is no “middle” number, because there are
even number of numbers. Because of this, the median of the list will be the mean (that is, the
usual average) of the middle two values within the list. The middle two numbers are 2 and 4, so:
(2 + 4) ÷ 2 = 6 ÷ 2 = 3
So the median of this list is 3, a value that isn’t in the list at all.
The mode is the number that is repeated most often, but all the numbers in this list appear only
once, so there is no mode.
The largest value in the list is 7, the smallest is 1, and their difference is 6, so the range is 6.
Because we know the population mean and standard deviation, as well as the distribution (IQ’s
are generally normally distributed), we can use a z-test.
1. H0: µ ≤ 100
Ha: µ > 100
The null and alternative hypothesis are for the parameter µ because the number of dollars of the
contracts is a continuous random variable. Also, this is a one-tailed test because the company has
only an interested if the number of dollars per contact is below a particular number not “too
high” a number. This can be thought of as making a claim that the requirement is being met and
thus the claim is in the alternative hypothesis.
2. Test statistic:
The test statistic is a Student’s t because the sample size is below 30; therefore, we cannot use
the normal distribution. Comparing the calculated value of the test statistic and the critical value
of at a 5% significance level, we see that the calculated value is in the tail of the
distribution. Thus, we conclude that 108 dollars per contract is significantly larger than the
hypothesized value of 100 and thus we cannot accept the null hypothesis. There is evidence that
supports Jane’s performance meets company standards.
10) A teacher believes that 85% of students in the class will want to go on a field trip to
the local zoo. She performs a hypothesis test to determine if the percentage is the
same or different from 85%. The teacher samples 50 students and 39 reply that they
would want to go to the zoo. For the hypothesis test, use a 1% level of significance.
Since the problem is about percentages, this is a test of single population proportions.
H0 : p = 0.85
Ha: p ≠ 0.85
p = 0.7554
Because p > α, we fail to reject the null hypothesis. There is not sufficient evidence to suggest
that the proportion of students that want to go to the zoo is not 85%.
11) Elaborate about Normal Distribution, and calculate the Z score for the following
scenario,
You collect SAT scores from students in a new test preparation course. The data
follows a normal distribution with a mean score (M) of 1150 and a standard
deviation (SD) of 150, i.e μ=1150 , σ=150
To standardize your data, you first find the z-score for 1380. The z-score tells you how many
standard deviations away 1380 is from the mean.
The z-score for a value of 1380 is 1.53. That means 1380 is 1.53 standard deviations from the
mean of your distribution.
1. Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline.
O’Reilly. 2014.
2. Applied Statistics and Probability For Engineers – By Douglas Montgomery.2016.
3. Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science.
QUESTION BANK
Part-A
Q.No Questions CO BT Level
L2
1. Interpret Statistics CO1
L2
2. Differentiate discrete data and continuous data CO1
L2
3. Outline about the Normal Distribution CO1
L2
4. List the properties of Normal distribution CO1
Enumerate the measure of central tendency? L2
5. CO1
L2
6. Differentiate population and sample CO1
L2
7. Interpret mean and median CO1
Outline about Standard Deviation L2
8. CO1
Find the mean, median, mode, and range for the following L3
9. CO1
list of values: 1, 2, 4, 7
Calculate covariance for the following data set: L3
10. x: 2.1, 2.5, 3.6, 4.0 (mean = 3.1) CO1
y: 8, 10, 12, 14 (mean = 11)
L3
11. Find the median for the data set: 34, 22, 15, 25, 10 CO1
L2
12. Enumerate covariance matrix CO1
L2
13. Interpret covariance CO1
L2
14. Differentiate positive and negative covariance? CO1
L2
15. Describe the measures of Variability CO1
L4
16. Why is the multivariate normal distribution so important? CO1
L2
17. Enumerate measures of asymmetry CO1
L2
18. Differentiate Null and Alternate Hypothesis CO1
L2
19. Interpret Hypothesis testing? CO1
20. Consider the following data points. CO1 L3
17, 16, 21, 18, 15, 17, 21, 19, 11, 23.
Find the mean and median.
PART B
CO1 L2
7 Explain about descriptive statistics?
Elaborate about Standard Normal Distribution, and
calculate the Z score for the following scenario, CO1 L4
You collect SAT scores from students in a new test
8 preparation course. The data follows a normal distribution
with a mean score (M) of 1150 and a standard deviation
(SD) of 150, i.e μ=1150 , σ=150