WEEK - 01 to 12 NOTES
WEEK - 01 to 12 NOTES
3. Sample space: A sample space is a set that contains all outcomes of an experiment.
• Sample space is a set, typically denoted S of an experiment.
• example: Toss a coin: S = { heads, tails }
5. Disjoint events: Two events with an empty intersection are said to be disjoint events.
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Higher value of the probability of an event means higher chance of occurring that
event.
• 0 means event cannot occur and 1 means event always occurs.
P (φ) = 0
P (E C ) = 1 − P (E)
P (F ) = P (E) + P (F \ E)
⇒ P (E) ≤ P (F )
P (E) = P (E ∩ F ) + P (E \ F )
P (F ) = P (E ∩ F ) + P (F \ E)
P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F )
13. Equally likely events: assign the same probability to each outcome.
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
15. Conditional probability space: Consider a probability space (S, E, P ), where S
represents the sample space, E represents the collection of events, and P represents
the probability function.
• Let B be an event in S with P (B) > 0. Now, conditional probability space given
B is defined as
For any event A in the original probability space (P, S, E), the conditional prob-
P (A ∩ B)
ability of A given B is .
P (B)
• It is denoted by P (A | B). And
P (A ∩ B) = P (B)P (A | B)
• If the events B and B c partitioned the sample space S such that P (B1 ), P (B2 ) 6= 0,
then for any event A of S,
17. Bayes’ theorem: Let A and B are two events such that P (A) > 0, P (B) > 0.
P (A ∩ B) = P (B)P (A | B) = P (A)P (B | A)
P (B)P (A | B)
⇒ P (B | A) =
P (A)
In general, if the events B1 , B2 , · · · , Bk partition S such that P (Bi ) 6= 0 for i =
1, 2, · · · , k, then for any event A in S such that P (A) 6= 0,
P (Br )P (A | Br )
P (Br | A) = k
P
P (Bi )P (A | Bi )
i=1
for r = 1, 2, · · · , k.
18. Independence of two events: Two events A and B are independent iff
P (A ∩ B) = P (A)P (B)
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Disjoint events are never independent.
• A and B independent ⇒ A and B c are independent.
• A and B independent ⇒ Ac and B c are independent.
19. Mutual independence of three events: Events A, B, and C are mutually indepen-
dent if
(a) P (A ∩ B) = P (A)P (B)
(b) P (A ∩ C) = P (A)P (C)
(c) P (A ∩ B) = P (A)P (B)
(d) P (A ∩ B ∩ C) = P (A)P (B)P (C)
20. Mutual independence of multiple events: Events A1 , A2 , · · · ,n are mutually in-
dependent if, ∀i1 , i2 , · · · , ik ,
P (Ai1 ∩ Ai2 ∩ · · · Aik ∩) = P (Ai1 )P (Ai2 ) · · · P (Aik )
n events are mutually independent ⇒ any subset with or without complementing are
independent as well.
21. Occurrence of event A in a sample space is considered as success.
22. Non - occurrence of event A in a sample space is considered as failure.
23. Repeated independent trials:
(a) Bernoulli trials
• Single Bernoulli trial:
– Sample space is {success, failure} with P(success) = p.
– We can also write the sample space S as {0, 1}, where 0 denotes the
failure and 1 denotes the success with P (1) = p, P (0) = 1 − p.
This kind of distribution is denoted by Bernoulli(p).
• Repeated Bernoulli trials:
– Repeat a Bernoulli trial multiple times independently.
– For each of the trial, the outcome will be either 0 or 1.
(b) Binomial distribution: Perform n independent Bernoulli(p) trials.
• It models the number of success in n independent Bernoulli trials.
• Denoted by B(n, p).
• Sample space is {0, 1, · · · , n}.
• Probability distribution is given by
P (B(n, p) = k) = nCk pk (1 − p)n−k
where n represents the total number trials and k represent the number of
success in n trials.
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• P (B = 0) + P (B = 1) + · · · + P (B = n) = 1
⇒ (1 − p)n + nC2 p2 (1 − p)n−2 + · · · + pn = 1.
(c) Geometric distribution: It models the number of failures the first success.
• Outcomes: Number of trials needed for first success and is denoted by G(p).
• Sample space: {1, 2, 3, 4, · · · }
• P (G = k) = P (first k − 1 trials result in 0 and kth trial result in 1.) =
(1 − p)k−1 p.
• Identity: P (G ≤ k) = 1 − (1 − p)k .
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
1. Random variable: A random variable is a function with domain as the sample space
of an experiment and range as the real numbers, i.e. a function from the sample space
to the real line.
3. Range of a random variable: The range of a random variable is the set of values
taken by it. Range is a subset of the real line.
5. Probability Mass Function (PMF): The probability mass function (PMF) of a dis-
crete random variable (r.v.) X with range set T is the function fX : T → [0, 1] defined
as
fX (t) = P (X = t) for t ∈ T .
6. Properties of PMF:
• 0 ≤ fX (t) ≤ 1
•
P
t∈T fX (t) = 1
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Range: Finite set T
• PMF: fX (t) = 1
|T |
for all t ∈ T
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 3 Notes
Multiple Random Variables
1. Joint probability mass function: Suppose X and Y are discrete random variables
defined in the same probability space. Let the range of X and Y be TX and TY ,
respectively. The joint PMF of X and Y , denoted fXY , is a function from TX × TY
to [0, 1] defined as
2. Marginal PMF: Suppose X and Y are jointly distributed discrete random variables
with joint PMF fXY . The PMF of the individual random variables X and Y are called
as marginal PMFs. It can be shown that
X
fX (t1 ) = P (X = t1 ) = (fXY (t1 , t2 ))
t2 ∈TY
X
fY (t2 ) = P (X = t2 ) = (fXY (t1 , t2 ))
t1 ∈TX
where t ∈ TX
We will denote the conditional random variable by X|A. (Note that X|A is a valid
random variable with PMF fX|A ).
P ((X = t) ∩ A)
• fX|A (t) =
P (A)
• Range of (X|A) can be different from TX and will depend on A.
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
4. Conditional distribution of one random variable given another:
Suppose X and Y are jointly distributed discrete random variables with joint PMF
fXY . The conditional PMF of Y given X = t is defined as the PMF
P (X = x, Y = y) fXY (x, y)
fY |X=x (y) = =
P (X = x) fX (x)
We will denote the conditional random variable by Y |(X = x). (Note that Y |(X = x)
is a valid random variable with PMF fY |(X=x) .
X
fX2 (t2 ) = P (X2 = t2 ) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
t1 ∈TX1 ,t3 ∈TX3 ,...,tn ∈TXn
..
.
X
fXn (tn ) = P (Xn = tn ) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
t1 ∈TX1 ,t2 ∈TX2 ,...,tn−1 ∈TXn−1
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
8. Conditioning with multiple discrete random variables:
• A wide variety of conditioning is possible when there are many random variables.
Some examples are:
• Suppose X1 , X2 , X3 , X4 ∼ fX1 X2 X3 X4 and xi ∈ TXi , then
fX X (x1 , x2 )
– fX1 |X2 =x2 (x1 ) = 1 2
fX2 (x2 )
fX X X (x1 , x2 , x3 )
– fX1 ,X2 |X3 =x3 (x1 , x2 ) = 1 2 3
fX3 (x3 )
fX X X (x1 , x2 , x3 )
– fX1 |X2 =x2 ,X3 =x3 (x1 ) = 1 2 3
fX2 X3 (x2 , x3 )
fX X X X (x1 , x2 , x3 , x4 )
– fX1 X4 |X2 =x2 ,X3 =x3 (x1 , x4 ) = 1 2 3 4
fX2 X3 (x2 , x3 )
9. Conditioning and factors of the joint PMF:
Let X1 , X2 , X3 , X4 ∼ fX1 X2 X3 X4 , Xi ∈ TXi .
fX1 X2 X3 X4 (t1 , t2 , t3 , t4 ) =P (X1 = t1 and (X2 = t2 , X3 = t3 , X4 = t4 ))
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )P (X2 = t2 and (X3 = t3 , X4 = t4 ))
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )fX2 |X3 =t3 ,X4 =t4 (t2 )P (X3 = t3 and X4 = t4 )
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )fX2 |X3 =t3 ,X4 =t4 (t2 )fX3 |X4 =t4 (t3 )fX4 (t4 ).
• Factoring can be done in any sequence.
10. Independence of two random variables:
Let X and Y be two random variables defined in a probability space with ranges TX
and TY , respectively. X and Y are said to be independent if any event defined using
X alone is independent of any event defined using Y alone. Equivalently, if the joint
PMF of X and Y is fXY , X and Y are independent if
fXY (x, y) = fX (x)fY (y)
for x ∈ TX and y ∈ TY
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• To show X and Y dependent, verify
fXY (x, y) 6= fX (x)fY (y)
for some x ∈ TX and y ∈ TY
– Special case: fXY (t1 , t2 ) = 0 when fX (t1 ) 6= 0, fY (t2 ) 6= 0.
11. Independence of multiple random variables:
Let X1 , X2 , . . . , Xn be random variables defined in a probability space with range of
Xi denoted TXi . X1 , X2 , . . . , Xn are said to be independent if events defined using
different Xi are mutually independent. Equivalently, X1 , X2 , . . . , Xn are independent
iff
fX1 X2 ...Xn (t1 , t2 , . . . , tn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
for all xi ∈ TXi
• All subsets of independent random variables are independent.
12. Independent and Identically Distributed (i.i.d.) random variables:
Random variables X1 , X2 , . . . , Xn are said to be independent and identically distributed
(i.i.d.), if
(i) they are independent.
(ii) the marginal PMFs fXi are identical.
Examples:
• Repeated trials of an experiment creates i.i.d. sequence of random variables
– Toss a coin multiple times.
– Throw a die multiple times.
• Let X1 , X2 , . . . Xn ∼ i.i.d.X (Geometric(p)).
X will take values in {1, 2, . . .}
P (X = k) = pk−1 p
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
13. Function of random variables (g(X1 , X2 , . . . , Xn )):
Suppose X1 , X2 , . . . , Xn have joint PMF fX1 X2 ...Xn with TXi denoting the range of Xi .
Let g : TX1 × TX2 × . . . × TXn → R be a function with range Tg . The PMF of
X = g(X1 , X2 ..., Xn ) is given by
X
fX (t) = P (g(X1 , X2 ..., Xn ) = t) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
(t1 ,...,tn ):g(X1 ,X2 ...,Xn )=t
P (Z = z) =P (X + Y = z)
X∞
= P (X = x, Y = z − x)
x=−∞
X∞
= fXY (x, z − x)
x=−∞
X∞
= fXY (z − y, y)
y=−∞
∞
• Convolution: If X and Y are independent, fX+Y (z) =
P
fX (x)fY (z − x)
x=−∞
fZ (z) = P (Z = z) = P (min{X, Y } = z)
= P (X = z, Y = z) + P (X = z, Y > z) + P (X > z, Y = z)
X X
= fXY (z, z) + fXY (z, t2 ) + fXY (t1 , z)
t2 >z t1 >z
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
•
FZ (z) = P (Z ≤ z) = P (min{X, Y } ≤ z)
= 1 − P (min{X, Y } > z)
= 1 − [P (X > z, Y > z)]
fZ (z) = P (Z = z) = P (max{X, Y } = z)
= P (X = z, Y = z) + P (X = z, Y < z) + P (X < z, Y = z)
X X
= fXY (z, z) + fXY (z, t2 ) + fXY (t1 , z)
t2 <z t1 <z
FZ (z) = P (Z ≤ z) = P (max{X, Y } ≤ z)
= [P (X ≤ z, Y ≤ z)]
Important Points:
Page 6
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
6. Sum of r i.i.d. Geometric(p) is Negative-Binomial(r, p).
8. If X and Y are independent, then g(X) and h(Y ) are also independent.
Page 7
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 4 Notes
Expected value
E[X] ≥ 0
g : TX1 × . . . × TXn → R
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Variance measures the spread about the expected value.
Variance of random variable X is also given by Var(X) = E[X 2 ] − E[X]2
1. Var(aX) = a2 Var(X)
2. SD(aX) =| a | SD(X)
3. Var(X + a) = Var(X)
4. SD(X + a) = SD(X)
• Covariance:
Definition: Suppose X and Y are random variables on the same probability space. The
covariance of X and Y , denoted as Cov(X, Y ), is defined as
1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = E[XY ] − E[X]E[Y ]
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
3. Covariance is symmetric if Cov(X, Y ) = Cov(Y, X)
4. Covariance is a “linear” quantity.
(a) Cov(X, aY + bZ) = aCov(X, Y ) + bCov(X, Z)
(b) Cov(aX + bY, Z) = aCov(X, Z) + bCov(Y, Z)
5. Independence: If X and Y are independent, then X and Y are uncorrelated, i.e.
Cov(X, Y ) = 0
6. If X and Y are uncorrelated, they may be dependent.
• Correlation coefficient:
Definition: The correlation coefficient or correlation of two random variables X and Y
, denoted by ρ(X, Y ), is defined as
Cov(X, Y )
ρ(X, Y ) =
SD(X)SD(Y )
1. −1 ≤ ρ(X, Y ) ≤ 1.
2. ρ(X, Y ) summarizes the trend between random variables.
3. ρ(X, Y ) is a dimensionless quantity.
4. If ρ(X, Y ) is close to zero, there is no clear linear trend between X and Y .
5. If ρ(X, Y ) = 1 or ρ(X, Y ) = −1, Y is a linear function of X.
6. If | ρ(X, Y ) | is close to one, X and Y are strongly correlated.
• Bounds on probabilities using mean and variance
1. Markov’s inequality: Let X be a discrete random variable taking non-negative
values with a finite mean µ. Then,
µ
P (X ≥ c) ≤
c
Mean µ, through Markov’s inequality: bounds the probability that a non-negative
random variable takes values much larger than the mean.
2. Chebyshev’s inequality: Let X be a discrete random variable with a finite mean
µ and a finite variance σ 2 . Then,
1
P (| X − µ |≥ kσ) ≤
k2
Other forms:
σ2 1
(a) P (| X − µ |≥ c) ≤ 2
, P ((X − µ)2 > k 2 σ 2 ) ≤ 2
c k
1
(b) P (µ − kσ < X < µ + kσ) ≥ 1 − 2
k
Mean µ and standard deviation σ, through Chebyshev’s inequality: bound the
probability that X is away from µ by kσ.
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 5 Notes
Continuous Random Variables
P (X = x) = F2 − F1
• If F is continuous at x, then
P (X = x) = 0
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Probability of X falling in an interval will be nonzero
6. For a random variable X with PDF fX , an event A is a subset of the real line and its
probability is computed as Z
P (A) = fX (x)dx
A
Rb
• P (a < X < b) = FX (b) − FX (a) = a
fX (x)dx
7. Density function:
A function f : R → R is said to be a density function if
(i) fR(x) ≥ 0
∞
(ii) −∞ fX (x)dx = 1
(iii) f (x) is piece-wise continuous
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
10. Continuous Uniform distribution:
• X ∼ Uniform[a, b]
• PDF:
1 a<x<b
fX (x) = b − a
0 otherwise
• CDF:
0 x≤a
x − a
FX (x) = a<x<b
b−a
1 x≥b
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
15. Functions of continuous random variable:
Suppose X is a continuous random variable with CDF FX and PDF fX and suppose
g : R → R is a (reasonable) function. Then, Y = g(X) is a random variable with CDF
FY determined as follows:
• FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ∈ {x : g(x) ≤ y})
• To evaluate the above probability
– Convert the subset Ay = {x : g(x) ≤ y} into intervals in real line.
– Find the probability that X falls in those intervals.
R
– FY (y) = P (X ∈ AY ) = AY fX (x)dx
• If FY has no jumps, you may be able to differentiate and find a PDF.
16. Theorem: Monotonic differentiable function
Suppose X is a continuous random variable with PDF fX . Let g(x) be monotonic for
dg(x)
x ∈ supp(X) with derivative g 0 (x) = . Then, the PDF of Y = g(X) is
dx
1 −1
fY (y) = fX (g (y))
|g 0 (g −1 (y))|
• Translation: Y = X + a
fY (y) = fX (y − a)
• Scaling: Y = aX
1
fY (y) = fX (y/a)
|a|
• Affine: Y = aX + b
1
fY (y) = fX ((y−b)/a)
|a|
• Affine transformation of a normal random variable is normal.
17. Expected value of function of continuous random variable:
Let X be a continuous random variable with density fX (x). Let g : R → R be a
function. The expected value of g(X), denoted E[g(X)], is given by
Z ∞
E[g(X)] = g(x)fX (x)dx
−∞
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
19. Variance of a continuous random variable:
2
Variance, denoted Var[X] or σX or simply σ 2 is given by
Z ∞
2
Var(X) = E[(X − E[X]) ] = (x − µ)2 fX (x)dx
−∞
X E[X] Var(X)
a+b (b−a)2
Uniform[a, b] 2 12
1 1
Exp(λ) λ λ2
Normal(µ, σ 2 ) µ σ 2
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 6 Notes
1. Marginal density: Let (X, Y ) be jointly distributed where X is discrete with range
TX and PMF pX (x).
For each x ∈ TX , we have a continuous random variable Yx with density fYx (y).
fYx (y) : conditional density of Y given X = x, denoted fY |X=x (y).
• Marginal density of Y
P
– fY (y) = pX (x)fY |X=x (y)
x∈TX
4. 2D uniform distribution: Fix some (reasonable) region D in R2 with total area |D|.
We say that (X, Y ) ∼ Uniform(D) if they have the joint density
(
1
|D|
(x, y) ∈ D
fXY (x, y) =
0 otherwise
5. Marginal density: Suppose (X, Y ) have joint density fXY (x, y). Then,
y=∞
• X has the marginal density fX (x) =
R
fXY (x, y)dy.
y=−∞
x=∞
• Y has the marginal density fY (y) =
R
fXY (x, y)dx.
x=−∞
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• fXY (x, y) = fX (x)fY (y)
– If independent, the marginals determine the joint density.
7. Conditional density: Let (X, Y ) be random variables with joint density fXY (x, y).
Let fX (x) and fY (y) be the marginal densities.
• For a such that fX (a) > 0, the conditional density of Y given X = a, denoted as
fY |X=a (y), is defined as
fXY (a, y)
fY |X=a (y) =
fX (a)
• For b such that fY (b) > 0, the conditional density of X given Y = b, denoted as
fX|Y =b (x), is defined as
fXY (x, b)
fX|Y =b (x) =
fY (b)
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Important results
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Continuous random variables:
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Test for mean
Case (1): When population variance σ 2 is known (z-test)
T =X
right-tailed µ = µ0 µ > µ0 X − µ0 X>c
Z= σ√
/ n
T =X
left-tailed µ = µ0 µ < µ0 X − µ0 X<c
Z= σ√
/ n
T =X
two-tailed µ = µ0 µ 6= µ0 X − µ0 |X − µ0 | > c
Z= σ√
/ n
T =X
right-tailed µ = µ0 µ > µ0 X − µ0 X>c
tn−1 = S/√n
T =X
left-tailed µ = µ0 µ < µ0 X − µ0 X<c
tn−1 = S/√n
T =X
two-tailed µ = µ0 µ 6= µ0 X − µ0 |X − µ0 | > c
tn−1 = S/√n
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• χ2 -test for variance:
(n − 1)S 2
right-tailed σ = σ0 σ > σ0 T = 2
∼ χ2n−1 S 2 > c2
σ0
(n − 1)S 2
left-tailed σ = σ0 σ < σ0 T = ∼ χ2n−1 S 2 < c2
σ02
α
(n − 1)S 2 S 2 > c2 where = P (S 2 > c2 ) or
two-tailed σ = σ0 σ 6= σ0 T = ∼ χ2n−1 2
σ02 α
S 2 < c2 where = P (S 2 < c2 )
2
T =X −Y
σ12 σ22
right-tailed µ1 = µ2 µ1 > µ2 X −Y >c
X − Y ∼ Normal 0, + if H0 is true
n1 n2
T =Y −X
σ22 σ12
left-tailed µ1 = µ2 µ1 < µ2 Y −X >c
Y − X ∼ Normal 0, + if H0 is true
n2 n1
T =X −Y
σ12 σ22
two-tailed µ1 = µ2 µ1 6= µ2 |X − Y | > c
X − Y ∼ Normal 0, + if H0 is true
n1 n2
S12 S12
one-tailed σ1 = σ2 σ1 > σ2 T = ∼ F(n1 −1,n2 −1) >1+c
S22 S22
S12 S12
one-tailed σ1 = σ2 σ1 < σ2 T = ∼ F(n1 −1,n2 −1) <1−c
S22 S22
S12 α
> 1 + cR where = P (T > 1 + cR ) or
S12 S22 2
two-tailed σ1 = σ2 σ1 6= σ2 T = ∼ F(n1 −1,n2 −1)
S22 S12 α
< 1 − cL where = P (T < 1 − cL )
S22 2
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• χ2 -test for goodness of fit:
H0 : Samples are i.i.d X, HA : Samples are not i.i.d X
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 7 Notes
Statistics from samples and Limit theorems
1. Empirical distribution:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. Let #(Xi = t) denote the number of times
t occurs in the samples. The empirical distribution is the discrete distribution with
PMF
#(Xi = t)
p(t) =
n
• The empirical distribution is random because it depends on the actual sample
instances.
• Descriptive statistics: Properties of empirical distribution. Examples :
– Mean of the distribution
– Variance of the distribution
– Probability of an event
• As number of samples increases, the properties of empirical distribution should
become close to that of the original distribution.
2. Sample mean:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample mean, denoted X, is defined to
be the random variable
X1 + X 2 + . . . + Xn
X=
n
• Given a sampling x1 , . . . , xn the value taken by the sample mean X is x =
x1 + x2 + . . . + xn
. Often, X and x are both called sample mean.
n
σ2
E[X] = µ, Var(X) =
n
• Expected value of sample mean equals the expected value or mean of the distri-
bution.
• Variance of sample mean decreases with n.
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
4. Sample variance:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample variance, denoted S 2 , is defined
to be the random variable
(X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2
S2 = ,
n−1
6. Sample proportion:
The sample proportion of A, denoted S(A), is defined as
8. Chernoff inequality:
Let X be a random variable such that E[X] = 0, then
E[eλX ]
P (X > t) ≤ , λ>0
eλt
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
9. Moment generating function (MGF):
Let X be a zero-mean random variable (E[X] = 0). The MGF of X, denoted MX (λ),
is a function from R to R defined as
MX (λ) = E[eλX ]
MX (λ) = E[eλX ]
λ2 X 2 λ3 X 3
= E[1 + λX + + + . . .]
2! 3!
λ2 λ3
= 1 + λE[X] + E[X 2 ] + E[X 3 ] + . . .
2! 3!
λk
That is coefficient of in the MGF of X gives the kth moment of X.
k!
2 σ2
• If X ∼ Normal(0, σ 2 ) then, MX (λ) = eλ /2
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
12. Beta distribution:
X ∼ Beta(α, β) if PDF fx (x) ∝ xα−1 (1 − x)β−1 , 0<x<1
• θ is a location parameter.
• α > 0 is a scale parameter.
• Mean and variance are undefined.
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
n 1
• Gamma , is called Chi-square distribution with n degrees of freedom, de-
2 2
noted χ2n .
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 8 notes
Note:
1. θ is an unknown parameter.
2. θ̂ is a function of n random variables.
• Risk: The (squared-error) risk of the estimator θ̂ for a parameter θ, denoted Risk(θ̂, θ),
is defined as
Risk(θ̂, θ) = E[(θ̂ − θ)2 ]
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
1. Risk is the expected value of “squared error” and is also called mean squared error
(MSE) often.
2. Squared-error risk is the second moment of Error.
• Variance of estimator:
Variance(θ̂) = E[(θ̂ − E[θ])2 ]
Var(Error) = Var(θ̂)
• Bias-Variance tradeoff: The risk of the estimator satisfies the following relationship:
1. Method of moments
1P n
(a) Sample moments: Mk (X1 , . . . , Xn ) = Xk
n i=1 i
(b) Mk is a random variable, and mk is the value taken by it in one sampling
instance. We expect that Mk will take values around E[X k ]
(c) Procedure:
– Equate sample moments to expression for moments in terms of unknown
parameters.
– Solve for the unknown parameters.
(d) One parameter θ usually needs one moment
– Sample moment: m1
– Distribution moment: E[X] = f (θ)
– Solve for θ from f (θ) = m1 in terms of m1 .
– θ̂: replace m1 by M1 in above solution.
(e) Two parameters θ1 , θ2 usually needs two moments.
– Sample moments: m1 , m2
– Distribution moment: E[X] = f (θ1 , θ2 ), E[X 2 ] = g(θ1 , θ2 )
– Solve for θ1 , θ2 from f (θ1 , θ2 ) = m1 , g(θ1 , θ2 ) = m2 in terms of m1 , m2 .
– θ̂: replace m1 by M1 and m2 by M2 in above solution.
2. Maximum Likelihood estimators
(a) Likelihood of i.i.d. samples: Likelihood of a sampling x1 , x2 , . . . , xn , denoted
L(x1 , x2 , . . . , xn )
n
Y
L(x1 , x2 , . . . , xn ) = fX (xi ; θ1 , θ2 , . . .)
i=1
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
– Maximum likelihood (ML) estimation
n
Y
θ1∗ , θ2∗ , . . . = arg max fX (xi ; θ1 , θ2 , . . .)
θ1 ,θ2 ,...
i=1
• Properties of estimators:
• Confidence interval:
X1 , . . . , Xn ∼ iid X, µ = E[X]
X 1 + . . . + Xn
Estimator: µ̂ =
n
– Suppose P (| µ̂ − µ |< α) = β, where α is a small fraction and β is a large fraction.
– µ̂ in one sampling instance: estimate with margin of error (100α)% at confidence
level (100β)%.
P (| µ̂ − µ |< α) =β
!
µ̂ − µ α
=⇒ P √ < √ =β
σ/ n σ/ n
α
=⇒ P | Normal(0, 1) |< √ =β
σ/ n
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
2 µ̂ − µ
µ̂ ∼ i.i.d. Normal(µ, σn ), Z = √ ∼ tn−1
S/ n
P (| µ̂ − µ |< α) =β
!
µ̂ − µ α
=⇒ P √ < √ =β
S/ n α̂/ n
α
=⇒ P | Normal(0, 1) |< √ =β
α̂/ n
3. If samples are not normal: Use CLT to argue that sample mean will have a normal
distribution
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 9 Notes
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Prior M ∼ Normal(µ0 , σ02 )
2
⇒ fM (µ) = √2πσ 1
exp(− (µ−µ2
2σ0
0)
)
0
Samples: x1 , . . . , xn , Sample mean: x = (x1 + . . . + xn )/n
Posterior: M| (X1 = x1 , . . . Xn = xn )
Posterior density ∝ f (X1 = x1 , . . . Xn = xn | M = µ) × fM (µ)
2 2 2
Posterior density ∝ exp(− (x1 −µ) +...+(x 2σ02
n −µ)
)exp(− (µ−µ 0)
2σ02
)
⇒ Posterior density: Normal
X1 + X2 + . . . + Xn nσ02 σ2
Posterior mean: µ̂ = + µ 0
n nσ02 + σ 2 nσ02 + σ 2
X1 + X2 + . . . + Xn + α
Posterior mean: λ̂ =
n+β
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 10 Notes
Hypothesis testing
1. Null hypothesis:
The null hypothesis is a kind of hypothesis which explains the population parameter
whose purpose is to test the validity of the given experimental data. It is denoted by
H0 . The null hypothesis is a default hypothesis that is assumed to remain possibly
true.
2. Alternative hypothesis:
The alternative hypothesis is a statement used in statistical inference experiment. It
is contradictory to the null hypothesis and denoted by HA or H1 .
3. Test statistic:
A test statistic is numerical quantity computed from values in a sample used in statis-
tical hypothesis testing.
4. Type I error:
A type I error is a kind of fault that occurs during the hypothesis testing process when
a null hypothesis is rejected, even though it is true.
5. Type II error:
A type II error is a kind of fault that occurs during the hypothesis testing process when
a null hypothesis is accepted, even though it is not true (HA is true).
6. Significance level (Size):
Significance level (also called size) of a test, denoted α, is the probability of type I
error.
α = P (Type I error)
7. β = P (Type II error)
8. Power of a test:
Power = 1 − β
9. Types of hypothesis:
(a) Simple hypothesis: A hypothesis that completely specifies the distribution of
the samples is called a simple hypothesis.
(b) Composite hypothesis: A hypothesis that does not completely specify the
distribution of the samples is called a composite hypothesis.
10. Standard testing method: z-test:
Consider a sample X1 , X2 , . . . , Xn ∼ i.i.d. X.
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Test statistic, denoted T , is some function of the samples. For example: sample
mean X
• Acceptance and rejection regions are specified through T .
X − µ0
Note: In the test for mean (σ 2 known), T = X and when null is true, σ/√n
∼
Normal(0, 1).
11. P -value:
Suppose the test statistic T = t in one sampling. The lowest significance level α at
which the null will be rejected for T = t is said to be the P -value of the sampling.
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 11 Notes
t-test, χ2 -test, two samples z/F -test
• X ∼ Normal(µ, σ 2 /n)
(n − 1) 2
• 2
S ∼ χ2n−1 , chi-squared distribution with n − 1 degrees of freedom.
σ
X −µ
• √ ∼ tn−1 , t-distribution with n − 1 degrees of freedom.
S/ n
2. t-test for mean (Variance unknown)
Consider the samples X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ), σ 2 unknown. Following are the
three different possibilities:
H0 : µ = µ0
HA : µ > µ0
Test Statistic: T = X
Test: Reject H0 , if T > c
X −µ
Given H0 , √ ∼ tn−1
S/ n
α =P (reject H0 | H0 is true)
=P (T > c | µ = µ0 )
c − µ0 c − µ0
=P tn−1 > √ = 1 − Ftn−1 √
s/ n s/ n
s
=⇒ c = √ Ft−1 (1 − α) + µ0
n n−1
Note: Ftn−1 is the CDF of t-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:
H0 : µ = µ0
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
HA : µ < µ0
Test Statistic: T = X
Test: Reject H0 , if T < c
X −µ
Given H0 , √ ∼ tn−1
S/ n
α =P (reject H0 | H0 is true)
=P (T < c | µ = µ0 )
c − µ0 c − µ0
=P tn−1 < √ = Ftn−1 √
s/ n s/ n
s
=⇒ c = √ Ft−1 (α) + µ0
n n−1
H0 : µ = µ0
HA : µ 6= µ0
Test Statistic: T = X − µ
Test: Reject H0 , if | X − µ |> c
X −µ
Given H0 , √ ∼ tn−1
S/ n
α =P (reject H0 | H0 is true)
=P (| X − µ |> c | µ = µ0 )
c −c
=P | tn−1 |> √ = 2Ftn−1 √
s/ n s/ n
−s
=⇒ c = √ Ft−1 (α/2)
n n−1
H0 : σ = σ0
HA : σ > σ0
Page 2
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Test Statistic: S 2
Test: Reject H0 , if S 2 > c2
(n − 1) 2
Given H0 , S ∼ χ2n−1
σ2
α =P (reject H0 | H0 is true)
=P (S 2 > c2 | σ = σ0 )
2 (n − 1) 2 (n − 1) 2
=P χn−1 > c = 1 − Fχ2n−1 c
σ02 σ02
Note: Fχ2n−1 is the CDF of chi-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:
H0 : σ = σ0
HA : σ < σ0
Test Statistic: S 2
Test: Reject H0 , if S 2 < c2
(n − 1) 2
Given H0 , 2
S ∼ χ2n−1
σ
α =P (reject H0 | H0 is true)
=P (S 2 < c2 | σ = σ0 )
2 (n − 1) 2 (n − 1) 2
=P χn−1 < c = Fχ2n−1 c
σ02 σ02
Note: Fχ2n−1 is the CDF of chi-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:
H0 : σ = σ0
HA : σ 6= σ0
Test Statistic: S 2
Test: Reject H0 , if S 2 < c2 or S 2 > c2
(n − 1) 2
Given H0 , S ∼ χ2n−1
σ2
α
=P (S 2 < c2 | H0 ) = P (S 2 > c2 | H0 )
2
Page 3
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• The null and alternative hypothesis are:
H0 : µ1 = µ2
HA : µ1 6= µ2
Test Statistic: T = X − Y
Test: Reject H0 , if | T |> c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (| T |> c | µ1 = µ2 )
−c
=2FZ
σT
H0 : µ1 = µ2
HA : µ1 > µ2
Test Statistic: T = X − Y
Test: Reject H0 , if X − Y > c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (X − Y > c | µ1 = µ2 )
c
=1 − FZ
σT
H0 : µ1 = µ2
HA : µ1 < µ2
Test Statistic: T = X − Y
Test: Reject H0 , if Y − X > c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (Y − X > c | µ1 = µ2 )
c
=1 − FZ
σT
Page 4
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
5. Two samples F -test (known variances)
Page 5
www.letslearn1110.com
https://www.youtube.com/@letslearn1110
6. Likelihood Ratio test:
For simple null and alternative hypothesis, Likelihood ratio test is enough.
X1 , . . . , X n ∼ P
H0 : P = fX
HA : P = gX
n
Q
gX (Xi )
i=1
Likelihood ratio: L(X1 , . . . , Xn ) = Qn
fX (Xi )
i=1
Likelihood ratio test: Reject H0 , if T = L(X1 , . . . , Xn ) > c
Page 6
www.letslearn1110.com
https://www.youtube.com/@letslearn1110