0% found this document useful (0 votes)
10 views

WEEK - 01 to 12 NOTES

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

WEEK - 01 to 12 NOTES

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Statistics for Data Science - 2

Week 1 Important formulas


Basic Probability

1. Experiment: Process or phenomenon that we wish to study statistically.


Example: Tossing a fair coin.

2. Outcome: Result of the experiment.


Example: head is an outcome on tossing a fair coin.

3. Sample space: A sample space is a set that contains all outcomes of an experiment.
• Sample space is a set, typically denoted S of an experiment.
• example: Toss a coin: S = { heads, tails }

4. Event: An event is a subset of the sample space.

• Toss a coin: S = { heads, tails }


– Events: empty set, {heads}, {tails}, { heads, tails }
– 4 events
• An event is said to have “occurred” if the actual outcome of the experiment
belongs to the event.
• One event can be contained in another, i.e. A ⊆ B
• Complement of an event A, denoted AC = { outcomes in S not in A } = (S \ A).
• Since events are subsets, one can do complements, unions, intersections.

5. Disjoint events: Two events with an empty intersection are said to be disjoint events.

• Throw a die: even number, odd number are disjoint.


• Multiple events: E1 , E2 , E3 , .... are disjoint if, for any i 6= j , Ei ∩ Ej = empty set.

6. De Morgan’s laws: For any two events A and B,


(A ∪ B)C = AC ∩ B C and (A ∩ B)C = AC ∪ B C .

7. Probability: “Probability” is a unction P that assigns to each event a real number


between 0 and 1 and satisfies the following two axioms:

(i) P (S) = 1 (probability of the entire sample space equals 1).


(ii) If E1 , E2 , E3 , ... are disjoint events ( Could be infinitely many),

P (E1 ∪ E2 ∪ E3 ∪ ...) = P (E1 ) + P (E2 ) + P (E3 ) + ...

• Probability function Assigns a value that represents chance of occurrence of the


event.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Higher value of the probability of an event means higher chance of occurring that
event.
• 0 means event cannot occur and 1 means event always occurs.

8. Probability of the empty set (denoted φ) equals 0. that is

P (φ) = 0

9. Let E C be the complement of Event E. Then,

P (E C ) = 1 − P (E)

10. If event E is the subset of event F , that is E ⊆ F , then

P (F ) = P (E) + P (F \ E)

⇒ P (E) ≤ P (F )

11. If E and F are events, then

P (E) = P (E ∩ F ) + P (E \ F )

P (F ) = P (E ∩ F ) + P (F \ E)

12. If E and F are events, then

P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F )

13. Equally likely events: assign the same probability to each outcome.

14. If sample space S contains the equally likely outcomes, then


1
• P (one outcome) =
Number of outcomes in S
Number of outcomes in event
• P (event) =
Number of outcomes in S

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
15. Conditional probability space: Consider a probability space (S, E, P ), where S
represents the sample space, E represents the collection of events, and P represents
the probability function.

• Let B be an event in S with P (B) > 0. Now, conditional probability space given
B is defined as
For any event A in the original probability space (P, S, E), the conditional prob-
P (A ∩ B)
ability of A given B is .
P (B)
• It is denoted by P (A | B). And

P (A ∩ B) = P (B)P (A | B)

16. Law of total probability:

• If the events B and B c partitioned the sample space S such that P (B1 ), P (B2 ) 6= 0,
then for any event A of S,

P (A) = P (A | B)P (B) + P (A | B c )P (B c ).

• In general, if we have k events B1 , B2 , · · · , Bk that partition S, then for any event


A in S,
Xk Xk
P (A) = P (Bi ∩ A) = P (A | Bi )P (Bi ).
i=1 i=1

17. Bayes’ theorem: Let A and B are two events such that P (A) > 0, P (B) > 0.

P (A ∩ B) = P (B)P (A | B) = P (A)P (B | A)

P (B)P (A | B)
⇒ P (B | A) =
P (A)
In general, if the events B1 , B2 , · · · , Bk partition S such that P (Bi ) 6= 0 for i =
1, 2, · · · , k, then for any event A in S such that P (A) 6= 0,

P (Br )P (A | Br )
P (Br | A) = k
P
P (Bi )P (A | Bi )
i=1

for r = 1, 2, · · · , k.

18. Independence of two events: Two events A and B are independent iff

P (A ∩ B) = P (A)P (B)

• A and B independent ⇒ P (A | B) = P (A) and (B | A) = P (B) for P (A), P (B) >


0.

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Disjoint events are never independent.
• A and B independent ⇒ A and B c are independent.
• A and B independent ⇒ Ac and B c are independent.
19. Mutual independence of three events: Events A, B, and C are mutually indepen-
dent if
(a) P (A ∩ B) = P (A)P (B)
(b) P (A ∩ C) = P (A)P (C)
(c) P (A ∩ B) = P (A)P (B)
(d) P (A ∩ B ∩ C) = P (A)P (B)P (C)
20. Mutual independence of multiple events: Events A1 , A2 , · · · ,n are mutually in-
dependent if, ∀i1 , i2 , · · · , ik ,
P (Ai1 ∩ Ai2 ∩ · · · Aik ∩) = P (Ai1 )P (Ai2 ) · · · P (Aik )
n events are mutually independent ⇒ any subset with or without complementing are
independent as well.
21. Occurrence of event A in a sample space is considered as success.
22. Non - occurrence of event A in a sample space is considered as failure.
23. Repeated independent trials:
(a) Bernoulli trials
• Single Bernoulli trial:
– Sample space is {success, failure} with P(success) = p.
– We can also write the sample space S as {0, 1}, where 0 denotes the
failure and 1 denotes the success with P (1) = p, P (0) = 1 − p.
This kind of distribution is denoted by Bernoulli(p).
• Repeated Bernoulli trials:
– Repeat a Bernoulli trial multiple times independently.
– For each of the trial, the outcome will be either 0 or 1.
(b) Binomial distribution: Perform n independent Bernoulli(p) trials.
• It models the number of success in n independent Bernoulli trials.
• Denoted by B(n, p).
• Sample space is {0, 1, · · · , n}.
• Probability distribution is given by
P (B(n, p) = k) = nCk pk (1 − p)n−k
where n represents the total number trials and k represent the number of
success in n trials.

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• P (B = 0) + P (B = 1) + · · · + P (B = n) = 1
⇒ (1 − p)n + nC2 p2 (1 − p)n−2 + · · · + pn = 1.
(c) Geometric distribution: It models the number of failures the first success.
• Outcomes: Number of trials needed for first success and is denoted by G(p).
• Sample space: {1, 2, 3, 4, · · · }
• P (G = k) = P (first k − 1 trials result in 0 and kth trial result in 1.) =
(1 − p)k−1 p.
• Identity: P (G ≤ k) = 1 − (1 − p)k .

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2

Week 2 Important formulas

1. Random variable: A random variable is a function with domain as the sample space
of an experiment and range as the real numbers, i.e. a function from the sample space
to the real line.

• Toss a coin, Sample space = {H, T }


– Random variable X : X(H) = 0, X(T ) = 1

2. Random variables and events: If X is a random variable,


(X < x) = {s ∈ S : X(s) < x} is an event for all real x.
So, (X > x), (X = x), (X ≤ x), (X ≥ x) are all events.

• Throw a die, Sample space = {1, 2, 3, 4, 5, 6}


– E =0: event {1, 3, 5}
– E =1: event {2, 4, 6}
– E <0: null event
– E ≤1: event {1, 2, 3, 4, 5, 6}

3. Range of a random variable: The range of a random variable is the set of values
taken by it. Range is a subset of the real line.

• Throw a die, E = 0 if number is odd, E = 1 if number is even


– Range = {0, 1}

4. Discrete random variable: A random variable is said to be discrete if its range is a


discrete set.

5. Probability Mass Function (PMF): The probability mass function (PMF) of a dis-
crete random variable (r.v.) X with range set T is the function fX : T → [0, 1] defined
as
fX (t) = P (X = t) for t ∈ T .

6. Properties of PMF:

• 0 ≤ fX (t) ≤ 1

P
t∈T fX (t) = 1

7. Uniform random variable: X ∼ Uniform(T ), where T is some finite set.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Range: Finite set T
• PMF: fX (t) = 1
|T |
for all t ∈ T

8. Bernoulli random variable: X ∼ Bernoulli(p), where 0 ≤ p ≤ 1.


• Range: {0, 1}
• PMF: fX (0) = 1 − p, fX (1) = p
9. Binomial random variable: X ∼ Binomial(n, p), where n: positive integer, 0 ≤ p ≤ 1.
• Range: {0, 1, 2, . . . ., n}
• PMF: fX (k) = n Ck pk (1 − p)n−k
10. Geometric random variable: X ∼ Geometric(p), where 0 < p ≤ 1.
• Range: {1, 2, . . . ., n}
• PMF: fX (k) = (1 − p)k−1 p
11. Negative Binomial random variable: X ∼ Negative Binomial(r, p), where r: posi-
tive integer, 0 < p ≤ 1.
• Range: {r, r + 1, r + 2, . . . .}
• PMF: fX (k) = k−1 Cr−1 (1 − p)k−r pr
12. Poisson random variable: X ∼ Poisson(λ), where λ > 0.
• Range: {0, 1, 2, 3, . . . .}
e−λ λk
• PMF: fX (k) =
k!
13. Hypergeometric random variable: X ∼ HyperGeo(N, r, m), where N, r, m: positive
integers
• Range: {max(0, m − (N − r)), . . . , min(r, m)}
r
Ck N −r Cm−k
• PMF: fX (k) = NC
m

14. Functions of a random variable: X : random variable with PMF fX (t).


f (X) : random variable whose PMF is given as follows.

ff (X) (a) = P (f (X) = a) = P (X ∈ {t : f (t) = a})


X
= fX (t)
t:f (t)=a

• PMF of f (X) can be found using PMF of X.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 3 Notes
Multiple Random Variables

1. Joint probability mass function: Suppose X and Y are discrete random variables
defined in the same probability space. Let the range of X and Y be TX and TY ,
respectively. The joint PMF of X and Y , denoted fXY , is a function from TX × TY
to [0, 1] defined as

fXY (t1 , t2 ) = P (X = t1 and Y = t2 ), t1 ∈ TX , t2 ∈ TY

• Joint PMF is usually written as table or a matrix.


• P (X = t1 and Y = t2 ) is denoted P (X = t1 , Y = t2 )

2. Marginal PMF: Suppose X and Y are jointly distributed discrete random variables
with joint PMF fXY . The PMF of the individual random variables X and Y are called
as marginal PMFs. It can be shown that
X
fX (t1 ) = P (X = t1 ) = (fXY (t1 , t2 ))
t2 ∈TY

X
fY (t2 ) = P (X = t2 ) = (fXY (t1 , t2 ))
t1 ∈TX

Note: Given the joint PMF, the marginal is unique.

3. Conditional distribution given an event: Suppose X is a discrete random variable


with range TX , and A is an event in the same probability space. The conditional PMF
of X given A is defined as the PMF

fX|A (t) = P (X = t|A)

where t ∈ TX
We will denote the conditional random variable by X|A. (Note that X|A is a valid
random variable with PMF fX|A ).

P ((X = t) ∩ A)
• fX|A (t) =
P (A)
• Range of (X|A) can be different from TX and will depend on A.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
4. Conditional distribution of one random variable given another:
Suppose X and Y are jointly distributed discrete random variables with joint PMF
fXY . The conditional PMF of Y given X = t is defined as the PMF

P (X = x, Y = y) fXY (x, y)
fY |X=x (y) = =
P (X = x) fX (x)

We will denote the conditional random variable by Y |(X = x). (Note that Y |(X = x)
is a valid random variable with PMF fY |(X=x) .

• Range of (Y |X = t) can be different from TY and will depend on t.


• fXY (x, y) = fY |X=x (x, y).fX (x) = fX|Y =y (x, y).fY (y)

P
fY |X=x (y) = 1
y∈TY

5. Joint PMF of more than two discrete random variables:


Suppose X1 , X2 , . . . , Xn are discrete random variables defined in the same probability
space. Let the range of Xi be TXi . The joint PMF of Xi , denoted by fX1 X2 ...Xn , is a
function from TX1 × TX2 × . . . × TXn to [0, 1] defined as

fX1 X2 ...Xn (t1 , t2 , . . . , tn ) = P (X1 = t1 , X2 = t2 , . . . , Xn = tn ); ti ∈ TXi

6. Marginal PMF in case of more than two discrete random variables:


Suppose X1 , X2 , . . . , Xn are jointly distributed discrete random variables with joint
PMF fX1 X2 ...Xn . The PMF of the individual random variables X1 , X2 , . . . , Xn are
called as marginal PMFs. It can be shown that
X
fX1 (t1 ) = P (X1 = t1 ) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
t2 ∈TX2 ,t3 ∈TX3 ,...,tn ∈TXn

X
fX2 (t2 ) = P (X2 = t2 ) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
t1 ∈TX1 ,t3 ∈TX3 ,...,tn ∈TXn

..
.
X
fXn (tn ) = P (Xn = tn ) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
t1 ∈TX1 ,t2 ∈TX2 ,...,tn−1 ∈TXn−1

7. Marginalisation: Suppose X1 , X2 , . . . , Xn are jointly distributed discrete random


variables with joint PMF fX1 X2 ...Xn . The joint PMF of the random variables Xi1 , Xi2 , . . . Xik ,
denoted by fXi1 Xi2 ...Xik is given by
X
fXi1 Xi2 ...Xik (ti1 , ti2 , . . . tik ) = fX1 X2 ...Xn (t1 , . . . ti1 −1 , ti1 , ti1 +1 , . . . tik −1 , tik , tik +1 . . . , tn )

• Sum over everything you don’t want.

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
8. Conditioning with multiple discrete random variables:

• A wide variety of conditioning is possible when there are many random variables.
Some examples are:
• Suppose X1 , X2 , X3 , X4 ∼ fX1 X2 X3 X4 and xi ∈ TXi , then
fX X (x1 , x2 )
– fX1 |X2 =x2 (x1 ) = 1 2
fX2 (x2 )
fX X X (x1 , x2 , x3 )
– fX1 ,X2 |X3 =x3 (x1 , x2 ) = 1 2 3
fX3 (x3 )
fX X X (x1 , x2 , x3 )
– fX1 |X2 =x2 ,X3 =x3 (x1 ) = 1 2 3
fX2 X3 (x2 , x3 )
fX X X X (x1 , x2 , x3 , x4 )
– fX1 X4 |X2 =x2 ,X3 =x3 (x1 , x4 ) = 1 2 3 4
fX2 X3 (x2 , x3 )
9. Conditioning and factors of the joint PMF:
Let X1 , X2 , X3 , X4 ∼ fX1 X2 X3 X4 , Xi ∈ TXi .
fX1 X2 X3 X4 (t1 , t2 , t3 , t4 ) =P (X1 = t1 and (X2 = t2 , X3 = t3 , X4 = t4 ))
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )P (X2 = t2 and (X3 = t3 , X4 = t4 ))
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )fX2 |X3 =t3 ,X4 =t4 (t2 )P (X3 = t3 and X4 = t4 )
=fX1 |X2 =t2 ,X3 =t3 ,X4 =t4 (t1 )fX2 |X3 =t3 ,X4 =t4 (t2 )fX3 |X4 =t4 (t3 )fX4 (t4 ).
• Factoring can be done in any sequence.
10. Independence of two random variables:
Let X and Y be two random variables defined in a probability space with ranges TX
and TY , respectively. X and Y are said to be independent if any event defined using
X alone is independent of any event defined using Y alone. Equivalently, if the joint
PMF of X and Y is fXY , X and Y are independent if
fXY (x, y) = fX (x)fY (y)
for x ∈ TX and y ∈ TY

• X and Y are independent if


fX|Y =y (x) = fX (x)
fY |X=x (y) = fY (y)
for x ∈ TX and y ∈ TY

• To show X and Y independent, verify


fXY (x, y) = fX (x)fY (y)
for all x ∈ TX and y ∈ TY

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• To show X and Y dependent, verify
fXY (x, y) 6= fX (x)fY (y)
for some x ∈ TX and y ∈ TY
– Special case: fXY (t1 , t2 ) = 0 when fX (t1 ) 6= 0, fY (t2 ) 6= 0.
11. Independence of multiple random variables:
Let X1 , X2 , . . . , Xn be random variables defined in a probability space with range of
Xi denoted TXi . X1 , X2 , . . . , Xn are said to be independent if events defined using
different Xi are mutually independent. Equivalently, X1 , X2 , . . . , Xn are independent
iff
fX1 X2 ...Xn (t1 , t2 , . . . , tn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
for all xi ∈ TXi
• All subsets of independent random variables are independent.
12. Independent and Identically Distributed (i.i.d.) random variables:
Random variables X1 , X2 , . . . , Xn are said to be independent and identically distributed
(i.i.d.), if
(i) they are independent.
(ii) the marginal PMFs fXi are identical.
Examples:
• Repeated trials of an experiment creates i.i.d. sequence of random variables
– Toss a coin multiple times.
– Throw a die multiple times.
• Let X1 , X2 , . . . Xn ∼ i.i.d.X (Geometric(p)).
X will take values in {1, 2, . . .}
P (X = k) = pk−1 p

Since Xi ’s are independent and identically distributed, we can write


P (X1 > j, X2 > j, . . . , Xn > j) =P (X1 > j)P (X2 > j) . . . P (Xn > j)
=[P (X > j)]n

X
P (X > j) = (1 − p)k−1 p
k=j+1

=(1 − p)j p + (1 − p)j+1 p + (1 − p)j+2 p + . . .


=(1 − p)j p[1 + (1 − p) + (1 − p)2 + . . .]
 
j 1
=(1 − p) p
1 − (1 − p)
=(1 − p)j
⇒ P (X1 > j, X2 > j, . . . , Xn > j) = [P (X > j)]n = (1 − p)jn

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
13. Function of random variables (g(X1 , X2 , . . . , Xn )):
Suppose X1 , X2 , . . . , Xn have joint PMF fX1 X2 ...Xn with TXi denoting the range of Xi .
Let g : TX1 × TX2 × . . . × TXn → R be a function with range Tg . The PMF of
X = g(X1 , X2 ..., Xn ) is given by
X
fX (t) = P (g(X1 , X2 ..., Xn ) = t) = fX1 X2 ...Xn (t1 , t2 , . . . , tn )
(t1 ,...,tn ):g(X1 ,X2 ...,Xn )=t

• Sum of two random variables taking integer values:


X, Y ∼ fXY , Z = X + Y.
Let z be some integer,

P (Z = z) =P (X + Y = z)
X∞
= P (X = x, Y = z − x)
x=−∞
X∞
= fXY (x, z − x)
x=−∞
X∞
= fXY (z − y, y)
y=−∞


• Convolution: If X and Y are independent, fX+Y (z) =
P
fX (x)fY (z − x)
x=−∞

• Let X ∼ Poisson(λ1 ), Y ∼ Poisson(λ2 )


– X and Y are independent.
– Z = X + Y , z ∈ {0, 1, 2, . . .}
fZ (z) ∼ Poisson(λ1 + λ2 )    
λ1 λ2
(X = k | Z = n) ∼ Binomial n, , (Y = k | Z = n) ∼ Binomial n,
λ1 + λ2 λ1 + λ2
14. CDF of a random variable:
Cumulative distribution function of a random variable X is a function FX : R → [0, 1]
defined as
FX (x) = P (X ≤ x)

15. Minimum of two random variables:


Let X, Y ∼ fXY and let Z = min{X, Y }, then

fZ (z) = P (Z = z) = P (min{X, Y } = z)
= P (X = z, Y = z) + P (X = z, Y > z) + P (X > z, Y = z)
X X
= fXY (z, z) + fXY (z, t2 ) + fXY (t1 , z)
t2 >z t1 >z

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110

FZ (z) = P (Z ≤ z) = P (min{X, Y } ≤ z)
= 1 − P (min{X, Y } > z)
= 1 − [P (X > z, Y > z)]

16. Maximum of two random variables:


Let X, Y ∼ fXY and let Z = max{X, Y }, then

fZ (z) = P (Z = z) = P (max{X, Y } = z)
= P (X = z, Y = z) + P (X = z, Y < z) + P (X < z, Y = z)
X X
= fXY (z, z) + fXY (z, t2 ) + fXY (t1 , z)
t2 <z t1 <z

FZ (z) = P (Z ≤ z) = P (max{X, Y } ≤ z)
= [P (X ≤ z, Y ≤ z)]

17. Maximum and Minimum of n i.i.d. random variables

• Let X ∼ Geometric(p), Y ∼ Geometric(q)


X and Y are independent.
Z = min(X, Y )
Z ∼ Geometric(1 − (1 − p)(1 − q))
• Maximum of 2 independent geometric random variables is not geometric.

Important Points:

1. Let N ∼ Poisson(λ) and X|N = n ∼ Binomial(n, p), then X ∼ Poisson(λp)

2. Memory less property of Geometric(p)


If X ∼ Geometric(p), then

P (X > m + n|X > m) = P (X > n)

3. Sum of n independent Bernoulli(p) trials is Binomial(n, p).

4. Sum of 2 independent Uniform random variables is not Uniform.

5. Sum of independent Binomial(n, p) and Binomial(m, p) is Binomial(n + m, p).

Page 6

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
6. Sum of r i.i.d. Geometric(p) is Negative-Binomial(r, p).

7. Sum of independent Negative-Binomial(r, p) and Negative-Binomial(s, p) is Negative-


Binomial(r + s, p)

8. If X and Y are independent, then g(X) and h(Y ) are also independent.

Page 7

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 4 Notes
Expected value

• Expected value of a random variable


Definition: Suppose X is a discrete random variable with range TX and PMF fX . The
expected value of X, denoted E[X], is defined as
X
E[X] = tP (X = t)
t∈TX

assuming the above sum exists.


Expected value represents “center” of a random variable.

1. Consider a constant c as a random variable X with


P (X = c) = 1.
E[c] = c × 1 = c
2. If X takes only non-negative values, i.e. P (X ≥ 0) = 1. Then,

E[X] ≥ 0

• Expected value of a function of random variables


Suppose X1 . . . Xn have joint PMF fX1 ...Xn with range of Xi denoted as TXi . Let

g : TX1 × . . . × TXn → R

be a function, and let Y = g(X1 , . . . , Xn ) have range TY and PMF fY . Then,


X X
E[g(X1 , . . . , Xn )] = tfY (t) = g(t1 , . . . , tn )fX1 ...Xn (t1 , . . . , tn )
t∈TY ti ∈TXi

• Linearity of Expected value:

1. E[cX] = cE[X] for a random variable X and a constant c.


2. E[X + Y ] = E[X] + E[Y ] for any two random variables X, Y .

• Zero mean Random variable:


A random variable X with E[X] = 0 is said to be a zero-mean random variable.

• Variance and Standard deviation:


Definition: The variance of a random variable X, denoted by Var(X), is defined as

Var(X) = E[(X − E[X])2 ]

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Variance measures the spread about the expected value.
Variance of random variable X is also given by Var(X) = E[X 2 ] − E[X]2

The standard deviation of X, denoted by SD(X), is defined as


p
SD(X) = + Var(X)

Units of SD(X) are same as units of X.

• Properties: Scaling and translation


Let X be a random variable. Let a be a constant real number.

1. Var(aX) = a2 Var(X)
2. SD(aX) =| a | SD(X)
3. Var(X + a) = Var(X)
4. SD(X + a) = SD(X)

• Sum and product of independent random variables

1. For any two random variables X and Y (independent or dependent), E[X + Y ] =


E[X] + E[Y ].
2. If X and Y are independent random variables,
(a) E[XY ] = E[X]E[Y ]
(b) Var(X + Y ) = Var(X) + Var(Y )

• Standardised random variables:

1. Definition: A random variable X is said to be standardised if E[X] = 0, Var(X) =


1.
X − E[X]
2. Let X be a random variable. Then, Y = is a standardised random
SD(X)
variable.

• Covariance:
Definition: Suppose X and Y are random variables on the same probability space. The
covariance of X and Y , denoted as Cov(X, Y ), is defined as

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

It summarizes the relationship between two random variables.


Properties:

1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = E[XY ] − E[X]E[Y ]

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
3. Covariance is symmetric if Cov(X, Y ) = Cov(Y, X)
4. Covariance is a “linear” quantity.
(a) Cov(X, aY + bZ) = aCov(X, Y ) + bCov(X, Z)
(b) Cov(aX + bY, Z) = aCov(X, Z) + bCov(Y, Z)
5. Independence: If X and Y are independent, then X and Y are uncorrelated, i.e.
Cov(X, Y ) = 0
6. If X and Y are uncorrelated, they may be dependent.
• Correlation coefficient:
Definition: The correlation coefficient or correlation of two random variables X and Y
, denoted by ρ(X, Y ), is defined as
Cov(X, Y )
ρ(X, Y ) =
SD(X)SD(Y )
1. −1 ≤ ρ(X, Y ) ≤ 1.
2. ρ(X, Y ) summarizes the trend between random variables.
3. ρ(X, Y ) is a dimensionless quantity.
4. If ρ(X, Y ) is close to zero, there is no clear linear trend between X and Y .
5. If ρ(X, Y ) = 1 or ρ(X, Y ) = −1, Y is a linear function of X.
6. If | ρ(X, Y ) | is close to one, X and Y are strongly correlated.
• Bounds on probabilities using mean and variance
1. Markov’s inequality: Let X be a discrete random variable taking non-negative
values with a finite mean µ. Then,
µ
P (X ≥ c) ≤
c
Mean µ, through Markov’s inequality: bounds the probability that a non-negative
random variable takes values much larger than the mean.
2. Chebyshev’s inequality: Let X be a discrete random variable with a finite mean
µ and a finite variance σ 2 . Then,
1
P (| X − µ |≥ kσ) ≤
k2
Other forms:
σ2 1
(a) P (| X − µ |≥ c) ≤ 2
, P ((X − µ)2 > k 2 σ 2 ) ≤ 2
c k
1
(b) P (µ − kσ < X < µ + kσ) ≥ 1 − 2
k
Mean µ and standard deviation σ, through Chebyshev’s inequality: bound the
probability that X is away from µ by kσ.

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 5 Notes
Continuous Random Variables

1. Cumulative distribution function:


A function F : R → [0, 1] is said to be a Cumulative Distribution Function (CDF) if
(i) F is a non-decreasing function taking values between 0 and 1.
(ii) As x → −∞, F → 0
(iii) As x → ∞, F → 1
(iv) Technical: F is continuous from the right.

2. CDF of a random variable:


Cumulative distribution function of a random variable X is a function FX : R → [0, 1]
defined as
FX (x) = P (X ≤ x)
Properties of CDF

• FX (b) − FX (a) = P (a < X ≤ b)


• FX is a non-decreasing function of x.
• FX takes non-negative values.
• As x → −∞, FX (x) → 0
• As x → ∞, FX (x) → 1

3. Theorem: Random variable with CDF F(x)


Given a valid CDF F (x), there exists a random variable X taking values in R such
that
P (X ≤ x) = F (x)

• If F is not continuous at x and F (X) rises from F1 to F2 at x (jump at x), then

P (X = x) = F2 − F1

• If F is continuous at x, then
P (X = x) = 0

4. Continuous random variable:


A random variable X with CDF FX (x) is said to be a continuous random variable if
FX (x) is continuous at every x.
Properties of continuous random variables

• CDF has no jumps or steps.


• P (X = x) = 0 for all x.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Probability of X falling in an interval will be nonzero

P (a < X ≤ b) = F (b) − F (a)

• Since P (X = a) = 0 and P (X = b) = 0, we have

P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a < X < b)

5. Probability density function (PDF):


A continuous random variable X with CDF FX (x) is said to have a PDF fX (x) if, for
all x0 , Z x0
FX (x0 ) = fX (x)dx
−∞

• CDF is the integral of the PDF.


• Derivative of the CDF (wherever it exists) is usually taken as the PDF.
• Value of PDF around fX (x0 ) is related to X taking a value around x0 .
• Higher the PDF, higher the chance that X lies there.

6. For a random variable X with PDF fX , an event A is a subset of the real line and its
probability is computed as Z
P (A) = fX (x)dx
A

Rb
• P (a < X < b) = FX (b) − FX (a) = a
fX (x)dx

7. Density function:
A function f : R → R is said to be a density function if
(i) fR(x) ≥ 0

(ii) −∞ fX (x)dx = 1
(iii) f (x) is piece-wise continuous

8. Given a density function f , there is a continuous random variable X with PDF as f .

9. Support of random variable X


Support of the random variable X with PDF fX is

supp(X) = {x : fX (x) > 0}

• supp(X) contains intervals in which X can fall with positive probability.

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
10. Continuous Uniform distribution:
• X ∼ Uniform[a, b]
• PDF: 
 1 a<x<b
fX (x) = b − a
0 otherwise
• CDF: 

 0 x≤a
x − a
FX (x) = a<x<b

 b−a
1 x≥b

11. Exponential distribution:


• X ∼ Exp(λ)
• PDF: (
λe−λx x>0
fX (x) =
0 otherwise
• CDF: (
0 x≤0
FX (x) =
1 − e−λx x>0

12. Normal distribution:


• X ∼ Normal[µ, σ 2 ]
• PDF:
−(x − µ)2
 
1
fX (x) = √ exp −∞<x<∞
σ 2π 2σ 2
• CDF: Z x
FX (x) = fX (u)du
−∞

• CDF has no closed form expression.


• Standard normal: Z = Normal(0, 1)
 2
1 −z
– PDF: fZ (z) = √ exp −∞<z <∞
2π 2
13. Standardization:
If X ∼ Normal(µ, σ 2 ), then
X −µ
= Z ∼ Normal(0, 1)
σ
14. To compute the probabilities of the normal distribution, convert probability computa-
tion to that of a standard normal.

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
15. Functions of continuous random variable:
Suppose X is a continuous random variable with CDF FX and PDF fX and suppose
g : R → R is a (reasonable) function. Then, Y = g(X) is a random variable with CDF
FY determined as follows:
• FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ∈ {x : g(x) ≤ y})
• To evaluate the above probability
– Convert the subset Ay = {x : g(x) ≤ y} into intervals in real line.
– Find the probability that X falls in those intervals.
R
– FY (y) = P (X ∈ AY ) = AY fX (x)dx
• If FY has no jumps, you may be able to differentiate and find a PDF.
16. Theorem: Monotonic differentiable function
Suppose X is a continuous random variable with PDF fX . Let g(x) be monotonic for
dg(x)
x ∈ supp(X) with derivative g 0 (x) = . Then, the PDF of Y = g(X) is
dx
1 −1
fY (y) = fX (g (y))
|g 0 (g −1 (y))|
• Translation: Y = X + a
fY (y) = fX (y − a)
• Scaling: Y = aX
1
fY (y) = fX (y/a)
|a|
• Affine: Y = aX + b
1
fY (y) = fX ((y−b)/a)
|a|
• Affine transformation of a normal random variable is normal.
17. Expected value of function of continuous random variable:
Let X be a continuous random variable with density fX (x). Let g : R → R be a
function. The expected value of g(X), denoted E[g(X)], is given by
Z ∞
E[g(X)] = g(x)fX (x)dx
−∞

whenever the above integral exists.


• The integral may diverge to ±∞ or may not exist in some cases.
18. Expected value (mean) of a continuous random variable:
Mean, denoted E[X] or µX or simply µ is given by
Z ∞
E[X] = xfX (x)dx
−∞

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
19. Variance of a continuous random variable:
2
Variance, denoted Var[X] or σX or simply σ 2 is given by
Z ∞
2
Var(X) = E[(X − E[X]) ] = (x − µ)2 fX (x)dx
−∞

• Variance is a measure of spread of X about its mean.


• Var(X) = E[X 2 ] − E[X]2

X E[X] Var(X)
a+b (b−a)2
Uniform[a, b] 2 12
1 1
Exp(λ) λ λ2

Normal(µ, σ 2 ) µ σ 2

20. Markov’s inequality:


If X is a continuous random variable with mean µ and non-negative supp(X) (i.e.
P (X < 0) = 0), then
µ
P (X > c) ≤
c
21. Chebyshev’s inequality:
If X is a continuous random variable with mean µ and variance σ 2 , then
1
P (|X − µ| ≥ kσ) ≤
k2

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2

Week 6 Notes

1. Marginal density: Let (X, Y ) be jointly distributed where X is discrete with range
TX and PMF pX (x).
For each x ∈ TX , we have a continuous random variable Yx with density fYx (y).
fYx (y) : conditional density of Y given X = x, denoted fY |X=x (y).

• Marginal density of Y
P
– fY (y) = pX (x)fY |X=x (y)
x∈TX

2. Conditional probability of discrete given continuous: Suppose X and Y are


jointly distributed with X ∈ TX being discrete with PMF pX (x) and conditional densi-
ties fY |X=x (y) for x ∈ TX . The conditional probability of X given Y = y0 ∈ supp(Y ) is
defined as

pX (x)fY |X=x (y0 )


• P (X = x | Y = y0 ) =
fY (y0 )
3. Joint density: A function f (x, y) is said to be a joint density function if

• f (x, y) ≥ 0, i.e. f is non-negative.


R∞ R∞
• f (x, y)dxdy = 1
−∞ −∞

4. 2D uniform distribution: Fix some (reasonable) region D in R2 with total area |D|.
We say that (X, Y ) ∼ Uniform(D) if they have the joint density
(
1
|D|
(x, y) ∈ D
fXY (x, y) =
0 otherwise

5. Marginal density: Suppose (X, Y ) have joint density fXY (x, y). Then,
y=∞
• X has the marginal density fX (x) =
R
fXY (x, y)dy.
y=−∞
x=∞
• Y has the marginal density fY (y) =
R
fXY (x, y)dx.
x=−∞

– In general the marginals do not determine joint density.

6. Independence: (X, Y ) with joint density fXY (x, y) are independent if

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• fXY (x, y) = fX (x)fY (y)
– If independent, the marginals determine the joint density.

7. Conditional density: Let (X, Y ) be random variables with joint density fXY (x, y).
Let fX (x) and fY (y) be the marginal densities.

• For a such that fX (a) > 0, the conditional density of Y given X = a, denoted as
fY |X=a (y), is defined as

fXY (a, y)
fY |X=a (y) =
fX (a)

• For b such that fY (b) > 0, the conditional density of X given Y = b, denoted as
fX|Y =b (x), is defined as

fXY (x, b)
fX|Y =b (x) =
fY (b)

8. Properties of conditional density: Joint = Marginal × Conditional, for x = a and


y = b such that fX (a) > 0 and fY (b) > 0.

• fXY (a, b) = fX (a)fY |X=a (b) = fY (b)fX|Y =b (a)

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2

Important results

Discrete random variables:

Distribution PMF (fX (k)) CDF (FX (x)) E[X] Var(X)



1 0
 x<0
,x=k  k−a+1

k ≤x<k+1
Uniform(A) n
n a+b n2 −1
n=b−a+1
A = {a, a + 1, . . . , b} k = a, a + 1, . . . , b − 1, b 2 12
k = a, a + 1, . . . , b



x≥n

1

0 x<0
( 
p x=1
Bernoulli(p) 1−p 0≤x<1 p p(1 − p)
1−p x=0 
1 x≥1



 0 x<0
k


n
Ci pi (1 − p)n−i

P
n
Ck pk (1 − p)n−k , k ≤x<k+1
Binomial(n, p) i=0 np np(1 − p)
k = 0, 1, . . . , n
k = 0, 1, . . . , n





1 x≥n


0
 x<0
1 1−p
(1 − p)k−1 p,
Geometric(p) 1 − (1 − p)k k ≤ x < k + 1
k = 1, . . . , ∞  p p2
k = 1, . . . , ∞


 0 x<0
e−λ λk


 k λi
Poisson(λ) , e−λ
P
k ≤x<k+1 λ λ
k!
k = 0, 1, . . . , ∞ 
 i=0 i!

 k = 0, 1, . . . , ∞

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Continuous random variables:

Distribution PDF (fX (k)) CDF (FX (x)) E[X] Var(X)



 0 x≤a
(b − a)2

x − a
1 a+b
Uniform[a, b] ,a≤x≤b a<x<b
b−a 
 b−a 2 12
1 x≥b

(
0 x≤0 1 1
Exp(λ) λe−λx , x > 0
1 − e−λx x > 0 λ λ2
−(x − µ)2
 
1
√ exp ,
Normal(µ, σ 2 ) σ 2π 2σ 2 No closed form µ σ2
−∞ < x < ∞
β α α−1 −βx α α
Gamma(α, β) x e ,x>0
Γ(α) β β2
Γ(α + β) α−1
x (1 − x)β−1 α αβ
Beta(α, β) Γ(α)Γ(β)
α+β (α + β)2 (α + β + 1)
0<x<1

1. Markov’s inequality: Let X be a discrete random variable taking non-negative values


with a finite mean µ. Then,
µ
P (X ≥ c) ≤
c
2. Chebyshev’s inequality: Let X be a discrete random variable with a finite mean µ
and a finite variance σ 2 . Then,
1
P (| X − µ |≥ kσ) ≤
k2

3. Weak Law of Large numbers: Let X1 , X2 , . . . , Xn ∼ iid X with E[X] = µ, Var(X) =


σ2.
X 1 + X2 + . . . + X n
Define sample mean X = . Then,
n
σ2
P (|X − µ| > δ) ≤
nδ 2

4. Using CLT to approximate probability: Let X1 , X2 , . . . , Xn ∼ iid X with E[X] =


µ, Var(X) = σ 2 .
Define Y = X1 + X2 + . . . + Xn . Then,
Y − nµ
√ ≈ Normal(0, 1).

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Test for mean
Case (1): When population variance σ 2 is known (z-test)

Test H0 HA Test statistic Rejection region

T =X
right-tailed µ = µ0 µ > µ0 X − µ0 X>c
Z= σ√
/ n
T =X
left-tailed µ = µ0 µ < µ0 X − µ0 X<c
Z= σ√
/ n
T =X
two-tailed µ = µ0 µ 6= µ0 X − µ0 |X − µ0 | > c
Z= σ√
/ n

Case (2): When population variance σ 2 is unknown (t-test)

Test H0 HA Test statistic Rejection region

T =X
right-tailed µ = µ0 µ > µ0 X − µ0 X>c
tn−1 = S/√n

T =X
left-tailed µ = µ0 µ < µ0 X − µ0 X<c
tn−1 = S/√n

T =X
two-tailed µ = µ0 µ 6= µ0 X − µ0 |X − µ0 | > c
tn−1 = S/√n

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• χ2 -test for variance:

Test H0 HA Test statistic Rejection region

(n − 1)S 2
right-tailed σ = σ0 σ > σ0 T = 2
∼ χ2n−1 S 2 > c2
σ0

(n − 1)S 2
left-tailed σ = σ0 σ < σ0 T = ∼ χ2n−1 S 2 < c2
σ02
α
(n − 1)S 2 S 2 > c2 where = P (S 2 > c2 ) or
two-tailed σ = σ0 σ 6= σ0 T = ∼ χ2n−1 2
σ02 α
S 2 < c2 where = P (S 2 < c2 )
2

• Two samples z-test for means:

Test H0 HA Test statistic Rejection region

T =X −Y
σ12 σ22
 
right-tailed µ1 = µ2 µ1 > µ2 X −Y >c
X − Y ∼ Normal 0, + if H0 is true
n1 n2
T =Y −X
σ22 σ12
 
left-tailed µ1 = µ2 µ1 < µ2 Y −X >c
Y − X ∼ Normal 0, + if H0 is true
n2 n1
T =X −Y
σ12 σ22
 
two-tailed µ1 = µ2 µ1 6= µ2 |X − Y | > c
X − Y ∼ Normal 0, + if H0 is true
n1 n2

• Two samples F -test for variances

Test H0 HA Test statistic Rejection region

S12 S12
one-tailed σ1 = σ2 σ1 > σ2 T = ∼ F(n1 −1,n2 −1) >1+c
S22 S22

S12 S12
one-tailed σ1 = σ2 σ1 < σ2 T = ∼ F(n1 −1,n2 −1) <1−c
S22 S22
S12 α
> 1 + cR where = P (T > 1 + cR ) or
S12 S22 2
two-tailed σ1 = σ2 σ1 6= σ2 T = ∼ F(n1 −1,n2 −1)
S22 S12 α
< 1 − cL where = P (T < 1 − cL )
S22 2

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• χ2 -test for goodness of fit:
H0 : Samples are i.i.d X, HA : Samples are not i.i.d X

k (y − np )2 k (observed value − expected value)2


i i
∼ χ2k−1
P P
Test statistic: T = =
i=1 npi i=1 expected value

Test: Reject H0 if T > c.

• Test for independence:


H0 : Joint PMF is product of marginals, HA : Joint PMF is not product of marginals

P (yij − npij )2 k (observed value − expected value)2


∼ χ2dof
P
Test statistic: T = =
i,j npij i=1 expected value

where dof = (number of rows−1) × (number of columns−1)


yij = product of marginals for (i, j)
npij = expected, if independent

Test: Reject H0 if T > c.

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 7 Notes
Statistics from samples and Limit theorems

1. Empirical distribution:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. Let #(Xi = t) denote the number of times
t occurs in the samples. The empirical distribution is the discrete distribution with
PMF
#(Xi = t)
p(t) =
n
• The empirical distribution is random because it depends on the actual sample
instances.
• Descriptive statistics: Properties of empirical distribution. Examples :
– Mean of the distribution
– Variance of the distribution
– Probability of an event
• As number of samples increases, the properties of empirical distribution should
become close to that of the original distribution.

2. Sample mean:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample mean, denoted X, is defined to
be the random variable
X1 + X 2 + . . . + Xn
X=
n
• Given a sampling x1 , . . . , xn the value taken by the sample mean X is x =
x1 + x2 + . . . + xn
. Often, X and x are both called sample mean.
n

3. Expected value and variance of sample mean:


Let X1 , X2 , . . . , Xn be i.i.d. samples whose distribution has a finite mean µ and variance
σ 2 . The sample mean X has expected value and variance given by

σ2
E[X] = µ, Var(X) =
n
• Expected value of sample mean equals the expected value or mean of the distri-
bution.
• Variance of sample mean decreases with n.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
4. Sample variance:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample variance, denoted S 2 , is defined
to be the random variable
(X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2
S2 = ,
n−1

where X is the sample mean.

5. Expected value of sample variance:


Let X1 , X2 , . . . , Xn be i.i.d. samples whose distribution has a finite variance σ 2 . The
2 (X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2
sample variance S = has expected value
n−1
given by
E[S 2 ] = σ 2

• Values of sample variance, on average, give the variance of distribution.


• Variance of sample variance will decrease with number of samples (in most cases).
• As n increases, sample variance takes values close to distribution variance.

6. Sample proportion:
The sample proportion of A, denoted S(A), is defined as

number of Xi for which A is true


S(A) =
n
• As n increases, values of S(A) will be close to P (A).
• Mean of S(A) equals P (A).
• Variance of S(A) tends to 0.

7. Weak law of large numbers:


Let X1 , X2 , . . . , Xn ∼ iid X with E[X] = µ, Var(X) = σ 2 .
X1 + X2 + . . . + Xn
Define sample mean X = . Then,
n
σ2
P (|X − µ| > δ) ≤
nδ 2

8. Chernoff inequality:
Let X be a random variable such that E[X] = 0, then

E[eλX ]
P (X > t) ≤ , λ>0
eλt

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
9. Moment generating function (MGF):
Let X be a zero-mean random variable (E[X] = 0). The MGF of X, denoted MX (λ),
is a function from R to R defined as

MX (λ) = E[eλX ]

MX (λ) = E[eλX ]
λ2 X 2 λ3 X 3
= E[1 + λX + + + . . .]
2! 3!
λ2 λ3
= 1 + λE[X] + E[X 2 ] + E[X 3 ] + . . .
2! 3!
λk
That is coefficient of in the MGF of X gives the kth moment of X.
k!
2 σ2
• If X ∼ Normal(0, σ 2 ) then, MX (λ) = eλ /2

• Let X1 , X2 , . . . , Xn ∼ i.i.d. X and let S = X1 + X2 + . . . + Xn , then

MS (λ) = (E[eλX ])n = [MX (λ)]n

It implies that MGF of sum of independent random variables is product of the


individual MGFs.

10. Central limit theorem: Let X1 , X2 , . . . , Xn ∼ iid X with E[X] = µ, Var(X) = σ 2 .


Define Y = X1 + X2 + . . . + Xn . Then,
Y − nµ
√ ≈ Normal(0, 1).

11. Gamma distribution:


X ∼ Gamma(α, β) if PDF fx (x) ∝ xα−1 e−βx , x>0

• α > 0 is a shape parameter.


• β > 0 is a rate parameter.
1
• θ = is a scale parameter.
β
α
• Mean, E[X] =
β
α
• Variance, Var(X) = 2
β

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
12. Beta distribution:
X ∼ Beta(α, β) if PDF fx (x) ∝ xα−1 (1 − x)β−1 , 0<x<1

• α > 0, β > 0 are the shape parameters.


α
• Mean, E[X] =
α+β
αβ
• Variance, Var(X) = 2
(α + β) (α + β + 1)

13. Cauchy distribution:


1 α
X ∼ Cauchy(θ, α2 ) if PDF fx (x) ∝
π α + (x − θ)2
2

• θ is a location parameter.
• α > 0 is a scale parameter.
• Mean and variance are undefined.

14. Some important results:

• Let Xi ∼ Normal(µi , σi2 ) are independent and let Y = a1 X1 + a2 X2 + . . . an Xn ,


then
Y ∼ Normal(µ, σ 2 )
where µ = a1 µ1 + a2 µ2 + . . . an µn and σ 2 = a21 σ12 + a22 σ22 + . . . a2n σn2
That is linear combinations of i.i.d. normal distributions is again a normal distri-
bution.

• Sum of n i.i.d. Exp(β) is Gamma(n, β).


 
1 1
• Square of Normal(0, σ ) is Gamma , 2 .
2
2 2σ
X
• Suppose X, Y ∼ i.i.d. Normal(0, σ 2 ). Then, ∼ Cauchy(0, 1).
Y

• Suppose X ∼ Gamma(α, k), Y ∼ Gamma(β, k) are independent random vari-


X
ables, then ∼ Beta(α, β).
X +Y

• Sum of n independent Gamma(α, β) is Gamma(nα, β).


 
n 1
• If X1 , X2 , . . . , Xn ∼ i.i.d. Normal(0, σ ) , then
2
X12 +X22 +. . .+Xn2 ∼ Gamma , .
2 2σ 2

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
 
n 1
• Gamma , is called Chi-square distribution with n degrees of freedom, de-
2 2
noted χ2n .

• Suppose X1 , X2 , . . . , Xn ∼ i.i.d. Normal(µ, σ 2 ). Suppose that X and S 2 denote


the sample mean and sample variance, respectively, then
(n − 1)S 2
(i) ∼ χ2n−1
σ2
(ii) X and S 2 are independent.

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 8 notes

• Let X1 , . . . , Xn ∼ i.i.d.X, where X has the distribution described by parameters


θ1 , θ2 , . . ..

– The parameters θi are unknown but a fixed constant.


– Define the estimator for θ as the function of the samples: θ̂(X1 , . . . , Xn ).

Note:

1. θ is an unknown parameter.
2. θ̂ is a function of n random variables.

Remark: Infinite number of estimators are possible for a parameter of a distribution.

• Estimation error: θ̂(X1 , . . . , Xn ) − θ is a random variable.

– We expect the estimator random variable θ̂(X1 , . . . , Xn ) to take values around


the actual value of the parameter θ. So, the random variable ‘Error’ should take
values close to 0.
– Mathematically, it is expressed as P (| Error |> δ) should be small.
Var(Error)
– Chebyshev bound on error: P (| Error − E[Error] |> δ) ≤ .
δ2
– Good design: P (| Error |> δ) will fall with n.

• Good design principles:

1. Error should be close to or equal to 0.


2. Var(Error) → 0 with n.

• Bias: The bias of the estimator θ̂


for a parameter θ, denoted Bias(θ̂, θ) is defined as

Bias(θ̂, θ) = E[θ̂ − θ] = E[θ̂] − θ

1. Bias is the expected value of Error.


2. An estimator with bias equal to 0 is said to be an unbiased estimator.

• Risk: The (squared-error) risk of the estimator θ̂ for a parameter θ, denoted Risk(θ̂, θ),
is defined as
Risk(θ̂, θ) = E[(θ̂ − θ)2 ]

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
1. Risk is the expected value of “squared error” and is also called mean squared error
(MSE) often.
2. Squared-error risk is the second moment of Error.

• Variance of estimator:
Variance(θ̂) = E[(θ̂ − E[θ])2 ]
Var(Error) = Var(θ̂)

• Bias-Variance tradeoff: The risk of the estimator satisfies the following relationship:

Risk(θ̂, θ) = Bias(θ̂, θ)2 + Variance(θ̂)

• Estimator design approach:

1. Method of moments
1P n
(a) Sample moments: Mk (X1 , . . . , Xn ) = Xk
n i=1 i
(b) Mk is a random variable, and mk is the value taken by it in one sampling
instance. We expect that Mk will take values around E[X k ]
(c) Procedure:
– Equate sample moments to expression for moments in terms of unknown
parameters.
– Solve for the unknown parameters.
(d) One parameter θ usually needs one moment
– Sample moment: m1
– Distribution moment: E[X] = f (θ)
– Solve for θ from f (θ) = m1 in terms of m1 .
– θ̂: replace m1 by M1 in above solution.
(e) Two parameters θ1 , θ2 usually needs two moments.
– Sample moments: m1 , m2
– Distribution moment: E[X] = f (θ1 , θ2 ), E[X 2 ] = g(θ1 , θ2 )
– Solve for θ1 , θ2 from f (θ1 , θ2 ) = m1 , g(θ1 , θ2 ) = m2 in terms of m1 , m2 .
– θ̂: replace m1 by M1 and m2 by M2 in above solution.
2. Maximum Likelihood estimators
(a) Likelihood of i.i.d. samples: Likelihood of a sampling x1 , x2 , . . . , xn , denoted
L(x1 , x2 , . . . , xn )
n
Y
L(x1 , x2 , . . . , xn ) = fX (xi ; θ1 , θ2 , . . .)
i=1

– Likelihood L(x1 , x2 , . . . , xn ) is a function of parameters.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
– Maximum likelihood (ML) estimation
n
Y
θ1∗ , θ2∗ , . . . = arg max fX (xi ; θ1 , θ2 , . . .)
θ1 ,θ2 ,...
i=1

We find parameters that maximize likelihood for a given set of samples.

• Properties of estimators:

1. Consistency of estimators: If an estimator satisfies the following requirement, it


is said to be consistent. Technically, it is called convergence in probability.
P (| Error |> δ) → 0 as n → ∞ for any δ > 0.
2. To compare the estimators, use mean squared error (MSE).

• Confidence interval:

X1 , . . . , Xn ∼ iid X, µ = E[X]
X 1 + . . . + Xn
Estimator: µ̂ =
n
– Suppose P (| µ̂ − µ |< α) = β, where α is a small fraction and β is a large fraction.
– µ̂ in one sampling instance: estimate with margin of error (100α)% at confidence
level (100β)%.

1. Normal samples with known variance: X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ), σ 2 known.


X1 + . . . + Xn
Estimator: µ̂ =
n
σ2 µ̂ − µ
µ̂ ∼ i.i.d. Normal(µ, n ), Z = √ ∼ Normal(0, 1)
σ/ n

P (| µ̂ − µ |< α) =β
!
µ̂ − µ α
=⇒ P √ < √ =β
σ/ n σ/ n
 
α
=⇒ P | Normal(0, 1) |< √ =β
σ/ n

2. Normal samples with unknown variance: X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ), σ 2 un-


known.
Sampling instance: x1 , . . . , xn .
1P n 1 P n
Estimated mean and variance: X̄ = xi , σ̂ 2 = (xi − x̄)2
n i=1 n − 1 i=1

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
2 µ̂ − µ
µ̂ ∼ i.i.d. Normal(µ, σn ), Z = √ ∼ tn−1
S/ n

P (| µ̂ − µ |< α) =β
!
µ̂ − µ α
=⇒ P √ < √ =β
S/ n α̂/ n
 
α
=⇒ P | Normal(0, 1) |< √ =β
α̂/ n

3. If samples are not normal: Use CLT to argue that sample mean will have a normal
distribution

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2

Week 9 Notes

1. Parameter estimation: Let X1 , . . . , Xn ∼ iid X, parameter Θ


Prior distribution of Θ: Θ ∼ fΘ (θ)
Samples: x1 , . . . , xn , notation S = (X1 = x1 , . . . Xn = xn )
Bayes’ rule: posterior ∝ likelihood × prior

P (Θ = θ | S) = P (S | Θ = θ)fΘ (θ)/P (S)


P
In case of discrete: P (S) = P (S | Θ = θ)fΘ (θ)
θ R
In case of continuous: P (S) = P (S | Θ = θ)fΘ (θ) dθ
θ
Posterior mode: θ̂ = arg maxθ P (S | Θ = θ)fΘ (θ)
Posterior mean: E[Θ | S], mean of posterior distribution.

2. Bernoulli(p) samples with uniform prior: X1 , . . . , Xn ∼ iid Bernoulli(p)


Prior p ∼ Uniform[0, 1]
Samples: x1 , . . . , xn
Posterior: p| (X1 = x1 , . . . Xn = xn )
Posterior density ∝ P (X1 = x1 , . . . Xn = xn | p = p) × fp (p)
Posterior density ∝ pw (1 − p)n−w
⇒ Posterior density: Beta(w + 1, n − w + 1)
X1 + X 2 + . . . + Xn + 1
Posterior mean: p̂ =
n+2
3. Bernoulli(p) samples with beta prior: X1 , . . . , Xn ∼ iid Bernoulli(p)
Prior p ∼ Beta(α, β)
⇒ fp (p) ∝ pα−1 (1 − p)β−1
Samples: x1 , . . . , xn
Posterior: p| (X1 = x1 , . . . Xn = xn )
Posterior density ∝ P (X1 = x1 , . . . Xn = xn | p = p) × fp (p)
Posterior density ∝ pw+α−1 (1 − p)n−w+β−1

⇒ Posterior density: Beta(w + α, n − w + β)


X1 + X 2 + . . . + Xn + α
Posterior mean: p̂ =
n+α+β

4. Normal samples with unknown mean and known variance: X1 , . . . , Xn ∼ iid


Normal(M, σ 2 )

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Prior M ∼ Normal(µ0 , σ02 )
2
⇒ fM (µ) = √2πσ 1
exp(− (µ−µ2
2σ0
0)
)
0
Samples: x1 , . . . , xn , Sample mean: x = (x1 + . . . + xn )/n
Posterior: M| (X1 = x1 , . . . Xn = xn )
Posterior density ∝ f (X1 = x1 , . . . Xn = xn | M = µ) × fM (µ)
2 2 2
Posterior density ∝ exp(− (x1 −µ) +...+(x 2σ02
n −µ)
)exp(− (µ−µ 0)
2σ02
)
⇒ Posterior density: Normal
X1 + X2 + . . . + Xn nσ02 σ2
Posterior mean: µ̂ = + µ 0
n nσ02 + σ 2 nσ02 + σ 2

5. Geometric(p) samples with Uniform[0, 1] prior: X1 , . . . , Xn ∼ iid Geometric(p)


Prior p ∼ Uniform[0, 1]
Samples: x1 , . . . , xn
Posterior: p| (X1 = x1 , . . . Xn = xn )
Posterior density ∝ P (X1 = x1 , . . . Xn = xn | p = p) × fp (p)
Posterior density ∝ pn (1 − p)x1 +...+xn −n

⇒ Posterior density: Beta(n + 1, x1 + . . . + xn − n + 1)


n+1
Posterior mean: p̂ =
X1 + . . . + Xn + 2

6. Poisson(λ) samples with gamma prior: X1 , . . . , Xn ∼ iid Poisson(Λ)


Prior Λ ∼ Gamma(α, β)
⇒ fΛ (λ) ∝ λα−1 e−βλ
Samples: x1 , . . . , xn
Posterior: Λ | (X1 = x1 , . . . Xn = xn )
Posterior density ∝ P (X1 = x1 , . . . Xn = xn | Λ = λ) × fΛ (λ)
Posterior density ∝ e−nλ λx1 +...+xn λα−1 e−βλ

⇒ Posterior density: Gamma(x1 + . . . + xn + α, β + n)

X1 + X2 + . . . + Xn + α
Posterior mean: λ̂ =
n+β

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 10 Notes
Hypothesis testing

1. Null hypothesis:
The null hypothesis is a kind of hypothesis which explains the population parameter
whose purpose is to test the validity of the given experimental data. It is denoted by
H0 . The null hypothesis is a default hypothesis that is assumed to remain possibly
true.
2. Alternative hypothesis:
The alternative hypothesis is a statement used in statistical inference experiment. It
is contradictory to the null hypothesis and denoted by HA or H1 .
3. Test statistic:
A test statistic is numerical quantity computed from values in a sample used in statis-
tical hypothesis testing.
4. Type I error:
A type I error is a kind of fault that occurs during the hypothesis testing process when
a null hypothesis is rejected, even though it is true.
5. Type II error:
A type II error is a kind of fault that occurs during the hypothesis testing process when
a null hypothesis is accepted, even though it is not true (HA is true).
6. Significance level (Size):
Significance level (also called size) of a test, denoted α, is the probability of type I
error.
α = P (Type I error)

7. β = P (Type II error)
8. Power of a test:
Power = 1 − β
9. Types of hypothesis:
(a) Simple hypothesis: A hypothesis that completely specifies the distribution of
the samples is called a simple hypothesis.
(b) Composite hypothesis: A hypothesis that does not completely specify the
distribution of the samples is called a composite hypothesis.
10. Standard testing method: z-test:
Consider a sample X1 , X2 , . . . , Xn ∼ i.i.d. X.

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• Test statistic, denoted T , is some function of the samples. For example: sample
mean X
• Acceptance and rejection regions are specified through T .

(a) Right-tailed z-test:


• H0 : µ = µ0 , HA : µ > µ0
• Test: reject H0 if T > c.
• Significance level α depends on c and the distribution of T |H0 .
• α = P (T > c|H0 )
• Fix α and find c.
(b) Left-tailed z-test:
• H0 : µ = µ0 , HA : µ < µ0
• Test: reject H0 if T < c.
• Significance level α depends on c and the distribution of T |H0 .
• α = P (T < c|H0 )
• Fix α and find c.
(c) two-tailed z-test:
• H0 : µ = µ0 , HA : µ 6= µ0
• Test: reject H0 if |T | > c.
• Significance level α depends on c and the distribution of T |H0 .
• α = P (|T | > c|H0 )
• Fix α and find c.

X − µ0
Note: In the test for mean (σ 2 known), T = X and when null is true, σ/√n

Normal(0, 1).

11. P -value:
Suppose the test statistic T = t in one sampling. The lowest significance level α at
which the null will be rejected for T = t is said to be the P -value of the sampling.

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Statistics for Data Science - 2
Week 11 Notes
t-test, χ2 -test, two samples z/F -test

1. Normal samples and statistics: Consider the samples X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ).


X1 + . . . + Xn
The sample mean, X =
n
2 1
The sample variance, S = [(X1 − X)2 + . . . + (Xn − X)2 ]
n−1
E[X] = µ, E[S 2 ] = σ 2

• X ∼ Normal(µ, σ 2 /n)
(n − 1) 2
• 2
S ∼ χ2n−1 , chi-squared distribution with n − 1 degrees of freedom.
σ
X −µ
• √ ∼ tn−1 , t-distribution with n − 1 degrees of freedom.
S/ n
2. t-test for mean (Variance unknown)
Consider the samples X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ), σ 2 unknown. Following are the
three different possibilities:

• The null and alternative hypothesis are:

H0 : µ = µ0

HA : µ > µ0
Test Statistic: T = X
Test: Reject H0 , if T > c
X −µ
Given H0 , √ ∼ tn−1
S/ n

α =P (reject H0 | H0 is true)
=P (T > c | µ = µ0 )
   
c − µ0 c − µ0
=P tn−1 > √ = 1 − Ftn−1 √
s/ n s/ n
s
=⇒ c = √ Ft−1 (1 − α) + µ0
n n−1
Note: Ftn−1 is the CDF of t-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:

H0 : µ = µ0

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
HA : µ < µ0
Test Statistic: T = X
Test: Reject H0 , if T < c
X −µ
Given H0 , √ ∼ tn−1
S/ n

α =P (reject H0 | H0 is true)
=P (T < c | µ = µ0 )
   
c − µ0 c − µ0
=P tn−1 < √ = Ftn−1 √
s/ n s/ n
s
=⇒ c = √ Ft−1 (α) + µ0
n n−1

Note: Ftn−1 is the CDF of t-distribution with n − 1 degrees of freedom.


• The null and alternative hypothesis are:

H0 : µ = µ0

HA : µ 6= µ0
Test Statistic: T = X − µ
Test: Reject H0 , if | X − µ |> c
X −µ
Given H0 , √ ∼ tn−1
S/ n

α =P (reject H0 | H0 is true)
=P (| X − µ |> c | µ = µ0 )
   
c −c
=P | tn−1 |> √ = 2Ftn−1 √
s/ n s/ n
−s
=⇒ c = √ Ft−1 (α/2)
n n−1

Note: Ftn−1 is the CDF of t-distribution with n − 1 degrees of freedom.

3. χ2 -test for variance


Consider the samples X1 , . . . , Xn ∼ iid Normal(µ, σ 2 ), σ 2 unknown. Following are the
three different possibilities:

• The null and alternative hypothesis are:

H0 : σ = σ0

HA : σ > σ0

Page 2

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
Test Statistic: S 2
Test: Reject H0 , if S 2 > c2
(n − 1) 2
Given H0 , S ∼ χ2n−1
σ2
α =P (reject H0 | H0 is true)
=P (S 2 > c2 | σ = σ0 )
   
2 (n − 1) 2 (n − 1) 2
=P χn−1 > c = 1 − Fχ2n−1 c
σ02 σ02
Note: Fχ2n−1 is the CDF of chi-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:
H0 : σ = σ0
HA : σ < σ0
Test Statistic: S 2
Test: Reject H0 , if S 2 < c2
(n − 1) 2
Given H0 , 2
S ∼ χ2n−1
σ
α =P (reject H0 | H0 is true)
=P (S 2 < c2 | σ = σ0 )
   
2 (n − 1) 2 (n − 1) 2
=P χn−1 < c = Fχ2n−1 c
σ02 σ02
Note: Fχ2n−1 is the CDF of chi-distribution with n − 1 degrees of freedom.
• The null and alternative hypothesis are:
H0 : σ = σ0
HA : σ 6= σ0
Test Statistic: S 2
Test: Reject H0 , if S 2 < c2 or S 2 > c2
(n − 1) 2
Given H0 , S ∼ χ2n−1
σ2
α
=P (S 2 < c2 | H0 ) = P (S 2 > c2 | H0 )
2

Note: Fχ2n−1 is the CDF of chi-distribution with n − 1 degrees of freedom.

4. Two samples z-test (known variances)

Let X1 , . . . , Xn1 ∼ iid Normal(µ1 , σ12 )


and Y1 , . . . , Yn2 ∼ iid Normal(µ2 , σ22 )
Following are the three different possibilities:

Page 3

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
• The null and alternative hypothesis are:

H0 : µ1 = µ2

HA : µ1 6= µ2
Test Statistic: T = X − Y
Test: Reject H0 , if | T |> c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (| T |> c | µ1 = µ2 )
 
−c
=2FZ
σT

• The null and alternative hypothesis are:

H0 : µ1 = µ2

HA : µ1 > µ2
Test Statistic: T = X − Y
Test: Reject H0 , if X − Y > c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (X − Y > c | µ1 = µ2 )
 
c
=1 − FZ
σT

• The null and alternative hypothesis are:

H0 : µ1 = µ2

HA : µ1 < µ2
Test Statistic: T = X − Y
Test: Reject H0 , if Y − X > c
σ12 σ22
Given H0 , T ∼ Normal(0, σT2 ), where σT2 = +
n1 n2
α =P (reject H0 | H0 is true)
=P (Y − X > c | µ1 = µ2 )
 
c
=1 − FZ
σT

Page 4

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
5. Two samples F -test (known variances)

Let X1 , . . . , Xn1 ∼ iid Normal(µ1 , σ12 )


and Y1 , . . . , Yn2 ∼ iid Normal(µ2 , σ22 )
Following are the three different possibilities:
• The null and alternative hypothesis are:
H0 : σ1 = σ2
HA : σ1 > σ2
S12
Test Statistic: T =
S22
Test: Reject H0 , if T > 1 + c
Given H0 , T ∼ F (n1 − 1, n2 − 1)
α =P (reject H0 | H0 is true)
=P (T > 1 + c | σ1 = σ2 )
=1 − FF (n1 −1,n2 −1) (1 + c)

• The null and alternative hypothesis are:


H0 : σ1 = σ2
HA : σ1 < σ2
S12
Test Statistic: T =
S22
Test: Reject H0 , if T < 1 − c
Given H0 , T ∼ F (n1 − 1, n2 − 1)
α =P (reject H0 | H0 is true)
=P (T < 1 − c | σ1 = σ2 )
=FF (n1 −1,n2 −1) (1 − c)

• The null and alternative hypothesis are:


H0 : σ1 = σ2
HA : σ1 6= σ2
S12
Test Statistic: T =
S22
Test: Reject H0 , if T > 1 + cR or T < 1 − cL
Given H0 , T ∼ F (n1 − 1, n2 − 1)
α
=P (T > 1 + cR | H0 ) = P (T < 1 − cL | H0 )
2

Page 5

www.letslearn1110.com
https://www.youtube.com/@letslearn1110
6. Likelihood Ratio test:
For simple null and alternative hypothesis, Likelihood ratio test is enough.

X1 , . . . , X n ∼ P

Consider the simple null and alternative hypothesis:

H0 : P = fX

HA : P = gX
n
Q
gX (Xi )
i=1
Likelihood ratio: L(X1 , . . . , Xn ) = Qn
fX (Xi )
i=1
Likelihood ratio test: Reject H0 , if T = L(X1 , . . . , Xn ) > c

7. χ2 -test for goodness of fit:


H0 : Samples are i.i.d X, HA : Samples are not i.i.d X

k (y − np )2 k (observed value − expected value)2


i i
∼ χ2k−1
P P
Test statistic: T = =
i=1 np i i=1 expected value

Test: Reject H0 if T > c.


Significance level: α = P (T > c | H0 ) ≈ 1 − Fχ2k−1 (c)
Note: In case of continuous distribution, convert continuous to discrete by binning.

8. Test for independence:


H0 : Joint PMF is product of marginals, HA : Joint PMF is not product of marginals

P (yij − npij )2 k (observed value − expected value)2


∼ χ2dof
P
Test statistic: T = =
i,j npij i=1 expected value

where dof = (number of rows−1) × (number of columns−1)


yij = product of marginals for (i, j)
npij = expected, if independent

Test: Reject H0 if T > c.

Page 6

www.letslearn1110.com
https://www.youtube.com/@letslearn1110

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy