STATS Notes
STATS Notes
Statistics
-
Introductory -
LECTURE 1 : INTRODUCTION TO STATISTICS
mr/ eaf.name/lPfnrea.maemt;Yeot#-
10
*
}
q i
= I t 2 + 3 . . - t 9 t 10
N =
Population mean
parameters of the
↳
⇐I i 's
population variance
'
summing population
02 =
iii. iz = o
.
+ x. it .
. . . x. so
.
Tiffani:
MARGIN OF ERROR
Difference d sample
→ between population mean mean
FORMULA :
2¥ / z,Pxln
↳ * of selected ( p) to find
probability
being
throb ability varnishing
of
Thee
need the
MOE margin
error
: non - -
cannot be calculated )
equal &
→
every member of the population has an known
PROBABILITY SAMPLING
probability of being selected that is to
°
Random selected
randomly another member
equal chance of
being selected independent
-
, on
being selected
°
System e tic -
first member selected from the sampling frame at random ,
using a random number
°
stratified ↳
I chance of being selected ( Eg : to choose 100 people out of 500 ,
randomly choose a
NON -
°
Quota ( to compare 2 diff
groups
in the populate eg :
males vs
females )
↳ sample will be proportional to actual population
°
Convenience
↳
choosing samples based on ease
of accessibility leg : First 100
people who walk into school )
°
Purposive / Judgemental 1 Subjective
↳ specific goal
choosing samples e a
°
Self selection
-
Snowball
sampling
°
↳
finding specific hard to reach samples leg :
drug addicts)
ERRORS
DISCRETE
DATA
/
-
NON -
SAMPLING
-
NUMERICAL ( countable,
1
Imperfect research
design
/
-
CONTINUOUS
#annan}¥'d :{vetoed
unearned '
Mistake in execution
1 1
-
related to ttlargersamplesize
Any
,
-
other reason not
SAMPLING difficult to quantify CATEGORICAL
sample selection
\
ERROR
/
( no order ( ranking )
°
Systematic Error
sample selection
HOW NOMINAL DATA -
TO
- , ,
postal code)
larger sample
-
size
Estimate population 0
Random Errors
-
parameters
:
a
INTERVAL DATA -
differences between measurements NO
{
,
200cal 2604
GRAPHICAL PRESENTATION leg
TRUE ZERO : 150C - 210C -
Quantitative ,
continuous
OF DATA
fifteen.gg?algwe,measnrements.Truezero
RATIO DATA -
* TRUE ZERO
of property being measured
: absence the
°
line chart ( show aralne ↳
frequency changes in
distribution ✓ true zero
°
Okelvin : -
there is no heat at 01
Time overtime )
^
#
6
ocrosstabie
Ill
l-rari.net.in#re..I7noeh!a7Iia7n7
# o class intervals
't
,÷:th
no -
of desired intervals
' interval '
-
o
frequency distribution →
intervals are
mutually exclusive
mins
intervals
ospaoingbiw )
* → max 's
Histogram (
. .
"
/':/¥TI#
-
Age Frequency
Htt
-
obarcna.it soo .
I
200
Hii
-
-
"
✓
137.42¥
31-36 11
>
-
30 35 40 45 50
A B C
Pie chart
°
:/
stem leaf ( 52,55 , 63,69 . 82,83 -
jagged distribution with
gaps from empty classes
( tens ) ( ones)
indication how
poor of frequency
-
varies
-
-
across
:
connie:*
:/ 7
8 2
! Imai : engineer → II:
-
intervals wide,
compress variation too much
distribution
.
yield blocky
numerical variables)
To DATA PRESENTATION ERRORS To
/
?
widths must
10
histogram interval
* be
¥ equal
- - - - - - - - - - - - -
x
! *
compressing ldiitortingverticyaxiv.no
point
-
;
Zero
( makes data diff
appear vastly
.
×
actuality
5 I
when in
there little diff )
- - - -
is
, ,
.
ulideopg 19
I
↳ see
/
:#
80
* do
30
provide basis
f- relation b/w 2
groups
< GRAPH PLOTTING ON EXCEL >
°
BAR & PIE CHARTS
i. select data
toggle)
2 .
insert ( top left
3. Recce .
charts → Choo ve the chart
you want
chart element
4 Add → axis titles
↳ x - axis : Im
↳ y axis - : Im
↳ chart title : Im
÷÷÷÷÷÷÷i"
STACK PIE & PIVOT
/
°
.
TO PRESENT IN A CHART :
iii.sets:*::::
table
iii.
" "" "
i. ai . any data ."
"" "
T.ie::: as :
' "
ii's .
iiom
" e.
investment.
drag
.
→
g. '
VALUES
'
CHANGING B IN WIDTH
'
N =
IT 27=1 Xi
-
Ascending order -
most
frequent
Median position
-
Even no .
of values : median =
average of the 2 middle numbers
mean
median
mode
Positive skew 1
it symmetrical
.
Right skewed
-
Distribution
|m
mode
median
mean ,
median
-
skew I
Negative
left skewed
QUARTILES
-
LOCATIONS
00 , 125%1 :
IT ( htt )
°
Q2 ( 500/01
}
( htt ) → median
° 03 (
75%1
!
4- ( htt )
ICR :
03 -
Qi → middle 50%
8 0.9 I O I 2 I 3 1.5 I 6 2 O
Eg
: O .
, ,
- , .
,
-
, . -
, -
Qi :
LT (8-11)=24 → 0.9 t IT (1.0-0.9) = 0.925
1.0
0.9 #
Position 2
'
14 of the Position 3
way
BOX E WHISKERS
( Median )
Q1 Q2 03
|%/
Minimum Maximum
-
( smallest value 't outlier) ( largest value t outlier)
go % values ( ( ( R)
median
\
①
variability
/
VARIATION
↳ population variation ( 04
↳ sample variation (
-
s I
/
62 =
Ty 8 ( sci -
N )
2
S
-
=
NITE ( sci -
E)
2
# net, F.) IZ
=
8 ( xiz , (
- =
-
N sexist -
STANDARD DEVIATION
↳ population Holder .
( O)
↳
sample Holder . CS )
• =
IT
/ s
-
-
1st
COMPARING VARIANCES : COEFFICIENT OF VARIANCE ( CV )
on
( relative variation )
in percentage
Population CV :
f- / sample cv :
:÷t÷
÷
/
POPULATION SAMPLE
¥1
" fi
N Sfi * f: frequency
=
TOTAL COUNTS
e ximi ,
mis
.
mean,
62 =
# Effi ( mi 2
] S2 # Effi ( mi
VARIATION -
N ) = -
E) 2
]
'
Efimi
-
Efim 't
( II )
= =
'
N
-
I'
NT
-
µ-
LECTURE 3 :
PROBABILITY
PROBABILITY :
A B
A
①①
B
s"
"" " " " " ""
Im
""
a'
'
CLASSICAL METHOD
(
intersection ( AAB )
YIIYI.FI ÷ } (
|m÷
-
°
no .
of outcomes that satisfy A . P ( AUB) =P ( Alt PCB ) -
PLAN B )
PCA ) = -
total no -
f- outcomes in sample space
A B
THE ADDITIVE RULE
!!!!
""
"
" "" " " " " " " "" "
°
PLAT -
-
I -
PCA)
(
complement ( B )
CONDITIONAL PROBABILITY
PLAN B )
°
PCAIB ) : i PCBI A) =
PLA )
°
Pl A/B) =
; PCB , A) =
B
PCB ) PLA)
-
RELATIVE FREQUENCY METHOD
One event occurring does not make the occurance of the other event
anymore Hess probable
⇐
TESTS OF INDEPENDENCE
PCA AB )
°
PIAIB ) =
-
= PCA ) * PCB ) > O
PCB )
Plan B )
=
PCB ) * PCA ) > O
PCB I A)
ta
° =
VS
Mutually exclusive :
o
PCAAB ) =
PLA ) P ( B ) -
PLA AB ) -
-
O
independent
WEEK 4 :
DISCRETE PROBABILITY DECISIONS
Probability of X
taking
a
specific value x is P (x )
-
on -
x .
Eg
:
× p ( x
-
- x )
2
¥ = 0.25
(
PROPERTIES OF DISCRETE PROBABILITY '
var ( x ) = 8 (x - E Cx ) ) Pla )
IE
"
°
Same Pint YI
°
MEAN : E (X) -
-
N
-
-
E Ix Pla ) ]
DISCRETE DATA :
/ Ell X ) Ty / Ty E
2 '
E IX ) PIX )
-
N)
- '
°
VARIANCE : P (x ) -
N -
62 = E. ( sci -
N)
2
( sci 21 -
°
CUMULATIVE PROBABILITY : F- ( x ) = C ( x =
.
x ) = E x ex Plk ) = PIX EX )
state outcomes
n
-
FORMULAS
31 PASSOW DOG 'T Rb BUTTON
.
o
c : :
x ! (n -
x )!
used to determine the probability of random variable
-
a
no , occurrences
on ! = h (n -
1) In -
2) . . .
x 2X I
^
P (x) (e ) for i
PIX ) O for all other
-
°
= x -
-
-
-
-
values of X
° MEAN : E (x ) =
A
°
VARIANCE : Var ( x ) -
-
X
LECTURE 5 : CONTINUOUS PROBABILITY
X
probability density function , fix ) function of x when x
-
is a
- -
.
( X 42 E 2 P ( 27=0 )
-
-
value X
at specific
=
-
P a
-
-
O -
↳ * prevent how ga is
phrased
f CX ) disc
③ Variance
faux [ El XII ) fall
'
var ( X) ( ( El x ) 32
-
x f (x) dx fix ) dx
=
:
x
-
=
x
- -
④ cumulative probability :
""
n At a
point height of
. the curve : fix ) / PDF
( axis )
y
-
't
::Ii
"
!
: -
Discrete :
PMF =
probability
amiiacieiiitinemweea.int '
: :: :c; : :
a '
:
a b
Uniform distribution :
equal probabilities for all equal -
width intervals within the range of the random var .
f
f- (x ) = for a Ex Eb
:::c: ÷ :i÷÷
n
-
X min -
- a Xmax -
-
b
Exponential Distribution : distribution for events that
randomly happens
leg : customer
walking into
a
shop ) .
f- ( x ) =
X
MEAN : ECH -
-
E
VARIANCE :
Varin
L
-
'
' ' ' bi
e'
f.
a - Rb
-
.
..
A b
Normal distribution :
random variable hav an infinite theoretical
range
HEY
-
OHH
-
e
-
✓ 2162
°
Mean :
ECXI
-
-
N X -
N ( N 02 ) ,
°
Variance : Var ( x) =
E (X -
N 12=62
f- ( K )
.
N
mean
Median
Mode
Standardization :
standard normal distribution 2. where ← Nl Oil )
"
°
X → 2 : 2
-
-
IN
PC XL 8. 6) =P ( 2<8 .bg?o8iO ) PIX > 8. 6) = I -
Pcxc 8.6 )
I 0.5478
=
=P ( 210.12 )
-
→ TABLE 4
=
O - 4522
= 0.5478
q
IN
"
IN
"
6=5 2
=
= -
O . 8416
"
? :O : Iai: ! :L: III. I. means : adder .
-
-
sareieootnanataia .
MT x
X? 8 2
2?
✓¥
Empirical Rule
I .
NIG contains 68.26% of data
g. nt-goontain.gs?:%74::aa
Nt % of 6 contains / I ×
-
36 -
26 -
G N O 2636
Z
3 I 2 3
-
-
2 -
I 0
LECTURE 6 :
SAMPLING DISTRIBUTION
I -
N
need the sampling frame to calculate
probability of
-
→
ion
I
zernsn -×+zEn
-
cannot be means
estimated leg convenience sample )
.
mean :
r
Willaloobe normally distributed .
m.ae I + more
,
& "
'
Population mean :
N = population variance : o
n
"
8nF On %
'
sample mean :
Nz : =
µ sample variance :
GI
'
= =
{ = → STANDARD ERROR S
:
only YIN
SAMPLE . a
BY CLT
normal divtribntionifnpll -
p)
> 5 .
µ
,
II
µ
"
YI
"
" Tt "'
?,xi
In
=
lfntso.by CLIF
"
Nipp - op
-
I
-
. .
, , ,
62=fiE,Cxi
'
m2 ↳ = =
l l
N n
-
'
i
POPULATION A- MME
>
x ! I -
XI N ×
I, I
µ N =P
Np N p
-
-
-
-
l l
-
✓
l
G =P ( t P)
Pll
-
I N p)
g§2=
-
-
[ = I
IN
" N n
0
Z=
g- ( En )
-
al'll Pl
-
= I -
Gp
A
Pll p )
-_ -
A
"
it:
l
Ii: l
a.
Ii:
l
l
l
l
> l
N' O z
2 ?
-
2 ? N O
-
2=0
2=0
&
Point Interval estimates
-
confidence
Interval
I
p
tower 85% 90% 95% Upper
point
confidence estimate me confidence
( if Floris
Limit exactly at the confidence limit
poih-feotimafe.it ivan
Levey
unbiased estimate ) Ccl
Iisaneutimafeforlv
•
ouzisanettimateforo
'
Yoo
a :p
LELE )
→ .
II.
-
marginal error
l l
→
wet score distribution →
need to find degree of freedom
-
( v1 l
:µ=I±tvF( %)
-
2
→ Cl -
t
e
-
✓ =n -
I t
t
LEINE )
- F
:-( ng )
I -12
-
margin terror at
-
Yoo - -
lower limit
upper limit
Interpretation of
f- ( Ernle Ys :-( En )
the
Cl : Weare c% confident that population mean is
between I -
2 I -12 .
"' "
KEI ( g: ) CLE ) 't )
-
a- n
;
-
in fi
Cl :p
f±2E( )
e
-
-
is
p) g)
byut
→
* ALWAYS
>
!
-
ROUNDUP
,
ol
ftp.notgivenihqneution.uoeworuf cave
possible scenario take P' to be 0.5 !
-
-
'
- - :
+ to :
depends on
population
Th restricted due to
sample financial
X :
vice is reasons
✓ tell
LECTURE 9 : SAMPLING DISTRIBUTIONS ( 2 SAMPLES )
Dependant samples
Difference between paired values ( dit =
sci -
yi
* Both populations are
normally distributed
Mean
sample variance
f. mean
sample
sampling distribution paired difference
d- '
th E7= Eiichi d- 12
di Csd ,
-
a-
ton
"
,
=
n
Mean :p -0=14, variance :
i
-
confidence Interval
Cil variance unknown
where V=h I
=
-
,
,
→
If Cl does not include a zero /
negative number sample A sample B
: -
> o
Independent samples
( Ii Ia ) E
( Ni Nz )
±
(I
-
I 2)
-
-
ME + ME
:( M
2E ( )
Cl Na ) ( Ii
on ? On!
- =
Iz ) ± '
J
-
( M
F (
+
Cl th ) ( Ii Ez ) Itv
Sn! t.sn! )
:
=
- -
( 1) SZ
-
im (n .
-
II si t nz -
gpz =
margin of
error
- hit na -
margin of error
Cl ( M ( Tri f- ( I spank
thy ) )
:
2
822 Ma ) Fa ) tv
- = -
±
g.
(
z , +
)
,
t
ni na
where V
-
-
hi tha -
2
where ( Welch Satterthwaite formula ,
v
-
- -
( Sh! ) ( Sn ; )
- '
hi
-
I hz -
* round DOWN
integer
non
Df for
smaller
pudency
→
-
( in Proportions
F.
Film
F" a ph )
Varus ,
-
-
=
+
h 2
f
Cl Cp pal
fi
:
(
fat )
-
. =
72hL
±
LE F"
-
A A. a
Ho :N73 Ho :NZ3 Ho :p =3
( .
: i ) ( id
l
l
l l
2x O
-
O 2x O
-
ZE ZE
Z
2
Lower tail levy upper tail test two tailteof
-
-
-
step
't :
Hypothesis Eg : left -
tail test
Ho :
Ho :p 73
HI :
Hi : Ps3
a
-
5%
Hepa Teuftfatisfics
-
L :
I -
No METHOD 1 : USING CRITICAL VALUE ( 2C in relation to 2x )
+ eototatiuticlti ) :
corn )
↳?f=feufvtaf.to/lsinthecriticalregion
l
Ci ) 6 known -
20.05=1.6449
NI =p .
6×-2=92 Werej Ho ( i. e. t - -
stats 2x for
* Assume normal distribution lower tail test ) -
o
Gx
-
=p
METHOD 2 :P -
VALUE
I -
NINI ,
0-+2 )
I -
No
5%j.
Test Hafiz -
= i
( Tn ) '
i i Ho
( i
↳
l
Z
that 2 critical
2 S2
GI
=
rn
p -
value = PC 242 Ts )
I -
N ( NI ,
672 ) =p ( za -
z ) For 2 -
tail .
P -
value
I -
No
=
I -
stat ]
Stfu I 0.97725
=
-
where V -
-
n -
I
=
0.02275
since 0.02275<0.05 →
rej - Ho
µp=p
Plt * Assume normal distribution
Gp2 62N P)
-
, =
h since npll -
p ) > b- → CLT →
Torn (
pp ,
Gp2 )
Gp Pll p)
-
=
f -
Po
Test Hat :2= where Po is the proportion test
we
against
-
Po ( I -
Po )
n
Conclusion for Hypothesis Testing :
TYPE 11 ERROR
DO NOT ✓
( P)
reject ( I )
x
-
" °
Confidence Level
Probability afnotrej .
a
false Ho
Reject Ho TYPEIERROR
y
( x )
P )
'
( l -
Possibility frej .
Power
a true Ho after
* & vice
Hype -1 error ,
T-type error versa
Find
,
I 172<221 =L
Lansing
' -
* ,
N =
tihf-xcuvihg-xc.to ( En )
2.
-12 a
i
Hlo
W!%,
' '
l l l
l
! Mt THOD FROM ACTUAL MEAN GRAPH
-
2
'
(
,
not rejected J
i
was ) N ( N
#
62 )
-
I
,
i
'
,
OR
#
'
i Ic -
N
#
§p
'
2ps
=
! ( %)
own
4. Find P=P( 2>213 ) where 2-1410,1 )
13=1012 >
2131 =p ( I > I ,
I I N*
-
'
plz
'
>
)
=
615N
I
it
i
.
,
µ*=2.g 213
LECTURE 11 : HYPOTHESIS TESTING 12 samples )
Dependant samples
-
( i ) tuhknown
I PD
-
( sci )
-
Sd =
E. (di 412 NJ =
ND
tr
-
,
a
n I
( 5) where
-
var = V -
- n -
I
n
IV. E for two
tail
-
Edi
d- =
independent samples
(i ) 6 known
Var ( II -_
Var ( Xi ) t Var ( Xz )
6 , 2
622
t
=
MI M2
f- we
hypothesis @ that there
is no difference (µ ,
-
µ , ,=o
( Fi Fa )
.
( Ni µ ,)
- -
-
6,2 622
1-
hi H2
If set the
hypothesis Ho :P Nato
we as ,
-
Hi :p , -
Nz > o →
claiming that Hi >
Ho
liiltounknowndnneqnal
512
Var ( F) = Var ( Ii ) t Var ( Iz ) Var III ) = 6,2 =
At
512 822
t
=
822
hi h2 Var ( I 2) = 622 =
M2
( Il -
Xz ) -
( M -
Nz )
gp2
=
hlthz -
2
( Il -
Xz ) -
( M -
Nz )
Test stat :[
-
=
where V -
- hit ha - 2
spank, th )
in both samples )
§= Nith 2
Fi Fa
Teufofaf
-
: 2- -
Ipu -
print that ,
LECTURE 12 : CER
CORRELATION
•
Correlation measurer the LINEAR relationship between two variables
'
- d2 )
Correlation 't causation correlation ( 3rd variable correlation b/w
→
spurious causing a
°
o causation =
correlation
E. Lxi -
Nil Cgi -
Ivy ) sample Holder of X
Population correlation coefficient P ,
-
-
'
E. ( Xi txt 8 ( yi 512 El Xi
2
F)
-
-
-
Sx
-
-
n -
I
( ssxy )
87=1 ( Xi -
Ill 's i -
51
sample correlation coefficient ,
r=
-
sample Holder of 7
'
E. ( Xi -
Tx ) 8 ( yi -
512
Sy El Yi
2
g)
=
(
-
ssxx l l ,
ssyy
=
87=1 ( ki -
E) ( Yi -
J ) h -
I
( n -
l ) Soc
Sy
Ssxy
✓ =
=
COV ( x , y )
ssxx Ssyy
Soc Sy
S ( sci
2
& E)
Ssxx =
!
-
J)
,
,
i -
SSyy= El Yi -
5)
2
-
I
ssxy =
Ei! , ( sci E) ( -
Yi -
J)
°
Cov . is tied to the unit of x d y
↳
eg weight 1kg
}
: x -
height 1cm
t
user -
value to compare instead
value
has no
units)
General Guidelines -
>
VERY STRONG CORRELATION COEFFICIENT : r 20.8
>
RELATIVELY STRONG CORRELATION COEFFICIENT : 0.GE r E -
0.7
>
MILD 1 RELATIVELY WEAK CORRELATION COEFFICIENT : 0.5 Er E O - G
>
WEAK CORRELATION COEFFICIENT : 0.4 E RE O - 5
VERY WEAK
Dependent
> : REO 3
Independent
-
Variable Variable
REGRESSION :
how much y wills
given a sin x
X Y
simple linear
regression motel :
f- Pot Pint E
Exogenous E dogenous
variable variable
where E is the random term E NCO
-
error ,
-
, Ge I
line
simple linear
regression equation I Bet fiflinel Predictive Modelling light least square :
-
-
where bo :
Y intersect 15 bi II Ei =
Difference between raw d
predicted value of
-
-
SS " y
bi :
gradient of the dope ( ssxx ) ↳ when
you square all
the Ei drum them
together ,
i .
True relation vhip form is linear Cy is a linear function of X t E
ii. Ei is
independent of X
↳ difference
between predicted value y I mean of y ( ONLY raw
look @ predicted line )
Sum
of square to -14 ( SST ) SSE Ssk E Cy J )
'
> : t = . -
y & J
↳ difference between
>
YE ,
.
where se is the Elandavg error of the estimates
R (
-
coefficient of determination ) :
portion of total variations of 7 that is
explained by X
Ji z )
'
SS R E ( -
if I ( yi
'
yi E
Cgi g) E
Ji )
-
t
= - -
= -
JST SSE -
R2
2
SST = r
SSE
for simple
= l -
S ST regression
Le auf square
estimates
Cov ( x , y )
87=1 ( sci -
E ) l Yi J ) -
n 'T E 7- I ( sci -
El Cudi g )-
rs , sy sy
✓= =
= =
r
(
Sy
Soc S"
)
n÷ E (
n -
I sci 2
Sx
-
-
E )
,
step 1 :
Hypothesis
Ho : 13 ,
=
0
Ht :
Pi to
step Decision
2 :
rule
p
-
value s l s a
con duoion
steps :