DAO Cheatsheet
DAO Cheatsheet
nctions Replace Missing w/ Value *( ) empty bracket leaves original price = stocks.values #convert into array shows
line_upper = line.upper() # to upper case result = print('What?') unchanged value without index and col name
line_lower = line.lower() # to lower case print(result) # The returned value of the - gdp.subset.fillna(‘ABC’) fill all NaN items w/ ‘ABC’ pd.Series(x).var() #sample variance
line_cap = line.capitalize()# first letter print function is None - output=gdp.subset.fillna(0, inplace=True) np.array(x).var() #population variance
to upper case print(type(result))# The data type of None original data changed. inplace=True is similar to dropna( ) np.array(x).var() * (n/(n-1)) #= sample variance
line_swap = line.swapcase() # Swap upper Specifying row indexes: set_index(‘colname’) np.array(x).std() * (n/(n-1))**0.5 #sample std
and lower case is NoneType
The row indexes can be converted into a new column, while the - Transforming the variances to standard deviations
line_title = line.title() # Capitalize Notes: In Python programming, all positional arguments must default integer row indexes are recovered, by import numpy as np
the 1st letter of each word be specified before keyword arguments, otherwise there will be the reset_index() method. var = [3.5, 6.2, 7.3, 8.5]
The replace() method for string an error message. output.reset_index(drop=True) #drop the original std = np.array(var) ** 0.5
string_uk = string_us.replace('modeling', pd.DataFrame .columns & .index index std = list(std)
'modelling') .astype() used in pandas objects/ .dtypes: data type of each col Operations on series Discrete random variables
List methods empty_list = [] pd.Series series.values (values of data elements) Data as series, cannot apply string operations → need convert via from scipy.stats import <>
.append()# item is added to the end of list series.index: RangeIndex(start=0, stop=6, step=1)
.extend([,])# add multiple items to the .str. first! with .str. , function will be applied to all rows
Notes: syntax for variable names: - The indexer iloc[] is for integer-position based indexes, 1) Vectorized len: eg. is_long =
- Only one word list the stop index is exclusive of the selection.
.insert(2, '')# Insert item at index 2 condo[‘name’].str.len( ) > 20
- Only consist of letters, numbers, and underscores - The indexer loc[] is for label based indexes, the stop index is - Output as Boolean condo.loc[is_long] shows all rows w/
.remove('')# only removes the first
- Cannot begin with a number inclusive of the selection. names > 20 len
occurrence of specific item
- Avoid contradictions with Python keywords or other data_frame.iloc[row:row, col:col] 2) Vectorized count: *good for counting no. of words
.pop(2)# Remove and return the item at index
variable/function names - pandas.DataFrame objects are also mutable eg. project.str.count('A')
2. If the position index is not specified,
Comparison operators e.g. data_frame.loc[2:3, 'educ'] = 9.0 3) Vectorized Indexing eg. if name starts w D
then the pop() method removes the last item
- The loc[] indexer can even by used to create new columns. lower_names = condo[‘name’].str.lower()
of the list. e.g. data_frame.loc[:, 'new col'] = 'value' *convert lowercase for ‘D’
output = [n for n in range(1,101) if n%7 == Boolean indexing is typically used for filtering or segmenting d_start = lower_names.str[0]== ‘d’
0 and n%5 > 0] data according to given conditions. condo.loc[d_start]
print(greetings[::-1]), result: dlroW olleH e.g. is_female = data_frame['gender'] != 'F'
# True if the record is a male
4) Data Type Conversion (convert all elements in series to a P(X ≥ 6)=1−P (X ≤5)
Built-in data structures certain type)
Bitwise operators Eg. levels from str to int (’06 to 10’) to 6th floor to 10th floor
"and" & / "or" | / "not" ~ #1 Create New Column #2 Index & Slice with .str. #3 astype(int) 50 multiple-choice questions, given the only correct answer
e.g. cond1 = data_frame['gender']=='F' - eg. condo[‘level_from’] =
among four choices. If John is certain about n = 28 questions and
cond2 = data_frame['married'] condo[‘level’].str[:2].astype(int)
is randomly guessing the remaining questions. Probability that
is_wife = cond1 & cond2 condo[‘level_to] = condo[‘level’].str[-
John can answer at least m=32 questions correctly?
data_frame.loc[is_wife, 'remarks'] = 'Wife' 2:].astype(int) n = 28
new column named "husband" where each element equals to 1 if stocks.pct_change() method to calculate the rate of m = 32
this individual is a male and is married, otherwise the element is change between rows p_right = 1 / 4
Membership operators 0 Numpy p_wrong = 1 - p_right
wage_data.loc[:, 'husband'] = 1 1D: similar to list (1 row) tuple | 2D: (row, col) (rows are p_answer = binom.cdf(50 - m, 50 - n,
wage_data.loc[(wage_data['female']==1)| vertical down to show no. of rows) p_wrong)
Tuples (wage_data['married']==0), 'husband'] = 0
Item = ‘’, ‘’# Comma-separated items, tuples Avg: data.mean( ) #only numerical, incl. print(p_answer)
Continuous random variables
Boolean (proportion of T)
can go without parentheses unlike lists
Median: data.median( ) | SD: data.std( ) |
Item = ('', ‘’) # Comma-separated items Var: data.var( )
data.max( ) | data.min( )
within parentheses data[‘gender’].value_counts( )# Actual count
Logical operators feel_empty = () #len() = 0 of each category
tuple_one = 'here', # The comma creates a data[‘gender’].value_counts(normalize=True)
#Proportion of each category
tuple type object, comma is neccessary
s2 = range(3) # A tuple with three items
mixed = ['Jack', 32.5, (1, 2)] # Comma-
separated items within parentheses data.corr()
type(mixed) <class ‘list’>
type(mixed[2]) <class ‘tuple’>
Create NumPy arrays numpy.array()
item_one = ('there') # Only the parentheses
data.cov() e.g. Probability that the amount of the soft drink is between 11.92
do not create a tuple String e.g. locate specific value ounces to 12.12 ounces
range() function if only 1 argument: stops before that argu corr_table.loc['educ'].loc['expr']
x, y, z = 'abc' # Unpack a string of from scipy.stats import norm
corr_table.loc['educ'].iloc[2] prob = norm.cdf(12.12, 12, 0.05) -
three characters summary = data.describe( ) # will show norm.cdf(11.92, 12, 0.05)
cage=’bad’ trav=’good’ count, mean, std, min, - Inverse of cdf
cage,trav=trav,cage #swapping variables quartiles(25%,50%,75%),max. median will be
for outcome, prob in zip(outcomes, probs) shown as 50% quartile
#zip function def sum_sq(x):
break vs continue Function arange(startdefault=0, stop, step)
output = sum([item**2 for item in x])
- Break the loop by break Dictionaries {'key': 'value'} .reshape((rows, cols))
for name in stocks.keys() OR stocks: x = [1, 1, 1]
- Skip the subsequent code by continue np.sin() / np.cos()
a_string = 'abcdef' print(name,stocks[name]) y = [2, 2, 2]
print(sum_sq(y)) result: None due to no Functions and array methods
new_string = '' for price in stocks.values(): print(np.log(3)) # Natural logarithm of 3
for letter in a_string: print(price) #print value only Return in the function
print(np.exp(np.arange(1, 3, 0.5)))
new_string = new_string + letter for item in stocks.items(): Handling Missing Data (NaN: Not a Number for missing
# Natural exponentials of 1, 1.5, 2, 2.5
if letter == 'c': print(item)#iterate both keys and values values) print(np.square(np.arange(3)))
break in parallel Detecting Missing Data: eg. gdp.loc[:,
‘1960’,’1961’].isnull() # Squares of 0, 1, 2
print(new_string) a /n b Create a new dictionary all_dao that print(np.power(2, np.arange(3)))
print(new_string) include all DAO courses. isnull( ): Returns True if NaN, False if not missing
# 2 to the power of 0, 1, 2
abc all_dao = {} notnull( ): Returns True if not missing, False if NaN
if letter == 'c': #skip this for item in courses: -Selects only True (not null): eg. gdp1960=
character to the next iteration if 'DAO' in item: gdp.loc[gdp[‘1960’].notnull()
directly all_dao[item] = courses[item] Dropping Missing *( ) empty bracket default leaves original data
continue #skip the subsequen OR for item, value in courses.items(): unchanged
code, if no code is under continue if 'DAO' in item: - gdpsubset.dropna() returns new data frame w/o missing
then no code will be skipped abdef all_dao[item] = value value row
*The output of the input() function is always a str type *data type of a list inside dictionaries Series - output= gdpsubset.dropna(inplace=True) original
object Gives a list of ALL possible name combinations data frame is overwritten, but the dropna() method returns
first = names['first'] nothing (print(output) None)
last = names['last']
all_names = []
for a in first:
for b in last:
all_names.append(a+' '+b)
Poisson Monte Carlo simulations for decision-making repeat the sampling experiment by 1000 times using the for-loop population proportion, 𝜇, is less than a given constant value 𝜇0,
𝐻𝑎: 𝜇 < 𝜇0 𝜇 is smaller than 𝜇0, or in other words, the test
checks if 𝜇 is on the left-hand-side of 𝜇0, so it is called a left-
tailed test.
population proportion, 𝜇, is greater than a given constant value
Expected Values and Variances 𝜇0, 𝐻𝑎: 𝜇 > 𝜇0 𝜇 is greater than 𝜇0, or in other words, the test
checks if 𝜇 is on the right-hand-side of 𝜇0, so it is called a right-
tailed test.
If 𝑃-value is larger than 𝛼, there is insufficient evidence to
reject the null hypothesis under the given significance level. In
other words, we do not have sufficient evidence to support the
conclusion that the alternative hypothesis is true
If 𝑃-value is lower than 𝛼, we reject the null hypothesis in
Among the 1000 experiments, the chance that the population favor of the alternative hypothesis, which implies that the
mean value 𝜇 falls between the confidence intervals is around alternative hypothesis is true
Alternative 1−𝛼=95%. 1. Hypotheses prove that the population mean is larger than a
sample = np.random.normal(mean, std, given value, so it is a right tailed test
size=1000) # 1000 records following a normal distribution Null hypothesis: 𝐻0: 𝜇 ≤ 𝜇0 = 1340
sample = np.random.poisson(mean&variance, Alternative hypothesis: 𝐻𝑎: 𝜇 > 𝜇0 = 1340
size=1000) # Poisson distribution 2. Sampling distribution
rolls = np.random.choice(outcomes, p=probs,
lower = -1.5
The log return Alternative p_in = 0.90 #random variable has a probability of 0.9 to be
If you buy $1000 worth of this stock at time 𝑡=1, what is the
probability that after five trading day, i.e. at time 𝑡=6, your Cut-off value within the interval.
investment is worth less than $990? cut_value = norm.ppf(1-alpha) * se From scipy.stats import norm
cut_value = - norm.ppf(alpha) * se p_temp = 1 - p_in - norm.cdf(lower)
upper = norm.ppf(1 - p_temp)
Hypothesis testing
decide whether a population mean, 𝜇, is different from a given
constant value 𝜇0, 𝐻𝑎: 𝜇 ≠ 𝜇0 𝜇 is either larger or smaller than
Distribution of 𝑋 the constant 𝜇0, so such a test is called a two-tailed test, where 𝜇
could vary from 𝜇0 in two directions.