0% found this document useful (0 votes)
89 views

AL Notes

This document provides an overview of various Python programming concepts covered across multiple tutorials and classes. Key topics discussed include data types, operators, strings, lists, dictionaries, NumPy arrays, Pandas DataFrames, Matplotlib and Seaborn for visualization, functions, Object Oriented Programming concepts like classes and inheritance. More advanced topics covered include exception handling, custom exceptions, parameterization, and Tableau dashboarding concepts like actions, blending, pivoting, splitting columns.

Uploaded by

Shikha Jayaswal
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

AL Notes

This document provides an overview of various Python programming concepts covered across multiple tutorials and classes. Key topics discussed include data types, operators, strings, lists, dictionaries, NumPy arrays, Pandas DataFrames, Matplotlib and Seaborn for visualization, functions, Object Oriented Programming concepts like classes and inheritance. More advanced topics covered include exception handling, custom exceptions, parameterization, and Tableau dashboarding concepts like actions, blending, pivoting, splitting columns.

Uploaded by

Shikha Jayaswal
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 61

https://cashify.udemy.

com/course/python-for-data-science-and-machine-learning-bootcamp/learn/lecture/5733434

Class To revise 37, 38 39


amp/learn/lecture/5733434#overview
Udemy https://cashify.udemy.com/course/complete-python-scripting-for-automation/learn/lecture/

Class 10 \n next line


\b \b\b\ >> back space
\t tab space
python\'s script'

Class 12 del x to delete a variable

Class 17 my_string.lower() for Lower


my_string.swapcase() letters lower will become upper & upper will become lower
my_string.title() Every word's 1st letter will be in caps
my_string.capitalize() 1st
my letter
_stringor=sentance
"Python"will be in caps
print(join("*",my_string)) my_string.join("*") >> Output P*y*t*h*o*n

Class 19 print(my_str.zfill(10)) total 10 values will be there, and blank will be replaced by 0

Class 36
or-automation/learn/lecture/15328078#overview

will become lower

ill be replaced by 0
LinkedIn https://www.linkedin.com/learning/python-for-data-science-essential-training-part-1/what-you-sho
Krish Nayak https://www.youtube.com/watch?v=bPrmA1SEN2k&list=PLZoTAELRMXVNUL99R4bDlVYsncUNvwU

Tutorial 1 Operators
type(1)
Data Types
Format Paste
len('Pranay')

Tutorial 2 Data Types


Shift + Tab
Logical Operators
List
List Functions

Tutorial 3 Sets
Dictionaries

Tuple

Tutorial 4 Array
Numpy
arr.shape
Creating Array
(Rows, Columns)
Indexing
np.linspace(1,10,50)
copy function
np.arrange(1,10)
np.ones(4) || np.ones(2,5)
np.random.rand(3,3)
np.random.randn(4,4)
np.random.randint(0,100,8).res
hape(4,2)
Tutorial 5 & 6 Pandas
pd.DataFrame()
pd.to_csv('path/link//
file_name.csv')
.loc | .iloc
df.isnull().sum()
pd DF functions
String to csv

Tutorial 7 pd.read_json()
html (page link) data read
Read Excel
Pickling

Tutorial 8 Matplotlib
For multiple graphs
More graphs code
Pie charts
Tutorial 9 Univeriate Vs Bivariate Vs
Multivariate Analysis
Distribution of Plot in Seaborn
Seaborn
Tutorial 10 Seaborn
Tutorial 11 EDA (Codes)

Tutorial 12 Functions
Print Vs Return
Add Function
Default value in function
Even Odd Sum function

Tutorial 13 Lambda Function

Tutorial 15 Map Function

Tutorial 16 Filter Function


Filter with Lambda
Tutorial 17 List Comprehension
Tutorial 18 String Formatting
Tutorial 19 List Iterables vs Iterables
Tutorial 20 OOPS
Class

Class 24 Advanced Python


Exception Handling

Class 25 Custom Exception Handling

Class 26 Inheritance In Python


din.com/learning/python-for-data-science-essential-training-part-1/what-you-should-know?autoSkip=true&autoplay=true&resum
ube.com/watch?v=bPrmA1SEN2k&list=PLZoTAELRMXVNUL99R4bDlVYsncUNvwUBB

1+1 = 2, 2*5 = 10, 5**2 = 25, 10/2 = 5, 10%2 = 0


Check Datatype
Integer, Float, String, Bool
print("My Name is {First} & last Name is {Last}".format(first = first_name, last = last_name))
Gives Length >> 6
x.isalnum() >> alpha numeric >> Num + Alphabet
x.istitle(): 1st letter caps
To show detail
and
or
Can be mutable, changeable, ordered sequence
Can- Add
contain any value,
multiple valuescomma
in list seperated
lst*5
- elements will be repeted 5 times (Not Multiplied with 5)
set_2.intersection_update(set_1)
for-i common records in both sets & output will be saved
in dict_1.items():
print(i) #will print only both
dict_1["Car1"]
Cannot be changed = "Renault" # will replace "Car1" value
()

Contain same Datatypes


import numpy as np
arr.reshape(3,5)
- If there are total 15 elements, then it can be reshaped only with multiple of 15. Eg: (3,5) , (5,3), (1,15)
- Array is too fast

arr[0:2, 0:2]
Give value between 1 to 10 in 50 parts. Eg: [1, 1.18, 1.36 .... 9.81, 10]

will give value in array from 1 to 9


create array with value 1
array with random values, 9 elements
array with
Create random
array integervalues
from 0& to
normal distribution
100 with 8 elements
4 rows & 2 columns

import pandas as pd
To create DF
To export data in csv
pd.read_csv('path', sep = ',') # It can be changed from "," to any other if required
df.iloc[:,:] # will show all columns & rows

df["Col_1"].value_counts()df["Col_1"].unique()df[[Col1, Col2]]df.head()df.info() # shape, memory us


from io import StringIO, BytesIO

pd.read_html('url', match = "Any Word to match, if there are multiplte tables on page", header=0)
pd.read_excel("path//fileNm.xlsx", sheet = 1)
excel file converted/saved into pickels

plt.plot()
plt.subplot(2,2,2) ##multiple
graph is options:
shown on color,
2nd line format, line width, etc
index
plt.plot(x,y)
plt.hist(y)
plt.boxplot(Num_Col)
plt.axis('equal')
plt.show()
2 Feature (F1, F2) >> Bivariate Analysis
>2 Feature
joinplot (F1, F2, F3, .... Fn) >> Multivariate Analysis
pairplot
sns.pairplot(df, hue = 'Gnder') # Will show M/F seperate graph in every chart
sns.distplot(df['Y_var']) # Creates histogram with distribution line
sns.violinplot("Gender", "Age", DF) # Gives Violin shape graph
https://github.com/krishnaik06/EDA1/blob/master/EDA.ipynb

starts with "def" keyword


But if inplace of print, "return" was there, then it show the value while printing x
val # Output will be 10
hello(*lst, **dict_args)
odd_sum += i # output >> ('Pranay', 'Singh') {'age': 26, 'dob': 1997}
return even_sum, odd_sum
- Works faster from other function
-addition
Only Single Operator
= lambda a,b: at
a+ba time
addition(12,24) # Output >> 36

- Without using loop, it runs all the records of list


list(map(odd_even,lst))
list(filter(even,lst)) # Show only Even records
list(filter(lambda num: num%2 == 0, lst))
[i*i for i in lst if i%2==0]
string("Pranay", 25)
next(itr)

Class can contain different functions.


Eg:def
Cardrive(self):
>> Window, Door, Mirror, etc.
return("This Car is {} car".format(self.enginetype))

finally: # It will run always, if code is correct or any exception is there


print("Code ran successfully!")

Py File
w?autoSkip=true&autoplay=true&resume=false&u=138505057

Eg: (3,5) , (5,3), (1,15)

# shape, memory usagedf.describe() # count, mean, std, min, max, percentil # Only int & float col

e", header=0)
t

h
centil # Only int & float columns takendf.Col1[df.Col1 > 100]
Percentile
get_dummies() If we remove 1st col, then how we will identify if it was significant or not
Pass Vs Break command
Udemy Tableau Classes https://cashify.udemy.com/course/tableau10/learn/lecture/5618178#overview
Hover on 1st map & it show filter on another she
Class 31 Action Filter "Dashboard" >> "Actions"
Highliter

Class 37 Data Blending


Joining Data Vs Data Blending

Class 49 Creating Bins

Class 50 Parameters with Top 10,20,30... Values

Class 58 Data Enterpreter In Data Source Sheet, Check Box after importing

Class 59 Pivot In Data Source Sheet, Mark columns in Data and

Class 60 Splitting Column into Multiple columns In Data Source Sheet, Right Click on column & SP

Class 67 Analytics Tab in SHeets Cluster

Advanced Tableau Classes https://cashify.udemy.com/course/tableau10-advanced/learn/lecture/5843098#ov

Class 10 Grouping the Data

Class 11 Sets

Class 13 Combining Sets

Class 14 Controlling sets with Parameters

Class 48 Increase size of small bubbles in chart

Class 51 Tableau Animation


- Exclude
Class 57 LOD Calculation (Level of Detail) - Fixed
LOD Syntax { INCLUDE [Customer Name] : SUM([Sales]) }
5618178#overview
how filter on another sheets in DB
ns"

heck Box after importing Data, Removes Null Rows & Columns

Mark columns in Data and select pivot

ight Click on column & SPlit

/lecture/5843098#overview

ame] : SUM([Sales]) }
Python Packages
Apache Spark
RDD

Logging

apportion

Databricks

JSON

By Jatin
https://towardsdatascience.com/a-complete-data-science-roadmap-in-2021-77a15d6be1d9
https://medium.com/coriers/data-engineering-roadmap-for-2021-eac7898f0641

For Python Practice


https://pynative.com/python-if-else-and-for-loop-exercise-with-solutions/#h-exercise-1-print-first-10-natural-numbers

For SQL Practice


sqlzoo
leetcode
Used to connect with very huge datasets
For Python we use library: Pyspark
Resilient
Import Distributed Dataframe
pyspark
from pyspark.sql import SparkSession
create a session >> from pyspark.sql import SparkSession

Basically used to capture errors & warnings and show it in graphical way
Helps to create text files

Collabering extra values to previous values in same ratio

AWS, cloud, azure

JavaScript Object Notation

-data-science-roadmap-in-2021-77a15d6be1d9
ng-roadmap-for-2021-eac7898f0641

oop-exercise-with-solutions/#h-exercise-1-print-first-10-natural-numbers-using-while-loop
EG >> A: 50, B: 30, C: 20, D: 10 || Callaboration of D in others
A: 50+5, B: 30+3, C: 20+2
Constraints

Trigger

Index

Window function
Rank() vs Dence_Rank()
CTE (Common Table Expression)
Create Views
NOT NULL
UNIQUE
PRIMARY KEY
FOREIGN KEY
CHECK: This constraint helps to validate the values of a column to meet a particular condition. That is, it helps to ensure that the value store
DEFAULT: This constraint specifies a default value for the column when no value is specified by the user.

- UPDATE TRIGGER
- DELETE TRIGGER
- BEFORE UPDATE
- BEFORE DELETE

- Indexes are used to retrieve data from the database very fast

- with NTILE()
- with lag() & lead()

- Enable users to maintain complex queries via increased readability & simplification
- Can be accessed in SELECT, INSERT, DELETE, UPDATE, MERGE statement
- To create views, for making Tableau reports
tored in a column meets a specific condition. || Like Dropdown in Excel
https://cashify.udemy.com/course/sql-and-postgresql/learn/lecture/22800007#overview

Class 3 Add Data


Read Data
Update Data
Delete Data

Class 5 Creating Table

Class 7 Insert Data

Class 9 Math Operators


Alias name

12 String Operators

13 Where condition

14 Comparision Operators

22 UPDATE

23 DELETE

28 Relationships

30 Primary Keys
Foreign Keys

32 SERIAL

40 DROP

On DELETE Option what happens


urse/sql-and-postgresql/learn/lecture/22800007#overview

Insert into cities (name, country, population)


Values ('Lucknow', 'India', 15000)
select * cities
Update from cities
set population = 20000
Delete from cities
where name = 'Lucknow'
where name='Lucknow'
name varchar(50),
country varchar (50)
population integer,
INSERT INTO cities (name, country, population, area)
VALUES ('Lucknow', 'India', 5600, 2400),
('DC', 'US', 3400, 300);
+ Add, - Subtract, * Multiply, / Divide, ^ Exponent, |/ Square Root, @ Absolute Value, % Remainder

|| Join 2 strings, concat(), lower(), length(), Upper()


Eg: concate(name, ' ',country) = name||' '||country

> < = <> != >= <= IN NOTIN BETWEEN


UPDATE cities
SET population = 200000
where name = 'Tokyo'
DELETE FROM cities
where name = 'Tokyo'
One to many
Many to Many
- Unique
- Not Null
- Primary Keyhave
1 Table can of Other table Key
1 Primary id SERIAL PRIMARY KEY,
- 1 Table can have multiple Foregin Keys username VARCHAR(50)
id SERIAL PRIMARY KEY,
username VARCHAR(50) );

Delete complete table


DROP TABLE photos;

When a tbl 1 PRIMARY KEY is used as FOREIGN KEY in other table, & if we try to delete record from tbl 1
% Remainder

SERIAL PRIMARY KEY,


ername VARCHAR(50)
Classes Topics
Class 12 (1) Why Statistics
Structure
Population
Sample
Parameter
What is Statistics
How Samples are
created
Types of Stats
Descriptive Stats

Class 12 (2) Variability


Standard Deviation
Squared Deviation
Variance Formula
Data Shape
Type of Shape
Normal Distribution
Asymetrical
Distribution
Negative Skewness
Positive Skewness
Kurtosis

Class 13 (1) Sample Types


Traditional Way
Data Science Way
Central Limit
Theorem
3/6 Sigma Rule
Inferential Stats
Probability
z score
AUC

Hypothesis Testing
Significance Level

Z Test

Class 14 (1) ND & SND


Significance Level
Confidence Level
Hypothesis Testing
AUC
z value (Hypo Test)
Standard Error
Common z scores
used
Class 14 (2) Error Types
T Test
Z Test used
Degree of Freedom
t Test types
Dependent t Test
Independent t Test

Class 15 (1) ANOVA


Consider groups to be
different from each
other
F-value
Parametric vs Non
Parametric Test
Chi Square Test
Correlation
Test we have to
perfrom in Py
Py Implementation
Scatter Plot Py

Stats by Krish Nayak


Class 2 Stats

Class 4 Random Variable


Description
1. Describe features/vars/cols
2. Relay b/wStats,
Descriptive 2 cols/ features & vars
Sampling
Probabilities
Total no Distr
Subset of Population
When mathematical calculation is performed on population, outcome is Parameter
When mathematical calculation is performed on sample data, outcome is Statistics
Random Sampling
Descriptive Stats: To better understand the data
Mostly it is Univariate
Measure of Frequency : Count of distinct (Hist chart)
Std Deviation: SQRT of Variance || Variance : Avg Squared Dev
SQRT[SIGMA( Sample - Sample Mean )^2 / No. of Samples -1] || SQRT(sigma(X-Xbar)^2/n-1)
(X-Xbar)^2
Parameter / a^2 : sigma(X-u)^2 / N || u = Population mean , N = No. of Population
Each dataset have shape. Eg: Histogram
Symetrical : When divide from center, gives mirror opposite or 50% of whole data || Eg: Rectangular, Uniform, Gaussian
etc
Also known as Gaussian Distribution
-- Bell shaped
Skewed curve
Distribution
-- Mean != Median != Mode
Left Skewed
-- Mean < Median < Mode
Right Skewed
-Leptokurtic
Mode < Median < Mean High Peack || Eg: Income data collected from Govt quarter only
: Abnomrally

Traditional Way, Data Science Way


Eg: Simply asking from any 6 people from population will be sample
Eg: 6 people are collecting different samples of any 10 (Sub Sample), then average of these Sub samples will be converted
Data Science Way
For Normal Distributed Data
Rule 1 : +- STD_1
Probability, will cover
Hypothesis 68.2% of Data
Testing
- Percentile
-(xAKA
- u) P-value
/ STD || u=Mean, z score table || Only for Normal Distributed
Area under the curve

- Proving something Statistically


-Normally
Null Hypothesis
5% or 1%(u = xbar), Alternative Hypothesis (u != xbar)
- No. of sample should >30
- Significance level ?

- ND : Normal Distribution
-- SND : Standard
It is decided Normal
by the Distribution,
client same as Normal
or by Data Scientist, usually Distribution,
following arebut here u = 0 & STD = 1
cosiderd
1 - Signifcance Level || Eg: Significance Level = 0.05 or 5%, Confidence level = 0.95 or 95%
- One Tailed Test : When Significance at 1 side of distribution
-P-value
Two Tailed Test : When Significance level is at both sides of distribution
tells AUC
(xbar - u) / (STD / SQRT(n)) || u = Population Mean, xbar = Sample Mean, n = Total no.
STD / SQRT(n)
- 90% : 1.645
- 95% : 1.96
- Type 1 Error : Rejecting Null while it should be accepted
-Used Typewhen:
2 Error : Accepting Null while it should be rejected
-Used n < 30, u & STD is known
when:
-Formula
n > 30 : n - 1
- Dependent (Paired)
-- Independent
Subjects b/w (2 Sample)are same
2 samples
-- Equal
No. ofno. of samples
samples should be almost same
- STD should be same
- ANOVA : Analysis of Variance
-- Use F distribution
Variance within group is low then better
-Variance
Varianceb/wb/wthe
group
groupis high then better
/ Variance within the group
- Parametric Test : Traditional Method, Proper probability distribution involved, Concepts of mean are there.
- Used only for categorical Vars
-- Diff b/w
Helps to observed
understand & expected
nature relationship b/w 2 num samples
-- Check
T test +ve, -ve or no relationship
- Onescipy.stats
- import Sample T test >> 1 Sample vs 1 value
as stats
- import matplotlib
- from matplotlib import pyplot as plt

https://www.youtube.com/watch?v=Vtvj6fPZ1Ww&list=PLZoTAELRMXVMhVyr3Ri9IQ-t5QPBtxzJO&index=2
Descriptive Stats: Include Central tendency, summarizing in form of number & graph, focus on describing the visible chara
dataset
Discrete Random Variable
LAMBDA Function >> Code >> To call out come calculations, functions, etc
Amazon EventBridge
EC2 Bucket, which contains several folders, inside that CSV file is there
S3 Files Can be pushed in S3 by: Boto 3, Py Code, etc
CloudWatch Monitoring >> Uses Consumption >> How much load on data/server, etc (Like Google Analytics)
IAM Users >> Login Credentials, No of users, etc

Interview Q
Entropy
Deflection
ke Google Analytics)
Classes Topics
Class 16 X & Y Variable
Linear Regression
Logistic Regression
B/s Problem
Learning Setup
Discipline
Parametric & Non
Parametric
y = mx + c
Linear Regression
equation
Statistical way (b & c
Formula)
ML Process
SSE (Sum of Squared
Error)
OLS
OLS py
R-squared
SSR
R^2 (Accuracy)
SST
Adjusted R-Squared
Bita
Generalization
Model Validation
Underfitting &
Overfitting

Class 17 Linear Regression


pandas_profiling
Coeff of Variance
Pre-Modeling / EDA

Y_Var Assumptions
Data Preparation Level
2: Assumptions
Data Preparation Level
3: Feature Reduction
RFE
F - Regression
K - best
VIF
Corr b/w X & X
Data Preparation Level 4
Model Implementation
Post Modeling

Class 18 MAPE
RMSE / MSE
Correlation
Decile Analysis
Logistic Regression /
Classification
GLM
Logistic
Confusion Metrix
Threshold Value
Concordance
Discordance
Decile Analysis
KS Value

Class 19 Bankloans Data

Feature Reduction
Techniques
WOE
Sommer's D (Ginni)
Sommer's D Py
VIF
Model Implementation
Model Evaluation
Threshold Value
Model Evaluation

IQR
Outlier
Quartile

fuzzywuzzy
fuzz.ratio
fuzz.partial_ratio
fuzz.token_sort_ratio
process.extract

Label Encoding
BM25
NMSLIB
Tocken Vectorizer
Description

affecting sales
-- Optimisation Reinforsement >>Learning
Used to maximise
>> Includes or "Sticks
minimise || Eg:orA"rewards
& carrots" company&wants to maximise
punishment" || itsDeep
salesLearning
setup
- Bayesian (ANN,>> etc)Naïve Baies
-- Ensemble Parametric >>Random >> Linear &Forest,Logistic Bagging, Boosting
- Non-Parametric >> DT, KNN
For
- b :multiple X vars : y(Slope),
beta or cofficient = m1x1 c+ =m2x2 + m3x3…..
constant or alpha + c(Intercept)
-- Bita Beta:(b) Every unit increase in X Var, Y will change
: corr*(stdev_y/stdev_x)
-5:Constant (c)
Identify Objective : Mean_y >> - Eg:
(b*Mean_x)
Minimise Error [Sum of Squared Error (SSE)]
6: Converts whole prb into Optimization Prob
(Y
- Ordinary - YBar)^2Least || Squared
Y : Actual Y, Ybar : Predicted Y
Regression
-- Remove OLS Regression X vars which>> Comeup has >0.05withP-value
best fit line which has Minimum SSE
- Pred_Y : model.predict(df)
- R-squared - Predicted_Y : Accuracy- Avg_Y|| It should be high while building model
- Best fit line - Base model || Base Model : Straight line at Avg
-SSR Sum / SST
of Squared Total
- SSE + SSR
Adj
in stats R^2model, every increase in unit Y, X will increase. But it's not correct way, 1st we need to convert
Bita value in standard format
How model fits in real scenario
Train
-- Underfitting & Test or: Model Devlopment & Validation
performing
Used to see HTML summary of DFbad on Training Data
- Understand Overfitting
Go anaconda : Data
Model
prompt performed
>> Installwell on Training but>>
pandas_profiling failed
pip on testing
install pandas_profiling OR conda install
-pandas_profiling df.describe()
- Boxplot to identify Outliers
import pandas_profiling
-- Histogram reportrow
Drop to check
where Y varY var
is||is normally
= pandas_profiling.ProfileReport(df)
null >>not, dist or not = 0, subset = ['Y_Var'])
df.dropna(axis
-- Check Normally Distributed
special char
report.to_file('report.html') If then perform log transformation to bring it normal
- Dividing X_Vars
Col name
Data into
should Y var, Numwith
be correlated
correction >>
& Cat dataset
Y_Var
df.columns.str.replace(".", "_")
-- Outlier Variance
X_Vars & /missing
shoulAvg(x)not value
be Treatement
||correlated
Variance >>for
with (x Num_Vars
each - Avg(x))^2
other
-- Replace Changing Data
missing types >> df.ColNm.astype("float64")
values with mode for Cat_Vars
No missing values
df.ColNm.str.split("-", expand
-- Create No outliers dummy for Cat Vars >> =pd.get_dummies(df_cat,
True) drop_first = True)
-- Combine RFE
Not >>
sufferingDatasets
Recursive from >>
Feature pd.concat([df_Num,
Elimination
hetroscedasticity (Variance df_cat_dummies,
is increasing overdf_Y], axis=1]
a(Hypothesis
period of tym)
-- Eg: Check
Value We Y want
Var
should is3beX <2,
vars
normal >>
or
drop It internally
not:
var >2 runs several types iteration Tests) >> Y ~ X1 + X2 +
-X3… FData
- Regression
is on same >>scale
aka Univariate
(Used in Regression
KNN, K-Means, PCA, etc)
-- Check K - If np.log()
Best Collenarity can't normalise,
b/w X & X then there are several other ways to do it
--- If Select
- Can
VIF
Correlation
top Xisvars
search
value b/w >2 with
onXthen minhave
google,
& X we
P-Val
eg: toordrop
squared,drops X_Var
cube,
that etc.1with
Var by 1Max P-Val VIF
& re-run & again runs iteration
--- from If
It there
VIFrun
are 10 X vars, and we
statsmodels.stats.outliers_influence
>>iteration
Variationon Y_Var vs
Inflation every X_Var and gives P-Value & F-value & drop single X var in each
Factor
want 3 vars, then
import it will run 7 iterations,
variance_inflation_factor
iteration
- from Y ~ X1, patsy
Y ~ import
X2, Y ~ dmatrices
X3, ….
-- Here from= we
VIF can provide number of import
sklearn.feature_selection
[variance_inflation_factor(df.values, X_Vars RFEwe i) want
for cani(nin=range(df.shape[1])]
?)
-arrays As
Andsorting on Min
F-classifier or PChi
- Value
Square ortest
Maxwhich
F - Value,
want we
to choose number of X_Vars||we want
perform
df.values will give
-- Here from of all columns
sklearn.feature_selection import f_regression
- VIF = Kpd.Series(VIF) means number of X_Vars
-- col from sklearn.feature_selection
= pd.Series(df.columns) import SelectKBest, f_classif, chi2
-- VIF =be pd.concat([col, VIF],variables
axis = 1)
-- Will Split
import Data taking
intoonlyTrainingthose
statsmodels.formula.api & Testingaswhich smf
fall under - 0.5 to 0.5
-- from Its range is -1 to 1
modelsklearn.model_selection
- Implement = smf.ols(Y
prediction ~ X1model
+ X2 +on import
X3…. train_test_split
, data
Train = train).fit()
& Test || If= we taken log Y, then we have to take its exponential
train_test_split(df,
-- print(model.summary2()) test_size = 0.3, random_state 123)
Train_Pred = np.exp(model.predict(train))
-- Drop Test_Pred X var =which has highest P-value & re-run iteration
nop.exp(model.predict(test))
- Mean Absolute
Check accuracy, MAPE, SSE, Percent ErrorRMSE,
(0 to 100)
Correlation it should similar for both Train & Test / positive & high
-- Should be as low
from sklearn import metrics as possible
-- SortGives Y over
& Pred_Yall error
metrics.mean_squared_error(Y, ||inGroup
data themY_Pred) in 10 parts
-- Take
np.mean(np.abs((Actual_Y
Avg of each
rootYof&itPred_Y part
to measure & make -
RMSEPred_Y)
a line / Actual_Y))*100
chart || Line should overlap each other
- Corr b/w
If it shows any differenceshould be positive
at any level & very
than high to identify
it is easy
-- Train & Test
stats.stats.pearsonr(Y,RMSE / MSE should
Pred_Y) be similar
- pd.qcut(train['Decile_no'], 10, labels = False) || will give numbering which help to sort df
-- Training
Group by&itTesting on deciles shouldwithhaveAvg similar values
of Y & Pred_Y
- It should be similar for Train & Test
- It uses a function to normalize Y values
- It has 2 functions to do so:
True Positive
- Probit False Positive (Precision)
- Sigmoid
False Curve >>True
Negative
- Logit S shaped
Negative curve
-- Logit exp(mx + c) / 1 + ex(mx +
funtion internally converts
(Senstivity) c)
(Specificity) = Probability
Y (0 & 1) to normal (0.29, 0.78, etc) as probability
-- Here, Then it we can comeup with best fitDistribution,
Y value is 0 or 1 aka Bernauli line so cannot come up with best fit line
-- But there are function
Senstivity >> TP / (TP + FN) which brings it into
|| Should function
Normal be as high>> asGLM
possible
- Specificity When we
A cutoffonly >>
use
point TN
GLM /
which (FP +
with TN)
Logit function,
decided value ||
it is
to be(0,0 1)Should
known
or 1 be
as as high
Logisticas possible
Regression
-- Solves Binary / /(TP
Binomial Problems
- Precision ROC >>
(Receiving
Multiclass
TPOperator
Problems
+ FP)
Curve) >> Gives optimal Threshold value
-- Accuracy Totsl 1's / >>Total(TP1's+ +TN) / (TP
Total 0's+ TN + FP + FN)
-- F1 Score
Concordance >> (2*Recall) + Precision / (Precision + Recall) || Higher is better
-- AUC Concordance >> Sum (Concordance) / Sum (Discordance) || If >1 then good else bad
- Discordance >> 1 - Concordance
-Max[Abs(Bad In 1st 4 to 5 Deciles,
- Good)]we>>should havebe
It should 70intotop80% of customers
5 deciles
- Lift chart
-- Classification
EDA
-- Some times b/w
Correlation we get X &2(WOE)
Ydifferent
should be dataset
partially >>fulfilled
Prev & >>Newdf.corrwith(df.Y)
Weight
-- Sometimes
import of Evidence
we get in a single
statsmodels.formula.api dataas Actuals
sm & New
Correlation Db/w X
--- model Sommer's (Ginni)
=reduction
sm.logit('Y_Var ~ X1, X2, ...', df).fit()
-- Feature
VIF
p = model.predict(df)
-- RFE (Concordance - Discordance) / (Concordancep) + Discordance + Tie)
-- X ORvarwith
AUC_score Random
should
>> 2 * AUC -
Forest Classifier
=categorical
metrics.roc_auc_score(Y_var,
1
SommersD
If X_VarD=include
2 * AUC_score then -1
- Eg:Sommer's should income,
be high so it’swe will bin
a good modelit in 4 parts
-ThisSommer's
is for allDX_Vars,
>> Y ~but X1,weY ~need
X2, ….
with|| Just lyk F X_Vars
individually Regression
|| Code in image >>
Take vars which
- Confusion Metrixhas higher sommersD value
- import statsmodels.formula.api as sm
- Senstivity
-- model = sm.logit('Y_Var
- Specificity ~Train
X1, X2, ...', df).fit()
- Calculate
Thres = Sommer's
Total 1's
print(model.summary2())
- Precision / D for1's
(Total +
||
& Test
Total 0's) ||X vars
Remove
it should be high & similar
which has high P-value and re run model
-- AUC
< Thresscore
= 0 || >Thesh = 1
- Accuracy
- Concordance (Have to explore from google)
- Py >> print(metrics.classification_report(Actual_Y, Pred_Y))
- OR >> Max(Senstivity + Specificity)
- Decile Analysis (KS Value)

Interquartile Ratio
- -Upper:
Q3 - Q1 Q3 + 1.5*IQR
fuzz.ratio('geeksforgeeks',
-pip Lower: Q1 - 1.5*IQR 'geeksgeeks')
-87 Q1:install
1/4 *fuzzywuzzy
(n+1)
query
pip = 'geeks for geeks'
Q3:install
-choices python-Levenshtein
3/4= *['geek
(n+1)for geek', 'geek geek', 'g. for geeks']
fuzz.ratio('GeeksforGeeks',
fuzz.partial_ratio("geeks for'GeeksforGeeks')
geeks", "geeks for geeks!")
from
100 fuzzywuzzy import fuzz
#from
100Get a list of
fuzzywuzzy matches
importordered
process by score, default limit to 5
fuzz.token_sort_ratio("geeks
fuzz.ratio('geeks for geeks',
process.extract(query, for geeks",
'Geeks
choices) "for geeks
For Geeks ') geeks")
100
80
[('geeks geeks', 95), ('g. for
fuzz.partial_ratio("geeks forgeeks',
geeks",95), ('geek
"geeks for geek', 93)]
geeks")
64
fuzz.token_sort_ratio("geeks
# If we want only the top onefor geeks", "geeks for for geeks")
88
process.extractOne(query, choices)
('geeks geeks', 95)
- Nominal Encoding:
- Ordinal Encoding:
Classes Topics
Class 21 Customer Segmentation
Types of Techniques
Regression Problems
Classification Problems
Segmentation Problems
Forecasting Problem

Gradient Descent Algorithm


Gradient Ascent Algorithm

Class 22 Bias
Regularization
Cross Validation - Kfold
Validation

KNN
Similarity Metrics
Scaling of Data

Distance
Correlation
Cosine Similarity
Z_transformation (Standard
Scaler)
Min-Max Scaler

Weightage
Uniform & Distance
Parameters for KNN
GridsearchCV
Best Fit Line
KNN Imputation

Class 23 Packages Used in Python


Feature Selection
KNN for Classification Problem
Standardise Data
GridsearchCV
KNN py
KNN for regression

Naïve Bayes
Probability Understanding in
detail
Limitations of NB
Types of NB in py
NB py

Class 24 Decision Tree


DT Vs Others
Decision Boundry
Decision Tree Classifier
Nodes
Splitting Criteria
Stopping Criteria
Tunning Parameters
Gini
Entropy
Algorithms (Types of DT)

DT Regressor
DT Segmentation

DT as Feature Reduction
Technique
DT in Grouping Vars

Class 25 Advantages of DT
Disadvantages of DT
Py Implementation
pydotplus
Graphviz
DT Tunning parameters
DT Tunning Parameters
DT Feature importance

Ensemble Learning
Classification of ensemble models
Homogenious Ensembling
Hetrogenious Ensembling
Bagging
Bag vs Out of Bag
Tunning Parameters of EL
Bagging & Random Forest
Py Implementation

Class 26 Boosting Algorithm


Adaboost
Gradient Boosting
Description
Dividing Customers
- Segmentation into groups such that within group there is similarity & b/w groups there is dissimilarity
Problems
-- Forecasting problems'
Bagging Regressor, Random Forest Regressor, Xgboost Regression
-- K-Nearest
K-Nearest Neighbors
Neighbors Regressor, SupportVector
Classfier, Support VectoClassifer,
Regressor, Artificial
Artificial Neural
Neural Network
Network Regressor
Classifer
-- Naive Bayes
Scientific Classifier
Segmentation
- K-Means/Medians
- Using Regression Clustering, Hierarchial Clustering, DBScan Clustering (Density Based Clustering)
- All Regressor Techniques can be used
- Used to minimise SSE by changing Beta (X) values
- It helps to adjust Beta's such that in next iteration, value of objective function will be decreased
- Helps to solve maximization problem

- Helps to reduce problm of overfitting by validating model while building it


- Eg: BiasTrain:
means 700 rows & 20 Vars , Test: 300 rows & 20 vars , take K=5
Errors
- It will divide 700 into 5 parts (i.e., 140 each)
- Helps to give
- Refer Fig. reduce
M4 as problem
final model of overfitting
and can givebymax
giving less importance
accuracy on Test to insignificant variables & high importance to signi
- If data size is less, then K value should be high, & vice versa
- KNN Classifier : Classification Problem
-- KNN Regressor : Regression Problem
Distance
-- KNN Imputation : Missing Value Imputation
Correlation
-- Cosine Similarity (Standard Scaler)
Z_transformation
- Min-Max Scaler
- SQRT(x1-y1)^2 + (x2-y2)^2 + (x3-y3)^2 + ….
-- Rank Here Xobservations
& Y both arebasedX_Varsonof similarity
differentfor given observation
rows
-- Low means similar
- Corr Findingwith Row
similarity1 data vs other row data
by angle
-- High means similar
Less the angle, high the similarity
- High means similar
Z_Age
- Distance = (Age - mean(age))
is less / std(age)
weight is more, vice versa
-Transformed_Age
Weight = 1/Distance = Age - min(Age) / range(Age) || Range = Max - Min
-- If we put K=5 = Weight / sum(All Vars
Weight_final Weight)
-- It will look for 5 neighbours,
Sigma_Weight_Final = AcutalandY1 predict & take avg
* Weight_final X1+of.....
it
-K : 3,4,5,6,7,,,,
Weighted Average = Sigma_Weight_Final > 0.5 then 1 else 0
-- Simple
WeightsAvg : Weights
: Uniform are uniform
or Distance
- Weighted Avg : weights depends on distance
-- Whichever combination gives higher accuracy, that is best
Knumpy is low >>& pandas
More fluctating line
-To isfind
highBest
>> parameters
Kpandas_profiling More smoother line
-Soscipynormally
-- If any value isvalue
matplotlib
K is taken
missing then b/w
try to5 find
to 15out which rows are similar to it for rest of columns
-- Find
seabornnearest 3 or 5 variables and take avg of it
- statsmodel >> Linear models, Time Series
- sklearn >> DT, Bagging (RF, Adaboost, Gradietn, etc), KNN, SVM, ANN, NB
- xgboost
- keras >> ANN
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(RandomForestClassifier(), n_features_to_select= 6)
- from sklearn.preprocessing import StandardScaler
-RandomForestClassifier from sklearn.neighbors isimport KNeighborsCalssifier
- std_data Help
from to
= StandardScaler()
identify not influenced
best paramteters
sklearn.model_selection
by Multicollienearity
import GridSearchCV
-- std_data para_grid == std_data.fit(train_X)
{'n_neighbors' : [3,4,5,…10], 'weights' : ['uniform', 'distance']}
-- Knn_Model train_X_std_data
GS =
Conditional
= KNeighbhorsCalssifier(n_neighbors=5,
Probability = std_data.transform(train_X)
GridSearchCV((KNeighborsClassifier(),
(When there are para_grid,
several
weights='distance')
scoring='accuracy')
conditions)
- train_X_std_data Knn_model.fit(train_X_std_data, = pd.DataFrame(train_X_std,train_Y) columns = train_X.columns)
-- Bayes GS.fit(train_X_std_data,
Theorem (Conditional
Knn_model.predict_proba(train_X) train_y)
Probability) ||>>to P(Y/X)
predict= P(X/Y)*P(Y)/P(X)
probabilities || P: Probability
-- GS.best_params_ Knn_model.predict(train_X) || To predict 0 & 1
-- 1. from sklearn.neighbors import= KNeighborsRegressor
P(Cust=Bad)+P(Cust=Good) 1
-- for Knn_model.predict(test_X)
- 2. Xscoring & Y are >> 'r2'
2 independent vars
- For solving
Try-yourself
P(X or classification
Y) = P(X)*P(Y) Problems
- Can only predict if new data has same class in previous data
-- Naïve 3.
If new
Bayes
X = [X1,data
Classifer
X2,X3]
contains , Y any other || class X1, which
X2,X3 are independent
is not present vars then it can't predict
in X_Vars
data,
- Bernaulli Naïve - : Naïve
P(X/Y) Assumptions
If X_Vars are (There
NB=:P([X1,X2,X3]/Y) categorical
= is no&multicollinerity
Binary
P(X1/Y)*(PX2/Y)*P(X3/Y) (0,1) : All are independent)
-- All Bayes X_Vars
Multinomial should
NB : IfbeX_Vars
: Conditional categorical,
Probability Eg: if Income
are categorical is there then we have to bin it 4 or 5 parts
& multinomial
- Or Numerical
Gaussian NB : variable
IfGood
X_Vars should
are num be & normallynormal
distributed
- Calculates P of and P of Bad, follow
then higher one distribution
is selected
-- NB from workssklearn.naive_bayes
wellMixed
if data(Num,Cat) import MultinomialNB, GaussianNB, BernaulliNB
is categorical/descrete
- If X_Vars
nb_model are
= GaussianNB() : Convert num vars in categorical & use Multinomial or binomial depends on X_Vars
-- If X_Vars are num thentrain_Y)
nb_model.fit(train_X, do log of all vars to bring in normal distribution
- nb_model.predict(train_X)
-NB Tree Structure
Classifier whichhave
doesen't helpsany to take decision
parameters
-- Conditional Classification Problem:
Decisions
Take care of Non Linear Relationships / Decision
Rule Based Tree Classifier
Decisions
| While Linear model take care of it
-- Take Regression care ofProblem:
Interaction Decision
b/w vars Tree Regressor| Can't handle interaction b/w vars
-Eg:
-Line Segmentation
It Call Center
dosen't havehas Problem:
any Dept Objective
3 assumptions>> Customer Segmentation
Facing
| There issue,arecalled Service Center
assumptions & get following
(Correlation, etc) options to select >>
which
Prepaid,require helps
Broadband, to divide
DTH data
-- Dosen't Linear Decision much data
BoundryServices >> preparation | Require lot of data preparation
-- max_depth - Postpaid , Prepaid
Non - Linear Decision
- Network Boundry>>
- Min Numbers >> /Get Billing
by significance level >> P-value < 0.05
-- Chi nodes Square >> No. of -A Support
leaf nodes guy assigned
-- Ginni Splitting Criteria
Root
- Stopping No. of vars
NodeCriteria >> No.
(Top) >> Child of varsNodewe (B/w
want topto perform
& last) >> (That means
Leaf Nodex (last)
vars will be taken, eg: gender, income, age, income)
- Entropy Min no. of Obs in each nodes
- Diff Information
CHAID from>>used Gain
Chisquare
previously Automatic Interaction Detection Tree | Splitting Criteria: Chisquare (Multi Split)
Whichever
->> CART 1 - >> combinations
Classification
Probability(1's %) - give
&any high Train &
Regression
(Probability(0's %)test
Tree Accuracy
|| On Node level| base:2)
Splitting: Gini/Entropy (Binary Split)
To stop criteria,
Entropy Formula we >> can put
-(1's%) * of them,or
log(1's% combination
base:2) - 0's% *of them
log(0's%,
- C5.0 / Id3 >> Classification Trees | Splitting: Information Gain (Binary Split)
GridSerchCV()
>> Ginni to get=best &combination
-Multi Min 0 Min
=Split value
&>>Max A = 10 can
node
Max Ginni = 0.5
distibute into multiple childGinni
nodes
- Best Splitting Criteria >> Splitting criteria giving least
Binary >> A node can distibute into 2 child nodes only
->>Split Best onfitting
the basis = Min Entropy
of Y_Var num
- Splitting
Mainly CART Crieteria
is used, >>it is faster
- F-Value >> Split on Highest
- MSE >> Split on Lowest
-DT Stopping
Classificationcriteria can is same
be used as segmentation
- For eg: Perticular group is giving most bad customers
- Root Node >> Can be said as most imp var
- Child node >> Can be said as 2nd, 3rd,.. imp vars
-- LeafDT while Nodecreating>> Can be said
nodes, as least
divides Varimp
intovars
2/3/.. groups
- Non So it Linear
can also models
be used to bin num vars
- Easy to implement (No multicollenearity required, no feture reduction required)
-- Easy
Don'ttouse understand
all the vars&in Explain
DT rules
-Splitting
Model
- High chances Building is very
of Overfitting quick
Criteria:
- If -tree
For is big, it is hard
Regressor: f-value,to explain
MSE in form of tree
- For Classifier: Gini, Entropy,
- from sklearn.tree import DecisionTreeRegressor(), Information Gain DecisionTreeClassifer()
param_grid
- PackageCriteria
Stopping = {'criterion':
in python to view
is same ['gini',
fordecision 'entropy'],
both: tree graph
-Package
max depth 'max_depth': [3,4,5,6],
to export graphviz object
'max_features': [3,4,5,6],
- #nodes
- Min objct in each node 'max_leaf_nodes': [4,5,6,7,8,9]}
- Min objct Split further
Collective Model:
clf_tree.feature_importances_
- Build multiple models seperately
- ConsolidateEnsembling
Homogenious output into >> singleAll output
individual models same algos
- Final Output will be compared from original data to calculate accuracies
- Automate process
Hetrogenious
Bootstrap
- Parallel Processing Ensembling
Aggregating Algo >> >> Bagging
All individual models
Algo (Help to may
handleuseproblem
differentofalgos
Overfitting) >> Bagging, Random Forest
- In multiple Processing
Sequential samples, Data Algowill >>not have duplicate
Boosting Algo (Help records
to handle problem of Underfitting) >> Gradient Boost, XGBoost
- Samples may have Overlap records
-- Manual
Getting single Process output by aggregating all outputs
- Use Decision Tree Method
sklearn.ensemble
Eg: 700 samples
- -Decision
Bagging_classifierwere there,
Tree Multiple
tunning for ensemble 500 records are picked & 200 are left.
Parameters
Bagging
-- -500 are Build
Bag
Bagging_Regressor models from Train Data using all columns/Vars, Random Forest build models on few Vars
Number of Models
-- -200 are Out of Bag (OOB)
Random_Forest_Classifier
- -Bag
RandomSize >> 2/3rd
Trees of data
(Trees
Random_Forest_Regressor are different) >> When Using few Vars instead of all in every model
- no. - of vars
These to
are consider
collection inofeach sample
Trees = forest
- Adaboost
- It is known as Random
- GradientBoostClassifier Forest
- -Weak Learners (Icorrect Predict) will be boosted
GradientBoostRegressor
- Buld Multiple models & Decrease weight of correctly predicted elements & Increase weight for incorrectly predicted elem
-- Adaboost
Gives Accuracy >> Adaptive boosting
at the end, whichever model gives high accuracy will be selected
- Gradient Boosting
- Weight1 * Error1 + W2*E2 + ....
-
gnificant vars
NLP https://www.youtube.com/watch?v=zlUpTlaxAKI&list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQsjfX

Class 1 NLP
Real World NLP Apps
Common NLP Task
Approaches to NLP
Heuristic Approaches
Machine learning
Approaches
Deep Learning
Approaches
Challenges in NLP

Class 2
ww.youtube.com/watch?v=zlUpTlaxAKI&list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQsjfX
- It is a subfiled of CS, AI & Human Language
-- Search Engines
-- Chatboats
Text Parsing
-- Speech
MachinetoLearning
text: Voice Type
Methods
-- Deep Learning Methods
Wordnet
-- LDA
Open Mind Common Sence
-- Hidden Markov
Transformers Models
(Too much used, boosted NLP)
-- Autoencoders
Creativity
- Diversity
EDA Process - Import important libraries
- If X vars are more, then remove few of them manually
- If Column contains >25% null, then drop that column
- If Y Var is missing, remove that row
-- Separate
Outlier & Num_Vars, Cat_Vars
missing values capping& Y_Var
on Num_Vars
- Capping by mode on Cat_Vars
* Creating Dummy Vars (0,1 flag) >> When Independent Vars
* Creating Label encoding (Numbering the vars) >> When dependent vars
-- Concat both data
Check Y_Var (Num Vars,
is normally Cat Num
distributed vars)
or not
(Regression Problem) - If5.not
VIFthen take log on it
* Soomer's D (Gini) >> Classification Problem
- Taking Unique Col names after above process
- Train Test Split
- Rebuild final Model
(Regression Problem) - Convert Y value normal (exp) if converted into log
(Regression Problem) - Compare it with Test Data
* AUC ROC Score >> metrics.roc_auc_score()
(Classification Problem) * Gini
- Decide cutoff point & Make data into booliean form
- Compare it with Test Data
https://www.youtube.com/watch?v=qCR2Weh64h4&list=PLzMcBGfZo4-lUA8uGjeXhBUUzPYc6vZRn

Tutorial 1
Variability Std Deviation: SQRT of Variance || Variance : Avg Squared Dev
Standard Deviation SQRT[SIGMA( Sample - Sample Mean )^2 / No. of Samples -1] || SQRT(sigma(X-Xbar)^2/n-1)
Squared Deviation (Mean - X)^2
Variance Formula Parameter / a^2 : sigma(X-u)^2 / N || u = Population mean , N = No. of Population
3/6 Sigma Rule For Normal Distributed Data
Standard Error Rule/1SQRT(n)
STD : +- STD_1 will cover 68.2% of Data
Error Types - Type 1 Error : Rejecting Null while it should be accepted
Test we have to perfrom -- TType
test2 Error : Accepting Null while it should be rejected
in Py - Onescipy.stats
Sample T test >> 1 Sample vs 1 value
Py Implementation - import as stats

SSE (Sum of Squared


Error) (Y - YBar)^2 || Y- Avg_Y
- Predicted_Y : Actual Y, Ybar : Predicted Y
SSR - Best fit line - Base model || Base Model : Straight line at Avg
R^2 (Accuracy) SSR
- Sum/ SST
of Squared Total
SST - SSE + SSR

Data Preparation Level - Correlation b/w X & X


3: Feature Reduction -- Sort
VIF >>over
Gives Variation Inflation
all error
Y & Pred_Y dataFactor
||inGroup them in 10 parts
MAPE --True
np.mean(np.abs((Actual_Y
TakePositive
Avg
root of
ofeach
it to part
measure - Pred_Y)
&Positive
make
RMSE / Actual_Y))*100
a line(Precision)
chart || Line should overlap each other
-- Corr b/w Y & Pred_YFalse should be positive & very high to identify
RMSE / MSE False If it shows
Train & Test
Negativeany difference
RMSE True/ MSE at any level
should
Negative be than
similar it is easy
-- stats.stats.pearsonr(Y,
pd.qcut(train['Decile_no'], Pred_Y)
10, labels = False) || will give numbering which help to sort df
Correlation -- Training (Senstivity) (Specificity)
Group by&itTesting
on decilesshouldwithhave
Avg similar values
of Y & Pred_Y
Decile Analysis -- It should be
Senstivity >>similar
TP / (TP for+ Train
FN) & Test || Should be as high as possible
-- Specificity >> TN / (FP
A cutoff point which decided + TN) value to be || 0 Should
or 1 be as high as possible
-- Precision >> TP / (TP + FP)
ROC (Receiving Operator Curve) >> Gives optimal Threshold value
-- Accuracy
Totsl 1's / >>Total(TP1's+ +TN) / (TP
Total 0's+ TN + FP + FN)
Confusion Metrix -- F1 Score >> (2*Recall) + Precision / (Precision + Recall) || Higher is better
Concordance
Threshold Value -Max[Abs(Bad
AUC
- Weight of Evidence - Good)](WOE) >> It should be in top 5 deciles
KS Value -- Lift chart
Feature Reduction Sommer's D (Ginni)
Techniques - VIF
Interquartile Ratio
IQR
- -Upper:
Q3 - Q1
Q3 + 1.5*IQR
Outlier -- Lower:
Q1: 1/4Q1 - 1.5*IQR
* (n+1)
Quartile - Q3: 3/4 * (n+1)
Feature reduction method for Classification problem
KNN for Regression parameters
DT for Regression
Naïve Bayes
How we can find Skewness from Boxplot?
Which part of SQL query run first then 2nd and so on
Function to find out count of vowels from a string
Why collenear vars are removed?

SQL
Remove Duplicate Records
Lead / Lag Date Function
Replacing Mbl No with 'xxx'
3rd Highest Salary
To check data is not present in both tbls
Only Columns of table
Split KG & Gram
Generalization (Bottom to Up)
Specialization (Up to bottom)

Tableau
LOD Syntax
Sets

Python
Iterators Vs Generators
- orderby
- limit

WHERE row_num = 1
); date_part('days',lag(date,1) over(partition by customer_id order by date desc) - date) as order_Date_diff
from order_table ) a
CONCAT(SUBSTR(phone, 1, LENGTH(phone) - 5), 'xxxxx') as update_mbl
from tbl_nm
from tbl_nm) a
where
LEFT JOINrn =table1
3 ON table2.id = table1.id
WHERE table1.id IS NULL;
select * from Employee
limitsplit_part(weight::text,
0 '.', 2) AS second_part
from employee
Creating new (Higher Level) entity by combining lower level entities
- Eg: Tbl1: Employee , Tbl2: Customer >> Tbl3: Name, Add, Ph no.
Vice versa of Generalization

-{ To Create
FIXED Iterators
[Customer we use
Name] iter() keyword
: SUM([Sales]) } || Iter process value 1 by 1
-Helped
To Create
to filter out Top, by condition, etc along with function || Yield keyword save local variables
Generator we use yeild keyword

- It can be run using next() or for loop

- Generators helps to write fast & compact code compared to iterators


- Iterator is more memory efficient
- Generator is also a type of iterator
https://cashify.udemy.com/course/python-coding/learn/lecture/5488068#overview
For Practice https://pynative.com/python-if-else-and-for-loop-exercise-with-solutio
Section 2: Core Programming Principles
type(x) show var type
Dtype int, float, string, bool,
--- It will give a line

import math
-and-for-loop-exercise-with-solutions/#h-exercise-1-print-first-10-natural-numbers-using-while-loop

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy