Statistics: Dealing With Skewed Data
Statistics: Dealing With Skewed Data
When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents
the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.
However, while applying the predictive model derived by using log of the dependent variable, do not forget to apply exp() on the result to get the actual predicted values.
The Z - Score
• It is a measure of relative location of an observation in the dataset and helps us determine how far a particular value is from the mean
• There is a Z-score associated to each value (observation) of the population/sample
• It is often called the Standardized Value
• It is interpreted as the number of standard devia ons Xi is from the mean X.
• Any value with Z>3 or Z<-3 is an outlier
Zi = (Xi - X)/s
Zi is the Z score for Xi ; X is the sample mean; s is the sample standard devia on
Chebyshev's Theorem:
At least (1 - 1/Z2) of the data values must be within Z standard deviations of the mean where Z > 1
The theorem applies to all data sets irrespective of the shape of distribution of the data
Normal Distribution
1. Mean = Median = Mode
2. Symmetric about mean
3. Standard Deviation determines how flat or wide the curve is
4. Probability for normal random variable is given by area under the curve
5. 68.3% values of a Normal Random Variable are within +/- 1 standard deviation of its mean
6. 95.4% values of a Normal Random Variable are within +/- 2 standard deviation of its mean
7. 99.7% values of a Normal Random Variable are within +/- 3 standard deviation of its mean
8. f(x) = e-(X - μ)^2/2σ^2/σ√(2π)
Standard Normal Density Function
Normal distribution with a mean = 0 and standard deviation = 1.
Point Estimation
Population Statistic
Mean = μ
Standard Deviation = σ
Sample Statistic
Mean = x̅
Standard Deviation = s
Propor on Est. = p̅ = (no of selected observa ons)/(Total no of observa ons)
Sample Mean x̅ becomes the Point Es mator for Popula on Mean μ
Sample Standard Deviation s becomes the Point Estimator for Population Standard Deviation σ
Sampling Distribution
Consider selecting a simple random sample as an experiment, repeated several times
Each sample gives us a sample mean x̅.
As a result we will have a random variable x̅, that will have a mean or expected value, a standard devia on and a probability distribu on.
This is known as the sampling distribu on of x̅
Expected Value of x̅, E(x̅) = μ the Popula on Mean σ = SD of Population
Standard Devia on of x̅: n = Sample Size
Finite Population SD N = Population Size
= √{(N-n)/(N-1)} * σ/√n
Infinite Population SD
= σ/√n
Central Limit Theorem
In selec ng simple random samples of size n from a popula on, the sampling distribu on of the sample mean x̅ can be approximated by a normal distribu on as the sample
size becomes large.
Sampling Types
1. Simple Random Sampling
2. Stratified Random Sampling:
Elements in Population first divided into groups called strata, based on certain attributes such as department, location, age, type etc.
After this, a Simple Random sample is taken from each stratum.
3. Cluster Sampling
The population is divided into smaller groups called clusters.
Each cluster should ideally be representative of the population
Samples are taken from each cluster
4. Systematic Sampling
Example, selecting one sample from the first 100, another from the next 100 ….
5. Convenience Sampling
Non-Probability Sampling technique.
Elementa are included in the sample without prespecified or known probabilities of being selected
6. Judgement Sampling
Interval Estimation
The purpose of an interval estimate is to provide information about how close the point estimate, provided by the sample, is to the value of the Population Parameter.
It helps us guess the value of the Popula on Mean μ, using the value of the sample mean x̅, and sample size n
Interval estimate of Population Mean:
x̅ ± Margin of Error
Population SD, σ is known: Confidence Coefficient = (1 - α) = Confidence Level/100
So with Confidence level of 95%,
x̅ ± Zα/2(σ/√n) Confidence Coefficient is 0.95 and α = 0.05
Zα/2 is the Z value providing an area of α/2 in the
Population SD, σ is not known: upper/lower tail of the standard normal probability Dist
x̅ ± tα/2(s/√n)
S is the Sample SD
t- is the random variable for t-Distribution
tα/2 is the t value providing an area of α/2 in the upper tail
of the t-Distribution with n-1 degrees of freedom
Margin of Error, E = Zα/2(σ/√n)
Desired Sample Size, n = (Zα/2)2 * σ2/E2
Hypothesis Testing
Type-1 & Type-2 Errors
Table H0 True Ha True μ0 is the Hypothesized Value
Accept H0 Correct Type-II Error
μ is the Population Mean
Linear Regression
Y = a + bX + e ---- (a -> Intercept; b-> slope of line; e -> error in prediction)
Baseline Model
predicts the average value of the dependent variable regardless of the value of the independent variable. Always a flat line. Gives maximum SSE.
Y = a ---- where a = avg(Yi)
R Squared Model
R2 = 1 - (SSE/SST) ……..where ( 0 <= SSE <= SST & SST > 0)
multiple linear regression allows you to use multiple variables at once to improve the model. The multiple linear regression model is similar to the one variable regression
model that has a coefficient beta for each independent variable.
We predict the dependent variable y using the independent variables x1, x2, through xk, where k here denotes the number of n dependent variables in our model. Beta 0 is,
again, the coefficient for our intercept term, and beta 1, beta 2, through beta k are the coefficients for the independent variables. We use i to denote the data for a particular
data point or observation.
The best model is selected in the same way as before. To minimize the sum of squared errors, using the error terms, epsilon.
We can start by building a linear regression model that just uses the variable with the best R squared. Then we can add variables one at a time and look at the improvement
in R squared. Note that the improvement is not equal to the one variable R squared for each n dependent variable we add, since they're interactions between the
independent variables.
Adding independent variables improves the R squared to almost double what it was with a single independent variable. But there are diminishing returns. The marginal
improvement from adding an additional variable decreases as we add more and more variables.
Overfitting
Often not all variables should be used. This is because each additional variable used requires more data, and using more variables creates a more complicated model. Overly
complicated models often cause what's known as overfitting. This is when you have a higher R squared
on data used to create the model, but bad performance on unseen data.
Adjusted R-squared
This number adjusts the R-squared value to account for the number of independent variables used relative to the number of data points. Multiple R-squared will always
increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn't help the model. This is a good way to
determine if an additional variable should even be included in the model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.45039886 10.18888394 -0.044 0.965202
AGST 0.60122388 0.10302027 5.836 0.0000127 ***
HarvestRain -0.00395812 0.00087511 -4.523 0.000233 ***
WinterRain 0.00104251 0.00053099 1.963 0.064416 .
Age 0.00058475 0.07900313 0.007 0.994172
FrancePop -0.00004953 0.00016668 -0.297 0.769578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
A coefficient of 0 means that the value of the independent variable does not change our prediction for the dependent variable. If a coefficient is not significantly different
from 0, then we should probably remove the variable from our model since it's not helping to predict the dependent variable.
Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model
constant.
The standard error column gives a measure of how much the coefficient is likely to vary from the estimate value
The t value is the estimate divided by the standard error. It will be negative if the estimate is negative and positive if the estimate is positive. The larger the absolute value of
the t value, the more likely the coefficient is to be significant. So we want independent variables with a large absolute t-value.
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In
other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the
response variable This number will be large if the absolute value of the t value is small, and it will be small if the absolute value of the t-value is large. We want independent
variables with small values in this column.
After removing France Population, the variable Age becomes significant in the model. This is because population and Age were highly correlated among themselves. Also
note that this is a better model than the previous one, as the Adjusted R2 has increased
Correlation
Correlation measures the linear relationship between two variables and is a number between -1 and +1.
Multicollinearity
refers to the situation when two independent variables are highly correlated.
A high correlation between an independent variable and the dependent variable is a good thing since we're trying to predict the dependent variable using the independent
variables.
Now due to the possibility of multicollinearity, you always want to remove the insignificant variables one at a time
typically, a correlation greater than 0.7 or less than -0.7
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Important R - Codes
# Read in data
wine = read.csv("C:\\Users/Raktim/Documents/IIM-Trichy/ClassRoom/Term-5/MASDM/Analytics Edge/Wine Test/wine.csv")
str(wine)
summary(wine)
# Multi variable Regression
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
# Correlations
cor(wine$WinterRain, wine$Price)
cor(wine$Age, wine$FrancePop)
cor(wine)
# Make test set PREDICTIONs
predictTest = predict(model4, newdata=wineTest)
predictTest
# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
Rsquared = 1 - SSE/SST
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Logistic Regression
Logistic regression predicts the probability of the outcome variable being true. The probability that the outcome variable is 0 is just 1 minus the probability that the outcome
variable is 1.
Logistic Response Function
P(y=1) = 1/(1 + e-(B0 + B1X1 + B2X2 + ……. + BnXn))
Nonlinear transformation of Linear Regression equation to produce number between 0 and 1.
Odds
The Odds are the probability of 1 divided by the probability of 0
Odds = P(y=1)/P(y=0) => P(y=1)/(1-P(y=1))
The Odds are greater than 1 if 1 is more likely, and less than 1 if 0 is more likely.
The Odds are equal to 1 if the outcomes are equally likely.
Logit
If you substitute the Logistic Response Function for the probabilities in the Odds equation on the previous slide, you can see that the Odds are equal to "e" raised to the
power of the linear regression equation.
Substituting the Logistic Response Function for the probabilities in the Odds equation:
Odds = e B0 + B1X1 + B2X2 + ……. + BnXn
Log(Odds) = B0 + B1X1 + B2X2 + …. + BnXn
This is the Logit, looks exactly like the linear regression equation.
A positive beta value increases the Logit, which in turn increases the Odds of 1.
A negative beta value decreases the Logit, which in turn, decreases the Odds of 1.
In R
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial) # the family = binomial tells to build a logistic regression model. Glm stands for
generalized linear regression
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6818 -0.6250 -0.4767 -0.1496 2.1060
Coefficients:
Estimate Std. Error z value Pr(>|z|) The coefficient of OfficeVisits variable means that, two people (A&B) who are otherwise
(Intercept) -2.80449 0.59745 -4.694 2.68e-06 *** identical, 1 unit of more office visit increases the Predicted Log Odds of A by 0.08 more
OfficeVisits 0.07995 0.03488 2.292 0.02191 * than the Predicted Log Odds of B.
Narcotics 0.12743 0.04650 2.740 0.00614 ** Ln(OddsA) = Ln(OddsB) + 0.08
--- => exp(Ln(OddsA)) = exp(Ln(OddsB) + 0.08)
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 => OddsA = exp(0.08) * OddsB
AIC
Measure of Quality of the model. Similar to Adjusted R-square. Accounts for number of variables used compared to the number of observations. Can only be used to
compare between models on the same data set. The preferred model is the model with minimum AIC.
Sensitivity
is equal to the true positives divided by the true positives plus the false negatives, and measures the percentage of actual poor care cases that we classify correctly. This is
often called the true positive rate.
= TP/(TP + FN)
Specificity
is equal to the true negatives divided by the true negatives plus the false positives, and measures the percentage of actual good care cases that we classify correctly. This is
often called the true negative rate.
= TN/(TN + FP)
Threshold, Specificity & Sensitivity
• A model with a higher threshold will have a lower sensitivity and a higher specificity.
• A model with a lower threshold will have a higher sensitivity and a lower specificity.
The ROC curve always starts at the point (0, 0). This corresponds to a threshold value of 1. If you have a threshold of 1, you will not catch any poor care cases, or have a
sensitivity of 0. But you will correctly label of all the good care cases, meaning you have a false positive rate of 0.
The ROC curve always ends at the point (1,1), which corresponds to a threshold value of 0. If you have a threshold of 0, you'll catch all of the poor care cases, or have a
sensitivity of 1, but you'll label all of the good care cases as poor care cases too, meaning you have a false positive rate of 1. The threshold decreases as you move from (0,0)
to (1,1).
This helps you select a threshold value by visualizing the error that would be made in the process.