0% found this document useful (0 votes)

35 views

Statistics: Dealing With Skewed Data

1) When handling skewed dependent variables, it is often better to predict the logarithm of the variable rather than the variable itself to prevent outliers from unduly influencing models. Remember to apply the exponential function to results to get actual predictions. 2) There are different types of data including cross-sectional data collected at a single time and time series data collected over periods. Measures of variability include range, interquartile range, variance, and standard deviation. 3) The normal distribution is symmetric and determined by its mean and standard deviation. It follows that 68%, 95%, and 99.7% of values lie within 1, 2, and 3 standard deviations of the mean, respectively.

Uploaded by

Abhijeet

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Statistics: Dealing With Skewed Data

Uploaded by

Abhijeet

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Statistics

28 September 2016 03:41

Dealing with Skewed Data

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents
the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.
However, while applying the predictive model derived by using log of the dependent variable, do not forget to apply exp() on the result to get the actual predicted values.

Cross Sectional Data

Data collected at the same or approximately at the same point in time.
Histogram
Time Series Data Frequency distribution
Data collected over several time periods.
Box Plot
Measures of Variability The box represents the number of observations between the
Range= Highest - Lowest 3rd Quartile and the 1st Quartile
Inter Quartile Range = Q3 - Q1 Inter-Quartile Range (IQR) = Q3 - Q1
Outliers:
Variance a. Observation Value > Q3 + IQR
Measure of variability of data, based on the difference between each observation and the mean.
b. Observation Value < Q1 - IQR
= (Standard Deviation)^2
For a Sample:
s2 = Σ(Xi - X )2 / (n -1)
s-> variance; Xi -> Sample Observa ons; n -> Sample Size; X -> Sample Mean
For a Population:
σ 2 = Σ(Xi - μ)2/N
σ-> variance; Xi -> Population Observations; N -> Population Size; μ -> Population Mean
Coefficient of Variance
(100 * Standard Deviation / Mean)%
Skewness

The Z - Score
• It is a measure of relative location of an observation in the dataset and helps us determine how far a particular value is from the mean
• There is a Z-score associated to each value (observation) of the population/sample
• It is often called the Standardized Value
• It is interpreted as the number of standard devia ons Xi is from the mean X.
• Any value with Z>3 or Z<-3 is an outlier
Zi = (Xi - X)/s
Zi is the Z score for Xi ; X is the sample mean; s is the sample standard devia on

Chebyshev's Theorem:
At least (1 - 1/Z2) of the data values must be within Z standard deviations of the mean where Z > 1
The theorem applies to all data sets irrespective of the shape of distribution of the data

Normal Distribution
1. Mean = Median = Mode
2. Symmetric about mean
3. Standard Deviation determines how flat or wide the curve is
4. Probability for normal random variable is given by area under the curve
5. 68.3% values of a Normal Random Variable are within +/- 1 standard deviation of its mean
6. 95.4% values of a Normal Random Variable are within +/- 2 standard deviation of its mean
7. 99.7% values of a Normal Random Variable are within +/- 3 standard deviation of its mean
8. f(x) = e-(X - μ)^2/2σ^2/σ√(2π)
Standard Normal Density Function
Normal distribution with a mean = 0 and standard deviation = 1.

IIM Trichy Page 1

Normal distribution with a mean = 0 and standard deviation = 1.
Here effectively the Z score becomes the normal random variable.
F(z) = e-z^2/2/√(2π)
To calculate probability in the Standard Normal Density function:
a. Calculate Z score of the desired value (converting the observation value to standard normal random variable)
b. Use the Z-table for cumulative probabilities for Standard Normal Dist to get the probability

Point Estimation
Population Statistic
Mean = μ
Standard Deviation = σ
Sample Statistic
Mean = x̅
Standard Deviation = s
Propor on Est. = p̅ = (no of selected observa ons)/(Total no of observa ons)
Sample Mean x̅ becomes the Point Es mator for Popula on Mean μ
Sample Standard Deviation s becomes the Point Estimator for Population Standard Deviation σ
Sampling Distribution
Consider selecting a simple random sample as an experiment, repeated several times
Each sample gives us a sample mean x̅.
As a result we will have a random variable x̅, that will have a mean or expected value, a standard devia on and a probability distribu on.
This is known as the sampling distribu on of x̅
Expected Value of x̅, E(x̅) = μ the Popula on Mean σ = SD of Population
Standard Devia on of x̅: n = Sample Size
Finite Population SD N = Population Size
= √{(N-n)/(N-1)} * σ/√n
Infinite Population SD
= σ/√n
Central Limit Theorem
In selec ng simple random samples of size n from a popula on, the sampling distribu on of the sample mean x̅ can be approximated by a normal distribu on as the sample
size becomes large.
Sampling Types
1. Simple Random Sampling
2. Stratified Random Sampling:
Elements in Population first divided into groups called strata, based on certain attributes such as department, location, age, type etc.
After this, a Simple Random sample is taken from each stratum.
3. Cluster Sampling
The population is divided into smaller groups called clusters.
Each cluster should ideally be representative of the population
Samples are taken from each cluster
4. Systematic Sampling
Example, selecting one sample from the first 100, another from the next 100 ….
5. Convenience Sampling
Non-Probability Sampling technique.
Elementa are included in the sample without prespecified or known probabilities of being selected
6. Judgement Sampling

Interval Estimation
The purpose of an interval estimate is to provide information about how close the point estimate, provided by the sample, is to the value of the Population Parameter.
It helps us guess the value of the Popula on Mean μ, using the value of the sample mean x̅, and sample size n
Interval estimate of Population Mean:
x̅ ± Margin of Error
Population SD, σ is known: Confidence Coefficient = (1 - α) = Confidence Level/100
So with Confidence level of 95%,
x̅ ± Zα/2(σ/√n) Confidence Coefficient is 0.95 and α = 0.05
Zα/2 is the Z value providing an area of α/2 in the
Population SD, σ is not known: upper/lower tail of the standard normal probability Dist
x̅ ± tα/2(s/√n)
S is the Sample SD
t- is the random variable for t-Distribution
tα/2 is the t value providing an area of α/2 in the upper tail
of the t-Distribution with n-1 degrees of freedom
Margin of Error, E = Zα/2(σ/√n)
Desired Sample Size, n = (Zα/2)2 * σ2/E2

Hypothesis Testing
Type-1 & Type-2 Errors
Table H0 True Ha True μ0 is the Hypothesized Value
Accept H0 Correct Type-II Error
μ is the Population Mean

IIM Trichy Page 2

Reject H0 Type-I Error Correct
Level of Significance - α
It is the probability of making a Type-I error when the null hypothesis is true as an equality
It is represented by α
Smaller the value, the better.
P-Value
A probability that provides a measure of evidence against the null hypothesis provided by the sample.
Smaller p-values indicate more evidence against H0
It is the probability of obtaining a value equal to or smaller than that provided by the sample statistic Z
One Tailed Test
H0: μ >= μ0 H0: μ =< μ0
Ha: μ < μ0 H a: μ > μ 0
Lower Tail Test Upper Tail Test
For Critical Value Approach
Take the required significance level α
Lower Tail Test: See in Z-chart, against what value of Z, do we
Population SD, σ is known get a value equal to α.
1. Calculate Z sta s c: Z = (X - μ0)/s = (X - μ0)/(σ/√n) Mark this value as Z0.
2. Calculate the p-value by looking up the Z table Reject H0 if Z <= Z0
3. Reject H0 if p-value =< α
Population SD, σ is unknown
1. Calculate the t-sta s c; t = (X - μ0)/(s/√n) s is the Sample SD & n is the Sample Size
2. Derive the degrees of freedom = n - 1
3. Find the p-value corresponding to t from the t-table
4. Reject H0 if p-value =< α
Upper Tail Test:
Population SD, σ is known
1. Calculate Z sta s c: Z = (X - μ0)/s = (X - μ0)/(σ/√n)
2. Calculate the p-value by looking up the Z table
3. Reject H0 if (1 - p-value) =< α
4. Reject H0 if Z >= Z0 (using critical Value approach)
Population SD, σ is unknown
1. Calculate the t-sta s c; t = (X - μ0)/(s/√n) s is the Sample SD & n is the Sample Size
2. Derive the degrees of freedom = n - 1
3. Find the p-value corresponding to t from the t-table
4. Reject H0 if (1 - p-value) =< α
Two Tailed Test
H0: μ = μ0
Ha: μ != μ0
Population SD, σ is known
1. Compute the value of test sta s c Z = (X - μ0)/(σ/√n)
2. If Z > 0, find area to the right of Z in the Upper Tail,
3. If Z < 0, find area to the left of Z in the Lower Tail
4. Double the tail area obtained to get p-value
5. Reject if p =< α
6. Reject H0 if Z =< -Zα/2 or if Z >= Zα/2 (using critical Value approach)

Regression Using ANOVA table

Null Hypothesis,
H0: β1 = β2 = β3 = β4 = 0 (i.e. None of the in-dependent variables are related to the dependent variable)
Ha: βi != 0
The F Statistic
Large values of the test statistic provide evidence against the null hypothesis
The F test does not indicate which of the parameters βi != 0 is not equal to zero, only that at least one of them is linearly related to the response variable.
Calculated as (Mean Squared Error of Reg Model)/(Mean Squared Error of Baseline Model)

Linear Regression
Y = a + bX + e ---- (a -> Intercept; b-> slope of line; e -> error in prediction)

Baseline Model
predicts the average value of the dependent variable regardless of the value of the independent variable. Always a flat line. Gives maximum SSE.
Y = a ---- where a = avg(Yi)

SSE = Sum of Squared Errors = Σei

SST = SSE of a base line model (where the mean of the observation is always taken as the future predictor)

R Squared Model
R2 = 1 - (SSE/SST) ……..where ( 0 <= SSE <= SST & SST > 0)

IIM Trichy Page 3

R2 = 1 - (SSE/SST) ……..where ( 0 <= SSE <= SST & SST > 0)
0 < R2 < 1
R2 = 0 ----> SSE = SST, means that regression does not help. There is no improvement over baseline and Y & X are not very related
R2 = 1 ----> SSE = 0, means that the regression is a perfect fit and there are no errors
R squared is nice because it captures the value added from using a linear regression model over just predicting the average outcome for every data point.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Multiple Regression
Y = a + b1X1 + b2X2 + ……. + bnXn + e

multiple linear regression allows you to use multiple variables at once to improve the model. The multiple linear regression model is similar to the one variable regression
model that has a coefficient beta for each independent variable.
We predict the dependent variable y using the independent variables x1, x2, through xk, where k here denotes the number of n dependent variables in our model. Beta 0 is,
again, the coefficient for our intercept term, and beta 1, beta 2, through beta k are the coefficients for the independent variables. We use i to denote the data for a particular
data point or observation.
The best model is selected in the same way as before. To minimize the sum of squared errors, using the error terms, epsilon.

We can start by building a linear regression model that just uses the variable with the best R squared. Then we can add variables one at a time and look at the improvement
in R squared. Note that the improvement is not equal to the one variable R squared for each n dependent variable we add, since they're interactions between the
independent variables.

Adding independent variables improves the R squared to almost double what it was with a single independent variable. But there are diminishing returns. The marginal
improvement from adding an additional variable decreases as we add more and more variables.

Overfitting
Often not all variables should be used. This is because each additional variable used requires more data, and using more variables creates a more complicated model. Overly
complicated models often cause what's known as overfitting. This is when you have a higher R squared
on data used to create the model, but bad performance on unseen data.

Adjusted R-squared
This number adjusts the R-squared value to account for the number of independent variables used relative to the number of data points. Multiple R-squared will always
increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn't help the model. This is a good way to
determine if an additional variable should even be included in the model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.45039886 10.18888394 -0.044 0.965202
AGST 0.60122388 0.10302027 5.836 0.0000127 ***
HarvestRain -0.00395812 0.00087511 -4.523 0.000233 ***
WinterRain 0.00104251 0.00053099 1.963 0.064416 .
Age 0.00058475 0.07900313 0.007 0.994172
FrancePop -0.00004953 0.00016668 -0.297 0.769578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06

A coefficient of 0 means that the value of the independent variable does not change our prediction for the dependent variable. If a coefficient is not significantly different
from 0, then we should probably remove the variable from our model since it's not helping to predict the dependent variable.
Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model
constant.
The standard error column gives a measure of how much the coefficient is likely to vary from the estimate value
The t value is the estimate divided by the standard error. It will be negative if the estimate is negative and positive if the estimate is positive. The larger the absolute value of
the t value, the more likely the coefficient is to be significant. So we want independent variables with a large absolute t-value.
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In
other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the
response variable This number will be large if the absolute value of the t value is small, and it will be small if the absolute value of the t-value is large. We want independent
variables with small values in this column.

Correlation & Multicollinearity

#Removing France Population from model (since it is not significant)#
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
WinterRain 0.0010755 0.0005073 2.120 0.046694 *
Age 0.0239308 0.0080969 2.956 0.007819 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07

After removing France Population, the variable Age becomes significant in the model. This is because population and Age were highly correlated among themselves. Also
note that this is a better model than the previous one, as the Adjusted R2 has increased
Correlation
Correlation measures the linear relationship between two variables and is a number between -1 and +1.

IIM Trichy Page 4

Correlation measures the linear relationship between two variables and is a number between -1 and +1.
A correlation of +1 means a perfect positive linear relationship.
A correlation of -1 means a perfect negative linear relationship.
In the middle of these two extremes is a correlation of 0, which means that there is no linear relationship between the two variables

Multicollinearity
refers to the situation when two independent variables are highly correlated.
A high correlation between an independent variable and the dependent variable is a good thing since we're trying to predict the dependent variable using the independent
variables.
Now due to the possibility of multicollinearity, you always want to remove the insignificant variables one at a time
typically, a correlation greater than 0.7 or less than -0.7
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Important R - Codes
# Read in data
wine = read.csv("C:\\Users/Raktim/Documents/IIM-Trichy/ClassRoom/Term-5/MASDM/Analytics Edge/Wine Test/wine.csv")
str(wine)
summary(wine)
# Multi variable Regression
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
# Correlations
cor(wine$WinterRain, wine$Price)
cor(wine$Age, wine$FrancePop)
cor(wine)
# Make test set PREDICTIONs
predictTest = predict(model4, newdata=wineTest)
predictTest
# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
Rsquared = 1 - SSE/SST
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Logistic Regression
Logistic regression predicts the probability of the outcome variable being true. The probability that the outcome variable is 0 is just 1 minus the probability that the outcome
variable is 1.
Logistic Response Function
P(y=1) = 1/(1 + e-(B0 + B1X1 + B2X2 + ……. + BnXn))
Nonlinear transformation of Linear Regression equation to produce number between 0 and 1.

Positive Coefficient Values are predictive of Class 1

Negative Coefficient Values are predictive of Class 0

Odds
The Odds are the probability of 1 divided by the probability of 0
Odds = P(y=1)/P(y=0) => P(y=1)/(1-P(y=1))
The Odds are greater than 1 if 1 is more likely, and less than 1 if 0 is more likely.
The Odds are equal to 1 if the outcomes are equally likely.
Logit
If you substitute the Logistic Response Function for the probabilities in the Odds equation on the previous slide, you can see that the Odds are equal to "e" raised to the
power of the linear regression equation.
Substituting the Logistic Response Function for the probabilities in the Odds equation:
Odds = e B0 + B1X1 + B2X2 + ……. + BnXn
Log(Odds) = B0 + B1X1 + B2X2 + …. + BnXn
This is the Logit, looks exactly like the linear regression equation.
A positive beta value increases the Logit, which in turn increases the Odds of 1.
A negative beta value decreases the Logit, which in turn, decreases the Odds of 1.

In R
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial) # the family = binomial tells to build a logistic regression model. Glm stands for
generalized linear regression

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6818 -0.6250 -0.4767 -0.1496 2.1060
Coefficients:
Estimate Std. Error z value Pr(>|z|) The coefficient of OfficeVisits variable means that, two people (A&B) who are otherwise
(Intercept) -2.80449 0.59745 -4.694 2.68e-06 *** identical, 1 unit of more office visit increases the Predicted Log Odds of A by 0.08 more
OfficeVisits 0.07995 0.03488 2.292 0.02191 * than the Predicted Log Odds of B.
Narcotics 0.12743 0.04650 2.740 0.00614 ** Ln(OddsA) = Ln(OddsB) + 0.08
--- => exp(Ln(OddsA)) = exp(Ln(OddsB) + 0.08)
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 => OddsA = exp(0.08) * OddsB

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 111.888 on 98 degrees of freedom

IIM Trichy Page 5

Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 84.855 on 96 degrees of freedom
AIC: 90.855
Number of Fisher Scoring iterations: 5

AIC
Measure of Quality of the model. Similar to Adjusted R-square. Accounts for number of variables used compared to the number of observations. Can only be used to
compare between models on the same data set. The preferred model is the model with minimum AIC.

Thresholding :: Confusion Matrix

Often, we want to make an actual prediction. We can convert the probabilities to predictions using what's called a threshold value, t.
The threshold value, t, is often selected based on which errors are better. There are two types of errors that a model can make --ones where you predict 1, or poor care, but
the actual outcome is 0, and ones where you predict 0, or good care, but the actual outcome is 1.
If we pick a large threshold value t, then we will predict poor care rarely, since the probability of poor care has to be really large to be greater than the threshold. This means
that we will make more errors where we say good care, but it's actually poor care. This approach would detect the patients receiving the worst care and prioritize them for
intervention.
On the other hand, if the threshold value, t, is small, we predict poor care frequently, and we predict good care rarely. This means that we will make more errors where we
say poor care, but it's actually good care. This approach would detect all patients who might be receiving poor care.
Some decision-makers often have a preference for one type of error over the other, which should influence the threshold value they pick.
If there's no preference between the errors, the right threshold to select is t = 0.5, since it just predicts the most likely outcome.

Confusion matrix or Classification Matrix

Predicted = 0 Predicted = 1
Actual = 0 True Negative (TN) False Positive (FP)
Actual = 1 False Negative (FN) True Positive (TP)
This compares the actual outcomes to the predicted outcomes. The rows are labelled with the actual outcome, and the columns are labelled with the predicted outcome.
Each entry of the table gives the number of data observations that fall into that category

Sensitivity
is equal to the true positives divided by the true positives plus the false negatives, and measures the percentage of actual poor care cases that we classify correctly. This is
often called the true positive rate.
= TP/(TP + FN)
Specificity
is equal to the true negatives divided by the true negatives plus the false positives, and measures the percentage of actual good care cases that we classify correctly. This is
often called the true negative rate.
= TN/(TN + FP)
Threshold, Specificity & Sensitivity
• A model with a higher threshold will have a lower sensitivity and a higher specificity.
• A model with a lower threshold will have a higher sensitivity and a lower specificity.

Selecting a Threshold :: ROC Curve

A Receiver Operator Characteristic curve, or ROC curve, can help you decide which value of the threshold is best.
The ROC curve for our problem is shown on the right of this slide. The sensitivity, or true positive rate of the model, is shown on the y-axis. And the false positive rate, or 1
minus the specificity, is given on the x-axis. The line shows how these two outcome measures vary with different threshold values.

The ROC curve always starts at the point (0, 0). This corresponds to a threshold value of 1. If you have a threshold of 1, you will not catch any poor care cases, or have a
sensitivity of 0. But you will correctly label of all the good care cases, meaning you have a false positive rate of 0.

The ROC curve always ends at the point (1,1), which corresponds to a threshold value of 0. If you have a threshold of 0, you'll catch all of the poor care cases, or have a
sensitivity of 1, but you'll label all of the good care cases as poor care cases too, meaning you have a false positive rate of 1. The threshold decreases as you move from (0,0)
to (1,1).

This helps you select a threshold value by visualizing the error that would be made in the process.

IIM Trichy Page 6

IIM Trichy Page 7

CRE Equations and Formulas Print Out
No ratings yet
CRE Equations and Formulas Print Out
30 pages
Homework 3
100% (1)
Homework 3
7 pages
Statistics For Geoscientists: Pieter Vermeesch
No ratings yet
Statistics For Geoscientists: Pieter Vermeesch
225 pages
Cape Applied Mathematics Cheat Sheet
No ratings yet
Cape Applied Mathematics Cheat Sheet
6 pages
CFA Level 1 Review - Quantitative Methods
50% (2)
CFA Level 1 Review - Quantitative Methods
10 pages
Statistics Cheat Sheet
100% (3)
Statistics Cheat Sheet
23 pages
Applied Math Unit1 Summary and Useful Formulas
100% (1)
Applied Math Unit1 Summary and Useful Formulas
4 pages
Important Formulas: Chapter 5: Discrete Probability Distributions
No ratings yet
Important Formulas: Chapter 5: Discrete Probability Distributions
2 pages
Statistics: a QuickStudy Laminated Reference Guide
From Everand
Statistics: a QuickStudy Laminated Reference Guide
BarCharts Publishing, Inc.
No ratings yet
07 Buhler India
No ratings yet
07 Buhler India
1 page
Marketing The Nissan Micra and Tata Nano - Mavericks - Submission - Group 2-13
0% (1)
Marketing The Nissan Micra and Tata Nano - Mavericks - Submission - Group 2-13
1 page
Milk Bidding Wars: Data Driven Decision Making - Case Study 3
No ratings yet
Milk Bidding Wars: Data Driven Decision Making - Case Study 3
4 pages
Case Analysis Pilgrim Bank
No ratings yet
Case Analysis Pilgrim Bank
3 pages
Cheat Sheet - Test 3
No ratings yet
Cheat Sheet - Test 3
2 pages
Statistical Estimation
No ratings yet
Statistical Estimation
31 pages
Stats Exam 1 Cheat Sheet
No ratings yet
Stats Exam 1 Cheat Sheet
3 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
22 pages
MAS202 - Assignment 2: Exercise 1
No ratings yet
MAS202 - Assignment 2: Exercise 1
16 pages
X X X X: Statistic Symbol Population Sample Description
No ratings yet
X X X X: Statistic Symbol Population Sample Description
3 pages
FINALS CHEATSHEET
No ratings yet
FINALS CHEATSHEET
2 pages
Advanced Statistical Methods Rajeswari
No ratings yet
Advanced Statistical Methods Rajeswari
46 pages
Introduction To Inferential Statistics
No ratings yet
Introduction To Inferential Statistics
27 pages
Theory of Estimation
No ratings yet
Theory of Estimation
21 pages
Statistics ESCP
No ratings yet
Statistics ESCP
383 pages
Stats Assign
No ratings yet
Stats Assign
6 pages
2006 Geog090 Week06 Lecture01 CentralLimitTheorem
No ratings yet
2006 Geog090 Week06 Lecture01 CentralLimitTheorem
37 pages
MANM526-W2-1
No ratings yet
MANM526-W2-1
18 pages
The Most Important Probability Distribution in Statistics
No ratings yet
The Most Important Probability Distribution in Statistics
57 pages
Theory Term2
No ratings yet
Theory Term2
9 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
29 pages
GB Academy Equation List
No ratings yet
GB Academy Equation List
16 pages
U-3 Notes
No ratings yet
U-3 Notes
42 pages
Lecture 30 - Sample and Population Mean
No ratings yet
Lecture 30 - Sample and Population Mean
49 pages
Statistics Cheatsheet
100% (1)
Statistics Cheatsheet
2 pages
Statistical Formula Sheet 1: X X N X N X F X N
No ratings yet
Statistical Formula Sheet 1: X X N X N X F X N
11 pages
x (sample mean) is the most unbiased estimate for the population mean μ p= x n
No ratings yet
x (sample mean) is the most unbiased estimate for the population mean μ p= x n
5 pages
Final Stat Fiche Révision PDF
No ratings yet
Final Stat Fiche Révision PDF
9 pages
Quant Part2
No ratings yet
Quant Part2
40 pages
2. My Little Stats Book
No ratings yet
2. My Little Stats Book
8 pages
Inferential Statistic: 1 Estimation of A Population Mean
No ratings yet
Inferential Statistic: 1 Estimation of A Population Mean
8 pages
Statistics
No ratings yet
Statistics
49 pages
Unit-4 - Confidence Interval and CLT
No ratings yet
Unit-4 - Confidence Interval and CLT
29 pages
Formula Sheet
No ratings yet
Formula Sheet
13 pages
Chapter 6
No ratings yet
Chapter 6
33 pages
REVIEWER-STATISTICS-AND-PROBABILITY
No ratings yet
REVIEWER-STATISTICS-AND-PROBABILITY
5 pages
6 Estimation and Hypothesis
No ratings yet
6 Estimation and Hypothesis
95 pages
STA248
No ratings yet
STA248
26 pages
Pro Band Stat
No ratings yet
Pro Band Stat
27 pages
课本附录 (二) - 公式表 Formula Sheet - final
No ratings yet
课本附录 (二) - 公式表 Formula Sheet - final
2 pages
C 4
No ratings yet
C 4
61 pages
StockWatson Econ CH 2
No ratings yet
StockWatson Econ CH 2
39 pages
Statistics 221 Summary of Material
No ratings yet
Statistics 221 Summary of Material
5 pages
STAT2303 UARK Week 03 Lesson 01 Confidence Intervals
No ratings yet
STAT2303 UARK Week 03 Lesson 01 Confidence Intervals
3 pages
List of Formula - Managerial Statistics
No ratings yet
List of Formula - Managerial Statistics
6 pages
Formula Stables
No ratings yet
Formula Stables
29 pages
Sampling Theory
No ratings yet
Sampling Theory
7 pages
ECON 332 Business Forecasting Methods Prof. Kirti K. Katkar
No ratings yet
ECON 332 Business Forecasting Methods Prof. Kirti K. Katkar
46 pages
Inbound 588667172330667162
No ratings yet
Inbound 588667172330667162
30 pages
Lecture06 Ch6 Forsyth Inf Stats FA24
No ratings yet
Lecture06 Ch6 Forsyth Inf Stats FA24
56 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
12 pages
FlowChart_V20
No ratings yet
FlowChart_V20
8 pages
Instalinotes - PREVMED II
100% (1)
Instalinotes - PREVMED II
24 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Business Statistics Lecture Notes Chapter II
No ratings yet
Business Statistics Lecture Notes Chapter II
13 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Lean Six Sigma - Synopsis
No ratings yet
Lean Six Sigma - Synopsis
2 pages
Strategic Objectives: Micra
No ratings yet
Strategic Objectives: Micra
1 page
MLP Sous Keras: A. MLP Pour Une Classification Binaire
No ratings yet
MLP Sous Keras: A. MLP Pour Une Classification Binaire
2 pages
TEST 3 Descriptive Statistics J Probability PDF
No ratings yet
TEST 3 Descriptive Statistics J Probability PDF
8 pages
Idem - Adyanto Armando Purba
No ratings yet
Idem - Adyanto Armando Purba
3 pages
Miller and Freunds Probability and Statistics for Engineers 9th Edition Johnson Solutions Manual download
100% (3)
Miller and Freunds Probability and Statistics for Engineers 9th Edition Johnson Solutions Manual download
52 pages
Case Study Dbm30033 Group1 - 98
No ratings yet
Case Study Dbm30033 Group1 - 98
18 pages
Econ275 (Stanford) PDF
No ratings yet
Econ275 (Stanford) PDF
4 pages
Stat Notes
No ratings yet
Stat Notes
9 pages
Yaregal Birhanu
No ratings yet
Yaregal Birhanu
8 pages
2160 - Mock Exam Sol - Tri 1 22
No ratings yet
2160 - Mock Exam Sol - Tri 1 22
6 pages
Stata Help Hausman Test PDF
No ratings yet
Stata Help Hausman Test PDF
9 pages
PSLP notes
No ratings yet
PSLP notes
13 pages
Ejercicios Minimos Cuadrados Chapra
No ratings yet
Ejercicios Minimos Cuadrados Chapra
3 pages
A AMU
No ratings yet
A AMU
9 pages
Uncertainty
No ratings yet
Uncertainty
26 pages
320C10
100% (2)
320C10
59 pages
Lab 5 EA
No ratings yet
Lab 5 EA
4 pages
Pengaruh Kualitas Pelayanan Dan Harga Terhadap Keputusan Pembelian Pada Kedai Kirani Coffee Abdul Mukti
No ratings yet
Pengaruh Kualitas Pelayanan Dan Harga Terhadap Keputusan Pembelian Pada Kedai Kirani Coffee Abdul Mukti
17 pages
Home Study Notes (Hypothesis Testing and Normal Distribution)
No ratings yet
Home Study Notes (Hypothesis Testing and Normal Distribution)
23 pages
Lista Fabio Cozman
No ratings yet
Lista Fabio Cozman
6 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
Prof Ed Mock Board Questions 76-150
No ratings yet
Prof Ed Mock Board Questions 76-150
150 pages
Mediate
100% (1)
Mediate
7 pages
MC Simulation
No ratings yet
MC Simulation
39 pages
Example Chapter 11 (1)
No ratings yet
Example Chapter 11 (1)
23 pages
Econometrics II Handout For Students
No ratings yet
Econometrics II Handout For Students
29 pages
Using Matlab in Mutual Funds Evaluation
No ratings yet
Using Matlab in Mutual Funds Evaluation
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Statistics: Dealing With Skewed Data

Uploaded by

Statistics: Dealing With Skewed Data

Uploaded by

Statistics

28 September 2016 03:41

Dealing with Skewed Data

Cross Sectional Data

IIM Trichy Page 1

IIM Trichy Page 2

Regression Using ANOVA table

SSE = Sum of Squared Errors = Σei

IIM Trichy Page 3

Correlation & Multicollinearity

IIM Trichy Page 4

Positive Coefficient Values are predictive of Class 1

(Dispersion parameter for binomial family taken to be 1)

IIM Trichy Page 5

Thresholding :: Confusion Matrix

Confusion matrix or Classification Matrix

Selecting a Threshold :: ROC Curve

IIM Trichy Page 6

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.