0% found this document useful (0 votes)

27 views

Lab4 - SLR - Ipynb - Colaboratory

The document describes developing a simple linear regression model to predict salary based on percentage in grade 10 using ordinary least squares (OLS). [1] The model was fit using statsmodels OLS on training data with salary as the target variable Y and percentage in grade 10 as the feature X. [2] The model had an R-squared of 0.211, indicating it explains 21.1% of variation in salary. [3] Residual analysis and outlier detection showed the residuals had constant variance and there were no highly influential outliers.

Uploaded by

PATTABHI RAMANJANEYULU

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Lab4 - SLR - Ipynb - Colaboratory

Uploaded by

PATTABHI RAMANJANEYULU

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Simple Linear Regression Using (OLS) Ordinary Least Squares method.

Develop a regression
model to predict Salary based on Percentage in Grade 10.

#This code is to upload the data set from local drive into Colab
from google.colab import files
uploaded = files.upload()

Choose Files No file chosen

Upload widget is only available when the cell has been
executed in the
current browser session. Please rerun this cell to enable.

1.Import the MBA Salary dataset

#import the data from MBA Salary.csv
import pandas as pd
mba_salary_df = pd.read_csv('/content/MBA Salary.csv')
mba_salary_df.head(3)

S. No. Percentage in Grade 10 Salary

0 1 62.00 270000

1 2 76.33 200000

2 3 72.00 240000

#print information about the data set

mba_salary_df.info()

RangeIndex: 50 entries, 0 to 49

Data columns (total 3 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 S. No. 50 non-null int64

1 Percentage in Grade 10 50 non-null float64

2 Salary 50 non-null int64

dtypes: float64(1), int64(2)

memory usage: 1.3 KB

mba_salary_df.shape

(50, 3)
*2. Creating feature set X and the outcome variable Y. The
statsmodel library is used for building statistical models.
OLS API in statsmodel.api is used to estimate the
parameters of simple linear regression. It takes two
parameters Y and X. IN this data Y is Salary and X is
Percentage in Grade 10. The OLS model estimates only the
coefficient of X (Beta 1 or slope). To estimate Beta 0, a
constant term of 1 needs to be added as a seperate
column. This parameter is the intercept term. *

import statsmodels.api as sm

X = sm.add_constant(mba_salary_df['Percentage in Grade 10'])

X.head(5)

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarnin
x = pd.concat(x[::order], 1)

const Percentage in Grade 10

0 1.0 62.00

1 1.0 76.33

2 1.0 72.00

3 1.0 60.00

4 1.0 61.00

3. Create outcome Variable Y

Y = mba_salary_df['Salary']

Y.head()

0 270000

1 200000

2 240000

3 250000

4 180000

Name: Salary, dtype: int64

4. Split dataset into training and validation sets. Use 80%
for training and 20% for validating

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X,Y,train_size = 0.8, random_state = 1

5. Fit the model

mba_salary_lm = sm.OLS(train_y, train_X).fit()

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fada1eb6290>

The fit() method on OLS, estimates the parameters and returns the model information such as
model parameters(coefficients), acccuracy measures and residual values to the varibale
mba_salary_lm

6. Print the estimated parameters

print(mba_salary_lm.params)

const 30587.285652

Percentage in Grade 10 3560.587383

dtype: float64

Hence Beta 0 = 30587.285 and Beta 1 = 3560.587. The estimated model is MBA Salary =
30587.285652 + 3560.587383(Percentage in Grade 10)

7. Model Diagnostics - Printing the coefficient of

determination R-Square

print(mba_salary_lm.summary2())

Results: Ordinary least squares

===================================================================================

Model: OLS Adj. R-squared: 0.190

Dependent Variable: Salary AIC: 1008.8680

Date: 2022-10-12 09:27 BIC: 1012.2458

No. Observations: 40 Log-Likelihood: -502.43

Df Model: 1 F-statistic: 10.16

Df Residuals: 38 Prob (F-statistic): 0.00287

R-squared: 0.211 Scale: 5.0121e+09

-----------------------------------------------------------------------------------

Coef. Std.Err. t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

const 30587.2857 71869.4497 0.4256 0.6728 -114904.8089 176079.3802

Percentage in Grade 10 3560.5874 1116.9258 3.1878 0.0029 1299.4892 5821.6855

-----------------------------------------------------------------------------------

Omnibus: 2.048 Durbin-Watson: 2.611

Prob(Omnibus): 0.359 Jarque-Bera (JB): 1.724

Skew: 0.369 Prob(JB): 0.422

Kurtosis: 2.300 Condition No.: 413

===================================================================================

Hence R Square of the model is 0.211. So the model explains 21.1% of the variation in salary.

8. Model Diagnostics - Residual Analysis - variance of the

residual has to be constant across different values of the
predicted value (Y') - a property known as
homoscedasticity. A non-constant variance of the
residuals is known as heteroscedasticity - not desired. If
there is heteroscedasticity, a residual plot between
standardised residual values and standardised predicted
values, will be funnel shaped. To standardize, subtract
from mean and divide by standard deviation

import matplotlib.pyplot as plt

def get_std_values(vals):

return(vals - vals.mean())/vals.std()

x_axis = get_std_values(mba_salary_lm.fittedvalues)

y_axis = get_std_values(mba_salary_lm.resid)

plt.scatter(x_axis, y_axis)

plt.xlabel("Standardised Predicted values")

plt.ylabel("Standardised Residual Values")

plt.title("Residual Plot")

plt.show()

The residual plot is not funnel shaped. Hence residuals have constant variance.

9. Model Diagnostics - Oulier Detection. Outliers are

observations whose values show a large deviation from
the mean value. Their presence can have a significant
influence on the values of the regression coefficients.
Hence we use Z-Score to identify their existence in the
data. Any obervation with an Z-Score of more than 3.0 is
an outlier.

from scipy.stats import zscore

mba_salary_df['z_score_salary'] = zscore(mba_salary_df.Salary)

#mba_salary_df.head()

mba_salary_df[(mba_salary_df.z_score_salary > 3.0)| (mba_salary_df.z_score_salary< -3.0)]

S. No. Percentage in Grade 10 Salary z_score_salary

Hence there is no outlier

10. Model Diagnostics - Finding highly influential

Observations using Cook's distance. This distance
measures how much the predicted value of the dependent
variable changes for all observations on the sample when
a particular observation is removed from the sample while
estimating the regression parameters. get_influence()
returns the influence of each observations and
cook_distance variable provides Cook's distance
measures. An observation with Cook's distance of more
than 1 is highly influential.
import numpy as np

mba_influence = mba_salary_lm.get_influence()

(c,p) = mba_influence.cooks_distance

plt.stem(np.arange(len(train_X)), np.round(c,3))

plt.xlabel("Row Index")
plt.ylabel("Cooks Distance")

plt.show()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: UserWarning: In Matp

after removing the cwd from sys.path.

There is no observation with Cooks's distance > 1. Hence none of them are influential.

11. Making predictions on validation set and measuring

accuracy - R-Squared and RMSE
import numpy as np

from sklearn.metrics import r2_score, mean_squared_error

pred_y = mba_salary_lm.predict(test_X)

print('R2 Score =',np.abs(r2_score(test_y,pred_y)))

print('RMSE = ', np.sqrt(mean_squared_error(test_y,pred_y)))

R2 Score = 0.156645849742304

RMSE = 73458.04348346895

Colab paid products

-
Cancel contracts here

Detailed Lesson Plan in Mathematics For Grade 10
No ratings yet
Detailed Lesson Plan in Mathematics For Grade 10
21 pages
Ep05 A3
0% (2)
Ep05 A3
5 pages
Six Sigma Green Belt Exam - ProProfs Quiz
0% (1)
Six Sigma Green Belt Exam - ProProfs Quiz
12 pages
CE1 Sol
No ratings yet
CE1 Sol
7 pages
Econometrics 7
No ratings yet
Econometrics 7
49 pages
211423205047-Exp1c
No ratings yet
211423205047-Exp1c
6 pages
Linear regression
No ratings yet
Linear regression
1 page
Simple_and_Multiple_Regression
No ratings yet
Simple_and_Multiple_Regression
9 pages
How to Perform Simple Linear Regression in Python
No ratings yet
How to Perform Simple Linear Regression in Python
8 pages
2 Simple Regression Model
No ratings yet
2 Simple Regression Model
55 pages
Simple Regression Model
No ratings yet
Simple Regression Model
54 pages
Python_Codes_Regression - Jupyter Notebook
No ratings yet
Python_Codes_Regression - Jupyter Notebook
7 pages
Nu - Edu.kz Econometrics-I Assignment 4 Answer Key
No ratings yet
Nu - Edu.kz Econometrics-I Assignment 4 Answer Key
4 pages
Data_Analysis_Report
No ratings yet
Data_Analysis_Report
16 pages
BA Soln
No ratings yet
BA Soln
9 pages
Code Book
No ratings yet
Code Book
20 pages
Chapter 4 - Linear Regression
100% (2)
Chapter 4 - Linear Regression
25 pages
TestExercise 3.ipynb - Colab
No ratings yet
TestExercise 3.ipynb - Colab
8 pages
Econometrics 5 and 6
No ratings yet
Econometrics 5 and 6
16 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
predictive modelling outputs
No ratings yet
predictive modelling outputs
7 pages
assignment2
No ratings yet
assignment2
5 pages
TP Regression
100% (1)
TP Regression
1 page
Assignment_Solution_1
No ratings yet
Assignment_Solution_1
11 pages
Lecture4 Linearregression Oneregressor
No ratings yet
Lecture4 Linearregression Oneregressor
37 pages
Additional Problem Set Units I and II
No ratings yet
Additional Problem Set Units I and II
8 pages
3 Multiple Regression Model
No ratings yet
3 Multiple Regression Model
48 pages
Lab 9 Report
No ratings yet
Lab 9 Report
5 pages
Salary Structure Design Tutorial Simulasi
No ratings yet
Salary Structure Design Tutorial Simulasi
18 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Final Exam Suggested Solution Key
No ratings yet
Final Exam Suggested Solution Key
5 pages
Manual ML 1
No ratings yet
Manual ML 1
8 pages
Chapter 2: Properties of The Regression Coe Cients and Hypothesis Testing
No ratings yet
Chapter 2: Properties of The Regression Coe Cients and Hypothesis Testing
5 pages
Econ321 2017 Tutorial 1
No ratings yet
Econ321 2017 Tutorial 1
12 pages
Ecotrix Assignment
No ratings yet
Ecotrix Assignment
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
cheatsheet
No ratings yet
cheatsheet
2 pages
Stata Textbook Examples Introductory Econometrics by Jeffrey PDF
No ratings yet
Stata Textbook Examples Introductory Econometrics by Jeffrey PDF
104 pages
Stata Textbook Examples Introductory Eco No Metrics by Jeffrey
100% (1)
Stata Textbook Examples Introductory Eco No Metrics by Jeffrey
104 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
98 pages
Pregunta 5
No ratings yet
Pregunta 5
2 pages
Ch3 Multiple Regression
No ratings yet
Ch3 Multiple Regression
56 pages
Week 2 MrSumanBera HandsOn
No ratings yet
Week 2 MrSumanBera HandsOn
9 pages
Coding Activity 3.ipynb - Colaboratory
No ratings yet
Coding Activity 3.ipynb - Colaboratory
7 pages
Estadisticas Descriptivas - DSTAT Rhs ONE, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12$
No ratings yet
Estadisticas Descriptivas - DSTAT Rhs ONE, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12$
4 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
42 pages
Lab Exercises Answer
No ratings yet
Lab Exercises Answer
13 pages
Empirical Exercises 6
No ratings yet
Empirical Exercises 6
7 pages
Ch08 - Linear Regression
No ratings yet
Ch08 - Linear Regression
37 pages
Regression 101
No ratings yet
Regression 101
46 pages
Lecture 2-3
No ratings yet
Lecture 2-3
8 pages
Lesson Week 13
No ratings yet
Lesson Week 13
6 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
2 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
132 pages
Assignment Econometrics
No ratings yet
Assignment Econometrics
7 pages
Task1
No ratings yet
Task1
5 pages
TCH442E Quantitative Methods For Finance
No ratings yet
TCH442E Quantitative Methods For Finance
21 pages
ECON 301 - Midterm - F2020 Answer Key - pdf-1601016920671
No ratings yet
ECON 301 - Midterm - F2020 Answer Key - pdf-1601016920671
7 pages
SPSS STATISTICS PROJECT Interpretation
No ratings yet
SPSS STATISTICS PROJECT Interpretation
6 pages
Lecture 3. Part 1 - Regression Analysis
No ratings yet
Lecture 3. Part 1 - Regression Analysis
21 pages
Seu Ds610 Mod03
No ratings yet
Seu Ds610 Mod03
45 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Types of Tree Plantation
No ratings yet
Types of Tree Plantation
34 pages
Lab 3 - Working With Data Frames
No ratings yet
Lab 3 - Working With Data Frames
10 pages
Simple Linear regression-LAB4.ipynb - Colaboratory
No ratings yet
Simple Linear regression-LAB4.ipynb - Colaboratory
6 pages
Lab2 - Questions Only CON
No ratings yet
Lab2 - Questions Only CON
3 pages
Uji Normalitas Dan Homogenitas
No ratings yet
Uji Normalitas Dan Homogenitas
18 pages
Research Critique
No ratings yet
Research Critique
5 pages
Chapter Seventeen: Correlation and Regression
No ratings yet
Chapter Seventeen: Correlation and Regression
71 pages
Sample Qs For MCQ Test
No ratings yet
Sample Qs For MCQ Test
3 pages
Statitics
No ratings yet
Statitics
10 pages
Dependent Variable 1
No ratings yet
Dependent Variable 1
3 pages
R&R Excel Example
No ratings yet
R&R Excel Example
16 pages
Universiti Teknologi Mara Test 1
No ratings yet
Universiti Teknologi Mara Test 1
2 pages
treasury single account and banks' liquidity of deposit money banks in Nigeria.
No ratings yet
treasury single account and banks' liquidity of deposit money banks in Nigeria.
31 pages
Measure of Dispersion Statistics
No ratings yet
Measure of Dispersion Statistics
24 pages
Assessment of Outlier....................
No ratings yet
Assessment of Outlier....................
8 pages
Beauty Cosmetics: Preference of Home Economics Senior High School Students at San Miguel National High School
No ratings yet
Beauty Cosmetics: Preference of Home Economics Senior High School Students at San Miguel National High School
16 pages
QUIZ 3 (For Posting) - 1
No ratings yet
QUIZ 3 (For Posting) - 1
4 pages
Single Variable Data (3) MA5.2-15SP
No ratings yet
Single Variable Data (3) MA5.2-15SP
12 pages
Uji Akar Unit - Pendekatan Augmented Dickey-Fuller (ADF)
No ratings yet
Uji Akar Unit - Pendekatan Augmented Dickey-Fuller (ADF)
16 pages
Hfa Boys 5 19years Per
No ratings yet
Hfa Boys 5 19years Per
7 pages
GridDataReport-Aji Setiawan
No ratings yet
GridDataReport-Aji Setiawan
7 pages
Free Access to Test Bank for Statistics for People Who Think They Hate Statistics 6th Edition Salkind 1506333834 9781506333830 Chapter Answers
100% (13)
Free Access to Test Bank for Statistics for People Who Think They Hate Statistics 6th Edition Salkind 1506333834 9781506333830 Chapter Answers
61 pages
CASO Crecimiento Plantas
No ratings yet
CASO Crecimiento Plantas
12 pages
Descriptive Statistics: Sample
No ratings yet
Descriptive Statistics: Sample
5 pages
2 Cumulative Effect of Tol
79% (14)
2 Cumulative Effect of Tol
30 pages
MGMNT X115 Business Statistics (Online) Summer 2014
No ratings yet
MGMNT X115 Business Statistics (Online) Summer 2014
6 pages
Chapter 4 - Numerical Descriptive Measures
No ratings yet
Chapter 4 - Numerical Descriptive Measures
68 pages
Quartiles and Percentiles
No ratings yet
Quartiles and Percentiles
7 pages
Anova Kacang Panjang
No ratings yet
Anova Kacang Panjang
8 pages
Measures of Dispersion MG Edit
No ratings yet
Measures of Dispersion MG Edit
61 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lab4 - SLR - Ipynb - Colaboratory

Uploaded by

Lab4 - SLR - Ipynb - Colaboratory

Uploaded by

Simple Linear Regression Using (OLS) Ordinary Least Squares method.

Choose Files No file chosen

1.Import the MBA Salary dataset

S. No. Percentage in Grade 10 Salary

Data columns (total 3 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 S. No. 50 non-null int64

1 Percentage in Grade 10 50 non-null float64

2 Salary 50 non-null int64

dtypes: float64(1), int64(2)

memory usage: 1.3 KB

const Percentage in Grade 10

3. Create outcome Variable Y

Name: Salary, dtype: int64

5. Fit the model

6. Print the estimated parameters

Percentage in Grade 10 3560.587383

7. Model Diagnostics - Printing the coefficient of

Results: Ordinary least squares

Model: OLS Adj. R-squared: 0.190

Dependent Variable: Salary AIC: 1008.8680

Date: 2022-10-12 09:27 BIC: 1012.2458

No. Observations: 40 Log-Likelihood: -502.43

Df Model: 1 F-statistic: 10.16

Df Residuals: 38 Prob (F-statistic): 0.00287

R-squared: 0.211 Scale: 5.0121e+09

Coef. Std.Err. t P>|t| [0.025 0.975]

const 30587.2857 71869.4497 0.4256 0.6728 -114904.8089 176079.3802

Percentage in Grade 10 3560.5874 1116.9258 3.1878 0.0029 1299.4892 5821.6855

Omnibus: 2.048 Durbin-Watson: 2.611

Prob(Omnibus): 0.359 Jarque-Bera (JB): 1.724

Skew: 0.369 Prob(JB): 0.422

Kurtosis: 2.300 Condition No.: 413

8. Model Diagnostics - Residual Analysis - variance of the

9. Model Diagnostics - Oulier Detection. Outliers are

S. No. Percentage in Grade 10 Salary z_score_salary

Hence there is no outlier

10. Model Diagnostics - Finding highly influential

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: UserWarning: In Matp

11. Making predictions on validation set and measuring

Colab paid products

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.