Lab4 - SLR - Ipynb - Colaboratory
Lab4 - SLR - Ipynb - Colaboratory
Develop a regression
model to predict Salary based on Percentage in Grade 10.
#This code is to upload the data set from local drive into Colab
from google.colab import files
uploaded = files.upload()
#import the data from MBA Salary.csv
import pandas as pd
mba_salary_df = pd.read_csv('/content/MBA Salary.csv')
mba_salary_df.head(3)
0 1 62.00 270000
1 2 76.33 200000
2 3 72.00 240000
#print information about the data set
mba_salary_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
mba_salary_df.shape
(50, 3)
*2. Creating feature set X and the outcome variable Y. The
statsmodel library is used for building statistical models.
OLS API in statsmodel.api is used to estimate the
parameters of simple linear regression. It takes two
parameters Y and X. IN this data Y is Salary and X is
Percentage in Grade 10. The OLS model estimates only the
coefficient of X (Beta 1 or slope). To estimate Beta 0, a
constant term of 1 needs to be added as a seperate
column. This parameter is the intercept term. *
import statsmodels.api as sm
X = sm.add_constant(mba_salary_df['Percentage in Grade 10'])
X.head(5)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarnin
x = pd.concat(x[::order], 1)
0 1.0 62.00
1 1.0 76.33
2 1.0 72.00
3 1.0 60.00
4 1.0 61.00
Y = mba_salary_df['Salary']
Y.head()
0 270000
1 200000
2 240000
3 250000
4 180000
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X,Y,train_size = 0.8, random_state = 1
mba_salary_lm = sm.OLS(train_y, train_X).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fada1eb6290>
The fit() method on OLS, estimates the parameters and returns the model information such as
model parameters(coefficients), acccuracy measures and residual values to the varibale
mba_salary_lm
print(mba_salary_lm.params)
const 30587.285652
dtype: float64
Hence Beta 0 = 30587.285 and Beta 1 = 3560.587. The estimated model is MBA Salary =
30587.285652 + 3560.587383(Percentage in Grade 10)
print(mba_salary_lm.summary2())
===================================================================================
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
===================================================================================
Hence R Square of the model is 0.211. So the model explains 21.1% of the variation in salary.
import matplotlib.pyplot as plt
def get_std_values(vals):
return(vals - vals.mean())/vals.std()
x_axis = get_std_values(mba_salary_lm.fittedvalues)
y_axis = get_std_values(mba_salary_lm.resid)
plt.scatter(x_axis, y_axis)
plt.xlabel("Standardised Predicted values")
plt.ylabel("Standardised Residual Values")
plt.title("Residual Plot")
plt.show()
The residual plot is not funnel shaped. Hence residuals have constant variance.
from scipy.stats import zscore
mba_salary_df['z_score_salary'] = zscore(mba_salary_df.Salary)
#mba_salary_df.head()
mba_salary_df[(mba_salary_df.z_score_salary > 3.0)| (mba_salary_df.z_score_salary< -3.0)]
mba_influence = mba_salary_lm.get_influence()
(c,p) = mba_influence.cooks_distance
plt.stem(np.arange(len(train_X)), np.round(c,3))
plt.xlabel("Row Index")
plt.ylabel("Cooks Distance")
plt.show()
There is no observation with Cooks's distance > 1. Hence none of them are influential.
from sklearn.metrics import r2_score, mean_squared_error
pred_y = mba_salary_lm.predict(test_X)
print('R2 Score =',np.abs(r2_score(test_y,pred_y)))
print('RMSE = ', np.sqrt(mean_squared_error(test_y,pred_y)))
R2 Score = 0.156645849742304
RMSE = 73458.04348346895