Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
We can understand the concept of regression analysis using the below example:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.
o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the
basis of the year of experience.
1. Y= aX+b
Here, Y = dependent variables (target variables), X= Independent variables
(predictor variables), a and b are the linear coefficients
Logistic Regression:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test,
and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset),
which splits into left and right child nodes (subsets of dataset). These child nodes
are further divided into their children node, and themselves become the parent
node of those nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying
to predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which
is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each
tree output. The combined decision trees are called as base models, and it can be
represented more formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a
small amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty.
We can compute this penalty term by multiplying with the lambda to the squared
weight of each individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity
between the independent variables, so to solve such problems, Ridge regression
can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can
be written as:
For the above linear equation, MSE can be calculated as:
Where,
Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and
so cost function will high. If the scatter points are close to the regression line, then the
residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on
continuous or categorical values.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature,
Revenue of a company according to the investments in a year, etc.
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm
using Python
Problem Statement example for Simple Linear Regression:
Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting
line for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we
need to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. We
have already done it earlier in this tutorial. But there will be some changes, which are
given in the below steps:
o First, we will import the three important libraries, which will help us for loading the dataset,
plotting the graphs, and creating the Simple Linear Regression model.
1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
o Next, we will load the dataset into our code:
1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder
IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working directory,
and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable is
salary. Below is code for it:
1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove
the last column from the dataset. For y variable, we have taken 1 value as a parameter,
since we want to extract the second column and indexing starts from the zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30 observations,
so we will take 20 observations for the training set and 10 observations for the test set.
We are splitting our dataset so that we can train our model using a training dataset and
then test the model using a test dataset. The code for this is given below:
Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to perform it here. Now, our dataset is well
prepared to work on it and we are going to start building a Simple Linear Regression
model for the given problem.
Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor.
The code for this is given below:
In the above code, we have used a fit() method to fit our Simple Linear Regression object
to the training set. In the fit() function, we have passed the x_train and y_train, which is
our training dataset for the dependent and an independent variable. We have fitted our
regressor object to the training set so that the model can easily learn the correlations
between the predictor and target variables. After executing the above lines of code, we
will get the below output.
Output:
dependent (salary) and an independent variable (Experience). So, now, our model is ready
to predict the output for the new observations. In this step, we will provide the test dataset
(new observations) to the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of
test dataset, and prediction of training set respectively.
On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the training
set and test set.
Output:
You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these
values, we can check how good our model is performing.
Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step.
The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary
of employees. In the function, we will pass the real values of training set, which means a
year of experience x_train, training set of Salaries y_train, and color of the observations.
Here we are taking a green color for the observation, but it can be any color as per the
choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of
the pyplot library. In this function, we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.
In the previous step, we have visualized the performance of our model on the training set.
Now, we will do the same for the Test set. The complete code will remain the same as the
above code, except in this, we will use x_test, and y_test instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.
Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is given
by the red regression line. As we can see, most of the observations are close to the
regression line, hence we can say our Simple Linear Regression is a good model and able
to make good predictions.