0% found this document useful (0 votes)
2 views

Project_Report___Vishal_Pradeep

The document outlines a capstone project focused on predicting house prices using various features such as size, location, and condition. It includes sections on data exploration, treatment of missing values, and the development of multiple linear regression models to analyze the relationships between different variables and house prices. The project aims to provide insights for real estate investors, banks, and home buyers to assess market values accurately.

Uploaded by

trivedi.sundeep
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Project_Report___Vishal_Pradeep

The document outlines a capstone project focused on predicting house prices using various features such as size, location, and condition. It includes sections on data exploration, treatment of missing values, and the development of multiple linear regression models to analyze the relationships between different variables and house prices. The project aims to provide insights for real estate investors, banks, and home buyers to assess market values accurately.

Uploaded by

trivedi.sundeep
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 97

CAPSTONE PROJECT

HOUSE PRICE PREDICTION


Sundeep Trivedi.

1
Contents
1. Introduction..........................................................................................................................................4

2. Problem Statement...............................................................................................................................4

2.1. Objective.......................................................................................................................................4

2.2. Data Description...........................................................................................................................5

3. Data Report..........................................................................................................................................5

3.1. Dimension of data.........................................................................................................................5

3.2. Column Names.............................................................................................................................5

3.3 Renaming column names...............................................................................................................6

3.4. Overview of data...........................................................................................................................7

4. Exploratory Data Analysis...................................................................................................................7

4.1. Structure of the data......................................................................................................................7

4.2. Summary of Data..........................................................................................................................8

5. Missing Values....................................................................................................................................9

5. Data Treatment..................................................................................................................................10

5.1. Creating New Variable...............................................................................................................10

5.2. Removing $ Symbols..................................................................................................................10

5.3. Conversion of Variables.............................................................................................................11

5.4. Treating Missing Values and Checking Correlation...................................................................................11

5.3 Correlation Plot............................................................................................................................16

6. Univariate Analysis............................................................................................................................17

6.1. Bar Plot.......................................................................................................................................17

2
6.2. Density Plot.................................................................................................................................17

6.3 Bar Plot after conversion.............................................................................................................21

6.4 ZIPCODE Analysis......................................................................................................................21

6.4.1Properties in Zip code according to price.........................................................................................23

6.4.2Properties in Zip code according to Furnished.........................................................................................23

6.4.3Properties viewed per zip code.................................................................................................24

7. Bi -Variate Analysis...........................................................................................................................25

7.1 Box Plot.......................................................................................................................................25

8. Final Dataset......................................................................................................................................27

8.1 Correlation Check........................................................................................................................29

8.2 Outliers Check.............................................................................................................................30

8.3. Treating Outliers.........................................................................................................................31

9. Splitting the Dataset into train and test..............................................................................................31

10........................................................................................................................Model Building 32
10.1.Linear Regression Model..............................................................................................................32

10.1.1Linear Model 1........................................................................................................................33

10.1.2Linear Model 2........................................................................................................................35

10.1.3Linear Model 3 using Leaps...................................................................................................37

10.2Transforming the variables to improve Significance.....................................................................................38

10.2.1Linear Model 4........................................................................................................................39

10.2.2Linear Model 5........................................................................................................................41

10.2.3Residual Analysis...................................................................................................................42

10.2.4Linear Model 6........................................................................................................................45

10.2.5. Box Cox Transformation.......................................................................................................46

10.2.6 Linear Model 7........................................................................................................................47

10.3Comparison of Linear Models.......................................................................................................48

3
1. Introduction

The price of a house is dependent on various factors like size or area, number of bedrooms,
location, the price of other houses in the vicinity and many other factors. Real estate investors
would like to find out the actual cost of the house in order to buy and sell real estate properties.
They will lose money when they pay higher than the current market cost or sell for less than
the market cost.

Banks are also interested in knowing the actual value of the house when taking a house as
collateral for loans. At times the loan applicant overvalues their houses in order to maximize
the loan amount from banks. Local home buyers can also predict the price of the house to
ascertain if a seller is quoting a higher price than the actual market value. The local sellers
can also ascertain a fair price of their houses.

2. Problem Statement

A house value is simply more than location and square footage just like the values that make
up a person, an educated person would want to know all the aspects that gives it the value.

For example, you want to sell a house and you don’t know the selling price to quote. It
cannot be too low or too high. In order to determine the actual market price, we compare it
with similar properties in the vicinity and based on the gathered data you will try to access
the market values of your house.

2.1. Objective
To analyze and predict the price of house by using the list of feature variables which are
given in the dataset

4
2.2. Data Description
Variable Description
cid A notation for a house (ID variable for house)
dayhours Date house was sold
price Price of the house
room_bed Number of bedrooms/houses
room_bath Number of bathrooms/bedrooms
living_measure Square footage of the home
lot_measure Square footage of the lot
ceil Total floors in the house
coast House which has a view to waterfront
sight Whether property has been viewed
condition How good the condition is
quality Grade given to the housing unit
ceil_measure Area of house with respect to basement
basement Area of the basement
yr_built Year built
yr _renovated Year when the house was renovated
zipcode Zipcode
lat Latitude coordinate
long Longitude coordinate

living_measure15 Changes in living area after renovation (might or might now affected
lot size
lot_measure15 Lot size area in 2015 (After some renovation)
furnished Whether the home is furnished or not
total_area Measure of both living and lot (sum of home + plot area)

5
3. Data Report

3.1. Dimension of data


The dataset consists of 21673 rows and 23 columns

3.2. Column Names


Following are the column names in the dataset

Some column names seem to be ambiguous. So, we will rename these column names for
simpler understanding.

6
3.3 Renaming column names

Variable Description
House_ID A notation for a house (ID variable for house)
Date_of_Sale Date house was sold
Price Price of the house
Bedrooms Number of bedrooms/houses
Bath_Bed_Ratio Number of bathrooms/bedrooms
Home_area Square footage of the home
Plot_area Square footage of the lot
Floors Total floors in the house
Water_facing House which has a view to waterfront
Viewed Whether property has been viewed
Condition How good the condition is
Quality Grade given to the housing unit
Floor_area Area of house with respect to basement
Basement Area of the basement
Yr_Built Year built
Yr _renovated Year when the house was renovated
Zipcode Zipcode
lat Latitude coordinate
long Longitude coordinate

Home_area15 Changes in living area after renovation (might or might now affected
lot size
Plot_area15 Lot size area in 2015 (After some renovation)
Furnished Whether the home is furnished or not

7
Total_area Measure of both living and lot (sum of home + plot area)

8
3.4. Overview of data

4. Exploratory Data Analysis

4.1. Structure of the data

9
There are few variables which should be considered as categorical, but they are in
numerical format, few of the variables have unwanted symbols (such as $ sign). We will
treat these variables in the coming analysis

10
4.2. Summary of Data

We have observed that the dataset consists of 21613 records with 23 columns and from the
summary we observe that there are NA's in Bedrooms, Bath_bed_ratio, Home_area,
Plot_area, Water_facing, Home_area15, Plot_area_15 and Furnished.

Also, few variables have NA's, $ symbol, as well as values are not present in as Floors,
Water_facing, Condition and Total_area

The target variable is a Price which indicates the details about house prices. Bedrooms,
Bath_Bed_Ratio, Home_area, Plot_area, Quality, Condition, Floor_area, Year_built,
Total_area etc. are some important variables which tells us about the properties of the house.
The Zipcode gives us the information about the number of properties present in that area.

Variable Bedrooms, Floor_area, Condition and Quality can be converted to Categorical


variable. Bath_bed_ratio and Floors has decimal values, and these can also be
categorized. We also have 1NA & Quality variable can be categorized

Water_facing, Floors, Yr_built, Condition, Total_area have $ symbol and needs to be treated

Date_of_sale variable is not in the correct format. We need to format it and use only the year
value from this.

11
5. Missing Values

From the plot above the variable Yr_renovated shows that there are 0 percent missing values
but on closer inspection we observe that there are many entries in that variable given as 0.

12
From the table, it is observed that, there are 20699 values given as 0. This accounts to
(20699/21613) = 95% values not present. Hence, removing this column will not affect our
data.

13
Both Bedrooms and Bath_bed_ratio account to 0.5% of missing values

Water_facing has missing values and is about 0.26%. This variable can also be categorized
Home_area has 0.08% missing values

Plot_area has 0.19% missing values

Home_area15 accounts for 0.77% of missing values


Plot_area15 and furnished has 0.13 % missing values

5. Data Treatment

5.1. Creating New Variable


The variable Date_of_sale is not in correct format. This variable gives us information about
the date of sale of the house. We will require this variable for further analysis. Hence, we
will format this variable to extract the Year Values alone.

We are creating a new variable called as Yr_sale which will contain this information

Yr_sale

Similarly, Yr_built also has too many levels. So instead of using this variable we will create a
new variable called Age_at_sale (which will be used at a later stage)

5.2. Removing $ Symbols


The variables such as Floors, Water_facing, Condition, Yr_built, long and Total_area have $
symbols. We will treat these by replacing them with NA’s.

14
15
5.3. Conversion of Variables
We are converting few variables to categorical variables. Variables such as House_ID,
Water_facing, Viewed, Zipcode and Furnished are all converted to categorical variables.

We are giving Water_facing and Furnished variables two values as’ Yes’ or ‘No’ for the ease
of understanding

Similarly Viewed variable has numbers from 1,2,3,4... hence grouping them as ‘Yes’ if the
property has been viewed at least once or else giving them a value of ‘No’.

5.4. Treating Missing Values and Checking Correlation


There are few NA’s in Yr_built (15 NA’s) and we are removing the NA values

We will remove the missing values in a 2-step process.

First, we will group all the numerical variables and treat the missing values and next we will
combine categorical variables along with the treated numerical values and remove the
missing values if present.

Before removing missing values, we will just plot a correlation to check if any variables have
a correlation value of 1.

16
We see that from the correlation plot the Plot_area and Total_area have a correlation value
of 1, meaning both are same. We must choose the one to remove.

We are removing Total_area as it has 68 NA’s (Missing Values)

17
We will remove Total_area variable and now check for missing values

We see that Bedrooms, Bath_Bed_Ratio, Home_area, Plot_area, Floors, Condition, long,


Home_area15, Plot_area15 have missing values

Now Proceeding with the treatment of missing values by using MICE and takings the means
as the method

18
19
Missing values of Numerical Type

Now Checking for missing values for entire dataset by combining the remaining variables

We see that Water_facing and furnished have missing values. These missing values are
treated using MICE and we are taking logistic regression method this time to treat as those
two variables are of categorical type.

20
After Treating for entire dataset

21
We have removed all NA’s from our dataset after treating them.

22
5.3 Correlation Plot

It can be observed from the correlation plot that there is a

High correlation between the Home_area and Floor_area, Home_area and Quality,
Home_area and Bath_Bed_Ratio, Home_area and Price

Similarly, there is a high correlation between Quality and Floor_area, Quality and
Bath_Bed_Ratio and Quality and Price

23
6. Univariate Analysis

6.1. Bar Plot

From the Bar plot we observe that for the variables:

Water_facing – there are a lot number of houses which do not have a water facing view
Viewed - Many Properties have not been viewed

Furnished – Many properties are not furnished

24
6.2. Density Plot

We observe that, for few variables the distribution is not uniform. We will explore these variables.

25
6.2.1 Bath_Bed_Ratio

26
From the plot we see that the distribution is not uniform for bath_bed_ratio. We can convert
this variable to factor as the range is from 0 to 8 (Also consists of some decimal values)

We will categorize them as Low, Medium and High

6.2.2 Bedrooms

27
As observed from the plot, variables like the Bath Bed Ratio, the Bedrooms distribution is
uneven and can be categorized. There are Bedrooms in some house where there are more than
10. We will categorize them as “0-4, 4-6 and Above 7”.

6.2.3 Condition

28
The Condition variable determines how good the overall condition of the house is. From the
plot we see that there are some values in decimal. We will remove this by making them to
integers.

6.2.4 Floors

29
From the distribution we see there are few values in decimals. We will remove them by
making them as integers.

6.2.5 Quality

30
From the distribution we can also categorize the Quality variable. The Quality values start
from 0 and goes till 13.

We will also categorize these variables.

After conversion of all the required variables, an overall summary can be observed.

31
6.3 Bar Plot after conversion

6.4 ZIPCODE Analysis


Top 10 zip codes where a greater number of properties is available

32
Distribution of Properties across Zip Codes

We see that for the zip code 98103 has the highest number of properties and the Zip code
98039 has the least number of properties

33
6.4.1 Properties in Zip code according to price

From the plot we can see that 98039 has properties, where the price is highest and also has a
bigger Home area.

Zip Code - 98168 has properties where the price is low and the Home sizes are small.

6.4.2 Properties in Zip code according to Furnished

We see from the plot that Zip code – 98006 has highest number of furnished houses, but the
price of the properties is very low (Can imply that furnishing is not so good)

34
Zip code – 98039 has a smaller number of properties which are furnished but the price of the
houses is high

35
6.4.3 Properties viewed per zip code
We can see that zip code – 98006 area’s properties have been viewed many times and the price
of the properties in that area is cheaper

98040 area’s properties which have a higher price has also been seen many times
980039 area’s properties which has the highest price has been viewed less number
of times.

Properties in the area 98002,98031.98077 has been viewed the least, this means
that the properties in these areas have rarely been visited.

Properties in the area 98148 has not been visited (Viewed) at all.

36
7. Bi -Variate Analysis

7.1 Box Plot


7.1.1 Age of house and Quality

7.1.2 Price and Quality

37
From this plot we see that as the quality increases the price of the house also increases. We
can be observed, that price is highest for house whose quality rating is more than 3.

38
7.1.3 Price and Condition

From the plot we see that the price is high for the houses whose condition is greater than a rating of 3.

7.1.4 Price and Furnished

39
We see that the price of the houses which are furnished is higher than houses which are not
furnished.

7.1.5 Price and Water Facing

8. Final Dataset

After all analysis and conversion of the data accordingly we finalize our dataset with the
following values

40
Before we proceed to use this dataset to build our model, we will exclude the lat and long
variables as these values are not very diverse. (Decimal difference is not significant).

Hence, we are excluding this from our dataset.

As observed from summary, there are no NA’s

41
8.1 Correlation Check
From our previous analysis we saw that there was some correlation between the variables.
Let us now check for correlation in our final dataset.

From the correlation plot we observe the following:

There is high value of correlation between Floor_area and Home_area (correlation 0.88)
Similarly we see there is correlation between Home_area and Price, Home_area15 and
Home_area, Home_area15 and Floor_area and Plot_area15 and Plot_area.

42
8.2 Outliers Check
We will first check for outliers in the complete dataset.

From the plot we can observe that, for few variables there are outliers which need to be treated.

Before the treatment, lets analyses them at individual levels. Exploring the outliers with
respect to price bucket

43
We observe that, except in age at sale, the outliers are higher for values in cheap categories
followed by affordable categories. There are more outliers in the cheaper categories. Let us
treat it before moving forward.

44
8.3. Treating Outliers
Before we treat the outliers, a cautious approach must be followed in order to avoid
removing them without proper analysis as some variables may have outliers which may
represent some meaning. We are treating the outliers using the “Winsorization” function.

Here we have substituted the outlier present on the higher side with the 95th Percentile and the
outliers which are present on the lower side with the 5th percentile.

Plotting the chart after treating outliers

Most of the outliers are removed as seen from the above plot.

9. Splitting the Dataset into train and test

Data is split in the ratio of 70:30.

Dimension of Dataset

45
Train Set dimension

Test Set dimension

46
10. Model Building
10.1. Linear Regression Model
Linear Regression is used to predict a quantitative outcome variable (y) based on one or
multiple predictor variables (x)

The goal is to build a mathematical formula that defines y as a function of the x variable.
Once we have built a statistically significant model, it’s possible to use it for predicting future
outcome based on new x values.

There are two important metrics which we use to measure the performance of the regression
model.

1) Root Mean Square – which measures the model prediction error. It corresponds to the
average difference between the observed and predicted values by the model. The lower
the RMSE, the better the model.
2) R-Square – representing the squared correlation between the observed known outcome
values and the predicted values by the model. The higher R2, the better the model.

Also, few other things to note:

The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS
The average variation of points around the fitted regression line is called as the Residual
Standard Error (RSE).

47
10.1.1 Linear Model 1
We are going to build our model by considering all the variables. After running the model,
the following results are expected:

From the above output we get the following information:

Residuals: Provide a quick view of the distribution of the residuals.

Estimate: Gives the intercept and the beta coefficient estimate associated to each predictor
variable

Std.Error: The standard error of the coefficient measures. The larger the standard error the
less confident we are about the estimate.
48
t value: it is the coefficient estimate divided by the standard error of the estimate

Pr(>|t|): The p value corresponding to t statistic. The smaller the p value, the more significant
the estimate is.

Residual Standard Error (RSE), R-Squared (R2) and the F-Statistic are metrics that are
used to check how well the model fits to our data.

49
From the model which we have built we see that most of the variables are significant except
Floor_area, Basement, Plot_area and Quality. Condition seems to be little significant

According to this model performance, we see that we have get the R square value as
73.17% and Adjusted R square values as 73.12%

The purpose of linear regression is for interpretation. We get the coefficients from these
models which are critical for interpretation of the variables

Checking for multicollinearity

We can check for multicollinearity by running the VIF. Following are the observations:

For the values of Home_area and Floor_area the VIF values are high. Basement also has a
higher VIF when compared to other variables.

As we already saw from the correlation plot that the Home_area and floor_area have a
correlation of 0.88. We will remove one of these and build our model again.

50
10.1.2 Linear Model 2
We will build our model again by removing floor area. We see that after removing floor area
the R square value does not change that much but the problem of multicollinearity is
eliminated.

From this model we see that most of the variables are significant except Basement and
Plot_area. Similarly, Quality and Condition is barely significant.

According to this model performance we observe that, the R square value as 73.16% and
Adjusted R square values as 73.12%. The values are same as the previous model. Let us
now check if the problem of multicollinearity is reduced.
51
Checking for Multicollinearity

We observe from the above plot that the multicollinearity has been reduced.

We will now try to improve the value of R squared. For this we will use a concept of feature
selection to determine which variable to select and remove from our dataset and give it to the
model. We will create by using leaps functions, which creates a regression subset by taking
all the combination of the variables and gives the desired result.

By running leaps, we get the following variables which are important

Now let us build our model using the above variables.

52
10.1.3 Linear Model 3 using Leaps
Let us now observe how many variables are getting selected by using leaps function. Leaps
function reduces the number of variables. This facilitates for understanding and better
interpretation.

A simple model easier to understand. The more complex the model, more is the chance of
overfitting the model.

Building this model by taking the important variables as got from the feature selection by
using the leaps function and the regression subset the following result:

According to this model performance we observe that the R square value as 71.49% and
Adjusted R square values as 71.47%.

Still, we observe that the Quality variable is not significant. By using this method, the
performance of the model is not very high. So, let us build the model with some other
technique.

Checking for multicollinearity

53
The model 3 performance is not so good when compared with model 1 and 2.

Also, from model 1 and 2 we observed that, a few variables were not significant. So, we are
going to improve the significance of the variables by transforming those variables.

The Basement was not significant. We will try to increase its significance by dividing them
into buckets.

Similarly, Quality was also not significant. Quality > 10 matters more as more people tend
to purchase properties which have a higher rating.

Also, we will try to improve condition, as that variable was also not significant.

R square is the amount of variance that is explained by the predictor variable. The R square is
nothing but the coefficient of determination.

10.2 Transforming the variables to improve Significance:

Transforming Basement into three buckets such as Minimal, Adequate and Large Basement
areas.

If quality is more than 10 and above then we have named it as Premium otherwise we are
saying it as Standard.

Similarly for Condition, Conditions above 5 seems to be more important. If it is rated 5 then
we keep it as Yes otherwise, we have kept it as No.

After we are done with the variable Transformation, we are again dividing our dataset into
train and test. We are splitting them into 70:30 Ratio

Original Dataset

Train Dataset

54
Test Dataset

Now using this new train and test dataset with the transformed variables to build our model

55
10.2.1 Linear Model 4
We see that the R square values have not been affected much and most of the variables are
significant except plot area.

After running the model with the transformed data, we get the R-Square value as 72.73% and
adjusted R Squared value as 72.7%

56
Checking for multicollinearity

57
10.2.2 Linear Model 5
From the previous model we observed that the Plot_area was not significant we will
remove that variable and try to find out how the model performs. Also, we are removing
the Floor_area variable as it had high collinearity.

We see that all the variables are significant, and this model gives us a better interpretation.
Still, we can try to improve the value of R-Square.

58
Checking for multicollinearity

10.2.3 Residual Analysis


We will try to improve the model performance by looking at its residuals and try to improve
on that.

Residuals – It is the Error and is actually the difference between actual – predicted value
(that is by how much you have missed out the actual value)

Std. Residuals – Nothing but the standardized representation of the residuals (converted to a
Z square values)

Fitted values – it is nothing but the predicted values

Leverage – Outlier measure (Cook’s distance). There are two sides to a linear regression
model. Once is the dependent side which is the target variable and other is the predicted
values. (Independent variables). These variables may or may not have outliers (or extreme
values)

59
Independent variables are many and to condense these values we are using the leverage
column Using cook’s distance (where measure of leverage value is 2) if the value of leverage
is more than 2 then it is a highly influenced observation and we can remove them and see a
marginal improvement in the variables

60
We are transforming those values by setting a threshold (taking 95% here) to the leverage and
taking those values (Marking them in new column as High if it more and Ok if it is normally
within the range)

From the above plot we can see almost all the values fall in place and there are few points
where the residuals are high. The Heteroscedasticity is good

61
The residuals look normally distributed

62
63
10.2.4 Linear Model 6
Now let us build our linear regression model by excluding the two variables Plot_area and
Floor_area and excluding the residuals where the points are high. We get the following
results:

Still the value of R-Square is like the previous models which were built. We will try one
more transformation technique and then see how our model performs

64
10.2.5. Box Cox Transformation
A Box-Cox transformation is a way to transform non-normal dependent variables into a
normal shape. Normality is an important assumption for many statistical techniques.

Here, this transformation is put on the dependent variable Price. We are taking the value
where there is maximum value for log-likelihood

Now proceeding to build our model

65
10.2.6 Linear Model 7

The R-Squared value dropped when we ran the models using the box cox transformation.
This is because it is comparing the variables only with the transformed variable. It is
explaining the amount of variance that these variables have with respect to the transformed
variables.

So, we must back transform it to get the original value

To Transform back to its original coefficients and calculating the RMSE and R-Square we
are multiplying with this function

(Dataset*x+1) ^(1/x)

Performance of Model 7

RMSE R2

66
We see that the performance of this model is better than all the models which we have
run and it has given a low RMSE values and high R-Square Value. Hence, we are
Proceeding with this model

67
10.3 Comparison of Linear Models

Train

Test

Plotting the comparisons

68
69
Looking at the coefficients of Model 7

Relative importance of coefficients

These are the variables which are of importance in the Linear Regression Model.

In the Linear Regression Model, Model 7 performs the best and hence we are taking this
model.

70
10.4 Decision Trees

The decision tree method is a powerful and popular predictive machine learning technique
that is used for both classification and regression. So, it is also known as Classification
Regression Trees (CART).

The r implementation of decision tree is called as RPART (Recursive Partitioning and


Regression Trees). The algorithm works by repeatedly partitioning the data into multiple sub-
spaces, so that the outcomes in each final sub-space is as homogeneous as possible.

The decision rules generated by the CART Predictive model are generally visualized as a
binary tree.

We will build the models first by using the dataset without any transformed variables and
then will build the models again by using the transformed variables dataset.

(Decision Tree 1, Decision Tree 2, Decision Tree 3 – Without Transformed Dataset


Decision Tree 4, Decision Tree 5, Decision Tree 6 – With Transformed Dataset)

10.4.1 Decision Tree 1


71
Let us build our initial decision tree. We are keeping the CP value as 0 and see how the
model results are. We get the following results.

(Detailed output results are mentioned in csv)

Decision_Tree1_outpu
t.csv

72
(Detailed Results are attached in CSV)

73
10.4.2 Decision Tree 2
The decision tree will now be built by taking the best CP value. On finding out the best CP
value we get

We will now use this CP value in building our decision tree. On building the decision tree
with the above CP we get

(Detailed output results are available in the csv)

74
Decision_Tree2_outpu
t.csv

75
76
10.4.3 Decision Tree 3
We will now build our decision tree by giving it with the help of tunelength.

Also we will build a decision tree based on the manual grid and then compare the results.

Based on tunelength

77
Using Manual Grid

We are building the model using the grid instead of tune length to observe the results.

78
10.4.4 Decision Tree 4
Now let us build the decision tree model with the transformed dataset and let us see the
results.

Firstly, we are building the decision tree with CP = 0 with the new dataset.
(Full results attached in CSV)

Decision_Tree4_outpu
t.csv

79
80
10.4.5 Decision Tree 5
Now we are going to build our model with the best cost parameter and using the transformed
dataset. From our previous model we out the best CP value as 0.00028

Attaching the entire result of the decision tree in csv

81
82
10.4.6 Decision Tree 6
We will now build our decision tree by giving it with the help of tunelength.

Also we will build a decision tree based on the manual grid and then compare the results

Based on Tunelength

83
Using Manual Grid

84
10.5 Model Performance of Decision Trees

Train

Test

85
We see that DT2 model performs the best for both train and test dataset giving a good R-
Square value and a less RMSE value compared to other decision tree models.

86
Important Variables

10.6 Random Forest

Random Forest randomly selects observation/rows and specific features to build multiple
decision trees and then averages the results across all trees.

The prediction error is measured by RMSE which corresponds to average difference between
the observed known values of the outcome and the predicted value by the model.

10.6.1 Random Forest Model 1


We are first building our random forest model by setting the starting point as 2 (specified in
mtry). By giving this value we are making sure that there will some difference between each
other, otherwise chances of correlation may occur. Also giving a step factor, ntree and setting
the doBest as True.

We get the following results.

87
Variable Importance

Variable Importance

88
10.6.2 Random Forest Model 2
We will now further tune our model and build and see how the results are.

As seen from the results, the model with the mtry of 9 gives the best result. The RMSE value
is the lowest among all the results and R square value is also high

10.7 Comparison of Random Forest Models

Train

Test

89
From the Random Forest model comparison, we can observe that RF2 performs better in both
train and test. For Train dataset the R-Square value as 96.17% and RMSE value of 55513.86
and in the Test dataset it gives an R-Square value of 79.52% and RMSE value of 124024.

90
10.8 XG Boosting

XG Boosting works only on numerical values. Our dataset consists of both numerical and
categorical values. We have converted the categorical variables into dummy variables with
the help of the library fastDummies. This is process in which we create this set of variables is
otherwise called as Sparse Matrix.

Once we have created our columns in the dataset, we will use them to build our models.

10.8.1 XG Model
Now let us proceed in building our model.
We get the following results

10.9 Performance of XG Boosting

Train

Test

91
92
11. Ensemble Techniques

11.1. Ensemble Model 1


We are compiling all the predictions of the regression models (both train and test) and
decision trees (both train and test) for this model.

Then we are taking the row mean which is the average of the predictions of the various
models that have been used.

Train

Test

11.2 Ensemble Model 2


We are compiling the predictions of Random Forest and XG Boost Models and then
calculating the row means for this model. We observe that this model gives the following
results

Train

Test

93
11.3 Stacking
In stacking we do not consider the average of the models, but we take the predictions of
all the models (Regression Model, Decision Tree) and run a random forest among them.

Here we have the values of price predicted by the Regression Model, values of price
predicted by Decision Tree, values of price predicted by Random Forest. We will treat these
three as variables and actual variable Price as the dependent variable.

On running this model, we have following results

Performance
Train

Test

94
12. Performance comparison of all models

Comparing the performance of the models we have

12.1 Train

95
12.2 Test

96
12.3 Comparison Plot

13. Conclusion

Since we are dealing with a Regression type of problem, we are focusing mainly on the two
metrics which are R-Square and RMSE values. The value of R-Square must be high and the
value of RMSE must be kept to a minimum for an ideal model performance.

From above models which were built, Ensemble Model 2 performs the best as it gives a R-
Square value of 91.37% and RMSE score of 83401.12 on the Train dataset and gives a R-
Square value of 79.62% and RMSE score of 123729.6 on the Test dataset.

Stacking also performs great which gives us a result of R-Square of 92.53% and a RMSE
Score of 77571.36 for the Train dataset and gives R-Square value of 77.54% and a RMSE
score of 129871.3 on the Test dataset.

97

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy