0% found this document useful (0 votes)

2 views

Project_Report___Vishal_Pradeep

The document outlines a capstone project focused on predicting house prices using various features such as size, location, and condition. It includes sections on data exploration, treatment of missing values, and the development of multiple linear regression models to analyze the relationships between different variables and house prices. The project aims to provide insights for real estate investors, banks, and home buyers to assess market values accurately.

Uploaded by

trivedi.sundeep

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Project_Report___Vishal_Pradeep

Uploaded by

trivedi.sundeep

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 97

CAPSTONE PROJECT

HOUSE PRICE PREDICTION

Sundeep Trivedi.

1
Contents
1. Introduction..........................................................................................................................................4

2. Problem Statement...............................................................................................................................4

2.1. Objective.......................................................................................................................................4

2.2. Data Description...........................................................................................................................5

3. Data Report..........................................................................................................................................5

3.1. Dimension of data.........................................................................................................................5

3.2. Column Names.............................................................................................................................5

3.3 Renaming column names...............................................................................................................6

3.4. Overview of data...........................................................................................................................7

4. Exploratory Data Analysis...................................................................................................................7

4.1. Structure of the data......................................................................................................................7

4.2. Summary of Data..........................................................................................................................8

5. Missing Values....................................................................................................................................9

5. Data Treatment..................................................................................................................................10

5.1. Creating New Variable...............................................................................................................10

5.2. Removing $ Symbols..................................................................................................................10

5.3. Conversion of Variables.............................................................................................................11

5.4. Treating Missing Values and Checking Correlation...................................................................................11

5.3 Correlation Plot............................................................................................................................16

6. Univariate Analysis............................................................................................................................17

6.1. Bar Plot.......................................................................................................................................17

2
6.2. Density Plot.................................................................................................................................17

6.3 Bar Plot after conversion.............................................................................................................21

6.4 ZIPCODE Analysis......................................................................................................................21

6.4.1Properties in Zip code according to price.........................................................................................23

6.4.2Properties in Zip code according to Furnished.........................................................................................23

6.4.3Properties viewed per zip code.................................................................................................24

7. Bi -Variate Analysis...........................................................................................................................25

7.1 Box Plot.......................................................................................................................................25

8. Final Dataset......................................................................................................................................27

8.1 Correlation Check........................................................................................................................29

8.2 Outliers Check.............................................................................................................................30

8.3. Treating Outliers.........................................................................................................................31

9. Splitting the Dataset into train and test..............................................................................................31

10........................................................................................................................Model Building 32
10.1.Linear Regression Model..............................................................................................................32

10.1.1Linear Model 1........................................................................................................................33

10.1.2Linear Model 2........................................................................................................................35

10.1.3Linear Model 3 using Leaps...................................................................................................37

10.2Transforming the variables to improve Significance.....................................................................................38

10.2.1Linear Model 4........................................................................................................................39

10.2.2Linear Model 5........................................................................................................................41

10.2.3Residual Analysis...................................................................................................................42

10.2.4Linear Model 6........................................................................................................................45

10.2.5. Box Cox Transformation.......................................................................................................46

10.2.6 Linear Model 7........................................................................................................................47

10.3Comparison of Linear Models.......................................................................................................48

3
1. Introduction

The price of a house is dependent on various factors like size or area, number of bedrooms,
location, the price of other houses in the vicinity and many other factors. Real estate investors
would like to find out the actual cost of the house in order to buy and sell real estate properties.
They will lose money when they pay higher than the current market cost or sell for less than
the market cost.

Banks are also interested in knowing the actual value of the house when taking a house as
collateral for loans. At times the loan applicant overvalues their houses in order to maximize
the loan amount from banks. Local home buyers can also predict the price of the house to
ascertain if a seller is quoting a higher price than the actual market value. The local sellers
can also ascertain a fair price of their houses.

2. Problem Statement

A house value is simply more than location and square footage just like the values that make
up a person, an educated person would want to know all the aspects that gives it the value.

For example, you want to sell a house and you don’t know the selling price to quote. It
cannot be too low or too high. In order to determine the actual market price, we compare it
with similar properties in the vicinity and based on the gathered data you will try to access
the market values of your house.

2.1. Objective
To analyze and predict the price of house by using the list of feature variables which are
given in the dataset

4
2.2. Data Description
Variable Description
cid A notation for a house (ID variable for house)
dayhours Date house was sold
price Price of the house
room_bed Number of bedrooms/houses
room_bath Number of bathrooms/bedrooms
living_measure Square footage of the home
lot_measure Square footage of the lot
ceil Total floors in the house
coast House which has a view to waterfront
sight Whether property has been viewed
condition How good the condition is
quality Grade given to the housing unit
ceil_measure Area of house with respect to basement
basement Area of the basement
yr_built Year built
yr _renovated Year when the house was renovated
zipcode Zipcode
lat Latitude coordinate
long Longitude coordinate

living_measure15 Changes in living area after renovation (might or might now affected
lot size
lot_measure15 Lot size area in 2015 (After some renovation)
furnished Whether the home is furnished or not
total_area Measure of both living and lot (sum of home + plot area)

5
3. Data Report

3.1. Dimension of data

The dataset consists of 21673 rows and 23 columns

3.2. Column Names

Following are the column names in the dataset

Some column names seem to be ambiguous. So, we will rename these column names for
simpler understanding.

6
3.3 Renaming column names

Variable Description
House_ID A notation for a house (ID variable for house)
Date_of_Sale Date house was sold
Price Price of the house
Bedrooms Number of bedrooms/houses
Bath_Bed_Ratio Number of bathrooms/bedrooms
Home_area Square footage of the home
Plot_area Square footage of the lot
Floors Total floors in the house
Water_facing House which has a view to waterfront
Viewed Whether property has been viewed
Condition How good the condition is
Quality Grade given to the housing unit
Floor_area Area of house with respect to basement
Basement Area of the basement
Yr_Built Year built
Yr _renovated Year when the house was renovated
Zipcode Zipcode
lat Latitude coordinate
long Longitude coordinate

Home_area15 Changes in living area after renovation (might or might now affected
lot size
Plot_area15 Lot size area in 2015 (After some renovation)
Furnished Whether the home is furnished or not

7
Total_area Measure of both living and lot (sum of home + plot area)

8
3.4. Overview of data

4. Exploratory Data Analysis

4.1. Structure of the data

9
There are few variables which should be considered as categorical, but they are in
numerical format, few of the variables have unwanted symbols (such as $ sign). We will
treat these variables in the coming analysis

10
4.2. Summary of Data

We have observed that the dataset consists of 21613 records with 23 columns and from the
summary we observe that there are NA's in Bedrooms, Bath_bed_ratio, Home_area,
Plot_area, Water_facing, Home_area15, Plot_area_15 and Furnished.

Also, few variables have NA's, $ symbol, as well as values are not present in as Floors,
Water_facing, Condition and Total_area

The target variable is a Price which indicates the details about house prices. Bedrooms,
Bath_Bed_Ratio, Home_area, Plot_area, Quality, Condition, Floor_area, Year_built,
Total_area etc. are some important variables which tells us about the properties of the house.
The Zipcode gives us the information about the number of properties present in that area.

Variable Bedrooms, Floor_area, Condition and Quality can be converted to Categorical

variable. Bath_bed_ratio and Floors has decimal values, and these can also be
categorized. We also have 1NA & Quality variable can be categorized

Water_facing, Floors, Yr_built, Condition, Total_area have $ symbol and needs to be treated

Date_of_sale variable is not in the correct format. We need to format it and use only the year
value from this.

11
5. Missing Values

From the plot above the variable Yr_renovated shows that there are 0 percent missing values
but on closer inspection we observe that there are many entries in that variable given as 0.

12
From the table, it is observed that, there are 20699 values given as 0. This accounts to
(20699/21613) = 95% values not present. Hence, removing this column will not affect our
data.

13
Both Bedrooms and Bath_bed_ratio account to 0.5% of missing values

Water_facing has missing values and is about 0.26%. This variable can also be categorized
Home_area has 0.08% missing values

Plot_area has 0.19% missing values

Home_area15 accounts for 0.77% of missing values

Plot_area15 and furnished has 0.13 % missing values

5. Data Treatment

5.1. Creating New Variable

The variable Date_of_sale is not in correct format. This variable gives us information about
the date of sale of the house. We will require this variable for further analysis. Hence, we
will format this variable to extract the Year Values alone.

We are creating a new variable called as Yr_sale which will contain this information

Yr_sale

Similarly, Yr_built also has too many levels. So instead of using this variable we will create a
new variable called Age_at_sale (which will be used at a later stage)

5.2. Removing $ Symbols

The variables such as Floors, Water_facing, Condition, Yr_built, long and Total_area have $
symbols. We will treat these by replacing them with NA’s.

14
15
5.3. Conversion of Variables
We are converting few variables to categorical variables. Variables such as House_ID,
Water_facing, Viewed, Zipcode and Furnished are all converted to categorical variables.

We are giving Water_facing and Furnished variables two values as’ Yes’ or ‘No’ for the ease
of understanding

Similarly Viewed variable has numbers from 1,2,3,4... hence grouping them as ‘Yes’ if the
property has been viewed at least once or else giving them a value of ‘No’.

5.4. Treating Missing Values and Checking Correlation

There are few NA’s in Yr_built (15 NA’s) and we are removing the NA values

We will remove the missing values in a 2-step process.

First, we will group all the numerical variables and treat the missing values and next we will
combine categorical variables along with the treated numerical values and remove the
missing values if present.

Before removing missing values, we will just plot a correlation to check if any variables have
a correlation value of 1.

16
We see that from the correlation plot the Plot_area and Total_area have a correlation value
of 1, meaning both are same. We must choose the one to remove.

We are removing Total_area as it has 68 NA’s (Missing Values)

17
We will remove Total_area variable and now check for missing values

We see that Bedrooms, Bath_Bed_Ratio, Home_area, Plot_area, Floors, Condition, long,

Home_area15, Plot_area15 have missing values

Now Proceeding with the treatment of missing values by using MICE and takings the means
as the method

18
19
Missing values of Numerical Type

Now Checking for missing values for entire dataset by combining the remaining variables

We see that Water_facing and furnished have missing values. These missing values are
treated using MICE and we are taking logistic regression method this time to treat as those
two variables are of categorical type.

20
After Treating for entire dataset

21
We have removed all NA’s from our dataset after treating them.

22
5.3 Correlation Plot

It can be observed from the correlation plot that there is a

High correlation between the Home_area and Floor_area, Home_area and Quality,
Home_area and Bath_Bed_Ratio, Home_area and Price

Similarly, there is a high correlation between Quality and Floor_area, Quality and
Bath_Bed_Ratio and Quality and Price

23
6. Univariate Analysis

6.1. Bar Plot

From the Bar plot we observe that for the variables:

Water_facing – there are a lot number of houses which do not have a water facing view
Viewed - Many Properties have not been viewed

Furnished – Many properties are not furnished

24
6.2. Density Plot

We observe that, for few variables the distribution is not uniform. We will explore these variables.

25
6.2.1 Bath_Bed_Ratio

26
From the plot we see that the distribution is not uniform for bath_bed_ratio. We can convert
this variable to factor as the range is from 0 to 8 (Also consists of some decimal values)

We will categorize them as Low, Medium and High

6.2.2 Bedrooms

27
As observed from the plot, variables like the Bath Bed Ratio, the Bedrooms distribution is
uneven and can be categorized. There are Bedrooms in some house where there are more than
10. We will categorize them as “0-4, 4-6 and Above 7”.

6.2.3 Condition

28
The Condition variable determines how good the overall condition of the house is. From the
plot we see that there are some values in decimal. We will remove this by making them to
integers.

6.2.4 Floors

29
From the distribution we see there are few values in decimals. We will remove them by
making them as integers.

6.2.5 Quality

30
From the distribution we can also categorize the Quality variable. The Quality values start
from 0 and goes till 13.

We will also categorize these variables.

After conversion of all the required variables, an overall summary can be observed.

31
6.3 Bar Plot after conversion

6.4 ZIPCODE Analysis

Top 10 zip codes where a greater number of properties is available

32
Distribution of Properties across Zip Codes

We see that for the zip code 98103 has the highest number of properties and the Zip code
98039 has the least number of properties

33
6.4.1 Properties in Zip code according to price

From the plot we can see that 98039 has properties, where the price is highest and also has a
bigger Home area.

Zip Code - 98168 has properties where the price is low and the Home sizes are small.

6.4.2 Properties in Zip code according to Furnished

We see from the plot that Zip code – 98006 has highest number of furnished houses, but the
price of the properties is very low (Can imply that furnishing is not so good)

34
Zip code – 98039 has a smaller number of properties which are furnished but the price of the
houses is high

35
6.4.3 Properties viewed per zip code
We can see that zip code – 98006 area’s properties have been viewed many times and the price
of the properties in that area is cheaper

98040 area’s properties which have a higher price has also been seen many times
980039 area’s properties which has the highest price has been viewed less number
of times.

Properties in the area 98002,98031.98077 has been viewed the least, this means
that the properties in these areas have rarely been visited.

Properties in the area 98148 has not been visited (Viewed) at all.

36
7. Bi -Variate Analysis

7.1 Box Plot

7.1.1 Age of house and Quality

7.1.2 Price and Quality

37
From this plot we see that as the quality increases the price of the house also increases. We
can be observed, that price is highest for house whose quality rating is more than 3.

38
7.1.3 Price and Condition

From the plot we see that the price is high for the houses whose condition is greater than a rating of 3.

7.1.4 Price and Furnished

39
We see that the price of the houses which are furnished is higher than houses which are not
furnished.

7.1.5 Price and Water Facing

8. Final Dataset

After all analysis and conversion of the data accordingly we finalize our dataset with the
following values

40
Before we proceed to use this dataset to build our model, we will exclude the lat and long
variables as these values are not very diverse. (Decimal difference is not significant).

Hence, we are excluding this from our dataset.

As observed from summary, there are no NA’s

41
8.1 Correlation Check
From our previous analysis we saw that there was some correlation between the variables.
Let us now check for correlation in our final dataset.

From the correlation plot we observe the following:

There is high value of correlation between Floor_area and Home_area (correlation 0.88)
Similarly we see there is correlation between Home_area and Price, Home_area15 and
Home_area, Home_area15 and Floor_area and Plot_area15 and Plot_area.

42
8.2 Outliers Check
We will first check for outliers in the complete dataset.

From the plot we can observe that, for few variables there are outliers which need to be treated.

Before the treatment, lets analyses them at individual levels. Exploring the outliers with
respect to price bucket

43
We observe that, except in age at sale, the outliers are higher for values in cheap categories
followed by affordable categories. There are more outliers in the cheaper categories. Let us
treat it before moving forward.

44
8.3. Treating Outliers
Before we treat the outliers, a cautious approach must be followed in order to avoid
removing them without proper analysis as some variables may have outliers which may
represent some meaning. We are treating the outliers using the “Winsorization” function.

Here we have substituted the outlier present on the higher side with the 95th Percentile and the
outliers which are present on the lower side with the 5th percentile.

Plotting the chart after treating outliers

Most of the outliers are removed as seen from the above plot.

9. Splitting the Dataset into train and test

Data is split in the ratio of 70:30.

Dimension of Dataset

45
Train Set dimension

Test Set dimension

46
10. Model Building
10.1. Linear Regression Model
Linear Regression is used to predict a quantitative outcome variable (y) based on one or
multiple predictor variables (x)

The goal is to build a mathematical formula that defines y as a function of the x variable.
Once we have built a statistically significant model, it’s possible to use it for predicting future
outcome based on new x values.

There are two important metrics which we use to measure the performance of the regression
model.

1) Root Mean Square – which measures the model prediction error. It corresponds to the
average difference between the observed and predicted values by the model. The lower
the RMSE, the better the model.
2) R-Square – representing the squared correlation between the observed known outcome
values and the predicted values by the model. The higher R2, the better the model.

Also, few other things to note:

The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS
The average variation of points around the fitted regression line is called as the Residual
Standard Error (RSE).

47
10.1.1 Linear Model 1
We are going to build our model by considering all the variables. After running the model,
the following results are expected:

From the above output we get the following information:

Residuals: Provide a quick view of the distribution of the residuals.

Estimate: Gives the intercept and the beta coefficient estimate associated to each predictor
variable

Std.Error: The standard error of the coefficient measures. The larger the standard error the
less confident we are about the estimate.
48
t value: it is the coefficient estimate divided by the standard error of the estimate

Pr(>|t|): The p value corresponding to t statistic. The smaller the p value, the more significant
the estimate is.

Residual Standard Error (RSE), R-Squared (R2) and the F-Statistic are metrics that are
used to check how well the model fits to our data.

49
From the model which we have built we see that most of the variables are significant except
Floor_area, Basement, Plot_area and Quality. Condition seems to be little significant

According to this model performance, we see that we have get the R square value as
73.17% and Adjusted R square values as 73.12%

The purpose of linear regression is for interpretation. We get the coefficients from these
models which are critical for interpretation of the variables

Checking for multicollinearity

We can check for multicollinearity by running the VIF. Following are the observations:

For the values of Home_area and Floor_area the VIF values are high. Basement also has a
higher VIF when compared to other variables.

As we already saw from the correlation plot that the Home_area and floor_area have a
correlation of 0.88. We will remove one of these and build our model again.

50
10.1.2 Linear Model 2
We will build our model again by removing floor area. We see that after removing floor area
the R square value does not change that much but the problem of multicollinearity is
eliminated.

From this model we see that most of the variables are significant except Basement and
Plot_area. Similarly, Quality and Condition is barely significant.

According to this model performance we observe that, the R square value as 73.16% and
Adjusted R square values as 73.12%. The values are same as the previous model. Let us
now check if the problem of multicollinearity is reduced.
51
Checking for Multicollinearity

We observe from the above plot that the multicollinearity has been reduced.

We will now try to improve the value of R squared. For this we will use a concept of feature
selection to determine which variable to select and remove from our dataset and give it to the
model. We will create by using leaps functions, which creates a regression subset by taking
all the combination of the variables and gives the desired result.

By running leaps, we get the following variables which are important

Now let us build our model using the above variables.

52
10.1.3 Linear Model 3 using Leaps
Let us now observe how many variables are getting selected by using leaps function. Leaps
function reduces the number of variables. This facilitates for understanding and better
interpretation.

A simple model easier to understand. The more complex the model, more is the chance of
overfitting the model.

Building this model by taking the important variables as got from the feature selection by
using the leaps function and the regression subset the following result:

According to this model performance we observe that the R square value as 71.49% and
Adjusted R square values as 71.47%.

Still, we observe that the Quality variable is not significant. By using this method, the
performance of the model is not very high. So, let us build the model with some other
technique.

Checking for multicollinearity

53
The model 3 performance is not so good when compared with model 1 and 2.

Also, from model 1 and 2 we observed that, a few variables were not significant. So, we are
going to improve the significance of the variables by transforming those variables.

The Basement was not significant. We will try to increase its significance by dividing them
into buckets.

Similarly, Quality was also not significant. Quality > 10 matters more as more people tend
to purchase properties which have a higher rating.

Also, we will try to improve condition, as that variable was also not significant.

R square is the amount of variance that is explained by the predictor variable. The R square is
nothing but the coefficient of determination.

10.2 Transforming the variables to improve Significance:

Transforming Basement into three buckets such as Minimal, Adequate and Large Basement
areas.

If quality is more than 10 and above then we have named it as Premium otherwise we are
saying it as Standard.

Similarly for Condition, Conditions above 5 seems to be more important. If it is rated 5 then
we keep it as Yes otherwise, we have kept it as No.

After we are done with the variable Transformation, we are again dividing our dataset into
train and test. We are splitting them into 70:30 Ratio

Original Dataset

Train Dataset

54
Test Dataset

Now using this new train and test dataset with the transformed variables to build our model

55
10.2.1 Linear Model 4
We see that the R square values have not been affected much and most of the variables are
significant except plot area.

After running the model with the transformed data, we get the R-Square value as 72.73% and
adjusted R Squared value as 72.7%

56
Checking for multicollinearity

57
10.2.2 Linear Model 5
From the previous model we observed that the Plot_area was not significant we will
remove that variable and try to find out how the model performs. Also, we are removing
the Floor_area variable as it had high collinearity.

We see that all the variables are significant, and this model gives us a better interpretation.
Still, we can try to improve the value of R-Square.

58
Checking for multicollinearity

10.2.3 Residual Analysis

We will try to improve the model performance by looking at its residuals and try to improve
on that.

Residuals – It is the Error and is actually the difference between actual – predicted value
(that is by how much you have missed out the actual value)

Std. Residuals – Nothing but the standardized representation of the residuals (converted to a
Z square values)

Fitted values – it is nothing but the predicted values

Leverage – Outlier measure (Cook’s distance). There are two sides to a linear regression
model. Once is the dependent side which is the target variable and other is the predicted
values. (Independent variables). These variables may or may not have outliers (or extreme
values)

59
Independent variables are many and to condense these values we are using the leverage
column Using cook’s distance (where measure of leverage value is 2) if the value of leverage
is more than 2 then it is a highly influenced observation and we can remove them and see a
marginal improvement in the variables

60
We are transforming those values by setting a threshold (taking 95% here) to the leverage and
taking those values (Marking them in new column as High if it more and Ok if it is normally
within the range)

From the above plot we can see almost all the values fall in place and there are few points
where the residuals are high. The Heteroscedasticity is good

61
The residuals look normally distributed

62
63
10.2.4 Linear Model 6
Now let us build our linear regression model by excluding the two variables Plot_area and
Floor_area and excluding the residuals where the points are high. We get the following
results:

Still the value of R-Square is like the previous models which were built. We will try one
more transformation technique and then see how our model performs

64
10.2.5. Box Cox Transformation
A Box-Cox transformation is a way to transform non-normal dependent variables into a
normal shape. Normality is an important assumption for many statistical techniques.

Here, this transformation is put on the dependent variable Price. We are taking the value
where there is maximum value for log-likelihood

Now proceeding to build our model

65
10.2.6 Linear Model 7

The R-Squared value dropped when we ran the models using the box cox transformation.
This is because it is comparing the variables only with the transformed variable. It is
explaining the amount of variance that these variables have with respect to the transformed
variables.

So, we must back transform it to get the original value

To Transform back to its original coefficients and calculating the RMSE and R-Square we
are multiplying with this function

(Dataset*x+1) ^(1/x)

Performance of Model 7

RMSE R2

66
We see that the performance of this model is better than all the models which we have
run and it has given a low RMSE values and high R-Square Value. Hence, we are
Proceeding with this model

67
10.3 Comparison of Linear Models

Train

Test

Plotting the comparisons

68
69
Looking at the coefficients of Model 7

Relative importance of coefficients

These are the variables which are of importance in the Linear Regression Model.

In the Linear Regression Model, Model 7 performs the best and hence we are taking this
model.

70
10.4 Decision Trees

The decision tree method is a powerful and popular predictive machine learning technique
that is used for both classification and regression. So, it is also known as Classification
Regression Trees (CART).

The r implementation of decision tree is called as RPART (Recursive Partitioning and

Regression Trees). The algorithm works by repeatedly partitioning the data into multiple sub-
spaces, so that the outcomes in each final sub-space is as homogeneous as possible.

The decision rules generated by the CART Predictive model are generally visualized as a
binary tree.

We will build the models first by using the dataset without any transformed variables and
then will build the models again by using the transformed variables dataset.

(Decision Tree 1, Decision Tree 2, Decision Tree 3 – Without Transformed Dataset

Decision Tree 4, Decision Tree 5, Decision Tree 6 – With Transformed Dataset)

10.4.1 Decision Tree 1

71
Let us build our initial decision tree. We are keeping the CP value as 0 and see how the
model results are. We get the following results.

(Detailed output results are mentioned in csv)

Decision_Tree1_outpu
t.csv

72
(Detailed Results are attached in CSV)

73
10.4.2 Decision Tree 2
The decision tree will now be built by taking the best CP value. On finding out the best CP
value we get

We will now use this CP value in building our decision tree. On building the decision tree
with the above CP we get

(Detailed output results are available in the csv)

74
Decision_Tree2_outpu
t.csv

75
76
10.4.3 Decision Tree 3
We will now build our decision tree by giving it with the help of tunelength.

Also we will build a decision tree based on the manual grid and then compare the results.

Based on tunelength

77
Using Manual Grid

We are building the model using the grid instead of tune length to observe the results.

78
10.4.4 Decision Tree 4
Now let us build the decision tree model with the transformed dataset and let us see the
results.

Firstly, we are building the decision tree with CP = 0 with the new dataset.
(Full results attached in CSV)

Decision_Tree4_outpu
t.csv

79
80
10.4.5 Decision Tree 5
Now we are going to build our model with the best cost parameter and using the transformed
dataset. From our previous model we out the best CP value as 0.00028

Attaching the entire result of the decision tree in csv

81
82
10.4.6 Decision Tree 6
We will now build our decision tree by giving it with the help of tunelength.

Also we will build a decision tree based on the manual grid and then compare the results

Based on Tunelength

83
Using Manual Grid

84
10.5 Model Performance of Decision Trees

Train

Test

85
We see that DT2 model performs the best for both train and test dataset giving a good R-
Square value and a less RMSE value compared to other decision tree models.

86
Important Variables

10.6 Random Forest

Random Forest randomly selects observation/rows and specific features to build multiple
decision trees and then averages the results across all trees.

The prediction error is measured by RMSE which corresponds to average difference between
the observed known values of the outcome and the predicted value by the model.

10.6.1 Random Forest Model 1

We are first building our random forest model by setting the starting point as 2 (specified in
mtry). By giving this value we are making sure that there will some difference between each
other, otherwise chances of correlation may occur. Also giving a step factor, ntree and setting
the doBest as True.

We get the following results.

87
Variable Importance

Variable Importance

88
10.6.2 Random Forest Model 2
We will now further tune our model and build and see how the results are.

As seen from the results, the model with the mtry of 9 gives the best result. The RMSE value
is the lowest among all the results and R square value is also high

10.7 Comparison of Random Forest Models

Train

Test

89
From the Random Forest model comparison, we can observe that RF2 performs better in both
train and test. For Train dataset the R-Square value as 96.17% and RMSE value of 55513.86
and in the Test dataset it gives an R-Square value of 79.52% and RMSE value of 124024.

90
10.8 XG Boosting

XG Boosting works only on numerical values. Our dataset consists of both numerical and
categorical values. We have converted the categorical variables into dummy variables with
the help of the library fastDummies. This is process in which we create this set of variables is
otherwise called as Sparse Matrix.

Once we have created our columns in the dataset, we will use them to build our models.

10.8.1 XG Model
Now let us proceed in building our model.
We get the following results

10.9 Performance of XG Boosting

Train

Test

91
92
11. Ensemble Techniques

11.1. Ensemble Model 1

We are compiling all the predictions of the regression models (both train and test) and
decision trees (both train and test) for this model.

Then we are taking the row mean which is the average of the predictions of the various
models that have been used.

Train

Test

11.2 Ensemble Model 2

We are compiling the predictions of Random Forest and XG Boost Models and then
calculating the row means for this model. We observe that this model gives the following
results

Train

Test

93
11.3 Stacking
In stacking we do not consider the average of the models, but we take the predictions of
all the models (Regression Model, Decision Tree) and run a random forest among them.

Here we have the values of price predicted by the Regression Model, values of price
predicted by Decision Tree, values of price predicted by Random Forest. We will treat these
three as variables and actual variable Price as the dependent variable.

On running this model, we have following results

Performance
Train

Test

94
12. Performance comparison of all models

Comparing the performance of the models we have

12.1 Train

95
12.2 Test

96
12.3 Comparison Plot

13. Conclusion

Since we are dealing with a Regression type of problem, we are focusing mainly on the two
metrics which are R-Square and RMSE values. The value of R-Square must be high and the
value of RMSE must be kept to a minimum for an ideal model performance.

From above models which were built, Ensemble Model 2 performs the best as it gives a R-
Square value of 91.37% and RMSE score of 83401.12 on the Train dataset and gives a R-
Square value of 79.62% and RMSE score of 123729.6 on the Test dataset.

Stacking also performs great which gives us a result of R-Square of 92.53% and a RMSE
Score of 77571.36 for the Train dataset and gives R-Square value of 77.54% and a RMSE
score of 129871.3 on the Test dataset.

Property Price Prediction Capstone Project
100% (1)
Property Price Prediction Capstone Project
7 pages
Business: Capstone Project House Price Prediction Project Note-1
88% (8)
Business: Capstone Project House Price Prediction Project Note-1
40 pages
Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
No ratings yet
Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
6 pages
Reyem Affiar Case Study
No ratings yet
Reyem Affiar Case Study
13 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
18BCS115
No ratings yet
18BCS115
25 pages
Girish Chadha Capstone Notes 1 (11th June 2023)
No ratings yet
Girish Chadha Capstone Notes 1 (11th June 2023)
65 pages
Dialnet-PredictiveAnalyticsForHousingMarketTrendsAndValuat-9870249
No ratings yet
Dialnet-PredictiveAnalyticsForHousingMarketTrendsAndValuat-9870249
6 pages
House Price Prediction
No ratings yet
House Price Prediction
9 pages
FML PROJECT diya (1) (1)
No ratings yet
FML PROJECT diya (1) (1)
9 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
Problem Statement
No ratings yet
Problem Statement
1 page
Problem Statement - Capstone
No ratings yet
Problem Statement - Capstone
1 page
Coding
No ratings yet
Coding
7 pages
Real-Estate Property
No ratings yet
Real-Estate Property
11 pages
reprot final_pdf
No ratings yet
reprot final_pdf
57 pages
A Predictive Analysis On House Prices in Florida
No ratings yet
A Predictive Analysis On House Prices in Florida
5 pages
House price prediction
No ratings yet
House price prediction
5 pages
AIreport
No ratings yet
AIreport
17 pages
house_price_prediction_report
No ratings yet
house_price_prediction_report
3 pages
House Price Prediction - Research Paper FINAL DRAFT
100% (1)
House Price Prediction - Research Paper FINAL DRAFT
10 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
7 pages
House price predictor ppt Project
No ratings yet
House price predictor ppt Project
13 pages
Krishna Sorthiya_House Price prediction using ML
No ratings yet
Krishna Sorthiya_House Price prediction using ML
41 pages
VaibhavKumarPPT Modified
No ratings yet
VaibhavKumarPPT Modified
12 pages
ml project clg (2)
No ratings yet
ml project clg (2)
62 pages
Girish Chadha Capstone Final Report Submission 16 Jul 23
No ratings yet
Girish Chadha Capstone Final Report Submission 16 Jul 23
33 pages
Fyp Proposal
No ratings yet
Fyp Proposal
3 pages
Synopsis Format1.PDF
No ratings yet
Synopsis Format1.PDF
6 pages
Mini Project Report Format
No ratings yet
Mini Project Report Format
22 pages
Kunal Report New
No ratings yet
Kunal Report New
8 pages
Updated_House_Price_Prediction_Report
No ratings yet
Updated_House_Price_Prediction_Report
5 pages
Housepricepdf 2
No ratings yet
Housepricepdf 2
3 pages
Synopsis
No ratings yet
Synopsis
7 pages
Comprehensive Project
No ratings yet
Comprehensive Project
10 pages
Report
No ratings yet
Report
7 pages
MBB JETIR2204579
No ratings yet
MBB JETIR2204579
5 pages
Project Synopsis Shaiba
No ratings yet
Project Synopsis Shaiba
5 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
House Sale Price Prediction
0% (1)
House Sale Price Prediction
11 pages
#Raw Report On Predicting House Prices Using Decsion Tree Not ready
No ratings yet
#Raw Report On Predicting House Prices Using Decsion Tree Not ready
17 pages
IJIRCT2203007
No ratings yet
IJIRCT2203007
4 pages
Iamsp 2
No ratings yet
Iamsp 2
8 pages
HOUSE-PRICE-PREDICTION-Shreya-Majumder
No ratings yet
HOUSE-PRICE-PREDICTION-Shreya-Majumder
22 pages
Price Prediction
100% (1)
Price Prediction
13 pages
Housing
No ratings yet
Housing
21 pages
HOUSE-PRICE-PREDICTION-Shreya-Majumder012345678910111213141516171819_sign
No ratings yet
HOUSE-PRICE-PREDICTION-Shreya-Majumder012345678910111213141516171819_sign
21 pages
CSIC 6132 排版870 878
No ratings yet
CSIC 6132 排版870 878
9 pages
CS Assignment (Raam Kumar)
No ratings yet
CS Assignment (Raam Kumar)
32 pages
BDA_REPORT
No ratings yet
BDA_REPORT
27 pages
House Price Prediction Report
No ratings yet
House Price Prediction Report
2 pages
House Price Prediction Project Report
No ratings yet
House Price Prediction Project Report
36 pages
House Price Prediction
No ratings yet
House Price Prediction
12 pages
ese lab file
No ratings yet
ese lab file
30 pages
Sample Synopsis
No ratings yet
Sample Synopsis
4 pages
UtkarshGupta (House Price Prediction)
No ratings yet
UtkarshGupta (House Price Prediction)
14 pages
Report Project
No ratings yet
Report Project
8 pages
Utkarsh Gupta G (73) (House Price Prediction)
No ratings yet
Utkarsh Gupta G (73) (House Price Prediction)
6 pages
Decision Trees For Objective House Price Prediction
No ratings yet
Decision Trees For Objective House Price Prediction
4 pages
House Price Prediction Using Machine Learning: Bachelor of Technology
No ratings yet
House Price Prediction Using Machine Learning: Bachelor of Technology
20 pages
Sears Modern Homes, 1913
From Everand
Sears Modern Homes, 1913
Sears, Roebuck and Co.
No ratings yet
How To Add Value To Your Home
From Everand
How To Add Value To Your Home
Scott McGillivray
4/5 (4)
Project.pdf (1)
No ratings yet
Project.pdf (1)
20 pages
Project.pdf (1)
No ratings yet
Project.pdf (1)
20 pages
Business_Report_of_social_media_tourism_project.pptx
No ratings yet
Business_Report_of_social_media_tourism_project.pptx
21 pages
Advance_Stats...
No ratings yet
Advance_Stats...
9 pages
PHStat Readme
No ratings yet
PHStat Readme
4 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Health Monitoring of Concrete Dams - A Literature Review
No ratings yet
Health Monitoring of Concrete Dams - A Literature Review
13 pages
Credit Scoring For Microfinance Institutions in Mexico An Ensemble and Hybridized Approach
No ratings yet
Credit Scoring For Microfinance Institutions in Mexico An Ensemble and Hybridized Approach
7 pages
Aging and The Preference For The Human Touch
No ratings yet
Aging and The Preference For The Human Touch
12 pages
Pengaruh Insight Media Sosial Instagram Terhadap Penjualan Produk Online
No ratings yet
Pengaruh Insight Media Sosial Instagram Terhadap Penjualan Produk Online
15 pages
Applied Econometrics: Introduction To Matrix
No ratings yet
Applied Econometrics: Introduction To Matrix
18 pages
Tatistical Eports: The Misuse of Ratios in Ecological Stoichiometry
No ratings yet
Tatistical Eports: The Misuse of Ratios in Ecological Stoichiometry
7 pages
Do Audit Tenure and Firm Size Contribute To Audit Quality Empirical Evidence From Jordan PDF
No ratings yet
Do Audit Tenure and Firm Size Contribute To Audit Quality Empirical Evidence From Jordan PDF
18 pages
Data Analysis and Graphics Using R-An Example Based Approach
No ratings yet
Data Analysis and Graphics Using R-An Example Based Approach
22 pages
Nguyen Et Al., 2020
No ratings yet
Nguyen Et Al., 2020
9 pages
Distributed Lag
No ratings yet
Distributed Lag
4 pages
Course Outline MTS 202 - Statistical Inference
No ratings yet
Course Outline MTS 202 - Statistical Inference
5 pages
Computers and Electrical Engineering
No ratings yet
Computers and Electrical Engineering
14 pages
Econometrics Year 3 Eco
No ratings yet
Econometrics Year 3 Eco
185 pages
bakshi2018
No ratings yet
bakshi2018
9 pages
Dynamic Panel Data Models. Nickell's Bias. Anderson-Hsiao Estimator. Arellano-Bond Estimator. System GMM Estimator or Blundell-Bond
No ratings yet
Dynamic Panel Data Models. Nickell's Bias. Anderson-Hsiao Estimator. Arellano-Bond Estimator. System GMM Estimator or Blundell-Bond
22 pages
Spe 26436 MS PDF
No ratings yet
Spe 26436 MS PDF
16 pages
أثر عمليات اعادة التأمين على الأداء المالي لشركات التأمينات العامة المصرية ؛ دراسة كمية - محمد السيد حافظ ؛ مجلة الدراسات المالية والتجارية (تجارة بني سويف) ، ، مج29 ، ع2 ، 2019
No ratings yet
أثر عمليات اعادة التأمين على الأداء المالي لشركات التأمينات العامة المصرية ؛ دراسة كمية - محمد السيد حافظ ؛ مجلة الدراسات المالية والتجارية (تجارة بني سويف) ، ، مج29 ، ع2 ، 2019
42 pages
Statistics and Probability: Quarter 4 - Module 23
No ratings yet
Statistics and Probability: Quarter 4 - Module 23
27 pages
Machine Learning in Python - Course Notes
No ratings yet
Machine Learning in Python - Course Notes
36 pages
Week 6 Logic of Multivariate Analysis: The Elaboration Approach
No ratings yet
Week 6 Logic of Multivariate Analysis: The Elaboration Approach
24 pages
Testing For Multiple Structural Breaks - An Application of Bai-Perron Test To The Nominal Interest Rates and Inflation in Turkey (#242872) - 211274
No ratings yet
Testing For Multiple Structural Breaks - An Application of Bai-Perron Test To The Nominal Interest Rates and Inflation in Turkey (#242872) - 211274
14 pages
Jawaban Uas Statistika
No ratings yet
Jawaban Uas Statistika
10 pages
Linear Regression For Beginner Project Work
No ratings yet
Linear Regression For Beginner Project Work
14 pages
RESEARCH METHODS LESSON 18 - Multiple Regression
No ratings yet
RESEARCH METHODS LESSON 18 - Multiple Regression
6 pages
CBO AI Initiative - CBO T&I AI Infusion in Monitoring - Qlik Sense Infusion Guide January 2022
No ratings yet
CBO AI Initiative - CBO T&I AI Infusion in Monitoring - Qlik Sense Infusion Guide January 2022
34 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.