Stats Project Module 1 2
Stats Project Module 1 2
Group Members: Aishwarya Borra, Allison Lee, Olivia Moody, Yalie Rubin
1) Describe your data set (including the source). Why are you interested in it? What do you
hope to learn? Before exploring your data set, state some hypotheses (guesses) about how
the variables should be related, perhaps based on your knowledge and experience. Be sure
to identify the response variable and the predictor variables.
Our data set, sourced from Kaggle, includes information on Manhattan Property Sales
from 2003 to October 2018. As young and “broke” college students who are in need of
apartments, we would like to know how much a building’s age, square footage, and number of
units (both residential and commercial) (our three predictor variables) actually impact the price
of a building (our response variable), which would correlate to the price of rent/leases. Of course,
this is historical data, but we believe the pattern is still fun to see.
We hypothesize that the square footage and number of units would have a positive
correlation with the sale price, and therefore have a substantial impact on rent. Afterall, it costs
more to build a bigger building, and more units can be sold to more people, therefore more units
would be worth more and correlate to a higher sale price.
However, we doubt that the building age will affect the sale price significantly, especially
since we are in NYC and the neighborhood seems to matter more than the building’s age (e.g., an
old building in Soho may be pricier than a nice building in the Financial District). So, we believe
that there will be a negative or very weak correlation between these two variables.
Source of Data:
https://www.kaggle.com/datasets/s0myaj/nyc-annualized-sales-data?resource=download&select=
sales_manhattan_06.xls
2) Make a scatterplot of your response variable (on the Y-axis) versus one of the predictor
variables (on the X-axis). Describe the pattern you see. Is this pattern consistent with what
you expected? Note any apparent outliers in the plot. Can you propose a "cause" for these
outliers? Repeat the entire procedure for the other predictor variables.
Sale Price vs. Total Units:
→
There is an apparent outlier with 8800 units and a sale price of $4,040,527,000. That
sounds unrealistic, but after looking up the address of the building, it became clear that it was a
mega apartment complex in StuyTown with rent for a studio being $4250+. Expensive.
There’s also another large outlier with 2498 units and $1,347,433,250. Again in
StuyTown, tons of apartment units with very rather expensive rent ($6,000+ for 2 beds and 1
bath).
After removing said outliers and then some (see below), we changed the y-scale to make
the graph easier to read. From there, it looks like there is no true pattern between the total units
in a building and its sale price since there are many apartments with hundreds of units more than
others but similar sale prices. The correlation was positive, but not strong whatsoever.
Rows excluded with outliers: 750, 751, 881, 778, 860, 890 (Row numbers from Minitab)
750 and 751 are the outliers addressed above.
● Row 881’s building has 102 units and a sale price of $1,520,000,000. This is the Avenue
of Americas, which is home to News-Corporation where lots of shows are recorded, thus
the hefty price tag but low number of units.
● Row 778’s building has 1684 units and a sale price of $472,500. Row 860’s building has
1417 units and a sale price of $579,000. The low sales prices are because these rows of
data included the sale of a single unit rather than the building. This is an error in the
dataset. We removed these outliers in the graphs from here and beyond.
● Row 890’s building has 1 unit and a sale price of $1,246,450,000. It’s an office building
in the heart of Midtown and has significant Corporate Row presence, thus the large price
tag despite only having a single unit.
Sale Price vs. Gross Square Feet:
→
→
It’s the same large outlier with tons of apartment units, so I’m going to remove it and
address the other outliers. Firstly, there’s one apartment building with 2,438,059 gross square
feet but with a only moderately higher sale price of $186,388,000. Upon researching more, the
building is an office building, which explains the high gross square footage but lower sale price
(office buildings probably require lower maintenance costs than apartment buildings).
There’s another outlier as well with 1,172,021 gross square feet and a sale price of
$834,518,000. This is self-explanatory: the buildings are a luxury hotel located near Time
Square, so real estate costs more despite less square footage, and the hotel can recoup their
investment with high prices, justified by the high tourist demand.
After removing those outliers, we got to the third graph, which seems to show a positive
weak correlation, both visually and calculated (r = 0.635; see graph 4), meaning that high gross
square footage does not necessarily mean high sale price, but there may be some occasional
relationship.
Sale Price vs. Building Age:
→
→
There are a few outliers: for example, one building is 218 years old, though it does have a
low sale price like expected. A more significant outlier is the building developed 73 years ago
with a sale price of $4,040,527,000 (the same major outlier as before). We removed it to be able
to see a pattern more easily (graph 2) and found the correlation (graph 3). Clearly, building age
has nothing to do with sale price. The correlation is r=-0.173, proving that older buildings can
still have the same sale price as a recently constructed building. We believe neighborhood and
location, as well as square footage, matters more than building age.
3) Can you think of any other variables (not in your data set) that might be useful in
predicting Y? Try to list a few possibilities.
For building prices in New York City, a variable such as NYC Zoning Laws could affect
the prices of housing. Since the Zoning Laws divide the land into districts where similar rules are
in effect, and these rules are the result of issues ranging from climate change to walkability, such
zoning could potentially play a role in predicting the housing prices.
Another variable that could predict the sale price could potentially be the nature of the
building itself. Rather than the age, the idea that the building may be renovated, or even contain
amenities for the residents that drive up building maintenance, could potentially correlate to a
higher building sale price.
Another variable that could implicate sale price could be rent stabilization. Perhaps if the
building is in a rent stabilized area, or the rent is not subject to increase for residents in the area,
and a building owner is trying to sell off a building, they may sell it for a lower sale price.
Selling it for a lower price could ensure that the building is sold to another, given that people
may be reluctant to buy buildings in rent stabilized areas, in fear of making less money from
tenants.
Another possible variable that could explain the building prices could be the time the
building has spent on the market. Perhaps buildings that have been on the market for a longer
amount of time may correlate with a lower sale price, as the building owner may settle for a
lower price out of eagerness to sell the building.
4) For each variable, obtain Minitab's Graphical Summary. To do this, use Stat →Basic
Statistics → Graphical Summary. Enter the variables of interest in the dialog box. For each
variable, the graph gives, first, a histogram with a "normal curve" superimposed. The
graph also gives a boxplot (on its side, corresponding to the X-axis of the histogram) as well
as other numerical and graphical information. Note any points which are outliers (or at
least the two or three most extreme ones), according to the boxplots. Do these correspond to
outliers you found in the scatterplots?
Building Age: For the building age variable, according to the box plot, all the data points having
a value of less than 55 years are outliers. Additionally, there are outliers for building age at ages
158 and 218. These outliers slightly correspond to the outliers found in the scatterplots, as the
prices found at around building age of 55 on the scatterplot seem to be outliers. Also the building
prices at ages 158 and 218 on the scatterplot are visibly outliers.
Gross Square Feet: For the gross square feet variable, it seems that 50% of the values are
around 4,340 sq ft to 23,700 sq ft. The box plot is heavily skewed to the right, and the major
outliers seem to be at 8,942,176 sq ft and 3,122,165 sq ft. The outliers depicted on the scatterplot
support this conclusion from the box plots, as the sale prices at 8,942,176 sq ft and 3,122,165 sq
ft on the scatterplot are very far from the majority of the data, and seem to be outliers.
Total Units: The box plot for the total units variable is also a heavily skewed graph, showing a
skew to the right. It seems to be that 50% of the data is from values 2 to 20. The values that are
extremely far from this 50% of data, and thus looking like outliers, seem to be 8,800 units and
2498 units. According to the scatterplots, which also showed a very skewed-right distribution of
points, the points very far away from the majority of the data and presenting as outliers are the
sale prices at 8,800 units and 2,498 units, supporting what was found in the box plots.
5) Often, the variability of a quantity depends on its size. For example, the variation in the
incomes of the top 10% of earners is much greater than in the bottom 10% of earners. If
one of your variables suffers from this size-dependent variability:
(A) The histogram will show a right-skewed distribution,
(B) The mean will be larger than the median,
(C) The boxplots will show that the median line is towards the low side (in this case, left
side) of the box.
(D) The boxplot will show more outliers on the high side than on the low side.
Building Age:
Building age is skew-left with a mean (103.33) that is smaller than the median (114). The median
line is towards the high side of the boxplot and there are more outliers on the low side compared
to the high side. Therefore the Building Age predictor variable does not have the problem of
size-dependent variability.
Total Units:
Total Units is skew-right with a mean (30.83) that is larger than the median (7). The IQR Box
was completely obscured by the sheer amount of upper outliers, so we used the five number
summary to create a close-up of the box plot in excel. The median line is towards the low side of
the boxplot and there are far more outliers on the high side. Therefore the Total Units predictor
variable does have the size-dependent variability problem.
Sales Price:
Sales Price is skew-right with a mean (18,842,320) that is larger than the median (3,475,000).
The IQR Box was completely obscured by the sheer amount of upper outliers, so we used the
five number summary to create a close-up of the box plot in excel. The upper whisker is cut-off
because the max was so large that the IQR Box would’ve been obscured (due to outliers), so it
doesn’t max out at 18,000,000 (I indicated this with a small arrow). The median line is towards
the low side of the boxplot and there are far more outliers on the high side. Therefore the Sales
Price response variable does have the size-dependent variability problem.
For each variable (each predictor and the response), based on the descriptive statistics
output, decide if your particular variable has the problem described above. If so, and if all
of the data values for this variable are positive, try taking natural logs of the variable. To
do this, use Calc → Calculator. If, for example, you want to create a variable, LogPrice,
from the existing variable Price, type LogPrice in the box marked "Store result in
variable:", and type loge(Price) in the box marked "Expression:". Then create the
descriptive statistics graph again for the log of the variable, and decide whether the
problem is reduced, according to the criteria (A)-(D) above.
Total Units:
The LogTotalUnits variable appears to be very slightly skew-right based on the distribution fit
line and positive skewness metric, but the mean (1.9436) is slightly smaller than the median
(1.9459). The median line is shifted slightly towards the high side, and there are more outliers on
the high side than the low side. Overall, since the graphical summary no longer meets two of the
four criteria for size-dependent variability, and since taking the logs greatly increased the
symmetry and reduced the variability of the data, we would say the problem is reduced.
Sale Price:
The LogSalePrice variable appears to be slightly skew-left based on the distribution fit line and
the negative skewness metric with a mean (14.584) that is smaller than the median (15.061). The
median line is towards the high side, and there are more outliers on the low end than the high
end. Overall, since the graphical summary no longer meets any of the criteria for size-dependent
variability, we would say the problem is reduced.
Please note that if a variable has any zero or negative values, then taking logs is NOT
appropriate, so there is no point in trying it in this case. (Minitab will simply generate an
error message).
The reason we worry so much about taking logs is that it often helps the subsequent
statistical analysis. In particular, taking logs tends to bring the high outliers more in line
with the rest of the data, while at the same time "blowing up" the picture at the low end, so
that these points can now be seen more clearly.
6) Rerun the scatterplots (and answer the rest of question 2) using the logged variables wherever
this was found appropriate in question 5). Here are some examples of what I mean: If you decided
to take logs of predictor variable X2 only, then you should run a scatterplot of your response
variable (let's call it Y) against log(X2). If you decided to take logs of X2 and X3, then you should
run scatterplots of Y versus log(X2) and Y versus log(X3). If you decided to take logs of Y only, then
you should run scatterplots of log(Y) versus all of the (non-logged) predictor variables. If you
decided not to take the log of any of the variables, you do not need to do anything. For each
scatterplot you create here), compare it with the corresponding one from question 2). Did taking
logs help you to uncover a relationship between the variables?
From this scatter plot, we can see a weak positive upward trend. This is in line with the
hypothesis, although we did expect the trend to be stronger. Perhaps this is because the sales
price variable still has the size-distribution variability problem.
We were curious to see how the trends would change if we only standardized the response
variable:
Logged Sale Price by Total Units:
The trend is an upward one (although moderate), meaning that the higher the total units,
the greater the sale price is. There are no apparent outliers. This relationship is expected as the
more units in a building, the more units available to sell, so the building may be worth more to a
buyer because they can sell many units instead of few. This graph suggests that the logarithmic
transformation did not change the positive correlation between the variables. One reason why the
correlation may be more moderate is because there is more variety among the prices that
properties with low units sell for.
Logged Sale Price by Gross Square Feet:
The pattern observed here is very clearly a strong positive upward trend, meaning that the
higher the gross square feet, the greater the sale price is. There are no apparent outliers. This is
the relationship we expected, since it costs more money to build bigger properties, so those
properties would sell for more. This graph indicates that the positive correlation between the
variables still exists, and is even clearer, after the logarithmic transformation.
The original scatterplots were heavily influenced by outliers, therefore we expect that the
scatterplots will show a more defined and possibly linear trend when all of the variables are
standardized:
Logged Sales Price by Logged Total Units:
The pattern here is almost non-existent, or very weak. There are no apparent outliers.
While this isn’t the relationship that we expected, it seems clear that this is the true relationship
between these two variables, because even if we manipulate the data to show a relationship, it is
always much weaker than the relationship between, say, sales price and gross square footage.
This is likely due to the fact that there are lots of apartments with low total units that have low
prices and lots of apartments with low total units that have high prices. The logarithmic
transformation made this lack of pattern more clear.
Logged Sales Price by Logged Gross Square Feet:
The pattern here is a strong, positive, linear trend, meaning as square footage increases,
sale prices increase. The logarithmic transformation made this trend more clear and more linear.
This fits with our hypothesis of a correlation between square footage and sale price, since, as
previously mentioned, bigger properties cost more, and are worth more.