1. Introduction
Zillow is an American online real estate database company which gives value estimates of homes and some other services. However, Zillow has realized that its housing market predictions are not as accurate as they could be because they do not factor in enough local intelligence. Thus, my partner and I build a predictive model for home price for San Francisco.
Home price is of great importance in our daily life for individuals, governments and factories. housing creates a wealth effect that drives consumers’ spending, which accounts for 70% of our GDP (Tom Silva, 2011). The oscillation of house prices affects the value of asset portfolio for most households for whom a house is the largest single asset. Moreover, price movements influence the profitability of financial institutions and the soundness of the financial system (Yarui Li, 2011). However, predicting home price is not that easy. For all the buzz around data and algorithms, the difference between a good learner and a bad one usually comes down to which features are fed into the model. But selecting features is often the most difficult and imprecise part of the machine learning process. What is more, home price is influenced by many factors including natural disasters, political unrest and macroeconomics, etc. It is difficult to select those which influence home price most and the number of factors should be controlled.
To cope with these challenges, we choose Ordinary Least Squares (OLS) model as our fundamental model to find correlations of multiple selected predictors to Sale Price. Firstly, we collected the data from the Internet and did data cleaning. Then, we did exploratory analysis and feature engineering to select the features. Next, a preliminary model is built and tested using spatial statistical and model testing techniques, which might require further optimization. After multiple trials and adjustments, we got a model with smallest MAE and MAPE.
Our resulting model included 20 significant predictors based on 9433 observations of local characteristics. When checking the spatial structure of residuals and dependent values, we found it spatial autocorrelated (Moran’s I=-0.027). Thus, we introduced the neighborhood spatial lag into the model. Nearly 77% of the variance in the dependent variable is explained by our model (R2 and Adjusted R2 are 0.771 and 0.770,respectively). It presented 224795.8 average error (mean absolute error (MAE)= 224795.8) and 22.35% of prediction error (mean absolute percentage error (MAPE)=22.35%). Generally, it is a good model which describes the real estate factors and external factors.
2. Methodology
Our methodoloty for the home sale price model’s establishment,optimization and validation is demonstrated in the below flow chart:
2.1 Data collection&wrangling
In the very first beginning, we subjectively selected several factors that are considered to have a significant impact on housing prices(based on the hedonic model).Then, we downloaded the data from the Internet and did data cleaning in order to correlate the value of predictors with our house price observations.
2.2 Exploratory analysis
Next, an exploratory analysis was performed in order to select the potential predictors, which includes correlation test of variables, visualization of continuous variables and dummy variables, etc. A thoughtful analysis is also more interpretable for non-technical domain experts.
2.3 Feature engineering
After the exploratory analysis, we did the feature engineering for those potential predictors. Feature engineering is the art of seeking predictive power out of a model: the more we can convert data into useful features, the better the model will perform.
2.4 Model estimation and validation
We regressed sale price on the selected features and built our OLS model, which needed to be tested using spatial statistical and model testing techniques in order to verify the model’s accuracy and generalizability. If the model’s forecasting ability is unsatisfactory, further feature engineering and model optimization is still needed, which was a process of trial and error. After multiple variable adjustments, our final OLS model is built.
3. Model Estabilishment and Optimization
In chapter 3, we are going to introduce the establishment and optimization of our home sale price model, and the model validation process will be introduced specifically in the next chapter.
3.1 Data Collection
The factors for our prediction model are chosen according to the hedonic model.The hedonic model can be both a theoretical and practical framework for deconstructing a good into the value of its constituent parts.For our purpose, there are three theoretical components of house prices that are needed to sufficiently predict. The first is internal and parcel-level characteristics like property area,stories and number of bedrooms. The second is amenities and demographic characteristics, like facilities,building violations, racial diversity and education level. The third is spatial structure of home sale prices.
Price = Internal characteristics + Amenities + Spatial structure + Error
Based on the hedonic model, We collected data for 33 variables and finally picked 20 variables. The table of summary statistics of the final predictors as well as the method for gathering data is shown below.
We used Open Data San Francisco to collect facility data, road data,stops data,building violation data, and existing neighborhood designation (a Geojson file),etc.In order to match those data with the house data, we made geographic processing (linear density analysis, buffer analysis, spatial connection and so on) during our data cleaning process. Besides, We used data from U.S. Census 2015 data for demographic variables, such as median income,racial diversity, and percentage of population with bachelor degree and over. These data are collected on block group basis, which makes it easy to summarize upon individual observations. Moreover, we downloaded the median house value data and school score data from the Zillow open data. Similarly, we used k neighborhood method and spatial connection to match the collected data with housing data.
In terms of the null values, we fill it mainly by mean filling principle and k proximity principle (that is, we calculate the average nearest neighbor value from each home sale to its k nearest neighbor values to fill the null values).In total,we selected 9433 observations as our in-sample data.
3.2 Exploratory Analysis
Correlation analysis
To begin with, we use the correlation matrix to examine the relationships between these numeric variables.The correlation matrix indicates whether the variables are strongly related to each other and help us to remove factors that already have relationships between them. As can be seen from the output,the correlation coefficient between the two variables is less than 0.9,which means that none of the variables is strongly related to independent variables.
# correlation Matirix(reg)
sfCor <-
sf_1 %>%
st_set_geometry(NULL) %>%
dplyr::select(SalePrice,LotArea_Ne,PropArea_N,SaleYr,mdhv_mean,vio_nn5,
facility.Buffer,NumofTree,SchoolScore,RoadDensity,countStops,Age_new,month,
PCT_detached.house.unit,PCT_bachelor,median.income,racial,
Baths,stoFill)
M <- cor(sfCor)
corrplot(M, method = "number")
Visualization of continuous variables and dummy variables
To visualize a linear correlation between the predictor and dependent variable, a “least squares” line is drawn through the point cloud (more on least squares below). A point cloud that hugs the line in indicative of a correlation, while one which resembles random noise suggest no relationship between the two variables. Below SalePrice is plotted as a function of four numeric features, PropArea_N,median.income(Median household income in the census tract where the house is located.),racial(Ethnic diversity index),and SchoolScore(The average score of the 3 nearest schools to the house), which are the variables that we are most interested .
# scatterplots of 4 continuous vatiables (reg)
as.data.frame(sf_1) %>%
dplyr::select(SalePrice,PropArea_N,median.income,racial,SchoolScore) %>%
filter(SalePrice <= 1000000) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(colour = "#9999CC",size=0.7) +
geom_smooth(method = "lm", se=F, colour = "#003366") +
facet_wrap(~Variable, ncol =2, scales = "free") +
labs(title = "Price as a function of continuous variables") +
theme(axis.text.x = element_blank()) +
plotTheme()
At the same time, We also use bar graphs to visualize the impact of dummy variables on housing price.
Overview maps
To give a overview of the housing price in San francisco, We looked at the spatial distribution of the existing 9433 housing sales before the prediction model was built.As can be seen from the sale price map, there is a clustering of high sale point in the middle and northern part of the city, and the housing prices in the southern and western is much lower.
# map the original data
ggplot() +
geom_sf(data = nhoods) +
geom_sf(data = sf, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) +
labs(title="Sale price, SF") +
scale_color_manual(values=palette5)+
mapTheme()
Besides, we also mapped three of our most interesting independent variables.
The first map is 5 nearest neighbor distance of building violations. We calculate the average nearest distance from each home sale to its 5 nearest building violations and add a new field named vio_nn5.Then we plot the vio_nn5 in the map.
# load the building violation data
vio<-
read.csv("C:/Users/HP/Desktop/building_violation_12to15.csv") %>%
na.omit() %>%
st_as_sf(coords = c("longtitude", "latitude"), crs = 4326, agr = "constant") %>%
st_transform(2227)
# calculate the (average) nearest neighbor distance from each home sale to its 5 nearest neighbor crimes
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
sf$vio_nn5=
nn_function(st_coordinates(sf),st_coordinates(vio),5)
ggplot() +
geom_sf(data = nhoods) +
geom_sf(data = sf, aes(colour = q5(vio_nn5)), show.legend = "point", size = .75) +
scale_color_manual(values=palette5)+
labs(title="5 nearest neighbor distance of building violations, SF") +
mapTheme()
The second map indicates the number of facilities within the 1/2 mile buffer. We calculate the number of facilities within the 1/2 buffer for each house and add a new field named facility.Buffer.Then we plot the facility.Buffer in the map.
#load the facility data
facility <-
st_read("E:/2019MUSA/MUSA_507_Public_Policy_Analytics/HW_midterm/midtermData_studentVersion/data_add/facilities/cityFacilities.shp") %>%
st_set_crs(4326) %>%
st_transform(2227)
facility.sf <-
facility %>%
dplyr::select(LATITUDE, LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2227) %>%
distinct()
# take the sum of facilities for the 1/2 mile buffer
sf$facility.Buffer =
sf %>%
st_buffer(2640) %>%
aggregate(mutate(facility.sf, facility.Buffer = 1),.,sum) %>%
pull(facility.Buffer)
ggplot() +
geom_sf(data = nhoods) +
geom_sf(data = sf, aes(colour = q5(facility.Buffer)), show.legend = "point", size = .75) +
scale_color_manual(values=palette5)+
labs(title="Number of facilities within 1/2 mile buffer, SF") +
mapTheme()
The third map is the road density index map. We use the line density tool and extract to point tolls in Arcgis to calculate the road density idex for each house.Then we plot the value of RoadDensity in the map.
#load the Road Density data
ggplot() +
geom_sf(data = nhoods) +
geom_sf(data = sf, aes(colour = q5(RoadDensity)), show.legend = "point", size = .75) +
labs(title="Road Density, SF") +
scale_color_manual(values=palette5)+
mapTheme()
3.3 Feature Engineering
In the feature engineering process, We try to conduct variable transformation and dimension reduction for our data. For example, from the bar plots of category variables, we devided the zoningCode’s original types into three new types: H+(types with high sale price),H(types with regular sale price),H-(types with low sale price), in order to improve our predictive performance.
as.data.frame(sf_1) %>%
dplyr::select(SalePrice,ZoningCode,ZoneCoCla) %>%
mutate_at(vars(starts_with("R")), funs(factor)) %>%
filter(SalePrice <= 1000000) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_bar(position = "dodge", stat = "summary", fun.y = "mean",col="white",border = NA) +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of categorical variables", y = "Mean_Price") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
plotTheme()
3.4 Baseline Regression Modeling
Based on the exploratory analysis and further feature engineering,We set up a baseline model of house price forecast.The output results of Baseline Regression model are presented below.
# OLS regression
reg<- lm(SalePrice ~ ., data = as.data.frame(sf_1) %>%
dplyr::select(SalePrice,LotArea_Ne,PropArea_N,SaleYr,mdhv_mean,vio_nn5,facility.Buffer,NumofTree,SchoolScore,RoadDensity,countStops,Age_new,month,PCT_detached.house.unit,PCT_bachelor,median.income,racial, ZoneCoCla,PropClassC,Baths,stoFill))
summary(reg)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(sf_1) %>% dplyr::select(SalePrice,
## LotArea_Ne, PropArea_N, SaleYr, mdhv_mean, vio_nn5, facility.Buffer,
## NumofTree, SchoolScore, RoadDensity, countStops, Age_new,
## month, PCT_detached.house.unit, PCT_bachelor, median.income,
## racial, ZoneCoCla, PropClassC, Baths, stoFill))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6696705 -203313 -20757 162264 2809308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.497e+06 2.869e+05 -12.191 < 2e-16 ***
## LotArea_Ne 6.860e-01 4.586e-02 14.959 < 2e-16 ***
## PropArea_N 2.914e+02 7.483e+00 38.939 < 2e-16 ***
## SaleYr 1.598e+05 3.686e+03 43.360 < 2e-16 ***
## mdhv_mean 1.325e+03 4.239e+01 31.260 < 2e-16 ***
## vio_nn5 -5.039e+01 1.628e+01 -3.095 0.001976 **
## facility.Buffer -1.020e+03 1.880e+02 -5.423 5.99e-08 ***
## NumofTree -6.949e+02 1.125e+02 -6.175 6.90e-10 ***
## SchoolScore -1.301e+03 2.277e+03 -0.571 0.567692
## RoadDensity -6.687e+03 7.299e+02 -9.161 < 2e-16 ***
## countStops 2.515e+03 4.554e+02 5.522 3.44e-08 ***
## Age_new 6.141e+02 1.736e+02 3.537 0.000406 ***
## month 8.277e+03 1.242e+03 6.665 2.79e-11 ***
## PCT_detached.house.unit -3.477e+03 2.507e+02 -13.868 < 2e-16 ***
## PCT_bachelor 5.840e+03 5.997e+02 9.738 < 2e-16 ***
## median.income 1.488e+00 1.693e-01 8.788 < 2e-16 ***
## racial -6.784e+05 5.633e+04 -12.044 < 2e-16 ***
## ZoneCoClaH- -7.263e+04 6.000e+04 -1.211 0.226105
## ZoneCoClaH+ -2.547e+04 2.028e+04 -1.256 0.209171
## PropClassCD 1.016e+06 2.783e+05 3.649 0.000264 ***
## PropClassCDA 8.950e+05 2.952e+05 3.032 0.002438 **
## PropClassCF 7.996e+05 2.838e+05 2.818 0.004847 **
## PropClassCLZ 4.407e+05 2.909e+05 1.515 0.129784
## PropClassCOZ -6.504e+05 4.832e+05 -1.346 0.178339
## PropClassCTH 4.929e+05 2.982e+05 1.653 0.098415 .
## PropClassCTIC -4.099e+05 2.833e+05 -1.447 0.148029
## PropClassCZ 5.633e+05 2.787e+05 2.021 0.043341 *
## PropClassCZBM 3.056e+05 3.079e+05 0.993 0.320865
## PropClassCZEU 3.226e+06 3.949e+05 8.169 3.51e-16 ***
## Baths 2.407e+04 5.045e+03 4.770 1.87e-06 ***
## stoFill 1.419e+05 8.099e+03 17.521 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 392700 on 9402 degrees of freedom
## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6863
## F-statistic: 688.8 on 30 and 9402 DF, p-value: < 2.2e-16
The regression results show a saticfactory significance level (p value of all the predictors < 0.05) and decent predictability(R2=0.6873, that is,68.73 of the variance in the dependent variable is explained by our model), it could be clearly seen that the predictors such as lot area, property area and sale year have a positive association with the sale price, while the predictors such as number of violations, school score and road density have a negative association with the sale price.
3.5 Spatial Autocorrelation and Model Optimization
In the part 3.1, we discussed the three elements that combine to predict price, internal characteristics, neighborhood amenities and the underlying spatial structure of prices.In this part, we are going to focus on spatial structure of prices and how to revise our model.
Map of Price Errors
Firstly,we plot the price errors on our in-sample data.Visually, it appears as though errors clustering together,but we still need further confirmation.
sf_1$predict_ori=predict(reg,sf_1)
sf_1$error_ori=sf_1$SalePrice - sf_1$predict_ori
# plot residual map
sf_1 %>%
ggplot() +
geom_sf(data = nhoods, fill = "white") +
geom_sf(aes(colour = q5(error_ori)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sf_1,"error_ori"),
name="Quintile\nBreaks") +
labs(title="Price Errors of In-sample Data") +
mapTheme()
To further examine the spatial autocorrelation, we did the moran’s I test for our baseline model.The Moran’s I statistic is a statistical hypothesis test that asks whether and to what extent a spatial phenomenon exhibits a given spatial process.A Moran’s I that is positive nearing the value of 1, describes positive spatial autocorrelation, also known as clustering. Instances where high and low prices ‘repel’ one another, are said to be dispersed. The Moran’s I statistic in such an instance is closer to -1. Finally, where positive and negative values are randonly distributed, the Moran’s I statistic is 0. As can be seen from the following plot, the frequency of all 999 randomly permutated Moran’s I are plotted as a histogram and the Observed I is indicated by the orange line. That the observed I is higher than the 999 randomly generated datasets’I provides visual confirmation of spatial autocorrelation.
Moran’s I Test
sf_1_copy <-
sf_1 %>%
mutate(SalePrice.Predict = predict(reg, sf_1),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE_ = (abs(SalePrice - SalePrice.Predict)) / SalePrice)
coords_copy <-
filter(sf_1_copy, !is.na(SalePrice.Error)) %>%
st_coordinates()
neighborList_copy <- knn2nb(knearneigh(coords_copy, 5))
spatialWeights_copy <- nb2listw(neighborList_copy, style="W")
moranTest1 <- moran.mc(filter(sf_1_copy, !is.na(SalePrice.Error))$SalePrice.Error,
spatialWeights_copy, nsim = 999)
ggplot(as.data.frame(moranTest1$res[c(1:999)]), aes(moranTest1$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest1$statistic), colour = "#003366",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count") +
plotTheme()
Regression model with neighborhood effects
Home prices exhibit spatial autocorrelation, and thus controlling for localized means will improve the predictive accuracy of the model. To eliminate the effects of spatial autocorrelation, a new feature named lag_price_nn is developed to account for this neighborhood effect.To be more specific, We calculate the average house price from each home sale to its 5 nearest houses and add a field named lag_price_nn.
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(st_coordinates(measureTo), st_coordinates(measureFrom), k)$nn.index
for (i in seq(1,dim(nn)[1],by=1)){
for (j in seq(1,dim(nn)[2],by=1)){
hold <- measureTo[nn[i,j],]$holdOut
if (hold==0){
saleprice <- measureTo[nn[i,j],]$SalePrice
nn[i,j] <- saleprice
}
else{
nn[i,j] <- 0
}
# nn[i,j] <- price
}
}
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_price, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointprice = mean(point_price)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
sf$lag_price_nn <- nn_function(sf,sf,5)
Again, We build a new model with neighborhood effects to pridict the home price.The output results of the regression model with neighborhood effects are presented below.
sf_1<- filter(sf,sf$holdOut==0)
reg_1<- lm(SalePrice ~ ., data = as.data.frame(sf_1) %>%
dplyr::select(SalePrice,LotArea_Ne,PropArea_N,SaleYr,mdhv_mean,vio_nn5,lag_price_nn,
NumofTree,SchoolScore,RoadDensity,countStops,Age_new,month,facility.Buffer,
PCT_detached.house.unit,PCT_bachelor,median.income,racial,
ZoneCoCla,PropClassC,Baths,stoFill))
summary(reg_1)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(sf_1) %>% dplyr::select(SalePrice,
## LotArea_Ne, PropArea_N, SaleYr, mdhv_mean, vio_nn5, lag_price_nn,
## NumofTree, SchoolScore, RoadDensity, countStops, Age_new,
## month, facility.Buffer, PCT_detached.house.unit, PCT_bachelor,
## median.income, racial, ZoneCoCla, PropClassC, Baths, stoFill))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4890781 -168383 -7781 140874 2431386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.717e+06 2.459e+05 -11.051 < 2e-16 ***
## LotArea_Ne 3.699e-01 3.962e-02 9.336 < 2e-16 ***
## PropArea_N 2.042e+02 6.575e+00 31.064 < 2e-16 ***
## SaleYr 1.413e+05 3.170e+03 44.590 < 2e-16 ***
## mdhv_mean 5.352e+02 3.870e+01 13.828 < 2e-16 ***
## vio_nn5 -4.697e+01 1.393e+01 -3.371 0.000753 ***
## lag_price_nn 6.226e-01 1.062e-02 58.609 < 2e-16 ***
## NumofTree -4.839e+02 9.638e+01 -5.021 5.23e-07 ***
## SchoolScore -7.781e+03 1.952e+03 -3.987 6.75e-05 ***
## RoadDensity -2.893e+03 6.281e+02 -4.607 4.14e-06 ***
## countStops 1.481e+03 3.901e+02 3.795 0.000149 ***
## Age_new -1.188e+02 1.491e+02 -0.796 0.425801
## month 7.133e+03 1.063e+03 6.710 2.06e-11 ***
## facility.Buffer -4.463e+02 1.612e+02 -2.769 0.005635 **
## PCT_detached.house.unit -1.963e+03 2.161e+02 -9.082 < 2e-16 ***
## PCT_bachelor 2.412e+03 5.165e+02 4.669 3.06e-06 ***
## median.income 4.904e-01 1.459e-01 3.362 0.000777 ***
## racial -3.426e+05 4.855e+04 -7.056 1.84e-12 ***
## ZoneCoClaH- -1.147e+04 5.136e+04 -0.223 0.823360
## ZoneCoClaH+ 4.046e+04 1.739e+04 2.326 0.020043 *
## PropClassCD 6.093e+05 2.383e+05 2.557 0.010584 *
## PropClassCDA 5.696e+05 2.527e+05 2.254 0.024220 *
## PropClassCF 5.009e+05 2.429e+05 2.062 0.039246 *
## PropClassCLZ 2.244e+05 2.490e+05 0.901 0.367482
## PropClassCOZ -4.072e+05 4.136e+05 -0.985 0.324865
## PropClassCTH 3.601e+05 2.552e+05 1.411 0.158270
## PropClassCTIC -5.062e+05 2.425e+05 -2.087 0.036871 *
## PropClassCZ 2.271e+05 2.386e+05 0.952 0.341294
## PropClassCZBM -1.707e+04 2.636e+05 -0.065 0.948375
## PropClassCZEU 2.538e+06 3.382e+05 7.504 6.74e-14 ***
## Baths 2.485e+04 4.318e+03 5.755 8.96e-09 ***
## stoFill 8.971e+04 6.989e+03 12.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 336100 on 9401 degrees of freedom
## Multiple R-squared: 0.771, Adjusted R-squared: 0.7702
## F-statistic: 1021 on 31 and 9401 DF, p-value: < 2.2e-16
From the regression result above, we could see that the R Square of the current model is 0.7708, which suggested that our model accounted for 77.08% of varience in sale price prediction.However, we should realized that there was a risk related with this process–overfit our model. Unless the model also generalizes well, the otherwise impressive R-Square may be misleading. Thus, it is important for us to further validate our model.
4. Model Validation
In this chapter, we focus on validating the current model from two perspective:accuracy & generalizability, and then check the spatial autocorrelation of our model again.
4.1 Accuracy
Maps of predicted saleprice and the observed saleprice of in-sample data
In order to test the accuracy of our model, we map the predicted saleprice and the observed saleprice to get an overview of the predicted results. The maps indicate that the difference between the predicted value and the observed value is small, which proves the accuracy of our final model.
grid.arrange(
ggplot() +
geom_sf(data = nhoods, fill = "white") +
geom_sf(data = sf, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sf,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Real Sale Price") +
mapTheme(),
ggplot() +
geom_sf(data = nhoods, fill = "white") +
geom_sf(data = sf, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sf,"SalePrice.Predict"),
name="Quintile\nBreaks") +
labs(title="Predicted Sale Price") +
mapTheme(), ncol = 2)
Scatterplot of predicted and observed prices of in-sample data
The below figure shows the scatterplot of predicted and observed prices of our in-sample data, which indicate that our model predicts for most of the data in a precise way(except several outliers).
sf_1 <-
sf_1 %>%
mutate(Regression = "Spatial lag Effects",
SalePrice.Predict = predict(reg_1, sf_1),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)
ggplot(sf_1, aes(SalePrice.Predict, SalePrice)) +
geom_point(colour = "#9999CC") +
geom_smooth(method = "lm", se = FALSE, colour = "#003366") +
labs(title = "Price as a function of the Observed price",
x = "Observed price",
y = "Sale Price") +
plotTheme()
Regression results of training set
Then,we randomly split the in-sample data into a 60% training set, and a 40% test set.The following table is our training set model regression results:
#training data
inTrain<- createDataPartition(
y = paste(
sf_1$PCT_detached.house.unit,sf_1$PCT_bachelor,sf_1$median.income,sf_1$racial,
sf_1$PropClassC,sf_1$ZoneCoCla,sf_1$Baths,sf_1$SaleYr,sf_1$month,sf_1$stoFill),
p = .60, list = FALSE)
sf_1.training <- sf_1[inTrain,]
sf_1.test <- sf_1[-inTrain,]
# training regression
reg_traning<- lm(SalePrice ~ ., data = as.data.frame(sf_1.training) %>%
dplyr::select(SalePrice,LotArea_Ne,PropArea_N,
facility.Buffer,SchoolScore,RoadDensity,vio_nn5,countStops,Age_new,
PCT_detached.house.unit,PCT_bachelor,median.income,racial,
PropClassC,ZoneCoCla,Baths,SaleYr,month,stoFill,mdhv_mean,NumofTree,lag_price_nn))
summary(reg_traning)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(sf_1.training) %>%
## dplyr::select(SalePrice, LotArea_Ne, PropArea_N, facility.Buffer,
## SchoolScore, RoadDensity, vio_nn5, countStops, Age_new,
## PCT_detached.house.unit, PCT_bachelor, median.income,
## racial, PropClassC, ZoneCoCla, Baths, SaleYr, month,
## stoFill, mdhv_mean, NumofTree, lag_price_nn))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4893686 -171700 -7968 144197 2428845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.732e+06 2.491e+05 -10.966 < 2e-16 ***
## LotArea_Ne 3.814e-01 4.094e-02 9.316 < 2e-16 ***
## PropArea_N 2.041e+02 6.691e+00 30.511 < 2e-16 ***
## facility.Buffer -4.271e+02 1.651e+02 -2.587 0.00970 **
## SchoolScore -8.292e+03 2.013e+03 -4.120 3.83e-05 ***
## RoadDensity -3.021e+03 6.489e+02 -4.656 3.28e-06 ***
## vio_nn5 -4.713e+01 1.440e+01 -3.274 0.00106 **
## countStops 1.517e+03 3.998e+02 3.793 0.00015 ***
## Age_new -1.486e+02 1.524e+02 -0.975 0.32967
## PCT_detached.house.unit -1.997e+03 2.238e+02 -8.925 < 2e-16 ***
## PCT_bachelor 2.474e+03 5.300e+02 4.667 3.09e-06 ***
## median.income 4.758e-01 1.489e-01 3.194 0.00141 **
## racial -3.389e+05 4.987e+04 -6.795 1.15e-11 ***
## PropClassCD 6.052e+05 2.412e+05 2.509 0.01213 *
## PropClassCDA 5.669e+05 2.558e+05 2.216 0.02671 *
## PropClassCF 4.955e+05 2.459e+05 2.015 0.04391 *
## PropClassCLZ 2.149e+05 2.520e+05 0.853 0.39385
## PropClassCOZ -4.193e+05 4.187e+05 -1.002 0.31656
## PropClassCTH 3.524e+05 2.584e+05 1.364 0.17264
## PropClassCTIC -5.124e+05 2.454e+05 -2.088 0.03683 *
## PropClassCZ 2.200e+05 2.415e+05 0.911 0.36235
## PropClassCZBM -2.231e+04 2.668e+05 -0.084 0.93335
## PropClassCZEU 2.524e+06 3.423e+05 7.373 1.81e-13 ***
## ZoneCoClaH- -1.266e+04 5.200e+04 -0.243 0.80768
## ZoneCoClaH+ 3.836e+04 1.763e+04 2.176 0.02955 *
## Baths 2.534e+04 4.397e+03 5.763 8.55e-09 ***
## SaleYr 1.423e+05 3.267e+03 43.569 < 2e-16 ***
## month 7.355e+03 1.094e+03 6.726 1.84e-11 ***
## stoFill 8.974e+04 7.105e+03 12.632 < 2e-16 ***
## mdhv_mean 5.438e+02 3.954e+01 13.752 < 2e-16 ***
## NumofTree -4.948e+02 9.900e+01 -4.998 5.89e-07 ***
## lag_price_nn 6.232e-01 1.085e-02 57.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 340200 on 9060 degrees of freedom
## Multiple R-squared: 0.7701, Adjusted R-squared: 0.7693
## F-statistic: 978.8 on 31 and 9060 DF, p-value: < 2.2e-16
The output tell us that 77.06% of the variance in the dependent variable is explained by our model.Besides, we used MAE(149482.3) to suggest that the error is acceptable, considering that the mean sale price for our test data is 894058.6. The Mean Absolute Percent Error(MAPE) around 20% suggests that our model is off by only about one fifth.R_square | meanTestMAE | meanTestMAPE |
---|---|---|
0.6872774 | 146540.2 | 17.58201 |
4.2 Generalizability
There are two types of generalizability: generalizability to data that the model hasn’t seen yet and generalizability across different urban contexts.We are going to validate both of the generalizability in this part.
In order to judge the generalizability not one random hold out but one many, we used cross-validation to help ensure that the goodness of fit on one hold out is not a fluke.The standard deviation of all the MAE across all folds is 28150.3, indicating that there is not too much variation across folds, given the test set mean saleprice is 233027.4. In other words, there is no overfit problem for this model.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <- train(SalePrice ~ ., data = as.data.frame(sf_1.training) %>%
dplyr::select(SalePrice,LotArea_Ne,PropArea_N,
facility.Buffer,SchoolScore,RoadDensity,vio_nn5,countStops,Age_new,
PCT_detached.house.unit,PCT_bachelor,median.income,racial,
PropClassC,ZoneCoCla,Baths,SaleYr,month,stoFill,mdhv_mean,NumofTree,lag_price_nn),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 9092 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 9003, 9001, 9001, 9003, 9002, 9000, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 338341.2 0.7740983 229456.9
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
meanMAE <- mean(reg.cv$resample[,3])
SD_MAE <- sd(reg.cv$resample[,3])
kfold_train <- data.frame(meanMAE,SD_MAE)
kfold_train %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
meanMAE | SD_MAE |
---|---|
229456.9 | 25188.49 |
However, according the cross-validation MAE histogram, we found that the distribution of mean absolute errors is relatively dispersed, which means that our model might not generalize as good to random hold outs as our test data.
ggplot(as.data.frame(reg.cv$resample), aes(MAE)) +
geom_histogram(bins = 50, colour="white", fill = "#9999CC") +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
x="Mean Absolute Error", y="Count") +
plotTheme()
In order to judge the model’s generalizability across different urban contexts, we decided to see how well the Baseline Regression generalizes to different urban contexts and compare that to the Neighborhood Effects regression. To do so, we begin simply by mapping mean errors by neighborhood.It is clear the Neighborhood Effects model is more accurate, but more importantly, the model’s accuracy is more consistent across neighborhoods relative to the Baseline Regression. This consistency, suggests the Neighborhood Effects model is more generalizable.
medhv_nhoods <- data.frame(medhv$ParcelID,medhv$name)
colnames(medhv_nhoods)[1] <- "ParcelID"
sf_1.test <- merge(sf_1.test,medhv_nhoods,by="ParcelID",all=FALSE)
a <- left_join(
st_set_geometry(sf_1.test, NULL) %>%
group_by(medhv.name) %>%
summarize(meanPrice = mean(SalePrice, na.rm = T)),
st_set_geometry(sf_1.test,NULL) %>%
group_by(medhv.name) %>%
summarize(meanAPE = mean(SalePrice.APE,na.rm = T)))
colnames(a)[1] <- "name"
nhoods_merge <- merge(nhoods,a,by='name',all=FALSE)
for (i in seq(1,dim(nhoods_merge)[1], by=1)){
if (is.na(nhoods_merge[i,]$meanAPE)){
nhoods_merge[i,]$meanAPE <- 0
}
}
ggplot() +
geom_sf(data = nhoods_merge, aes(fill = q5(meanAPE))) +
scale_fill_manual(values = palette5,
labels = qBr(nhoods_merge, "meanAPE", rnd = F),
name="Quintile\nBreaks") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()
The following scatterplot of MAPE by neighborhood as a function of mean price by neighborhood indicates that our model under-predicted for lower housing prices in the test data.However,in terms of the in-sample data,our model predict better for lower priced sales.
ggplot(nhoods_merge, aes(meanPrice, meanAPE)) +
geom_point(colour = "#9999CC") +
geom_smooth(method = "lm", se = FALSE, colour = "#003366") +
labs(title = "MAPE by neighborhood as a function of mean price by neighborhood for test data",
x = "meanPrice.x",
y = "meanAPE.x") +
plotTheme()
Then, we try to examine whether our model works equally as well in rich neighborhoods as poor ones; in minority neighborhoods as white neighborhoods.As can been seen from the following pictures, our model works pretty well in both rich neighborhoods and poor ones, as well as white and non-white neighborhoods.
grid.arrange(
ggplot() +
geom_sf(data = na.omit(tracts17), aes(fill = raceContext)) +
scale_fill_manual(values = c("#CCCCFF", "#9999CC"),
name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() +
geom_sf(data = na.omit(tracts17), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#9999CC","#CCCCFF"),
name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
Regression | Majority Non-White | Majority White |
---|---|---|
Baseline Regression | 0.2690264 | 0.2575999 |
Spatial lag Effects | 0.2156201 | 0.2333151 |
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 0.2518319 | 0.2749505 |
Spatial lag Effects | 0.2292996 | 0.2180281 |
4.3 Spatial Autocorrelation Test
From the distribution of residuals in the test data, the errors of prediction sale price have few spatial cluster (i.e.The ones with high error are clustered together, and so do the ones with low error). Comparing with the spatial distribution of sale prices, the aggeration has abated. However, judging from visualization is not persuasive and thus we should do further test to comfirm that our model has accounted for the spatial variation in sale prices.
grid.arrange(
ggplot() +
geom_sf(data = nhoods, fill = "white") +
geom_sf(data = sf_1.test_original, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sf_1.test_original,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Test set sale prices") +
mapTheme(),
ggplot() +
geom_sf(data = nhoods, fill = "white") +
geom_sf(data = sf_1.test_original, aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sf_1.test_original,"SalePrice.AbsError"),
name="Quintile\nBreaks") +
labs(title="Sale price errors on the test set") +
mapTheme(),
ncol = 2)
So we did the Moran’s I test for the residuals, and the value of Moran’s I index is 0.057 and p-value is 0.039, which indicates that the spatial autocorrelation in the variables has been accounted for clearly. Also, the accuracy of the model has increased accordingly.
coords.test <-
filter(sf_1.test, !is.na(SalePrice.Error)) %>%
st_coordinates()
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
moranTest.test <- moran.mc(filter(sf_1.test, !is.na(SalePrice.Error))$SalePrice.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest.test$res[c(1:999)]), aes(moranTest.test$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest.test$statistic), colour = "#99CCFF",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count") +
plotTheme()
5. Conclusion
Generally. our model is an effective model which explained 77% of the variance of sale price (R square=0.7709). Additionally, we got a 22% of absolute error of our model, which is a decent value.There are some interesting variables like month, Number of trees around the house and Number of baths in the house in our model, which we didn’t expect a strong relationship with house sale price before. We calculate the number of trees around the house using a 500-meter buffer. Also, we deal wih the null value in month and number of baths in the house accorfing to the neighbors of the house. The regression outpur of the model represents some important variables, too. Lot area, number of violations around the house and the median income of the owner have greater effects on house sale price, which are consistent with reality world, indicating that we have done a good job. As we mentioned above, our model accounts for the spatial variation in prices very well (Moran’s I =-0.024).
However, the model we build do have some flaws. We do a better job in houses with lower sale price and a worse one in houses with high sale price, even confronts with two outliers. This defect may come from our methods for addressing null values and outliers in the original data. When we deal with those null values and outliers, we use their neighbors to evaluate their value, which may reduce the actual differences and unique spatial structures between observations. Additionally, when we connect the variables to the houses, we use buffers to calculate the number of point data or the mean distance of several observations close to the house. Different scales can also produce different results. Thus, hwo to select a reasonable scale comes into an issue worthy of discussion. Also, the classification of dummy variables influecnces the model a lot. For example, we classify the ZoneCode into two big categeries according to their practical meaning in the public documents. It is so intricate a problem for dummy variables to consider spliting or merging that it is usually hard to practice.
Though remaining a few flaws in our model, we still stand that our model worth to be recommanded to Zillow for its outstanding accuracy and generalizability. What is more, we believe that this model will perform well in Zillow which enjoys a wider range of data collection channels and brings more social reputation for Zillow in the future.