It is reported that Seattle has 1.6 million parking spots, which means that there are more than two spots for every person in the city. However, Seattle continues to grapple with how big a role cars should play in getting around the city and how much space we should devote to storing them. What even worse is that Seattle drivers spend 58 hours per year finding a parking space, costing $1,205 in time, fuel and emissions per person.
Part of the answer of why parking in Seattle is hard may be either that the spots are private or that drivers are not willing to pay. The study from the Research Institute for Housing America shows that new parking spaces are mainly garages while residents prefer to parking in curb parking spots because they are cheaper. The information about parking counts in curb parking is then become more and more significant. Thus, a model which predicts the parking counts in each curb parking places is built in this paper to offer information about the parking occupancy for both drivers as well the traffic planners.
#load the packages as well as the map theme
library(ggplot2)
library(sf)
library(viridis)
library(tidyverse)
library(QuantPsyc)
library(spatstat)
library(spdep)
library(FNN)
library(RSocrata)
library(lubridate)
library(riem)
library(tidycensus)
library(osmdata)
library(corrplot)
library(leaflet)
library(gganimate)
library(gifski)
library(png)
library(knitr)
library(kableExtra)
mapTheme <- function(base_size = 15) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "white", fill=NA, size=2)
)
}
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
palette2 <- c("#99cc33","#FF9900")
palette3 <- c("#FFCC00","#f3a333","#f16821")
palette5 <- c("#A64B00","#DA2D2D","#FF7400","#FF9640","#FFE100")
The data that we collected are listed below:
datadesciption <- data.frame(Data = c("Anual parking study data of Seattle","Temperatue","Precipitation","Street type data","Land use type data","Amenities","Percent of white", "Percent of people taking public transpotation","Median age"),Source = c("Open data, Seattle","Iowa Environment Mesonel","Iowa Environment Mesonel","Seattle Geodata","Seattle Geodata","Open Street Map","American Community Service","American Community Service","American Community Service"))
datadesciption %>%
kable() %>%
kable_styling()
Data | Source |
---|---|
Anual parking study data of Seattle | Open data, Seattle |
Temperatue | Iowa Environment Mesonel |
Precipitation | Iowa Environment Mesonel |
Street type data | Seattle Geodata |
Land use type data | Seattle Geodata |
Amenities | Open Street Map |
Percent of white | American Community Service |
Percent of people taking public transpotation | American Community Service |
Median age | American Community Service |
#annual parking data in Seattle
parking_all <-
read.csv("C:/Users/HP/Desktop/parking_final/data/Annual_Parking_Study_Data.csv") %>%
filter(Total_Vehicle_Count>0 & is.na(Elmntkey)==FALSE)
#parking location of the parking lots
parkinglocation <-
st_read("https://opendata.arcgis.com/datasets/3966ff0038bc429c8170085c010b49e2_2.geojson")
#join the two data set
parkdata <-
merge(parking_all,parkinglocation,by.x="Elmntkey",by.y="ELMNTKEY",all.x=TRUE) %>%
st_sf()
#blockface data
blockface <-
st_read("https://opendata.arcgis.com/datasets/a1458ad1abca41869b81f7c0db0cd777_0.geojson")
#streets data
streets <-
st_read("https://opendata.arcgis.com/datasets/383027d103f042499693da22d72d10e3_0.geojson")
#land use data
landuse <-
st_read("https://opendata.arcgis.com/datasets/0ce115e71a894f4e9a7ed2a803b4353a_0.geojson") %>%
filter(is.na(class_desc)==FALSE)
#POI data
#bar
bar <-
getbb("Seattle") %>%
opq() %>%
add_osm_feature("amenity","bar")
bar <- osmdata_sf(bar)
a <- bar$osm_points
a =a[c(1,62)]
bar <-
a %>%
mutate(legend="bar")%>%
st_sf()
#cafe
cafe <-
getbb("Seattle") %>%
opq() %>%
add_osm_feature("amenity","cafe") %>%
osmdata_sf()
a <- cafe$osm_points
a =a[c(1,62)]
cafe <-
a %>%
mutate(legend="cafe")%>%
st_sf() %>%
dplyr::select(-official_name)
#restaurant
restaurant <-
getbb("Seattle") %>%
opq() %>%
add_osm_feature("amenity","restaurant") %>%
osmdata_sf()
a <- restaurant$osm_points
a =a[c(1,62)]
restaurant <-
a %>%
mutate(legend="cafe")%>%
st_sf() %>%
dplyr::select(-level)
#census data
cunsus2010 <-
st_read("https://opendata.arcgis.com/datasets/b524dcea2c7f47d8b976ed2bb17ff615_16.geojson")
tracts17 <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state="WA",
county=c("King"),
geometry=T,
output = "wide",
key='58f0f177458f58fecb4f370dc79e25a7d70c7d51') %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport) %>%
na.omit()
tracts17 <- tracts17 %>%
st_set_geometry(NULL)
census <-
merge(cunsus2010,
tracts17 ,
by.x= "GEOID10",
by.y = "GEOID",
all.x=TRUE) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID10, geometry) %>%
st_sf()
#weather data
weather <-
riem_measures(station="BFI", date_start="2014-04-01", date_end = "2018-09-01") %>%
replace(is.na(.),0) %>%
mutate(interval60=ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarise(Temperature = max(tmpf),
Precipitation = sum(p01i)) %>%
mutate(Temperature = ifelse(Temperature== 0, 42, Temperature))
Since different types of streets and land use have different regulations for parking, they are two significant factors that influence the counts of parking. For example, there are many different curb colors in Seattle, indicating the different regulations of parking; there are also three main land use types in Seattle: residential distracts, commercial distrcts and industrial districts.Among those types, there are seven various functions of areas which are identified with different colors, and the vehicle should follow different instructions of the districts there.
#plots of the streets
streets.plot <-
streets %>%
mutate(type = case_when(STREETTYPE=="Alley" ~ "Alley",
STREETTYPE=="Downtown"|STREETTYPE=="Downtown Neighborhood"|STREETTYPE=="Downtown Neighborhood Access" ~ "Downtown",
STREETTYPE=="Industrial Access"|STREETTYPE=="Minor Industrial Access" ~ "Industrial Acess",
STREETTYPE=="Neighborhood Corridor"|STREETTYPE=="Neighborhood Yield Street" ~"Neighborhood",
STREETTYPE=="Urban Center Connector"~ "Urban Center",
STREETTYPE=="Urban Village Main"|STREETTYPE=="Urban Village Neighborhood"|STREETTYPE=="Urban Village Neighborhood Access"~ "Urban Village"))
palette7 <- c("#A64B00","#BF7130","#FF7400","#FF9640","#FFE100","#FFE840","#FFEE73")
pal_street <- colorFactor(
palette = palette7,
domain = streets.plot$type)
leaflet(streets.plot) %>%
addProviderTiles("CartoDB.Voyager") %>%
addPolygons(stroke = TRUE,
smoothFactor = 1,
weight = 0.7,
color = ~pal_street(type)) %>%
addLegend("bottomright", pal = pal_street, values = ~type,
title = "legendTitle",
labFormat = leaflet::labelFormat(prefix = ""),
opacity = 0.7)
palette8 <- c("#A64B00","#A66F00","#BF7130","#FF7400","#FF9640","#FFE100","#FFE840","#FFEE73")
pal_landuse <- colorFactor(
palette = palette8,
domain = landuse$class_desc)
leaflet(landuse) %>%
addProviderTiles("CartoDB.Voyager") %>%
addPolygons(stroke = TRUE,
smoothFactor = 1,
weight = 1,
fillOpacity = 0.5,
color = ~pal_landuse(class_desc)) %>%
addLegend("bottomright", pal = pal_landuse, values = ~class_desc,
title = "legendTitle",
labFormat = leaflet::labelFormat(prefix = ""),
opacity = 0.7)
We strongly believe that street types and land use types influence the parking counts. Then we plot the figures below to validate our assumptions.
#process raw parking data and add time
dat <-
parking_all %>%
mutate(interval60 = floor_date(mdy_hms(Date.Time),unit = "hour"),
week = week(interval60),
dotw = wday(interval60, label = TRUE))
parkdata.cat <-
merge(dat,parkinglocation,by.x="Elmntkey",by.y="ELMNTKEY",all.x=TRUE) %>%
st_sf() %>%
dplyr::select(Elmntkey,interval60,Total_Vehicle_Count,geometry,Side,Construction, Event.Closure, Peak.Hour...Yes.or.No.,dotw,week) %>%
mutate(lat=unlist(map(geometry,1))) %>%
na.omit()
parkdata.cat <-
parkdata.cat %>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))
parkdata_group.cat <-
parkdata.cat %>%
group_by(Elmntkey) %>%
summarise(mean_occupancy=mean(Total_Vehicle_Count))
# add street type
parkdata_group.cat <-
parkdata_group.cat %>%
st_join(.,streets, join= st_nearest_feature, left = TRUE) %>%
dplyr::select(Elmntkey,geometry,mean_occupancy,STREETTYPE) %>%
filter(is.na(mean_occupancy)==FALSE)
parkdata_group.cat$STREETTYPE=as.character(parkdata_group.cat$STREETTYPE)
parkdata_group.cat[is.na(parkdata_group.cat)] <- "non"
#add land use
parkdata_group.cat <-
parkdata_group.cat %>%
st_join(.,landuse,join=st_intersects, left= TRUE) %>%
dplyr::select(Elmntkey,geometry,mean_occupancy,STREETTYPE,zonelut,zoneid)
parkdata_group.cat <-
parkdata_group.cat %>%
mutate(streettype_change = ifelse(STREETTYPE=="Neighborhood Yield Street",1,0),
zone_change = ifelse(zonelut =="C2", 1,0))
as.data.frame(parkdata_group.cat) %>%
dplyr::select(mean_occupancy,STREETTYPE) %>%
gather(Variable, Value, -mean_occupancy) %>%
ggplot(aes(Value, mean_occupancy)) +
geom_bar(position = "dodge", stat = "summary", fun.y = "mean",fill="#FACF5A",col="white",border = NA) +
facet_wrap(~Variable, ncol = 1, scales = "free") +
labs(title = "Vehicle count as a function of Street type", y = "Vehicle count") +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5)) +
plotTheme
as.data.frame(parkdata_group.cat) %>%
dplyr::select(mean_occupancy,zonelut) %>%
gather(Variable, Value, -mean_occupancy) %>%
ggplot(aes(Value, mean_occupancy)) +
geom_bar(position = "dodge", stat = "summary", fun.y = "mean",fill="#FACF5A",col="white",border = NA) +
facet_wrap(~Variable, ncol = 1, scales = "free") +
labs(title = "Vehicle count as a function of Land use", y = "Vehicle count") +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5)) +
plotTheme
Also, the relationship between the dependent variable and some other categorical variables are plotted below. It is obvious that whether the parking places is under construction, whether there is an event and whether it is parked in peak hour influence the parking counts. However, time of the day and whether it is weekend has little relationship with dependent variable.
as.data.frame(parkdata.cat) %>%
dplyr::select(Total_Vehicle_Count,Side,Construction, Event.Closure, Peak.Hour...Yes.or.No.,weekend,time_of_day) %>%
gather(Variable, Value, -Total_Vehicle_Count) %>%
ggplot(aes(Value, Total_Vehicle_Count)) +
geom_bar(position = "dodge", stat = "summary", fill="#FACF5A",fun.y = "mean",col="white",border = NA) +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Vehicle count as a function of categorical variables", y = "Mean_vehicle_count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
plotTheme
Then we calculate the mean distances of neighboring amenities and parking lots to the observed parking lot. It is easy to understand that when the density of amenities is high, people are more likely to go there and park their cars. Also, when the density of parking lots is high, people have more choices when parking, which may make people park more there. The mean occupancy of the parking lots is set as a function of the two numeric variables and is plotted below.
#gain the data set that contation the DV and numeric variables
#ORIGINAL DATA
parkdata_group <-
parking_all %>%
group_by(Elmntkey) %>%
summarise(mean_occupancy=mean(Total_Vehicle_Count))
dat1 <-
parkinglocation %>%
merge(., parkdata_group, by.x="ELMNTKEY", by.y="Elmntkey", all.x=TRUE)%>%
dplyr::select(ELMNTKEY,SHAPE_LAT,SHAPE_LNG,geometry,mean_occupancy)
#add census
dat1 <-
dat1 %>%
st_join(.,census,join=st_intersects, left=TRUE) %>%
na.omit()
# add amenty
vars_net <-
rbind(bar,cafe,restaurant)
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
#calculate the nearest distance to amenities
dat1$amenity.nn =
nn_function(st_coordinates(dat1), st_coordinates(vars_net), 5)
dat1$station.nn =
nn_function(st_coordinates(dat1), st_coordinates(dat1), 3)
#plot the correlation plots
as.data.frame(dat1) %>%
dplyr::select(mean_occupancy,amenity.nn,station.nn) %>%
gather(Variable, Value, -mean_occupancy) %>%
ggplot(aes(Value, mean_occupancy)) +
geom_point(colour ="#FACF5A",size=0.8) +
geom_smooth(method = "lm", se=F, colour = "#f3a333") +
facet_wrap(~Variable, ncol =2, scales = "free") +
labs(title = "Parking counts as a function of continuous variables") +
theme(axis.text.x = element_blank()) +
plotTheme
The parking places are clustered in downtown areas, thus we focus on the parking counts in the downtown areas. The annual difference is displayed in the animated figure below to help us find the spatial and temperal structure of the dependent variable. It is obvious that the parking counts of the tracts change when time and location change.
data_ani <-
parkdata.cat %>%
st_set_geometry(NULL) %>%
merge(., parkdata_group.cat, by.x="Elmntkey",by.y="Elmntkey", all.x=TRUE) %>%
unique()
datayear <-
data_ani %>%
mutate(Year = year(interval60)) %>%
group_by(zoneid,Year) %>%
summarise(occupancy = sum(Total_Vehicle_Count)) %>%
left_join(landuse) %>%
dplyr::select(zoneid,Year,occupancy,geometry) %>%
st_sf()%>%
mutate(Class = case_when(q5(occupancy)==1 ~1,
q5(occupancy)==2 ~2,
q5(occupancy)==3 ~3,
q5(occupancy)==4 ~4,
q5(occupancy)==5 ~5)) %>%
mutate(Class = factor(Class, levels = c(1,2,3,4,5)))
parking_animation.zone <-
ggplot() +
geom_sf(data = landuse, color = "grey", fill = "transparent")+
geom_sf(data=datayear, aes(fill=Class),alpha=0.7) +
scale_fill_manual(values = palette5) +
ylim(47.59,47.65)+
xlim(-122.36, -122.3)+
labs(title = "Parking counts from 2014 to 2018 in downtown area, Seattle",
subtitle = "Year intervals: {current_frame}") +
transition_manual(Year) +
mapTheme()
animate(parking_animation.zone, duration=5)
From the figure above, it is concluded that several clusters have emerged in the study area, indicating that introducing spatial lag in the model may make a difference. Thus, we calculate the mean parking counts of the three neareast parking lots of the observed parking lot in a specific hour. The relationship between spatial lag and dependent variable is plotted below. To our surprise, the relationship between them is very strong.
#calculate the spatial lag
parkdata <-
merge(dat,parkinglocation,by.x="Elmntkey",by.y="ELMNTKEY",all.x=TRUE) %>%
st_sf() %>%
dplyr::select(Elmntkey,interval60,Total_Vehicle_Count,geometry) %>%
mutate(lat=unlist(map(geometry,1))) %>%
na.omit()
parkdata.weekNest <-
parkdata%>%
nest(-interval60)
nncount_function <- function(measureFrom,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
nn <-
get.knnx(st_coordinates(measureFrom), st_coordinates(measureFrom), k)$nn.index
nn <- nn[,-1]
for (i in seq(1,dim(nn)[1],by=1)){
for (j in seq(1,dim(nn)[2],by=1)){
count <- measureFrom[nn[i,j],]$Total_Vehicle_Count
nn[i,j] <- count
}
# nn[i,j] <- price
}
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_price, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointprice = mean(point_price)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
sptiallag <-
parkdata.weekNest %>%
mutate(spatial_lag = map(.x = data, k=3, .f = nncount_function))
sptiallag_un <-
sptiallag %>%
unnest()
#plot the spatial lag vs. DV
plotData.splag <-
as.data.frame(sptiallag_un) %>%
group_by(interval60) %>%
summarise_at(c("spatial_lag", "Total_Vehicle_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Total_Vehicle_Count)
ggplot(plotData.splag, aes(Total_Vehicle_Count, Value)) +
geom_point(colour = "#FACF5A",size=0.8) +
geom_smooth(method = "lm", se = FALSE, colour = "#f3a333") +
labs(title = "Mean nighboring parking count as a function of the Observed parking count",
x = "Observed parking count",
y = "Mean nighboring parking count") +
plotTheme
From the analysis above, it is obvious that parking counts are influenced by time series. Then we create the time lags to see whether they have relationships with parking counts, similar to what we have done in predicting the house prices. We calculate the mean parking counts of certain hours before the observed time and treat it as time lags.
Parking counts as a function of the various time lags are then visualized below. These effects are very strong. Note that the correlation decreases with each additional hour.
#calculate the time lag
timelag <-
sptiallag_un %>%
arrange(Elmntkey,interval60) %>%
mutate(lagHour = dplyr::lag(Total_Vehicle_Count,1),
lag2Hours = dplyr::lag(Total_Vehicle_Count,2),
lag3Hours = dplyr::lag(Total_Vehicle_Count,3),
lag4Hours = dplyr::lag(Total_Vehicle_Count,4),
lag12Hours = dplyr::lag(Total_Vehicle_Count,12),
lag1day = dplyr::lag(Total_Vehicle_Count,24))
plotData.lag <-
as.data.frame(timelag) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Total_Vehicle_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Total_Vehicle_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag <-
as.data.frame(timelag) %>%
group_by(interval60) %>%
summarise_at(c("lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day", "Total_Vehicle_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Total_Vehicle_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Total_Vehicle_Count),2))
#plot time lag vs. DV
plotData.lag %>%
ggplot(aes(Value, Total_Vehicle_Count)) +
geom_point(colour = "#FACF5A",size=0.8) + geom_smooth(method = "lm", se = F,colour = "#f3a333") +
facet_wrap(~Variable) +
geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "#666666",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
labs(title = "Parking count as a function of time lags") +
plotTheme
Before starting the modeling process,we check the distribution of the dependent variable to find out which regression model should be used. Since the distribution of parking count for each parking space is normally distributed, we decide to use the OLS regression model.
ggplot(timelag, aes(Total_Vehicle_Count)) +
geom_histogram(binwidth = 1,fill="#FACF5A",col="white") +
labs(title = "Distribution of parking counts")+
plotTheme
#process predictors
predictor <-
dat1 %>%
st_set_geometry(NULL) %>%
merge(., parkdata_group.cat, by.x="ELMNTKEY",by.y="Elmntkey", all.x=TRUE) %>%
dplyr::select(-geometry)
predictor1 <-
predictor %>%
merge(.,parkdata.cat, by.x="ELMNTKEY", by.y="Elmntkey") %>%
dplyr::select(-geometry,-Total_Vehicle_Count)
predictor2 <-
predictor1 %>%
merge(.,timelag, by.x = c("ELMNTKEY","interval60"), by.y=c("Elmntkey","interval60"),all.x=TRUE) %>%
unique()
predictor3 <-
predictor2 %>%
merge(.,weather, by.x = "interval60", by.y="interval60",all.x=TRUE)
We split the data into training and test set,and the data from week 9 to week 14 was selected as our test set. Because of the complexity to do cross-validation, we just split the data subjectively.
#split training and testing data
park.weeknest <-
predictor3 %>%
nest(-week)
parkweek.Train <- filter(predictor3, week>14)
parkweek.Test <- filter(predictor3, week<=14)
Three OLS regressions are built, and a breakdown of the 3 regressions is estimated below:
reg1 includes both time and space fixed effects.
reg2 adds the spatial lag features.
reg3 adds both the spatial lag and the time lags features.
#street and land use do not process and demographic data has been remained
reg1 <-
lm(Total_Vehicle_Count ~ hour(interval60) + dotw + Temperature
+Precipitation+ELMNTKEY
+amenity.nn+STREETTYPE+zonelut+Side
+Event.Closure +Peak.Hour...Yes.or.No.+ dotw,
data=parkweek.Train)
#process street type and zone code, also change to weekend and time period of day
reg2 <-
lm(Total_Vehicle_Count ~ hour(interval60) + dotw + Temperature
+Precipitation+ELMNTKEY
+amenity.nn+STREETTYPE+zonelut+Side
+Event.Closure +Peak.Hour...Yes.or.No.+ dotw +spatial_lag,
data=parkweek.Train)
# remove demographical data
reg3 <-
lm(Total_Vehicle_Count ~ hour(interval60) + dotw + Temperature
+Precipitation+ELMNTKEY
+amenity.nn+STREETTYPE+zonelut+Side
+Event.Closure +Peak.Hour...Yes.or.No.+ dotw +spatial_lag
+lagHour +lag2Hours+lag3Hours +lag4Hours,
data=parkweek.Train)
Then, all of the three models are used to predict the parking occupancy of the test set. We calculated the mean absolute error (MAE), mean absolute percent error (MAPE) and the standard deviation of absolute errors (sd_AE) to validate the accuracy of the model. The results are shown below.
parkweek.Test.weekNest <-
parkweek.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)
}
week_predictions <-
parkweek.Test.weekNest %>%
mutate(ATime_Space_FE = map(.x = data, fit = reg1, .f = model_pred),
BTime_Space_FE_Spatiallags = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE_Spatiallags_Timelags = map(.x = data, fit = reg3, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Total_Vehicle_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
Absolute_PercentError = map2(Observed, Prediction, ~ abs(.x-.y)/.x),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE),
MAPE = map_dbl(Absolute_PercentError,mean, na.rm=TRUE))
we also plot the mean absolute percent error of parking count by model by week. It is obvious that the time-space lag model (model C) has the highest accuracy, with the average MAPE lower than 0.25.Besides, the MAPE of the time-space lag model is similar in different weeks, that is, this model has a better generalizability in time aspect than other models.
#MAPE OF THE MODEL
week_predictions %>%
dplyr::select(week, Regression, MAPE) %>%
gather(Variable, MAPE, -Regression, -week) %>%
ggplot(aes(week, MAPE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette3) +
labs(title = "Mean Absolute Percent Errors by model specification and week") +
plotTheme
To further examine the generalizability of our model in time series, we choose a certain day(Mar 12) as an example to compare the predicted and observed Parking count.Both observed and predicted quantitites are taken out of their spatial context and summed only by each hour.As can be seen from the following charts, the model which contain the time lags and spatial lag (model C) does the best job among all models, but it still overestimate the parking counts in the early periods of the day. However, it does a good job in predicting the parking counts in the middle of the day. The spatial lag model slightly overestimates the parking counts but does well in estimating the parking counts in the later time of the day. The fixed effect model does a bad work since it overestimates a lot in the daytime and underestimates a lot at night.
#predicted value and observed value in time series
week_predictions %>%
filter(week ==11) %>%
mutate(interval60 = map(data, pull, interval60),
ELMNTKEY = map(data, pull, ELMNTKEY)) %>%
dplyr::select(interval60, ELMNTKEY, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -ELMNTKEY) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
scale_colour_manual("",values = c("Observed" = "#999999","Prediction"="#FFCC00"))+
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed parking counts in time series", subtitle = "Seattle; A test set of one day", x = "Hour", y= "Parking counts") +
plotTheme
Since the third model seems to have the best goodness of fit generally, we stick with reg3 to see whether the MAPE by station has spatial patterns. The figure is plotted below. Though the generalizability of the model is not that good in different time periods of the day, it is still generalizable in space generally.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
ELMNTKEY = map(data, pull, ELMNTKEY),
SHAPE_LAT = map(data, pull, SHAPE_LAT),
SHAPE_LNG = map(data, pull, SHAPE_LNG),
dotw = map(data, pull, dotw)) %>%
dplyr::select(interval60, ELMNTKEY, SHAPE_LNG,
SHAPE_LAT, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_Spatiallags_Timelags")%>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(ELMNTKEY,time_of_day, SHAPE_LNG, SHAPE_LAT) %>%
summarize(MAPE = mean(abs(Observed-Prediction)/Observed, na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = landuse, color = "grey", fill = "transparent")+
geom_point(aes(x = SHAPE_LNG, y = SHAPE_LAT, color = q5(MAPE)),
fill = "transparent", size = 1, alpha = 0.5)+
scale_colour_manual(values = palette5)+
ylim(47.59,47.65)+
xlim(-122.36, -122.3)+
facet_grid(~time_of_day)+
labs(title="Mean Absolute Percent Errors, Test Set")+
mapTheme()
Then we test the generalizability of the model using socio-economic variables. Errors are set as the function of socio-economic variables, and the figure is plotted below. The model is not influenced by the percent of residents’ taking public transportation, while it is influenced by the median age of the residents and the percent of white people in the census tract. Our model does bad in parking places where the percentage of white is high and the residents are younger.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
ELMNTKEY = map(data, pull, ELMNTKEY),
SHAPE_LAT = map(data, pull, SHAPE_LAT),
SHAPE_LNG = map(data, pull, SHAPE_LNG),
dotw = map(data, pull, dotw),
Total_Public_Trans = map(data, pull, Total_Public_Trans),
Total_Pop = map(data, pull, Total_Pop),
Means_of_Transport = map(data, pull, Means_of_Transport),
Med_Age = map(data, pull, Med_Age),
White_Pop = map(data, pull, White_Pop)) %>%
dplyr::select(interval60, ELMNTKEY, SHAPE_LNG,
SHAPE_LAT, Observed, Prediction, Regression,
dotw, Total_Public_Trans,Total_Pop,Med_Age,Means_of_Transport,Med_Age,
White_Pop) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_Spatiallags_Timelags")%>%
mutate(Percent_Taking_Public_Trans = Total_Public_Trans/Means_of_Transport,
Percent_White = White_Pop/Total_Pop) %>%
group_by(ELMNTKEY, Percent_Taking_Public_Trans, Med_Age, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-ELMNTKEY, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
geom_point(aes(x = value, y = MAE),colour = "#FACF5A", alpha = 0.5)+
geom_smooth(aes(x = value, y = MAE), colour = "#f3a333",method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
Why does our model predict so well?
In the most sophisticated model, the MAPE in the test set is under 0.25, indicating that the predictive accuracy is pretty impressive - particularly along the time dimension.What is more, the R squared for the model is around 0.88. The time lags are strongest features of the model because there is a tremendous amount of serial correlation between parking counts in time t and time t - 1.We can only use such high resolution lags because the nature of the use case allows parking to occur in near real time. Thus, it is reasonable to condition our predictions on parking from just certain hours ago.
Additionally, our model has a good generalizibility across space. However, since the spatial lag is calculated within certain fixed hours, the result of the model may change if the fixed condition has changed. Since we use the minimum research unit in the project, MAUP in our model is not serious.
In what real-life user senarios would this model be applied?
The public sector
This model could be used as part of Seattle’s curb parking planning algorithm which could offer useful information (the parking demand for the whole city at specific times and parking demand for specific street/blocks) for administrators when planning for parking spaces, monitoring illegal parkings and setting dynamic parking prices.
Driver
An app can be built on the basis of this model which helps drivers search for available parking lots within a specific area at a specific time and find the nearest route there. This will save the time for the drivers and reduce congestion in traffic caused by searching for avaliable parking places.
Based on the assumptions above, we design the app. More information about the app can be reached here: https://youtu.be/YgNFttlalO0.
App design