The bicycle-sharing system is a service in which bicycles are made available for shared use to individuals on a short term basis for a price or free. It allows people to borrow a bike from a “dock” and return it at another dock belonging to the same system.As one of the most important bike-sharing service provider in Philadelphia, Indego features over 500 bikes and 60+ stations located across Philadelphia in Center City and parts of North, South, and West Philadelphia.
However,one of the big operational challenges for Indego is “re-balancing” - getting bikes to stations that are anticipated to have demand but lack bikes. Figuring out how to do this is one of the keys to operating a successful ridershare system. If the bike share COO knew the bike station capacities, he could see when demand for bikes might drive stations to run out of bikes, and then move excess bikes from elsewhere,which might greatly improve operational efficiency as well as satisfy the needs of bike-sharing users.In order to help indego understand demand for bike share and make recommendations for allocation of new stations, we built a space-time ridershare prediction model.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(gganimate)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(stargazer)
library(gifski)
library(png)
mapTheme <- function(base_size = 12) {
theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm")
)
}
plotTheme <- function(base_size = 12) {
theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank()
)
}
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette3 <- c("#6baed6","#3182bd","#08519c")
palette2 <- c("#336699","#006600")
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
We also loaded census geography and variables. These data allow us to explore some of the socioeconomic associations with ride share in Philadelphia.
census <- get_acs(geography = "tract",
variables=c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
key="d72b594e4d0f9b9b34217cdea8a4bcbc60354e21",
year=2017,
state=42,
geometry=TRUE,
county=101,
output="wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
tracts <-
census %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
# join ride date to the tracts
dat_tracts <- st_join(dat %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE ) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
tracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
Weather data is downloaded following the very reasonable hypothesis that weather conditions play a role in travel model choice. Given pouring rain or driving snow, there is a clear commuting opportunity cost between walking 8 blocks to the subway then taking a $2 subway or just taking a $10 ride share directly to work.
weather <-
riem_measures(station = "PHL", date_start = "2018-07-02", date_end = "2018-08-27")
weather.panel <-
weather %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
Below the weather data is plotted as a time series.
# plot the weather data
grid.arrange(
ggplot(weather.panel, aes(interval60,Precipitation)) + geom_line(color="#666666") +
labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
ggplot(weather.panel, aes(interval60,Wind_Speed)) + geom_line(color="#666666") +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
ggplot(weather.panel, aes(interval60,Temperature)) + geom_line(color="#666666") +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
top="Weather Data - Philly - 2018.07-2018.08")
After loading all the data we needed , we created an empty data frame(study.panel) that has each unique space/time observations.
study.panel <-
expand.grid(interval60=unique(dat_tracts$interval60),
start_station = unique(dat_tracts$start_station)) %>%
left_join(., dat_tracts %>%
select(start_station, Origin.Tract, start_lon, start_lat )%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
Then,We created a full panel(ride.panel) that contained the total counts by station, weather data and census data for each unique space/time observations.
ride.panel <-
dat_tracts %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60,Origin.Tract,start_station, start_lon, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.panel) %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, census %>%
as.data.frame(), by = c("Origin.Tract" = "GEOID"))%>%
select(-geometry)
Next, some additional feature engineering is undertaken to mine the data for critical time trends. Like spatial lags, time lags are incredibly important when predicting time series trends. To capture these trends, several different time lags are created.
Besides, we also controlled for the effects of holidays that disrupt the expected demand during a given weekend or weekday. We have a holiday on Independence Day, for that three day weekend we could use some dummy variables indicating temporal proximity to the holiday.
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 185,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, 0))
In this section we explored our ride share data for time, space, weather and demographic trends. In order to forecast well, these features must help predict the time/space trends in our data.The hope is that much of the variation in ridershare trips count can be predicted by temporal and spatial effects.
Section 3.1-3.4 provides several different approaches for visualizing ride share by time. 3.5 explores whether ride share trips are correlated with certain socio-economic factors. 3.6-3.7 looks at weather trends. 3.8-3.9 Looks at ride share correlation across space and 3.10 produces an engaging animated gif of ride share trends across Philadelphia.
Looking at week by group means in the resulting barplot does not neccessarily support the hypothesis that rain and snow increase the propensity to take rideshare.In most of the cases, ridership demand decreases during rainy/snowy days.
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(weather.panel) %>%
mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
group_by(week = week(interval60), isPercip) %>%
summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
ggplot(aes(isPercip, Mean_Trip_Count)) +
geom_bar(stat = "identity",fill="#336699") +
facet_wrap(~week,ncol=9) +
labs(title="Does ridership vary when it's raining or snowing?",
subtitle="Mean trip count by week; July through August, 2018",
x="Precipitation", y="Mean Trip Count") +
plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Below, we built an animated gif map of rideshare trips by tract and hour of the day. This gif confirms our views in section 3.9.
week30 <- dat_tracts %>%
filter(week == 30 & dotw == "Mon")
week30.panel <-
expand.grid(
interval60=unique(week30$interval60),
Origin.Tract = unique(week30$Origin.Tract))
ride.animation.data <-
week30 %>%
mutate(Trip_Counter = 1) %>%
right_join(week30.panel) %>%
group_by(interval60, Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(tracts,by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
Trip_Count > 10 ~ "10+ trips")) %>%
mutate(Trips = factor(Trips, levels=c("0 trips","1-3 trips","4-6 trips","7-10 trips","10+ trips")))
rideshare_animation <-
ggplot() +
geom_sf(data=ride.animation.data, aes(fill=Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Rideshare pickups for one day in July 2018, Philly",
subtitle = "60 minute intervals: {current_frame}", caption = "@Ziqun Li") +
transition_manual(interval60) +
mapTheme()+
ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
labs(title="Sum of rideshare trips by tract and hour of the day") +
geom_sf(data=tracts,fill = NA)
animate(rideshare_animation, duration=20)
Our exploratory analysis has revealed a host of patterns in the rideshare data. Almost assuredly, these patterns will yield a robust model.
Since the trip counts are sufficiently large to feel at ease with OLS, we developed several models from ride.Train(27-33 weeks) and test those models on ride.Test(33-34 weeks). Here is a breakdown of the 5 regressions estimated below:
1. reg1 includes space fixed effects, Temperature and Precipitation. Lag effects are not included for now.
2. reg2 focus on hour fixed effects, day of the week fixed effects and the weather effects.
3. reg3 includes both time and space fixed effects, as well as the weather effects.
4. reg4 adds the time lag features.
5. reg5 adds holiday and holidayLag effects.
ride.Train <- filter(ride.panel, week < 33)
ride.Test <- filter(ride.panel, week >= 33)
reg1 <-
lm(Trip_Count ~ start_station + Temperature+ Precipitation , data=ride.Train)
reg2 <-
lm(Trip_Count ~ dotw + hour(interval60) + Temperature +Precipitation , data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + dotw + hour(interval60) + Temperature +Precipitation ,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + dotw + hour(interval60) + Temperature +Precipitation+
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + dotw + hour(interval60) + Temperature +Precipitation+
+ lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
Then, we created predictions for all of the above models.
# predict for test data
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat,fit){
pred <- predict(fit,newdata=dat)
}
week_predictions <-
ride.Test.weekNest %>%
mutate(Space_FE_1 = map(.x = data, fit = reg1, .f = model_pred),
Time_FE_2 = map(.x = data, fit = reg2, .f = model_pred),
Time_Space_FE_3 = map(.x = data, fit = reg3, .f = model_pred),
Time_Space_FE_timeLags_4 = map(.x = data, fit = reg4, .f = model_pred),
Time_Space_FE_timeLags_holidayLags_5 = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions
The MAE were plotted by models by week as below. It is clear that the best models - the lag models, are accurate to less than an average of one ride per hour, at a glance, that’s pretty alright for overall accuracy.
# MAE by model specification and week
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week, Test Set") +
plotTheme()
Besides, according to the Predicted/Observed bike share time series plot, we see that the holiday effects do not bring much additional predictive power to the model. This is perhaps because of the lack of any effect of the holiday lags in our test set with no holidays. In addition, based on our time-series plots, we can see that our best model(reg5) is able to track the time components of demand, but we miss the peaks, and underpredict for periods of high demand.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
na.omit() %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
scale_colour_manual("",values = c("Observed" = "darkgray","Prediction" = "#006699"))+
geom_line(size = 1)+
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme()
In this section, we selected the Time_Space_FE_timeLags_holidayLags_5 model (reg5), which has the best predictive ability,for further examination of generalizability.
If we plot observed vs. predicted of our model for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general, but what do we begin to see about some of the outcomes that our model cannot explain? In order to answer the above question, we decide to examine the generalizability of the model from several perspectives: space, time and demographic dimensions.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, start_station),
from_latitude = map(data, pull, start_lat),
from_longitude = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
select(interval60, from_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction),size=1,alpha = 0.4,color="#666666")+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#336699")+
geom_abline(slope = 1, intercept = 0,color="black")+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted, Test set, Model 5",
x="Observed trips",
y="Predicted trips")+
plotTheme()
At the very first beginning, we looked at the errors of the model on a map by neighborhood. The highest trip count MAE occurs in the central city area, where more trips obviously occur. Besides, it seem that the model is better at predicting in time than in space.
# chloropleth map of errors by neighborhood, Model 5
filter(week_predictions, Regression == "Time_Space_FE_timeLags_holidayLags_5") %>%
unnest() %>%
left_join(., tracts, by = c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
dplyr::select(Origin.Tract, Absolute_Error, geometry) %>%
gather(Variable, Value, -Origin.Tract, -geometry) %>%
group_by(Variable, Origin.Tract) %>%
na.omit() %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = q5(MAE))) +
scale_fill_manual(values = palette5,
labels = c("2", "3", "4", "5", "8"),
name = "MAE") +
labs(title="Chloropleth map of errors by neighborhood") +
mapTheme()+
ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
geom_sf(data=tracts,color = "grey", fill = "transparent")
To further investigate the temporal pattern to these big errors, we map the MAE of our model by time-space.Seems like these central city errors increase during the AM Rush and PM Rush on weekday, as well as the Mid-day and PM Rush on weekend, which means the predictive ability of our model decrease during these periods.
# facetted maps of error by time-period, Model 5
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull,start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station,start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = tracts, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", size =3)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Facetted maps of error by time-period")+
mapTheme()
In addition, we also explored the relationship between the rideshare trips and the demographic variables. As can be seen from the following figures, the percent_Taking_Public_Trans and Percent_White demographic variable are strongly associated with the MAE.To be more specific,our model performs better in the area which has a high percent of people who take public transport to travel, or the area which has a low percent of the white.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull,start_lon),
dotw = map(data, pull, dotw),
Med_Inc = map(data, pull, Med_Inc),
Total_Pop = map(data, pull,Total_Pop),
Percent_White = map(data, pull,Percent_White),
Percent_Taking_Public_Trans = map(data, pull,Percent_Taking_Public_Trans))%>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw, Med_Inc,Total_Pop,Percent_White,Percent_Taking_Public_Trans) %>%
unnest() %>%
filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "PM Rush") %>%
group_by(start_station, Med_Inc,Total_Pop,Percent_White,Percent_Taking_Public_Trans) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
geom_point(aes(x = value, y = MAE), alpha = 0.5,color="#666666")+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE,color="#336699")+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme()
In general, the model we built does exist some flaws: it miss the peaks, and underpredict for periods of high demand. However,we still believe that our model is an effective model, since it is accurate to less than an average of one ride per hour, which pretty alright for overall accuracy.Also, our model is able to track the time components of demand,that is, though this model does not capture the magnitude of the highest demand, it does capture the timing of the highest demand,which is useful insight for bike allocation. Besides,we stand that our model worth to be recommanded to be put into production in some specific area: for those tracts that have a high percent of people who take public transport to travel, or have a low percent of the white, our model still has good predictive power. In order to improve our algorithm in the future, we could add more effective features into our model, such as traffic conditions and socioeconomic disparities, etc.