1. Executive Summary

The bicycle-sharing system is a service in which bicycles are made available for shared use to individuals on a short term basis for a price or free. It allows people to borrow a bike from a “dock” and return it at another dock belonging to the same system.As one of the most important bike-sharing service provider in Philadelphia, Indego features over 500 bikes and 60+ stations located across Philadelphia in Center City and parts of North, South, and West Philadelphia.

However,one of the big operational challenges for Indego is “re-balancing” - getting bikes to stations that are anticipated to have demand but lack bikes. Figuring out how to do this is one of the keys to operating a successful ridershare system. If the bike share COO knew the bike station capacities, he could see when demand for bikes might drive stations to run out of bikes, and then move excess bikes from elsewhere,which might greatly improve operational efficiency as well as satisfy the needs of bike-sharing users.In order to help indego understand demand for bike share and make recommendations for allocation of new stations, we built a space-time ridershare prediction model.

2. Data

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(gganimate)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(stargazer)
library(gifski)
library(png)

mapTheme <- function(base_size = 12) {
  theme(
    plot.title =element_text(size=12),
    plot.subtitle = element_text(size=8),
    plot.caption = element_text(size = 6),
    axis.line=element_blank(),
    axis.text.x=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks=element_blank(),
    axis.title.x=element_blank(),
    axis.title.y=element_blank(),
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_line(colour = 'transparent'),
    panel.grid.minor=element_blank(),
    legend.direction = "vertical", 
    legend.position = "right",
    plot.margin = margin(1, 1, 1, 1, 'cm'),
    legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm")
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    plot.title =element_text(size=12),
    plot.subtitle = element_text(size=8),
    plot.caption = element_text(size = 6),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10),
    axis.title.y = element_text(size = 10),
    # Set the entire chart region to blank
    panel.background=element_blank(),
    plot.background=element_blank(),
    #panel.border=element_rect(colour="#F0F0F0"),
    # Format the grid
    panel.grid.major=element_line(colour="#D0D0D0",size=.2),
    axis.ticks=element_blank()
  )
}

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette3 <- c("#6baed6","#3182bd","#08519c")
palette2 <- c("#336699","#006600")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

2.1 load ride share data

In Philadelphia, July and August has a fairly pleasant temperature range, during which we may see some trips as well as commutes. Thus, we loaded the bikeshare data from 2018.7 to 2018.8 (8 weeks in total). It contains one major holiday weekend,Independence Day.

dat_ori <- read.csv("E:/2019MUSA/MUSA_507_Public_Policy_Analytics/HW6/indego-trips-2018-q3.csv")
dat <- dat_ori %>%
  filter(., as_date(start_time) >= as_date('2018-07-02 00:00:00') & as_date(start_time) <= as_date('2018-08-26 00:00:00')) %>%
  mutate(interval60=floor_date(ymd_hms(start_time),unit='hour'),
         interval15=floor_date(ymd_hms(start_time),unit='15mins'),
         week=week(interval60),
         dotw=wday(interval60,label=TRUE),
         weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) 

2.2 load census tract data

We also loaded census geography and variables. These data allow us to explore some of the socioeconomic associations with ride share in Philadelphia.

census <- get_acs(geography = "tract",
                  variables=c("B01003_001", "B19013_001", 
                              "B02001_002", "B08013_001",
                              "B08012_001", "B08301_001", 
                              "B08301_010", "B01002_001"),
                  key="d72b594e4d0f9b9b34217cdea8a4bcbc60354e21",
                  year=2017,
                  state=42,
                  geometry=TRUE,
                  county=101,
                  output="wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

tracts <- 
  census %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

# join ride date to the tracts
dat_tracts <- st_join(dat %>% 
                        filter(is.na(start_lon) == FALSE &
                                 is.na(start_lat) == FALSE ) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      tracts %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.3 load weather data

Weather data is downloaded following the very reasonable hypothesis that weather conditions play a role in travel model choice. Given pouring rain or driving snow, there is a clear commuting opportunity cost between walking 8 blocks to the subway then taking a $2 subway or just taking a $10 ride share directly to work.

weather <- 
  riem_measures(station = "PHL", date_start = "2018-07-02", date_end = "2018-08-27")

weather.panel <-  
  weather  %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Below the weather data is plotted as a time series.

# plot the weather data
grid.arrange(
  ggplot(weather.panel, aes(interval60,Precipitation)) + geom_line(color="#666666") + 
    labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
  ggplot(weather.panel, aes(interval60,Wind_Speed)) + geom_line(color="#666666") + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
  ggplot(weather.panel, aes(interval60,Temperature)) + geom_line(color="#666666") + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - Philly - 2018.07-2018.08")

2.4 create time-space panel

After loading all the data we needed , we created an empty data frame(study.panel) that has each unique space/time observations.

study.panel <- 
  expand.grid(interval60=unique(dat_tracts$interval60), 
              start_station = unique(dat_tracts$start_station)) %>%
  left_join(., dat_tracts %>%
              select(start_station,  Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

Then,We created a full panel(ride.panel) that contained the total counts by station, weather data and census data for each unique space/time observations.

ride.panel <- 
  dat_tracts %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60,Origin.Tract,start_station, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.panel) %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, census %>%
              as.data.frame(), by = c("Origin.Tract" = "GEOID"))%>%
  select(-geometry)

2.5 create time lag variables

Next, some additional feature engineering is undertaken to mine the data for critical time trends. Like spatial lags, time lags are incredibly important when predicting time series trends. To capture these trends, several different time lags are created.
Besides, we also controlled for the effects of holidays that disrupt the expected demand during a given weekend or weekday. We have a holiday on Independence Day, for that three day weekend we could use some dummy variables indicating temporal proximity to the holiday.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 185,1,0)) %>%
  mutate(day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, 0))

3. Explorary Analysis

In this section we explored our ride share data for time, space, weather and demographic trends. In order to forecast well, these features must help predict the time/space trends in our data.The hope is that much of the variation in ridershare trips count can be predicted by temporal and spatial effects.

Section 3.1-3.4 provides several different approaches for visualizing ride share by time. 3.5 explores whether ride share trips are correlated with certain socio-economic factors. 3.6-3.7 looks at weather trends. 3.8-3.9 Looks at ride share correlation across space and 3.10 produces an engaging animated gif of ride share trends across Philadelphia.

3.1 ridershare trips count per hour by station

The following plot shows that over 90 percent of the station have less than 5 bike share trips per hour.

ggplot(dat_tracts %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1,fill="#336699")+
  labs(title="Bike share trips per hr by station. Philly, 2018.7-2018.8",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme()

3.2 ridershare trips count by week

We plotted the ridershare trips by week. These plots indicate that except for the holiday effect, ridershare trips variation by week is not much different.Besides, the consistent peaks and troughs suggest a definite time trends that reaffirm our time lag feature engineering approach.

as.data.frame(ride.panel) %>%
  group_by(week, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ggplot(aes(interval60,Trip_Count)) + 
  geom_line(color="#336699") +
  plotTheme() +
  facet_wrap(~week, scales="free", ncol=3) +
  ylim(0,500) +
  geom_vline(xintercept = as.POSIXct("2018-07-04 00:00:00 UTC"), color = "darkgray", size=1.1) +
  labs(title="Rideshare trips in Philadelphia by week: July through August, 2018",
       subtitle="Independence Day in darkgray", x="Day", y="Trip Counts") 

3.3 ridershare trips count by day of the week

Then, we plot the ridershare trips cout by day of the week, it is clearly that there exists a significant difference in the count of ridershare trips between weekdays and weekends. Also, the count of rideshare trips increased sharply during the morning and evening peak of weekdays.

ggplot(dat_tracts %>% mutate(hour = hour(start_time)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philly, by day of the week, July, 2018",
       x="Hour", 
       y="Trip Counts")+
  plotTheme()

3.4 ridershare trip count as a function of time lags

Ridershare trip count as a function of the various time lags were visualized below. The correlation decreases with each additional hour, however, the predictive power returns with the 24 hour lag.These effects are very strong, which means that time effects most certainly play a vital role in forecasting ride share.

plotData.lag <-
  as.data.frame(ride.panel) %>%
  filter(week == 30) %>%
  group_by(interval60) %>% 
  summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))
correlation.lag <-
  plotData.lag %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Trip_Count),2))

plotData.lag %>%
  ggplot(aes(Value, Trip_Count)) + 
  geom_point(size=1, alpha = 0.4,color="#666666") + geom_smooth(method = "lm", se = F,color="#336699") + 
  facet_wrap(~Variable) +
  geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "black", 
            x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  labs(title = "Rideshare trip count as a function of time lags", 
       subtitle= "One week in July, 2018", x = "Lag Trip Count") +
  plotTheme()

3.5 ridershare trip count as a function of selected Census variables

Because of the space/time panel nature of the data, census data cannot be used in the forecast. Nevertheless, for the purpose of exploratory analysis, we still analyzed correlation between ridershare trip count by tract for one week as a function of 6 Census variables.The followint plots shows that the Mean_commute_time, Med_Inc, Percent_White and Total_Pop is positively related to the ridershare trip count, while the Percent_Taking_Public_Trans variable is negarively related to the ridershare trip count.

plotData.census <- 
  as.data.frame(ride.panel) %>%
  filter(week == 27) %>%
  group_by(Origin.Tract) %>% 
  summarize(Trip_Count = sum(Trip_Count))  %>%
  left_join(census, by=c("Origin.Tract" = "GEOID")) %>%
  select(Trip_Count,Origin.Tract,Med_Inc,Total_Pop,Percent_White,Mean_Commute_Time,Percent_Taking_Public_Trans,Med_Age) %>%
  gather(Variable, Value, -Origin.Tract, -Trip_Count) %>%
  na.omit() %>%
  filter(Variable =="Med_Inc"| Variable == "Total_Pop"|Variable == "Percent_White"|
         Variable == "Mean_Commute_Time"|Variable == "Percent_Taking_Public_Trans"|Variable == "Med_Age") 

correlation.census <-
  plotData.census %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, log(Trip_Count)),2))

ggplot(plotData.census, aes(Value,log(Trip_Count))) + geom_point(color="#666666",alpha = 0.4) + geom_smooth(method="lm", se = F,color='#336699') +
  facet_wrap(~Variable, scales="free", ncol=3) +
  geom_text(data=correlation.census, aes(label=paste("R =", correlation)),
            colour = "black", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  plotTheme() +
  labs(title="One week of Ridershare trips (log) by Census Tract\nas a function of selected Census variables")

3.6 dose ridership vary when it’s raining or snowing?

Looking at week by group means in the resulting barplot does not neccessarily support the hypothesis that rain and snow increase the propensity to take rideshare.In most of the cases, ridership demand decreases during rainy/snowy days.

as.data.frame(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  left_join(weather.panel) %>%
  mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
  group_by(week = week(interval60), isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
  ggplot(aes(isPercip, Mean_Trip_Count)) + 
  geom_bar(stat = "identity",fill="#336699") +
  facet_wrap(~week,ncol=9) +
  labs(title="Does ridership vary when it's raining or snowing?",
       subtitle="Mean trip count by week; July through August, 2018",
       x="Precipitation", y="Mean Trip Count") +
  plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.7 ridershare trips count per hour as a function of Temperature

Do more people take ride share when it’s colder? The trip count per hour as a function of temperature suggests the opposite - that ride share actually increases as temperatures warm.

as.data.frame(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  left_join(weather.panel) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
  geom_point(size=1, color="#666666",alpha = 0.4) + geom_smooth(method = "lm", se= FALSE,color="#336699") +
  plotTheme() +
  labs(title="Trip Count per hour as a fuction of Temperature",
       subtitle="July through August, 2018",
       x="Temperature", y="Mean Trip Count")

3.8 ridershare trips count per hour by station

Through the maps of bike share trips per hour by station, we could find some interesting phenomena: ridershare trips are more frequent during the week than on weekends, and the ridershare trips count peaks during the PM rush in weekday.

ggplot()+
  geom_sf(data = tracts %>%
            st_transform(crs=4326),color = "grey", fill = "transparent")+
  geom_point(data = dat_tracts %>% 
               mutate(hour = hour(start_time),
                      weekend = ifelse(dotw %in% c("Sat", "Sun"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 1, size = 3)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
  xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hour by station.Philly, 2018.7-2018.8")+
  mapTheme()

3.9 sum of rideshare trips by tract and hour of the day

Then, we map the sum of rideshare trips by tract and hour of the day.These maps indicates us that trips are concentrated in the central city through 7 am to 8 pm. As the day goes on, trips decreases appear to eminate from the central city, outward.

spatial_dis <- 
  left_join(ride.panel, tracts, by = c("Origin.Tract" = "GEOID")) %>% 
  st_sf()
  
spatial_dis  %>% 
  group_by(hour = hour(interval60), Origin.Tract) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
  facet_wrap(~hour, ncol = 8) +
  scale_fill_manual(values = palette5,
                    labels = c("85", "698", "1506", "2629", "4778"),
                    name = "Trip_Count") +
  mapTheme()+
  ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
  xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
  labs(title="Sum of rideshare trips by tract and hour of the day") +
  geom_sf(data=tracts,color = "grey", fill = "transparent") 

3.10 space/time animation

Below, we built an animated gif map of rideshare trips by tract and hour of the day. This gif confirms our views in section 3.9.

week30 <- dat_tracts %>%
  filter(week == 30 & dotw == "Mon")

week30.panel <-
  expand.grid(
    interval60=unique(week30$interval60),
    Origin.Tract = unique(week30$Origin.Tract))

ride.animation.data <-
  week30 %>%
  mutate(Trip_Counter = 1) %>%
  right_join(week30.panel) %>% 
  group_by(interval60, Origin.Tract) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  left_join(tracts,by=c("Origin.Tract" = "GEOID")) %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                           Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                           Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                           Trip_Count > 10 ~ "10+ trips")) %>%
  mutate(Trips  = factor(Trips, levels=c("0 trips","1-3 trips","4-6 trips","7-10 trips","10+ trips")))

rideshare_animation <-
  ggplot() +
  geom_sf(data=ride.animation.data, aes(fill=Trips)) +
  scale_fill_manual(values = palette5) +
  labs(title = "Rideshare pickups for one day in July 2018, Philly",
       subtitle = "60 minute intervals: {current_frame}", caption = "@Ziqun Li") +
  transition_manual(interval60) +
  mapTheme()+
  ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
  xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
  labs(title="Sum of rideshare trips by tract and hour of the day") +
  geom_sf(data=tracts,fill = NA) 

animate(rideshare_animation, duration=20)

4. Modeling

Our exploratory analysis has revealed a host of patterns in the rideshare data. Almost assuredly, these patterns will yield a robust model.

4.1 model building

Since the trip counts are sufficiently large to feel at ease with OLS, we developed several models from ride.Train(27-33 weeks) and test those models on ride.Test(33-34 weeks). Here is a breakdown of the 5 regressions estimated below:
1. reg1 includes space fixed effects, Temperature and Precipitation. Lag effects are not included for now.
2. reg2 focus on hour fixed effects, day of the week fixed effects and the weather effects.
3. reg3 includes both time and space fixed effects, as well as the weather effects.
4. reg4 adds the time lag features.
5. reg5 adds holiday and holidayLag effects.

ride.Train <- filter(ride.panel, week < 33)
ride.Test <- filter(ride.panel, week >= 33)

reg1 <- 
  lm(Trip_Count ~  start_station + Temperature+ Precipitation , data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  dotw + hour(interval60) + Temperature +Precipitation , data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + dotw + hour(interval60)  + Temperature +Precipitation , 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station + dotw + hour(interval60)  + Temperature +Precipitation+
     lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + dotw + hour(interval60)  + Temperature +Precipitation+
     + lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

4.2 model estimation

Then, we created predictions for all of the above models.

# predict for test data
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

model_pred <- function(dat,fit){
  pred <- predict(fit,newdata=dat)
}

week_predictions <- 
  ride.Test.weekNest %>% 
  mutate(Space_FE_1 = map(.x = data, fit = reg1, .f = model_pred),
         Time_FE_2 = map(.x = data, fit = reg2, .f = model_pred),
         Time_Space_FE_3 = map(.x = data, fit = reg3, .f = model_pred),
         Time_Space_FE_timeLags_4 = map(.x = data, fit = reg4, .f = model_pred),
         Time_Space_FE_timeLags_holidayLags_5 = map(.x = data, fit = reg5, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions

5. Validation

5.1 Accuracy

The MAE were plotted by models by week as below. It is clear that the best models - the lag models, are accurate to less than an average of one ride per hour, at a glance, that’s pretty alright for overall accuracy.

# MAE by model specification and week
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  labs(title = "Mean Absolute Errors by model specification and week, Test Set") +
  plotTheme()

Besides, according to the Predicted/Observed bike share time series plot, we see that the holiday effects do not bring much additional predictive power to the model. This is perhaps because of the lack of any effect of the holiday lags in our test set with no holidays. In addition, based on our time-series plots, we can see that our best model(reg5) is able to track the time components of demand, but we miss the peaks, and underpredict for periods of high demand.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station)) %>%
  dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start_station) %>%
  na.omit()  %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  scale_colour_manual("",values = c("Observed" = "darkgray","Prediction" = "#006699"))+
  geom_line(size = 1)+
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
  plotTheme()

5.2 Generalizability

In this section, we selected the Time_Space_FE_timeLags_holidayLags_5 model (reg5), which has the best predictive ability,for further examination of generalizability.

If we plot observed vs. predicted of our model for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general, but what do we begin to see about some of the outcomes that our model cannot explain? In order to answer the above question, we decide to examine the generalizability of the model from several perspectives: space, time and demographic dimensions.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         from_station_id = map(data, pull, start_station), 
         from_latitude = map(data, pull, start_lat), 
         from_longitude = map(data, pull, start_lon),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, from_station_id, from_longitude, 
         from_latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction),size=1,alpha = 0.4,color="#666666")+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#336699")+
  geom_abline(slope = 1, intercept = 0,color="black")+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted, Test set, Model 5",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()

5.2.1 validate test set by space

At the very first beginning, we looked at the errors of the model on a map by neighborhood. The highest trip count MAE occurs in the central city area, where more trips obviously occur. Besides, it seem that the model is better at predicting in time than in space.

# chloropleth map of errors by neighborhood, Model 5
filter(week_predictions, Regression == "Time_Space_FE_timeLags_holidayLags_5") %>% 
  unnest() %>% 
  left_join(., tracts, by = c("Origin.Tract" = "GEOID")) %>% 
  st_sf() %>% 
  dplyr::select(Origin.Tract, Absolute_Error, geometry) %>%
  gather(Variable, Value, -Origin.Tract, -geometry) %>%
  group_by(Variable, Origin.Tract) %>%
  na.omit() %>%
  summarize(MAE = mean(Value)) %>%
  ggplot() + 
  geom_sf(aes(fill = q5(MAE))) +
  scale_fill_manual(values = palette5,
                    labels = c("2", "3", "4", "5", "8"),
                    name = "MAE") +
  labs(title="Chloropleth map of errors by neighborhood") +
  mapTheme()+
  ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
  xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
  geom_sf(data=tracts,color = "grey", fill = "transparent") 

5.2.2 validate test set by time

To further investigate the temporal pattern to these big errors, we map the MAE of our model by time-space.Seems like these central city errors increase during the AM Rush and PM Rush on weekday, as well as the Mid-day and PM Rush on weekend, which means the predictive ability of our model decrease during these periods.

# facetted maps  of error by time-period, Model 5 
week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull,start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start_station,start_lon, 
         start_lat, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = tracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size =3)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_tracts$start_lat), max(dat_tracts$start_lat))+
  xlim(min(dat_tracts$start_lon), max(dat_tracts$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Facetted maps of error by time-period")+
  mapTheme()

5.2.3 validate test set by demographic dimensions

In addition, we also explored the relationship between the rideshare trips and the demographic variables. As can be seen from the following figures, the percent_Taking_Public_Trans and Percent_White demographic variable are strongly associated with the MAE.To be more specific,our model performs better in the area which has a high percent of people who take public transport to travel, or the area which has a low percent of the white.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull,start_lon),
         dotw = map(data, pull, dotw),
         Med_Inc = map(data, pull, Med_Inc),
         Total_Pop = map(data, pull,Total_Pop),
         Percent_White = map(data, pull,Percent_White),
         Percent_Taking_Public_Trans = map(data, pull,Percent_Taking_Public_Trans))%>%
  select(interval60, start_station, start_lon, 
         start_lat, Observed, Prediction, Regression,
         dotw, Med_Inc,Total_Pop,Percent_White,Percent_Taking_Public_Trans) %>%
  unnest() %>%
  filter(Regression == "Time_Space_FE_timeLags_holidayLags_5")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "PM Rush") %>%
  group_by(start_station, Med_Inc,Total_Pop,Percent_White,Percent_Taking_Public_Trans) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.5,color="#666666")+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE,color="#336699")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

6. Discussion

In general, the model we built does exist some flaws: it miss the peaks, and underpredict for periods of high demand. However,we still believe that our model is an effective model, since it is accurate to less than an average of one ride per hour, which pretty alright for overall accuracy.Also, our model is able to track the time components of demand,that is, though this model does not capture the magnitude of the highest demand, it does capture the timing of the highest demand,which is useful insight for bike allocation. Besides,we stand that our model worth to be recommanded to be put into production in some specific area: for those tracts that have a high percent of people who take public transport to travel, or have a low percent of the white, our model still has good predictive power. In order to improve our algorithm in the future, we could add more effective features into our model, such as traffic conditions and socioeconomic disparities, etc.