Law enforcement has significant resource allocation questions. In 2017, the Chicago Police Department recorded 268,387 incidents, which equates to 735 incidents per day on average. There is a clear need for an operational planning tool aimed at reducitng crime. Victims of crime and even those who live in fear of crime, incur real costs, physically, emotionally and financially. Crime can depress land values and hinder economic development. Crime can deter children from walking to school in the morning or families from enjoying an evening on their porch. Crime can be both the cause and the symptom of many significant socio-emotional and economic outcomes.
Predictive Policing is a form of geospatial risk prediction that borrows the locational experience of crimes observed in the recent past and tests the extent to which those experiences generalize to places that may be at risk for crime despite a lack of reporting. In order to better prevent robbery risk, we built a geospatial risk model using the robbery crime data.
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(raster)
# create a mapTheme() function that is used to standardized the map outputs
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
A useful predictive policing model is one which can predict explicit crime risk by way of meaningful risk factors without being biased by the implicit risk generated from biased policing interventions.However, most of the crime dataset are so biased that without a intense focus on model validation, most law enforcement use cases of predictive policing are unuseful.There exists many selection bias in our robbery crime dataset: It may be the neighborhood with certain environmental risk factors breed the behavior that leads to the increased robbery crime; it may be the systematic over-policing of minority neighborhoods by police that leads to the increased reported robbery crime; it may also be that people with a high propensity to commit robbery crime or report it, select into high crime neghborhoods. None of these selections happens at random, if this selection criteria goes unaccounted for in the model, it will fall into the error term,which sometimes might lead to serious consequences —
Take the police surveillance bias as an example, if it did exists,the ‘true’ amount of robbery may be overstated in some communities or under-stated in others.In other words,the result of our model might be one set of errors unique to the white neighborhood and another unique to the non-white neighborhood.These disparities are important because of how the model informs resource allocation. Any algorithm that creates a police surveillance disparity along racial lines is at best, unfair, and at worse, unconstitutional. Thus, in our model, we also examined whether this algorithm would perpetuate racial bias.
To get a sense of the spatial distribution of robbery data in Chicago, we loaded the Chicago boundary data and mapped the robberies as points throughout Chicago.
# load the boundary data
chicagoBoundary <-
st_read("E:/2019MUSA/MUSA_507_Public_Policy_Analytics/HW5/riskPrediction_data/chicagoBoundary.shp") %>%
st_transform(crs=102271)
# load the robbery data
robbery <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "ROBBERY") %>%
dplyr::select(-Date,-Updated.On) %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform(102271) %>%
distinct()
# map the robbery as point
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = robbery, colour="#006699", size=0.7, show.legend = "point") +
labs(title= "Handgun Rbbery , Chicago - 2017") +
mapTheme()
We also created a 500ft grid cell fishnet,and selected only those grid cells that fall into chicagoBoundary.Then, we joined the robberies to the fishnet and mapped the count of robberies by grid cell. The spatial structure of robberies began to take shape in the map below as the hotspots are immediately apparent. From the map, we can find that the hotpots mainly clustered in the southeast and northwest area in Chicago.
# join robbery data to the fishnet
crime_net <-
robbery %>%
dplyr::select() %>%
mutate(count = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count = ifelse(is.na(count), 0, count),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = count)) +
scale_fill_viridis() +
labs(title = "Count of domestic battery for the fishnet") +
mapTheme()
Next, we downloaded a small set of features(drug crime, liquor store,streetlightsout,sanitation,abandon buildings, abandon cars, graffiti and plice station), which were selected as our potential predictors. Again, we mapped those predictors as points to get a sense of their distribution.
police <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv/data") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "police")
drug <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type=="NARCOTICS") %>%
dplyr::select(Y = Latitude, X = Longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "drug")
liquor <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ == "Front" |
where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
# plot the predictors
ggplot() +
geom_sf(data=chicagoBoundary) +
geom_sf(data = rbind(drug,liquor,streetLightsOut,abandonCars,graffiti,
sanitation,abandonBuildings,police),
colour="#006699",size = .1) +
facet_wrap(~Legend, ncol = 4) +
labs(title = "Risk Factors") +
mapTheme()
There are many ways to engineer features that test the exposure hypothesis. In this model, we mainly used 2 methods to engineer the features: counting risk factor events by grid cell and averaging nearest neighbor features.
In terms of the sanitation, abandonBuildings, abandonCars and graffiti data, we simply summed the number of each risk factor points occuring in a given grid cell.
vars_net <-
rbind(sanitation,abandonBuildings,abandonCars,graffiti) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
Besides, We used the nearist neighbor method to process the police(k=1), drug(k=5),liquor(k=5) and streetLightsOut(k=5) data, this approach considers exposure at a different scale then simply counting up risk factor events by grid cell.
# knn distance
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
vars_net$police.nn_1 =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(police), 1)
vars_net$drug.nn_5 =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(drug), 5)
vars_net$liquor.nn_5 =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquor), 5)
vars_net$streetLightsOut.nn_5 =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 5)
A small multiple map of all factors in the fishnet are mapped below. Interestingly, these factors seem to illustrate slightly different spatial processes. Abandoned_Buildings are mainly clustered in South Chicago, while Abandoned_Cars are mostly in the north. 311 complaints for Graffit tends to cluster along major thoroughfares. In terms of the distance, the north Chicago is closer to the drug crime and liqor retail store. We then joined the robbery fishnet and factors fishnet into a final dataset: final_net.
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Factors by Fishnet"))
# create the final dataset
final_net <-left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID")
In this section, the local spatial structure of burglary (by grid cell) was quantified and visualized by way of a statistic called Local Moran’s I. As can be seen from the map below,high values of local moran’s I represent strong and statistically significant evidence of local clustering, that is, there are two significant hotspots of robbery crime in the north and south area.
# check the spatial autocorrelation
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$count, final_net.weights)),
as.data.frame(final_net, NULL)) %>%
st_sf() %>%
dplyr::select(Robbery_Count = count,
Local_Morans_I = Ii,
P_Value = `Pr(z > 0)`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(varList, ncol =4, top = "Local Morans I statistics, Robbery"))
A generalizable model is one which will have to predict equally as well in the hotspots as the coldspots. In other words, a model fit only to the coldspots, will underfit to the hotspots. A model fit only to the hotspots will overfit the coldspots.Thus,in order to improve the generalizability of our model,we added two spatial structure features.To be more specific, a dummy variable(isSig) was created to denote the significant cluster,a continuous variable(isSig.dist) was also created to measure distance from each grid cell to its nearest significant cluster.
final_net <-
final_net %>%
mutate(robbery.isSig = ifelse(localmoran(final_net$count,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, robbery.isSig == 1))), 1 ))
After all the variables(spatial variables and non-spatial variables) were created, we performed correlation tests.The following plots tells us that for those risk factors the count and nearest neighbor features tell similar stories with different directions of association. To our surprise, the distance from the nearest police station is also negatively correlated with the number of robberies,which means areas with police stations may have higher risk levels.
correlation.long <-
st_set_geometry(final_net, NULL) %>%
dplyr::select(-uniqueID, -cvID) %>%
gather(Variable, Value, -count)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, count, use = "complete.obs"))
ggplot(correlation.long, aes(Value, count)) +
geom_point(size = 0.2,colour = "#333333") +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#006699") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Robbery count as a function of risk factors")
For a given algorithm to be effective, the observed data must be distributed according to the assumptions of the algorithm. from the histograme of robbery count, we could find that the distribution of robbery count is similar to the simulated Poisson distribution.
# plot the histograme
ggplot(final_net, aes(count)) +
geom_histogram(binwidth = 1,fill="#006699") +
labs(title = "Distribution of robbery by grid cell")
Geospatial risk models are purely spatial in nature,which means that moving directly to cross-validation is a much more useful approach to guaging generalizability. In the geospatial risk model,we can perform spatial cross-validation to guage generalizability.A well generalized crime predictive model must predict equally as well in all communities and a geospatial model that predicts equally well in all neighborhoods is one which can learn crime risk patterns that apply at both Citywide and local spatial scales.
In order to take both random k-fold and spatial cross validation, we joined the neighborhood name to the final_net.The random generated cvID associated with each grid cell can be used for random k-fold cross-validation. Neighborhood name can be used for spatial cross-validation.
# join the Neighborhood name and police District for spatial cross-validation.
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
final_net <-
st_centroid(final_net) %>%
st_join(., dplyr::select(neighborhoods, name)) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
# cross validation
reg.ss.vars <- c('drug.nn_5','liquor.nn_5','streetLightsOut.nn_5','Abandoned_Cars','Graffiti',
'Sanitation','Abandoned_Buildings','police.nn_1',"robbery.isSig", "robbery.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(count ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
# k fold validation
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, count, Prediction, geometry)
# spatial cross validation
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, count, Prediction, geometry)
The principal motivation for geospatial risk prediction is the idea of ‘latent risk’. Crime is a relatively rare event and it may be happening at a greater rate than what is observe in the data. The goal of the model then, is to predict this latent risk by borrowing the experience of observed crimes and testing how well those experiences generalize to places where no crime has been observed. Assuming these predictions are ‘useful’, defined as generalizable across different community contexts, it may be possible to create a tool that Police can use to more efficiently allocate their limited resources across space.
In this context, accuracy & generalizability is far more nuanced. A model that is perfectly accurate is one where the predicted crime count equals the observed crime count and such a model would fail to identify areas at latent risk. This makes goodness of fit is very difficult to judge for geospatial risk prediction models.The goal is to achieve a useful accuracy that improves on the resource allocation better than the business as usual approach, while ensuring this accuracy generalizes appropriately.
In order to verify the accuracy & generalizability of our model, firstly, we mapped the model raw errors by random k-fold and spatial cross validation.The two maps seems roughly the same: Despite some outliers,the errors in most areas is close to zero, which means our model has a strong predictive ability.
# map the error
reg.summary <-
rbind(
mutate(reg.ss.cv, Error = Prediction - count,
Regression = "Random k-fold CV: Spatial Structure"),
mutate(reg.ss.spatialCV, Error = Prediction - count,
Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
st_sf()
filter(reg.summary, Regression == "Random k-fold CV: Spatial Structure" |
Regression == "Spatial LOGO-CV: Spatial Structure") %>%
ggplot() +
geom_sf(aes(fill = Error)) +
facet_wrap(~Regression) +
scale_fill_viridis(begin =0, end =0.8) +
labs(title = "Robbery Errors by Regression") +
mapTheme()
We also provide another approach for visualizing the model’s predicted robbery count compared to the observed count.It’s also interesting to note that it seems all models overpredict in the 1st through 7th deciles, but then underpredict in the 9th through 10th deciles. In other words, we overpredict in low burglary areas and underpredict in high burglary areas.
# visulize the predicted and observed burglary by observed robbery
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
mutate(robbery_Decile = ntile(count, 10)) %>%
group_by(Regression, robbery_Decile) %>%
summarize(meanObserved = mean(count, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -robbery_Decile) %>%
ggplot(aes(robbery_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = robbery_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed robberies by observed burglary decile")
Next, We examined MAE and the standard deviation of MAE across each fold for our regression.The interpretation is that this model err by roughly 2.38 robbery, on an absolute basis, on average. For context, the mean robbery count is 4.62,which means that there is not too much variation across folds. Importantly, the model seems to generalize equally well regardless of the cross-validation type.
# check the mae and the sd_mae
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
summarize(MAE = round(mean(abs(Prediction - count), na.rm = T),2),
SD_MAE = round(sd(abs(Prediction - count), na.rm = T),2)) %>%
kable(caption = "MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "white")
Regression | MAE | SD_MAE |
---|---|---|
Random k-fold CV: Spatial Structure | 2.35 | 2.61 |
Spatial LOGO-CV: Spatial Structure | 2.42 | 2.86 |
Now we get down to the important question of whether these algorithms can perpetuate racial bias.In section 2.1 we discussed how systematic over-policing of certain communities may lead to a reporting or selection bias, and if left unaccounted for, this variation will fall into the error term of the regression.If this reporting bias falls along racial lines, then the resulting predictions may exasperate racist policing practices and have a disparate impact on communities of color.We calculated the raw errors by race context for a random k-fold vs. spatial cross validation regression.
# A table of raw errors by race context for a random k-fold vs. spatial cross validation regression.
library(tidycensus)
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
key='385fdcbf1110d83b532fcea7941e0631e4f42b7b',year = 2017, state=17, county=031, geometry=T) %>%
st_transform(102271) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White")) %>%
.[neighborhoods,]
reg.summary <-
reg.summary %>%
mutate(uniqueID = rownames(.))
reg.summary.tracts <-
st_join(st_centroid(reg.summary), tracts17) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(reg.summary, uniqueID)) %>%
st_sf() %>%
na.omit()
From the table below, we could find that our model is slightly overpredicting with respect to the Majority white context (mean error value is positive), and underpredicting with respect to the majority non-white context(mean error value is negative).
st_set_geometry(reg.summary.tracts, NULL) %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
Regression | Majority Non-White | Majority White |
---|---|---|
Random k-fold CV: Spatial Structure | -0.1653522 | 0.1787604 |
Spatial LOGO-CV: Spatial Structure | -0.1460793 | 0.2432189 |
Next, let’s contrast this finding with this algorithm’s overall usefulness as a police resouce allocation tool. Police departments the world over have moved to hot spot policing, targeting police resources to the places where crime is most concentrated.Below we create Kernel Density hotspots map,which is built from 1500ft search radii.At the same time, a goodness of fit indicator is created to illustrate whether the risk predictions capture more of the observed burglaries relative to the kernel density. If the predictions do capture more observed burglaries, then the risk predictive model provides more robust targeting tool for allocating police resources.
## kernel density
# Compute kernel density
rob_ppp <- as.ppp(st_coordinates(robbery), W = st_bbox(final_net))
rob_KD <- spatstat::density.ppp(rob_ppp, 1000)
# Convert kernel density to grid cells taking the mean
rob_KDE_sf <- as.data.frame(rob_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
# Bind to a layer where test set crime counts are spatially joined to the fisnnet.
bind_cols(
aggregate(
dplyr::select(robbery) %>% mutate(Count = 1), ., length) %>%
mutate(Count = replace_na(Count, 0))) %>%
#Select the fields we need
dplyr::select(label, Risk_Category, Count)
head(rob_KDE_sf)
## risk predictors
#Spatial LOGO-CV: Spatial Structure
rob_risk_sf <-
filter(reg.summary, Regression == "Random k-fold CV: Spatial Structure") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
bind_cols(
aggregate(
dplyr::select(robbery) %>% mutate(Count = 1), ., length) %>%
mutate(Count = replace_na(Count, 0))) %>%
dplyr::select(label,Risk_Category,Count)
## The map comparing kernel density to risk predictions
rbind(rob_KDE_sf, rob_risk_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(robbery, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE)+
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
Just to make a little bit more intuitive comparison, the rate of burglary points by risk category and model type is calculated and plotted.It is clearly that the risk predictions capture a greater share of observed burglary events in the highest risk category relative to the kernel density, which means that our model allocate better than the traditional crime hotspots.
rbind(rob_KDE_sf, rob_risk_sf) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(count = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = count / sum(count)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label),position="dodge", stat="identity") +
scale_fill_manual(values = c("#339999", "#006699"))
Generally, our model is an effective model,the row error maps indicates that the difference between the predicted robbery crime and the observed robbery crime is small in most of areas,which indicates that our model predicts the robbery crime for most of the areas in a precise way. Besides,during the validation process by both random k-fold and spatial cross validation, the MAE results is similar(roughly 2.38),given the mean observed robbery count is 4.62,our model seems to have a strong predictive ability and generalize equally well regardless of the cross-validation type. However, the model we built do have some flaws: our predictive model is slightly overpredicting in minority neighborhoods, and underpredicting in white neighborhoods.Considering that the all of the absolute mean errors is less than 0.25, we believe that this problem is still within acceptable limits.
Though remaining a few flaws in our model, we still stand that our model worth to be recommanded to be put into production. Most importantly, our model captures a greater share of observed robbery events in the highest risk category relative to the traditional model,which means it allocates better than traditional crime hotspots. If this model is adopted, it could be a huge help for better law enforcement allocation.