On Saturday, March 23, 2013, I attended a "Chicago Crime Hack" sponsored by the Northwestern University Knight Lab and the Chicago Tribune News Applications team. My interest in the data is less news-oriented than for most of the folks who attended. Rather, I was interested in a nice noisy data set that might become a use case for Ancho modeling.
Regrettably, I wasn't able to stay for the entire event. A snowstorm was going to move into southern Illinois the next day. It didn't affect Chicago that much, but if I'd stayed I would have had to drive through it with two toddlers in the back seat. So, we drove back to St. Louis a day early.
Anyway, one other fellow, Bernie Leung, was interested in using the data set for predictive analytics. From our brief conversation, I believe we were thinking in the same direction. (He comes from the credit card fraud and risk analytics world.
What's in the Data?
The crime data is event-oriented. Each event has a timestamp and some metadata: approximate location, type of crime, was an arrest made at the time, etc.
My first thought was to model crime kind of like weather: divide the city up into grid sections and model it as a dynamic system. The crime data itself would be more like observational weather data. Instead of "on Thursday it rained in Andersonville," you have "on Thursday there was a car break-in in the 1800 block of West Catalpa St."
Crime data by itself is not going to be a very useful predictor, however. So I started thinking, what are variables that might have predictive usefulness, if you could get them as a time series and, in some cases, associate them with specific geographies? Let's ask the FBI, who presumably has spent a lot of time thinking about this.
I assembled a list of crime-related variables by type and potential source. I'm hoping I will be able to get some assistance in locating these data sets, because for some of them, I have no idea where to start looking.
The potentially useful variables I thought of are mostly about demographics, weather, enforcement levels, economic conditions, and the transportation network. Of these, I'd hope to get everything except macroeconomic variables broken down into geographies that are smaller than the entire City of Chicago.
Building a Model
Assuming that I had all that data at my disposal, how would Ancho be helpful in putting it to use?
Ancho is designed for building probabilistic time series models, but in this case, we don't know much about the uncertainties in the data. We're not sure a priori what relationships will actually be present between the "independent" variables and the "dependent" variable of actual reported crime. One key factor would be the difference in actual crimes versus reported crimes; sometimes people just don't report things.
We do know, however, that there is some amount of uncertainty in the American Community Survey data that is generated by the US Census Bureau, because unlike the decennial census, it is based on sampling. Therefore any data that we use that comes directly from ACS or is scaled and adjusted according to ACS is subject to some "known" uncertainty.
Ancho will also need to include some machinery for back-testing a model against actual observations. That will be useful here. In fact, the backtesting needs of a project like this will require some features I hadn't previously thought of, like the need to evaluate a time series of probability results, against actual observations, along a geographic grid matrix.
What does that mean? Think about it this way: weather forecasts are produced for a matrix of geographic grid points, not just for a single location. And they are probabilistic: forecasts don't say "it will rain today," but are stated in terms of probability, e.g. "there is a 30% chance of rain today." Forecast models are considered good if, on average, it rains 30% of the time when the forecast says there's a 30% chance of rain.
We would need to evaluate any modeled crime predictions in the same way. The model isn't going to tell you how likely you are to get your purse stolen. Instead, it may be able to say that per-capita violent crimes in a certain grid section will be in a certain range. Evaluating the total amount of error in the entire model versus what actually happened will allow us to say "this is a good model," or not.
I have some very ill-formed thoughts about how Ancho could be used in conjunction with genetic algorithms to automatically generate and successively improve upon a "population" of models. But that's a whole other topic and one I understand just well enough to embarrass myself, so I'm going to leave that one for later.
Anyway, it was nice to enjoy the company of proper nerds for a few hours.