Did you know that Oakland Crimespotting is still kicking hard, with hundreds of alert subscribers and a smooth, regular flow of timely data from the Oakland Police Department? The project has essentially been on auto-pilot since we re-launched it back in March, but holiday side projects have been a favorite activity of mine for years, so this time I'm thinking about the relatively short time horizon Crimespotting offers.
The current interface offers up to a month's worth of highly granular information on individual reports, and you can quickly get a sense for how active a given neighborhood is by digging around a little, doing a few searches, and checking out details on local crime reports. What we don't have is a long view.
Heat maps are one effective way to present large volumes of aggregate data over a geographical area, so I've been exploring ways to make them legible for crime data.
There's a ton of existing work out there in this area to draw on, some of it good and some of it dreadful.
First and foremost is Martin Watternberg's seminal Map Of The Market, a live and non-geographical view of stock trading activity, that celebrated its ten-year anniversary this past year. MOTM shows volume and change over time in a tight, clean, effective package most recently notable for showing how Campbell's Soup and gold mining managed to weather the recent precipitous drops in the Dow.
A more topical geographic example is Microsoft Research project How We Watch the City: Popularity and Online Maps. Danyel Fisher used server logs from Microsoft Virtual Earth tile servers to show viewing patterns around the world, with the beautiful results shown here.
Finally, HeatMapAPI offers commercial support for making your own heat maps.
The results of HeatMapAPI's software actually illustrate a few of the things I've found weakest about geographic heat maps, a big excuse for why we've not done them for Oakland Crimespotting so far. There are two big shortfalls in the screen shot above: the data obscures the context, and simultaneously fails to communicate much in the way of specifics. The two primary questions you might want to ask of your data are "where?" and "how much?" The answers offered here are apparently "in a place near Whittier whose name I can't read" and "yellow".
So that's the starting point.
The answer I've settled on for the "where?" question is OpenStreetMap. I've been growing steadily more excited about this project for some months now, in part because it offers up the possibility of playing some beautiful visual games with high quality street data. In the HeatMapAPI example above, the context problem arises from the impossibility of manipulating Google's map data at any level more granular than their pre-rendered tiles. The overlays obscure the town and street names that help give them meaning. With OSM data and Mapnik, it's possible to create a semi-transparent streets layer specifically designed to interact well with underlaid data. It took just an afternoon's worth of modifications to my existing OSM visual design to come up with something suitable for layering with quantitative data. Gem helped tune the visual interaction between layers, so now there's a directly-overlaid set of names and icons above a translucent (25% - 50%) black street grid. Each of these layers is a separate Mapnik style, composited with the underlying color heat map.
In these maps, streets have been stripped back to translucent dark stripes, with white edges showing where the shoreline of the Bay begins.
The second question, "how much?", is somewhat more interesting. The difficulty with continuous, analog data lies in communicating something of relevance and urgency in it. If the map is orange, what does that mean exactly? Will my car get broken into?
One approach I've been prodding at takes advantage of a neighborhood sense for time and space. People know how big a city block is, how it feels for a month to go by. We know something of this in our database of crime reports too, so the colors in these experimental designs are keyed to specific meanings. Orange here denotes areas where, on average, the police respond to a call once per month for every 100m x 100m city block. Inside orange, there are two more divisions shown as brighter, hotter colors: two weeks and one week. For the police to show up right on your block every week is quite heavy, and there are just a few places in town that see this kind of activity. Outside orange, there are divisions of green that represent an additional month of peace and quiet for every block at each step.
At this level, you can start to see where OpenStreetMap data really begins to shine: all those little flag icons represent Oakland public schools that I added to the OSM database specifically to have such local data available to Crimespotting. The Microsoft Virtual Earth maps we use on the current site are beautiful, but they aren't particularly helpful in the way of local, civic data relevant to a consideration of police activity.
As the map zooms in closer, large amorphous blobs particulate into smaller, more granular bleeps and bloops. When you start seeing individual blocks in the map, you can also see individual corner hot spots. Here, the two downtown Oakland BART stations, a slice of MLK between 14th and 16th streets, and the area immediately around Oakland Police headquarters on Broadway and 7th are especially hot. The colors at every zoom level continue to mean the same things: always orange for "once a month, once per block". The colors here are cribbed from Cynthia Brewer's cpt-city work, a combination of YlGn and Oranges.
I'm happy that Lincoln Elementary School seems to sit in a safe zone of relatively low crime.
At a certain point, increased granularity becomes a problem. Our data is really only accurate to the city block level, so it doesn't make sense to generate a heat map more specific than this. The smooth, swooping whorls at the highest levels of zoom help to communicate the relative imprecision of the data at this level.
Overall, I'm happy with the results so far. These images are being generated through a combination of GDAL, Mapnik, NumPy and PIL. They're not yet ready to be integrated into the Crimespotting site proper, though I imagine that the first place they would eventually show up would be on the static map beat pages. I'm interested in comments or criticisms on how to improve the beauty or clarity of these results, before they're pushed in the direction of a proper release.
Wonderful work. Thanks for sharing.
woaw, that's incredible ! Thanks for taking the time to share your experience & thoughts, it's highly appreciable :)
This is beautiful work, and I like that you thought hard about the metrics (police visits per unit time) for generating the heat maps. But I wonder if perhaps one might draw the wrong conclusions from these maps. Looking at the examples above, I might avoid taking BART as the area around the stations looks quite dangerous. Yet the high activity in those areas might actually reflect the fact that the BART stations have people around them fairly often, reporting criminal or suspicious activity. This would result in more police visits, but not because those areas are particularly dangerous. In fact, they may actually be safer. The fact that the police station is the other high-activity area in your example would seem to support this hypothesis. The *really* dangerous blocks in Oakland are those where people *don't* call the cops when they witness a crime, and thus wouldn't show up in the police department's crime statistics, or your visualizations. All of which is just a roundabout way of saying that as we progress in making data available and using it in a myriad of ways, we can't stop questioning the processes by which that data is generated, and what we therefore don't see or can't see in that data.
Awesome, Mike. These look fantastic, and I'm intrigued by your methodology. We're grappling with similar challenges of geographical data visualization at EveryBlock, so it's helpful to see a great example in context with tools similar to ours. What challenges are there in putting these into production?
Ryan: You're absolutely right that population density and willingness to call the police plays into this in a big way. I was thinking it might be interesting to correlate the report data further with population, but I think that census info might not do a great job of communicating how crowded downtown Oakland is during the day, vs. how deserted it can be other times. I've tried to be fairly scrupulous in this project with terminology like "crime reports" vs. "crimes", since of course there's a bizarre don't-snitch norm in certain parts of town here that keeps many events unreported. A good friend took me on a tour of Baltimore recently, and pointed out how the most dangerous looking neighborhoods populated with the shadiest-seeming characters were actually relatively well-off, with their local Jacobsian eyes-on-the-street. The slice of hot orange around Peralta St. in this image is another example of a place that has high police activity, but also an active, watchful population that makes it feel relatively safe even at night: http://mike.teczno.com/img/oakland-heatmapping/map-14.png
Paul: The actual color parts are super-quick to run, since the data resolution is so low to start with. Most of the heavy calculation seems to be in making the Mapnik overlays, but tile caching helps ensure that it only needs to be done once, at the beginning. The biggest challenge I had was a bunch of false starts in using raster images to generate the underlying heatmaps. I tried gdal_grid after a tip from Josh Livni, but it didn't seem to be quite the right tool for generating smooth, flowing transitions from point to point. Instead I used the GDAL python bindings to manually create TIFF files with one pixel for each data point, you can see them here at a variety of resolutions: http://mike.teczno.com/img/oakland-heatmapping/heat.14.tif http://mike.teczno.com/img/oakland-heatmapping/heat.15.tif ... http://mike.teczno.com/img/oakland-heatmapping/heat.18.tif If you run gdalinfo on them, you'll see that they're quite small, but correctly georeferenced. The code used to generate them can be found at http://dpaste.com/hold/103818/ gdalwarp stretches them out to the proper coverage area and also throws in the cubic interpolation that makes the smooth blobs look so good. Here's an example call: gdalwarp -t_srs "+proj=merc +datum=WGS84 +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs" -te -13618784.782090 4519863.480379 -13603497.376947 4531280.995883 -ts 800 600 -of GTiff -r cubic -dstnodata 6 heat.16.tif out-warped.tif The actual values in the TIFF are floating point, each pixel is equal to log(#days). I added the logarithm because I was getting some funky artefacts out of the cubic interpolation in high-volume areas, that made them seem a lot busier than they really were. On the other end there's this bit of code that turns the warped TIFF into a visible color map: http://dpaste.com/hold/103821/
Andy, Tim, Jonathan: thank you!
Awesome. Finally SimCity 2000 in real life. Now we know where to put the new police station ;) I like your choice of colors.
Ryan, remind me to ask you about particle systems and hachures sometime. =)
swooping in from Pointy Haired Dilbert --- and just gotta say: this is really awesome. Keep up the great work.
simple. brilliant. love it, mike. it's a big DUH! Like wheels on the bottom of a suitcase as eric says. seems obvious but when it's done it changes the whole world.
Mike, I think Ryan's got an interesting point about the BART stations. It would be fascinating if you could find a proxy for the presence of people over time - in places with lots of people, we might expect to find more crime, if only crimes of opportunity like pickpocketing, etc. Makes perfect sense that BART stations, as loci for lots of human activity would also be high crime areas. Are there data sets you could overlay that show population density? Or density of commercial operations? It would be intriguing to find out if there are low-density neighborhoods with high crime incidence - that would be a neighborhood to worry about, perhaps more so than a high density, high incidence neighborhood? As always, gorgeous, fascinating work...
Sorry, no new comments on old posts.