tecznotes

Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Mar 3, 2008 12:09am

oakland crime maps X: return of the jedi

We launched Oakland Crimespotting back in August, and all was well for a short time. There were friendly mails from Pete Wevurski, John Russo, and others who liked what we were up to. Unfortunately, we ran afoul of Oakland's website availability, and by late October it became completely impossible for us to collect data at a sustainable rate. We closed up shop and replaced the front page of the site with an apology and a promise.

After several months of general stagnation, Oakland City IT reconnected us to a current, reliable, and accessible data source in January, and I can now confirm that it all Just Works.

There are a few bits of New sprinkled throughout the site.

We've added pages for individual police beats, such as this one for 04X, where I live. A large number of our users asked for these, though truthfully it wasn't something I expected. I've been historically critical of the forms-first approach that CrimeView Community takes ("Easy wizard interface"), eschewing it in favor of a maps-first approach. Changing standards of cheapness are a recent interest of mine, and it's cheaper to show everything. Expect to hear more of this from Tom at E-Tech tomorrow. In fact, Police Service Area and City Council District aren't ways that Oakland residents commonly locate themselves. The Police department is organized into beats, and this turns out to be the right way to interface with them if you're a concerned, active citizen. Each beat has a consistent set of officers and public contact information. Oakland CTO Bob Glaze told me the beat designations haven't changed in decades. Clearly, maps and data for individual beats were going to be necessary.

Each beat page features a map of recent reports in that area. These maps are the result of Aaron's heroic work in extending Modest Maps' static mapping abilities. WS-compose is now a sweet little map generator that will happily report geographic dot locations in HTTP response headers if you ask it nicely, among other tricks.

There are also per-beat news feeds and downloadable spreadsheets of detailed information for neighborhood crime prevention councils.

The other addition is a proper comment feature. In the past, we've had an error report form on each crime report page where residents could alert us to improperly-placed reports or other mistakes, but this wasn't as effective as it could have been. The primary problem was that posting an error report didn't really set off any alarm bells, and it certainly didn't appear on the site anywhere. I've grown to feel that replacing a clunky web interface with a mute one isn't necessarily much of an improvement, so it's valuable to provide a direct feedback mechanism right there on the site.

The error reports have now been replaced by actual comment forms where you can leave your name, a message, and an optional link at the bottom of each individual report page. The comments are keyed on the case number, so case numbers with multiple reports share a set of comments. Right now these just look like regular blog comments, but the intent of the link is to add news articles or connect reports to one another. I hope very much to see this feature of the site grow into something interesting and unexpected.

Here is the mail I sent last month announcing our return:

Hello Everyone,
We're happy to announce that Oakland Crimespotting is back, thanks to the generous help of Oakland's City Information Technology Department. After three months without access to report data, we've been granted a reliable, regularly-updated source of crime report information. This is great news: it means that the website is back up and running with current information, e-mail alerts and RSS feeds work again, and we at Stamen Design can explore new ways of presenting and publishing this important information.
Here are a few things you can do, now:
Visit the site at http://oakland.crimespotting.org/. View a map at http://oakland.crimespotting.org/map/. Sign up for alerts at http://oakland.crimespotting.org/alerts.
We are also interested in what additions to the site you would find useful or interesting. So far, we've had a number of suggestions that we're actively looking into: spreadsheet-friendly downloads, details on individual police beats, a search function, and more than one month's worth of data. If you have any thoughts on these or other ideas, send us a mail at info@crimespotting.org.
Our return would not have been possible without the help of a few key people. Ahsan Baig, Ken Gordon, and Bob Glaze at Oakland City IT built and published a source of information for us. Ted Shelton, Charles Waltner, and others helped us navigate the difficult waters of City Hall communications. Jason Schultz, Ryan Wong, Karla Ruiz, and Jeremy Brown at U.C. Berkeley Law School helped us understand how to best approach city governments for information. Kathleen Kirkwood and Pete Wevurski at The Oakland Tribune helped us understand the journalistic context of the project. Dan O'Neil and Adrian Holovaty at EveryBlock.com were a valuable sounding boards for ideas

Aug 20, 2007 11:53pm

oakland crime maps IX: post-launch

Last week, we launched Oakland Crimespotting, capping off eight months of the occasional data sketching I've been recording on this site. I've covered a few speculative topics here that didn't graduate to the public version of the site, and there have been a number of interesting new things that were sure to add.

The initial work on scraping (post I, post II) is still in use. Thankfully, the city hasn't changed CrimeWatch much since December, so our nightly collection runs are still chugging along happily. We do four collections every evening: past four days, and then individual days a week, two weeks, and one month in the past. The overlap is because we've noticed that the Oakland PD amends and modifies crime reports, and the whole map site is frequently down altogether.

Two later pieces (post III, post IV) introduced an idea on time-based display, but ultimately it was effective to just drop in the dots and add live draggy/zoomy controls. This is something we've consistently found with other projects, too: it's so often the case that the "right" design is not the technically complicated one, but the one that gets feedback and interactivity just so.

Finally, I wrote up a few pieces (post VI, post VII) on public data indexing. This is something I continue to find interesting, but at the volume of traffic we're pushing, it's totally unnecessary. Turns out MySQL is kind of awesome at this sort of thing.

There are two big features on the map interface that only emerged when designing and developing it with Tom and Eric. The date slider is something that we shamelessly nicked from Measure Map, though we added the bit where per-day columns act as a display showing which data has been loaded. This part is still under active development. The idea is that the background should be draggable, to allow people to navigate back further in time than 30 days.

Measure Map:

Ours:

The second is the crime type picker, an interface whose affordances we borrowed from Newsmap. This one's quite simple, but it does trigger the visual spotlight effect that makes it possible to pick out crimes of a certain type throughout the map.

Newsmap:

Ours:

It was important that every view of the map be linkable and sharable, so we imported a number of ideas that Tom developed for our last map project, Trulia Hindsight. The thing to watch for is how the URL of the page you're looking at changes as you pan and zoom around. It can be copied, shared in an e-mail, sent over IM to a friend, and posted in a blog.

An "official" API has not been described or announced, but it will most likely include the site's Atom / GeoRSS feeds. These implement a small subset of the OpenSearch request specification:

  • bbox is a geographical bounding box in the order west, south, east, north.
  • dtstart and dtend are start and end dates, in YYYY-MM-DDTHH:MM:SSZ format.

Look for these hanging off of the /crime-data endpoint.

The site is hosted on Amazon's EC2 service, on a 10 cent/hour virtual server running Debian Linux, MySQL, Apache, and PHP. The static maps are generated by Aaron Cope's recent addition to Modest Maps, ws-compose.py. It's a BaseHTTPServer that stitches tiles into map PNG's, and I've been running four of them (and caching the responses) for the past week with no troubles.

I've rediscovered the joys of procedural PHP4 with this project. EC2 has proven to be a real champ, allowing us to set up a test machine, deploy a living site, but always holding out the possibility of migration to a "real" server. At a total of $80/month, the virtual Debian machine may last for a while.

Next steps may include San Francisco and Berkeley.

Aug 15, 2007 4:16pm

oakland crime maps VIII: first public launch

I promised we'd have something to show, right? In response to the red wave of homicides that swept Oakland two weeks ago, Tom and I published a visual map of crime reports in Oakland.

I'll write more later, but for now go and explore.

May 27, 2007 6:49pm

oakland crime maps VII: public indexes redux

Earlier this month, I described a way of publishing a database in a RESTful, static form. Since then I've been tweaking the data into a more presentable state, which I'll describe in this post.

Also I promise that next time, there'll actually be something to look at beyond of me noodling with computer pseudo-science.

When I first opened up the Oakland crime index, I published data in two forms: data about crime was stored in day/type resources, e.g. May 3rd murders or Jan 1st robberies, while binary-search indexes on the case number, latitude, and longitude were published with pointers to the day/type resources. As I've experimented with code to consume the data and kicked these ideas around with others, a few obvious changes had to be made:

First, the separate b-trees on latitude and longitude had to go. Location is 2-dimensional, and requires an appropriate index to fit. I had initially expected to use r-trees but found that quadtrees, a special case, made the most sense. These are closest in spirit to the b-tree, and unlike the r-tree each sub-index does not overlap with any other.

Second, space and time are intricately related, so spatiotemporal index was an obvious next step. I chose an oct-tree of latitude, longitude, and time. Again, this is a simple extension of the b-tree, and provides for simple answers like "show all crimes that are within a mile of a given point, for the following dates..."

Third, I was being too literal with the indexes, insisting that traversing the trees should ultimately lead back to a link to a specific day/type listing. Although this is how a real database index might work, in the context of an index served over HTTP, a large number of transactions can be avoided by just dropping the actual data right into the index. To understand what this means, compare the CSS-styled output of the various indexes to the HTML source: the complete data for each crime is stashed in a display: none block right in the appropriate node.

Finally, my initial implementation used the binary tree lingo "left" and "right" to mark the branches in each index. I've replaced this with more obvious "before", "after", "north", "south", "east", and "west" for greater ease of human-readability and consumption.

I'm still hosting the data on Amazon's S3, but a recent billing change is making me re-think the wisdom of doing this:

New Pricing (effective June 1st, 2007): $0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests.

Eep.

In one week, S3 is going to go from a sensible storage/hosting platform for data consisting of many tiny resources, to one optimized for data consisting of fewer, chunkier resources; think movies instead of tiles. I can see the logic behind this: S3's processing overhead for serving a million 1KB requests must be substantial compared to serving a thousand 1MB requests. Still, it makes my strategy of publishing these indexes as large collections of tiny files, many of which will never be accessed, start to seem a bit problematic.

The obvious answer is to stash them on the filesystem, which I plan to do. However, there is one feature of S3 that I'm going to miss: when publishing data to their servers, any HTTP header starting with "X-AMZ-Meta-" got to ride along as metadata, allowing me to easily implement a variant of mark and sweep garbage collection when posting updates to the indexes. This made it tremendously easy to simulate atomic updates by keeping the entire index tree around for at least 5 minutes after a replacement tree was put in place, a benefit for slow clients.

When I move the index to a non-S3 location before my Amazon-imposed June 1st deadline, I will no longer have the benefit of per-resource metadata to work with.

For next time: code to consume this, code to show it.

May 7, 2007 10:24am

oakland crime maps VI: public, indexed data

Things have been generally quiet on the Oakland crime scraping front since we released Modest Maps and I demonstrated some potential display ideas for the crime report records I'm borrowing from the Oakland PD. Here, I describe how I've chosen to make the data public in a purely-RESTful way with indexes.

The small demo at that second link above hooks up to a quick database-driven web service written in PHP, and making it live drove home the point that hosting live databases is tedious and unsatisfying.

Meanwhile, Tom Coates is drumming away about natives to a web of data, Matt Biddulph is telling information architects about RDF and API's, and Mark Atwood is releasing S3-backed MySQL storage engines. Putting these threads together suggests an interesting, or at least more durable, way of publishing pure data on the web. The MySQL engine is an interesting stake in the ground, but it hides its data and its index (the two primary components of a relational database) behind the usual MySQL server process. The contents of storage aren't open to data consumers, ditching many of the cost and scale advantages of a service like S3 by piping it all through your annoying old DB server. Tom and Matt already have the data-on-the-web bit covered, so I'm going to do something about the index.

Indexes to a database table are exactly what they are to anything else: a faster way to look up information than scanning through it all in order. It's how you jump straight to the "M's" in the phone book without a lot of paging back and forth. The most popular style of index is something called a binary tree. Imagine looking for a particular word in the dictionary: you open the book up to some page in the middle of the book, check to see whether your word is before, on, or after the current page, and then move back and forward in the book in large chunks of pages until you've found what you're searching for. This is generally much faster than starting at "A" and turning single pages to find your word. A binary tree works the same way.

Indexes are rarely exposed, even on good web-of-data citizens. Both Flickr and Twitter make it somewhat difficult to move through giant lists, though not anymore difficult than other sites. Meanwhile, the databases quietly running these services are wildly denormalized and indexed like crazy, making it possible to rapidly generate those long, long lists.

For the crime reports, I started by just getting the data up and public. It's at predictable URL's, like these:

If you are looking for crimes on a particular date with a particular type, you just ask for a guessable URL. This is in effect the primary key: the natural, internal storage format for the data. Most common types of crime happen on most days, so the majority of date/type combinations should Just Work, and a simple HTTP 4XX error tells you when there is no match. I've chosen to publish in XHTML format for two reasons: the markup is highly semantic, making it simultaneously machine-readable and human-readable. Realistically, I'll be adding JSON and POX pages soon.

Unfortunately, if you're looking for a particular case number, or crimes at a particular location, it would require hunting through every page of crimes. In database terms, this is known as a table scan, and is something to be avoided at all costs. Instead, I've created a set of indexes to the data, demonstrating the key trade-off: an index helps you find what you want, but takes space to store and time to calculate. Following the Case Number link above takes you to a page with a long, nested list on it, a binary search tree. The idea is that you enter looking for a particular case number or range of case numbers. You start by comparing the one you want to the one at the top of the page. If they match, you're done. If yours is smaller, you proceed to the first nested list. If it's larger, you proceed to the second. Eventually, you arrive at the number you want and get back a pointer to one of the date/type pages above where that particular case number can be found. For example, searching for case number 07-015248 gets you Oakland-2007-02-22-ROBBERY.html.

I've also chosen to use b-trees for latitude and longitude, but these will soon be replaced: r-trees are a similar format more suitable to two-dimensional information used by geographic systems such as PostGIS.

In a database, this link-following and tree-climbing process happens very quickly on a single server, ideally in RAM with a minimal number of disk hits. In the scheme I use, a lot of the processing overhead is offloaded to smarter clients: Flash or Ajax apps that know they're looking at an index, and understand a thing or two about traversing data structures. Disk access is replaced by network access. The information is chunkier (longer lists, fewer requests) to minimize network overhead as much as possible, but it's certainly not going to be as speedy as a connection to a real database. There's a short list of reasons to do this:

  1. A "database" that offers nothing but static file downloads will likely be more scalable than one that needs to do work internally. This architecture is even more shared-nothing than systems with multiple database slaves.
  2. Not needing a running process to serve requests makes publishing less of a headache.
  3. I'm using Amazon Web Services to do the hosting, and their pricing plans make it clear that bandwidth and storage are cheap, while processing is expensive. Indexes served over HTTP optimize for the former and make the latter unnecessary. It's interesting to note that the forthcoming S3 pricing change is geared toward encouraging chunkier blocks of data.
  4. The particular data involved is well-suited to this method. A lot of current web services are optimized for heavy reads and infrequent writes. Often, they use a MySQL master/slave setup where the occasional write happens on one master database server, and a small army of slaves along with liberal use of caching makes it possible for large numbers of concurrent users to read. Here, we've got infrequently-updated information from a single source, and no user input whatsoever. It makes sense for the expensive processing of uploading and indexing to happen in one place, about once per day.

I'm reasonably happy with this so far, but I haven't yet written a smart client to take advantage of it. The near-term plan is to replace the two latitude/longitude indexes with a single spatial index, and then revisit the whole thing after I have an idea of how complicated it is to consume.

May 2008
Su M Tu W Th F Sa
    

Other places on the web I'm enjoying: Andrew Vande Moere's Information Aesthetics, Jan Chipchase's Future Perfect, Peacay's Bibliodyssey, Eyebeam's Reblog, The Sartorialist, Processing Blogs, Matthew Hurst's Data Mining, Wondermark, Photos tagged Wroclaw, and The Beautiful Poland Pool.

Friends (who have websites): Abe, Adam, another Adam, Andrew, Andy, Boris, Cassidy, Darren, Eric, Mike, Nikki, Otherworld, Peter, Ryan, Tomas, Tom, Thomas.

Recent Entries

  1. flea market mapping
  2. arduino atkinson, take two
  3. visual urban data slides
  4. solaris
  5. schweddy eagle
  6. arduino atkinson
  7. design your api
  8. ffffound review
  9. brandon morse
  10. money

Archives