tecznotes

Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

May 7, 2007 5:24pm

oakland crime maps VI: public, indexed data

Things have been generally quiet on the Oakland crime scraping front since we released Modest Maps and I demonstrated some potential display ideas for the crime report records I'm borrowing from the Oakland PD. Here, I describe how I've chosen to make the data public in a purely-RESTful way with indexes.

The small demo at that second link above hooks up to a quick database-driven web service written in PHP, and making it live drove home the point that hosting live databases is tedious and unsatisfying.

Meanwhile, Tom Coates is drumming away about natives to a web of data, Matt Biddulph is telling information architects about RDF and API's, and Mark Atwood is releasing S3-backed MySQL storage engines. Putting these threads together suggests an interesting, or at least more durable, way of publishing pure data on the web. The MySQL engine is an interesting stake in the ground, but it hides its data and its index (the two primary components of a relational database) behind the usual MySQL server process. The contents of storage aren't open to data consumers, ditching many of the cost and scale advantages of a service like S3 by piping it all through your annoying old DB server. Tom and Matt already have the data-on-the-web bit covered, so I'm going to do something about the index.

Indexes to a database table are exactly what they are to anything else: a faster way to look up information than scanning through it all in order. It's how you jump straight to the "M's" in the phone book without a lot of paging back and forth. The most popular style of index is something called a binary tree. Imagine looking for a particular word in the dictionary: you open the book up to some page in the middle of the book, check to see whether your word is before, on, or after the current page, and then move back and forward in the book in large chunks of pages until you've found what you're searching for. This is generally much faster than starting at "A" and turning single pages to find your word. A binary tree works the same way.

Indexes are rarely exposed, even on good web-of-data citizens. Both Flickr and Twitter make it somewhat difficult to move through giant lists, though not anymore difficult than other sites. Meanwhile, the databases quietly running these services are wildly denormalized and indexed like crazy, making it possible to rapidly generate those long, long lists.

For the crime reports, I started by just getting the data up and public. It's at predictable URL's, like these:

If you are looking for crimes on a particular date with a particular type, you just ask for a guessable URL. This is in effect the primary key: the natural, internal storage format for the data. Most common types of crime happen on most days, so the majority of date/type combinations should Just Work, and a simple HTTP 4XX error tells you when there is no match. I've chosen to publish in XHTML format for two reasons: the markup is highly semantic, making it simultaneously machine-readable and human-readable. Realistically, I'll be adding JSON and POX pages soon.

Unfortunately, if you're looking for a particular case number, or crimes at a particular location, it would require hunting through every page of crimes. In database terms, this is known as a table scan, and is something to be avoided at all costs. Instead, I've created a set of indexes to the data, demonstrating the key trade-off: an index helps you find what you want, but takes space to store and time to calculate. Following the Case Number link above takes you to a page with a long, nested list on it, a binary search tree. The idea is that you enter looking for a particular case number or range of case numbers. You start by comparing the one you want to the one at the top of the page. If they match, you're done. If yours is smaller, you proceed to the first nested list. If it's larger, you proceed to the second. Eventually, you arrive at the number you want and get back a pointer to one of the date/type pages above where that particular case number can be found. For example, searching for case number 07-015248 gets you Oakland-2007-02-22-ROBBERY.html.

I've also chosen to use b-trees for latitude and longitude, but these will soon be replaced: r-trees are a similar format more suitable to two-dimensional information used by geographic systems such as PostGIS.

In a database, this link-following and tree-climbing process happens very quickly on a single server, ideally in RAM with a minimal number of disk hits. In the scheme I use, a lot of the processing overhead is offloaded to smarter clients: Flash or Ajax apps that know they're looking at an index, and understand a thing or two about traversing data structures. Disk access is replaced by network access. The information is chunkier (longer lists, fewer requests) to minimize network overhead as much as possible, but it's certainly not going to be as speedy as a connection to a real database. There's a short list of reasons to do this:

A "database" that offers nothing but static file downloads will likely be more scalable than one that needs to do work internally. This architecture is even more shared-nothing than systems with multiple database slaves.
Not needing a running process to serve requests makes publishing less of a headache.
I'm using Amazon Web Services to do the hosting, and their pricing plans make it clear that bandwidth and storage are cheap, while processing is expensive. Indexes served over HTTP optimize for the former and make the latter unnecessary. It's interesting to note that the forthcoming S3 pricing change is geared toward encouraging chunkier blocks of data.
The particular data involved is well-suited to this method. A lot of current web services are optimized for heavy reads and infrequent writes. Often, they use a MySQL master/slave setup where the occasional write happens on one master database server, and a small army of slaves along with liberal use of caching makes it possible for large numbers of concurrent users to read. Here, we've got infrequently-updated information from a single source, and no user input whatsoever. It makes sense for the expensive processing of uploading and indexing to happen in one place, about once per day.

I'm reasonably happy with this so far, but I haven't yet written a smart client to take advantage of it. The near-term plan is to replace the two latitude/longitude indexes with a single spatial index, and then revisit the whole thing after I have an idea of how complicated it is to consume.

Comments

Sorry, no new comments on old posts.

permanent link | tecznotes