tecznotes
Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.
May 28, 2007 1:49am
oakland crime maps VII: public indexes redux
Earlier this month, I described a way of publishing a database in a RESTful, static form. Since then I've been tweaking the data into a more presentable state, which I'll describe in this post.
Also I promise that next time, there'll actually be something to look at beyond of me noodling with computer pseudo-science.
When I first opened up the Oakland crime index, I published data in two forms: data about crime was stored in day/type resources, e.g. May 3rd murders or Jan 1st robberies, while binary-search indexes on the case number, latitude, and longitude were published with pointers to the day/type resources. As I've experimented with code to consume the data and kicked these ideas around with others, a few obvious changes had to be made:
First, the separate b-trees on latitude and longitude had to go. Location is 2-dimensional, and requires an appropriate index to fit. I had initially expected to use r-trees but found that quadtrees, a special case, made the most sense. These are closest in spirit to the b-tree, and unlike the r-tree each sub-index does not overlap with any other.
Second, space and time are intricately related, so spatiotemporal index was an obvious next step. I chose an oct-tree of latitude, longitude, and time. Again, this is a simple extension of the b-tree, and provides for simple answers like "show all crimes that are within a mile of a given point, for the following dates..."
Third, I was being too literal with the indexes, insisting that traversing the trees should ultimately lead back to a link to a specific day/type listing. Although this is how a real database index might work, in the context of an index served over HTTP, a large number of transactions can be avoided by just dropping the actual data right into the index. To understand what this means, compare the CSS-styled output of the various indexes to the HTML source: the complete data for each crime is stashed in a display: none block right in the appropriate node.
Finally, my initial implementation used the binary tree lingo "left" and "right" to mark the branches in each index. I've replaced this with more obvious "before", "after", "north", "south", "east", and "west" for greater ease of human-readability and consumption.
I'm still hosting the data on Amazon's S3, but a recent billing change is making me re-think the wisdom of doing this:
New Pricing (effective June 1st, 2007): $0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests.
Eep.
In one week, S3 is going to go from a sensible storage/hosting platform for data consisting of many tiny resources, to one optimized for data consisting of fewer, chunkier resources; think movies instead of tiles. I can see the logic behind this: S3's processing overhead for serving a million 1KB requests must be substantial compared to serving a thousand 1MB requests. Still, it makes my strategy of publishing these indexes as large collections of tiny files, many of which will never be accessed, start to seem a bit problematic.
The obvious answer is to stash them on the filesystem, which I plan to do. However, there is one feature of S3 that I'm going to miss: when publishing data to their servers, any HTTP header starting with "X-AMZ-Meta-" got to ride along as metadata, allowing me to easily implement a variant of mark and sweep garbage collection when posting updates to the indexes. This made it tremendously easy to simulate atomic updates by keeping the entire index tree around for at least 5 minutes after a replacement tree was put in place, a benefit for slow clients.
When I move the index to a non-S3 location before my Amazon-imposed June 1st deadline, I will no longer have the benefit of per-resource metadata to work with.
For next time: code to consume this, code to show it.