Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

May 27, 2007 9:49pm

oakland crime maps VII: public indexes redux

Earlier this month, I described a way of publishing a database in a RESTful, static form. Since then I've been tweaking the data into a more presentable state, which I'll describe in this post.

Also I promise that next time, there'll actually be something to look at beyond of me noodling with computer pseudo-science.

When I first opened up the Oakland crime index, I published data in two forms: data about crime was stored in day/type resources, e.g. May 3rd murders or Jan 1st robberies, while binary-search indexes on the case number, latitude, and longitude were published with pointers to the day/type resources. As I've experimented with code to consume the data and kicked these ideas around with others, a few obvious changes had to be made:

First, the separate b-trees on latitude and longitude had to go. Location is 2-dimensional, and requires an appropriate index to fit. I had initially expected to use r-trees but found that quadtrees, a special case, made the most sense. These are closest in spirit to the b-tree, and unlike the r-tree each sub-index does not overlap with any other.

Second, space and time are intricately related, so spatiotemporal index was an obvious next step. I chose an oct-tree of latitude, longitude, and time. Again, this is a simple extension of the b-tree, and provides for simple answers like "show all crimes that are within a mile of a given point, for the following dates..."

Third, I was being too literal with the indexes, insisting that traversing the trees should ultimately lead back to a link to a specific day/type listing. Although this is how a real database index might work, in the context of an index served over HTTP, a large number of transactions can be avoided by just dropping the actual data right into the index. To understand what this means, compare the CSS-styled output of the various indexes to the HTML source: the complete data for each crime is stashed in a display: none block right in the appropriate node.

Finally, my initial implementation used the binary tree lingo "left" and "right" to mark the branches in each index. I've replaced this with more obvious "before", "after", "north", "south", "east", and "west" for greater ease of human-readability and consumption.

I'm still hosting the data on Amazon's S3, but a recent billing change is making me re-think the wisdom of doing this:

New Pricing (effective June 1st, 2007): $0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests.


In one week, S3 is going to go from a sensible storage/hosting platform for data consisting of many tiny resources, to one optimized for data consisting of fewer, chunkier resources; think movies instead of tiles. I can see the logic behind this: S3's processing overhead for serving a million 1KB requests must be substantial compared to serving a thousand 1MB requests. Still, it makes my strategy of publishing these indexes as large collections of tiny files, many of which will never be accessed, start to seem a bit problematic.

The obvious answer is to stash them on the filesystem, which I plan to do. However, there is one feature of S3 that I'm going to miss: when publishing data to their servers, any HTTP header starting with "X-AMZ-Meta-" got to ride along as metadata, allowing me to easily implement a variant of mark and sweep garbage collection when posting updates to the indexes. This made it tremendously easy to simulate atomic updates by keeping the entire index tree around for at least 5 minutes after a replacement tree was put in place, a benefit for slow clients.

When I move the index to a non-S3 location before my Amazon-imposed June 1st deadline, I will no longer have the benefit of per-resource metadata to work with.

For next time: code to consume this, code to show it.

May 27, 2007 1:55pm

data bill of rights

I'm de-cloaking for a moment here to mention John Batelle's excellent Data Bill of Rights, published about a month ago. It popped into relief again for me with the announcement of Google's purchase of FeedBurner, and all the RSS traffic data that rides along.

The rights, enumerated:

  • Data Transparency. We can identify and review the data that companies have about us.
  • Data Portability. We can take copies of that data out of the company's coffers and offer it to others or just keep copies for ourselves.
  • Data Editing. We can request deletions, editing, clarifications of our data for accuracy and privacy.
  • Data Anonymity. We can request that our data not be used, cognizant of the fact that that may mean services are unavailable to us.
  • Data Use. We have rights to know how our data is being used inside a company.
  • Data Value. The right to sell our data to the highest bidder.
  • Data Permissions. The right to set permissions as to who might use/benefit from/have access to our data.

I like where this is going, but I believe that it's a bit toothless unless the ownership of that data is clarified. As long as the legal owner of personal data is assumed to be the company in possession (Google, FeedBurner, Facebook, etc.), the enumerated rights will be considered the responsibility of P.R. and marketing. If it were somehow possible to push the bill of rights into the legal department, this idea would gain some serious traction. It would also have the possibly-beneficial side effect of depressing valuations for data collection companies like FeedBurner or DoubleClick, or even Google itself. It might also have a similar effect on the financial world, giving companies such as ChoicePoint a well-deserved kick in the teeth.

October 2017
Su M Tu W Th F Sa

Recent Entries

  1. planscore: a project to score gerrymandered district plans
  2. blog all dog-eared pages: human transit
  3. the levity of serverlessness
  4. three open data projects: openstreetmap, openaddresses, and who’s on first
  5. building up redistricting data for North Carolina
  6. district plans by the hundredweight
  7. baby steps towards measuring the efficiency gap
  8. things I’ve recently learned about legislative redistricting
  9. oh no
  10. landsat satellite imagery is easy to use
  11. openstreetmap: robots, crisis, and craft mappers
  12. quoted in the news
  13. dockering address data
  14. blog all dog-eared pages: the best and the brightest
  15. five-minute geocoder for openaddresses
  16. notes on debian packaging for ubuntu
  17. guyana trip report
  18. openaddresses population comparison
  19. blog all oft-played tracks VII
  20. week 1,984: back to the map