tecznotes

Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Aug 12, 2010 1:30am

census-tools

The U.S. Census publishes an astonishing volume of data, notably with the most recent 2000 count. The demographic data contained in each of the summary files is precise, detailed, and distributed in a difficult-to-understand text format. The documentation for summary file #1 alone (race, age, sex) is a 637 page PDF file, and the actual data is stored in a maze of zip files all alike.

I've poked at these before, but I recently got a bee in my bonnet about making them available in a more useful form so they could be mapped. I talked to Josh Livni (of Land Summary) quite a while back about his plans for a demographic summary site that would store everything in a database in the cloud. Then Amazon made it available as a public dataset. Still I was not satisfied - both approaches to handling the data seemed a bit ocean-boiling in retrospect.

I've been experimenting with something I'm tentatively calling census-tools that seeks to make this data a bit more accessible. I'm motivated by the idea that predictably-structured zip files stored on a web server and accessed with Python's excellent stream-handling libraries might actually be considered quite a good API, so the first tool in the repository proceeds from there. It does a very simple thing: given an optional U.S. state, a geographic summary level (e.g. census tract or county), and a type of data, it unzips those remote files into memory and converts them to a tab-separated values file.

Here's an example:

python census2text.py ––verbose ––wide ––state=Hawaii ––geography=county ––table=P18 ––output=hawaii-households.txt

It outputs a chatty text file of household data for every county in Hawaii into a file called hawaii-households.txt. It takes about a minute to churn through a 2.8MB zip file and output the results. Omitting the state name gets you every county in the U.S. in about 20 minutes:

python census2text.py ––verbose ––wide ––geography=county ––table=P18 ––output=national-households.txt

I tested with Hawaii because it's small, and immediately discovered the strangely underpopulated Kalawao County:

The county is coextensive with the Kalaupapa National Historical Park, and encompasses the Kalaupapa Settlement where the Kingdom of Hawai'i, the territory, and the state once exiled persons suffering from leprosy (Hansen's disease) beginning in the 1860s. The quarantine policy was lifted in 1969, after the disease became treatable on an outpatient basis and could be rendered non-contagious. However, many of the resident patients chose to remain, and the state has promised they can stay there for the rest of their lives. No new patients, or other permanent residents, are admitted. Visitors are only permitted as part of officially sanctioned tours. State law prohibits anyone under the age of 16 from visiting or living there.

Fascinating.

Anyway, this small amount of information can be quite hard to get to. Between the impenetrable formatting of the geographic record files, the bewildering array of different kinds of geographic entities, and the depth of geographic minutiae, it can take quite a bit of head-scratching to extract even the first bits of information from the U.S. Census.

I hope this first tool makes it a little bit less of a hassle. I'd accept whatever patches people choose to offer: support for summary files beyond SF1, additional geograph summary levels, general patches, and more.

Comments (5)

  1. Have you considered to access the 2000 U.S. Census using Linked Data? http://www.rdfabout.com/demo/census/

    Posted by Alain on Thursday, August 12 2010 11:23am EDT

  2. To be honest, I haven't. This all seemed easier and more fun than consuming RDF. Also, the Census data here isn't really "linked", or if it is that's not really its primary characteristic. It's plain old tables, but they're very big and need to be filtered.

    Posted by Michal Migurski on Thursday, August 12 2010 12:04pm EDT

  3. Any 2010 plans?

    Posted by Ben on Thursday, August 12 2010 8:58pm EDT

  4. Ben I'd love to know what they have in store for publishing! I'm told there will be Census people at this weekend's State Of The Map conference in Atlanta, I plan to get them drunk and find out. =D

    Posted by Michal Migurski on Thursday, August 12 2010 9:15pm EDT

  5. Any update from SOTMUS re: Census 2010 data publishing plans? We're in the midst of converting our 2010 Census Hard To Count mapping site (http://www.censushardtocountmaps.org/) to one that focuses on post-2000 demographic change, and are casting about for better approaches. Things like Census Tools, TileStache, and Polymaps are all quite intriguing.

    Posted by Steven Romalewski on Monday, August 23 2010 8:55am EDT

Sorry, no new comments on old posts.

July 2017
Su M Tu W Th F Sa
      
     

Recent Entries

  1. blog all dog-eared pages: human transit
  2. the levity of serverlessness
  3. three open data projects: openstreetmap, openaddresses, and who’s on first
  4. building up redistricting data for North Carolina
  5. district plans by the hundredweight
  6. baby steps towards measuring the efficiency gap
  7. things I’ve recently learned about legislative redistricting
  8. oh no
  9. landsat satellite imagery is easy to use
  10. openstreetmap: robots, crisis, and craft mappers
  11. quoted in the news
  12. dockering address data
  13. blog all dog-eared pages: the best and the brightest
  14. five-minute geocoder for openaddresses
  15. notes on debian packaging for ubuntu
  16. guyana trip report
  17. openaddresses population comparison
  18. blog all oft-played tracks VII
  19. week 1,984: back to the map
  20. bike eleven: trek roadie

Archives