Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Aug 12, 2010 5:30am


The U.S. Census publishes an astonishing volume of data, notably with the most recent 2000 count. The demographic data contained in each of the summary files is precise, detailed, and distributed in a difficult-to-understand text format. The documentation for summary file #1 alone (race, age, sex) is a 637 page PDF file, and the actual data is stored in a maze of zip files all alike.

I've poked at these before, but I recently got a bee in my bonnet about making them available in a more useful form so they could be mapped. I talked to Josh Livni (of Land Summary) quite a while back about his plans for a demographic summary site that would store everything in a database in the cloud. Then Amazon made it available as a public dataset. Still I was not satisfied - both approaches to handling the data seemed a bit ocean-boiling in retrospect.

I've been experimenting with something I'm tentatively calling census-tools that seeks to make this data a bit more accessible. I'm motivated by the idea that predictably-structured zip files stored on a web server and accessed with Python's excellent stream-handling libraries might actually be considered quite a good API, so the first tool in the repository proceeds from there. It does a very simple thing: given an optional U.S. state, a geographic summary level (e.g. census tract or county), and a type of data, it unzips those remote files into memory and converts them to a tab-separated values file.

Here's an example:

python census2text.py ––verbose ––wide ––state=Hawaii ––geography=county ––table=P18 ––output=hawaii-households.txt

It outputs a chatty text file of household data for every county in Hawaii into a file called hawaii-households.txt. It takes about a minute to churn through a 2.8MB zip file and output the results. Omitting the state name gets you every county in the U.S. in about 20 minutes:

python census2text.py ––verbose ––wide ––geography=county ––table=P18 ––output=national-households.txt

I tested with Hawaii because it's small, and immediately discovered the strangely underpopulated Kalawao County:

The county is coextensive with the Kalaupapa National Historical Park, and encompasses the Kalaupapa Settlement where the Kingdom of Hawai'i, the territory, and the state once exiled persons suffering from leprosy (Hansen's disease) beginning in the 1860s. The quarantine policy was lifted in 1969, after the disease became treatable on an outpatient basis and could be rendered non-contagious. However, many of the resident patients chose to remain, and the state has promised they can stay there for the rest of their lives. No new patients, or other permanent residents, are admitted. Visitors are only permitted as part of officially sanctioned tours. State law prohibits anyone under the age of 16 from visiting or living there.


Anyway, this small amount of information can be quite hard to get to. Between the impenetrable formatting of the geographic record files, the bewildering array of different kinds of geographic entities, and the depth of geographic minutiae, it can take quite a bit of head-scratching to extract even the first bits of information from the U.S. Census.

I hope this first tool makes it a little bit less of a hassle. I'd accept whatever patches people choose to offer: support for summary files beyond SF1, additional geograph summary levels, general patches, and more.

Comments (5)

  1. Have you considered to access the 2000 U.S. Census using Linked Data? http://www.rdfabout.com/demo/census/

    Posted by Alain on Thursday, August 12 2010 3:23pm UTC

  2. To be honest, I haven't. This all seemed easier and more fun than consuming RDF. Also, the Census data here isn't really "linked", or if it is that's not really its primary characteristic. It's plain old tables, but they're very big and need to be filtered.

    Posted by Michal Migurski on Thursday, August 12 2010 4:04pm UTC

  3. Any 2010 plans?

    Posted by Ben on Friday, August 13 2010 12:58am UTC

  4. Ben I'd love to know what they have in store for publishing! I'm told there will be Census people at this weekend's State Of The Map conference in Atlanta, I plan to get them drunk and find out. =D

    Posted by Michal Migurski on Friday, August 13 2010 1:15am UTC

  5. Any update from SOTMUS re: Census 2010 data publishing plans? We're in the midst of converting our 2010 Census Hard To Count mapping site (http://www.censushardtocountmaps.org/) to one that focuses on post-2000 demographic change, and are casting about for better approaches. Things like Census Tools, TileStache, and Polymaps are all quite intriguing.

    Posted by Steven Romalewski on Monday, August 23 2010 12:55pm UTC

Sorry, no new comments on old posts.

October 2021
Su M Tu W Th F Sa

Recent Entries

  1. Mapping Remote Roads with OpenStreetMap, RapiD, and QGIS
  2. How It’s Made: A PlanScore Predictive Model for Partisan Elections
  3. Micromobility Data Policies: A Survey of City Needs
  4. Open Precinct Data
  5. Scoring Pennsylvania
  6. Coming To A Street Near You: Help Remix Create a New Tool for Street Designers
  7. planscore: a project to score gerrymandered district plans
  8. blog all dog-eared pages: human transit
  9. the levity of serverlessness
  10. three open data projects: openstreetmap, openaddresses, and who’s on first
  11. building up redistricting data for North Carolina
  12. district plans by the hundredweight
  13. baby steps towards measuring the efficiency gap
  14. things I’ve recently learned about legislative redistricting
  15. oh no
  16. landsat satellite imagery is easy to use
  17. openstreetmap: robots, crisis, and craft mappers
  18. quoted in the news
  19. dockering address data
  20. blog all dog-eared pages: the best and the brightest