Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Aug 12, 2010 1:30am


The U.S. Census publishes an astonishing volume of data, notably with the most recent 2000 count. The demographic data contained in each of the summary files is precise, detailed, and distributed in a difficult-to-understand text format. The documentation for summary file #1 alone (race, age, sex) is a 637 page PDF file, and the actual data is stored in a maze of zip files all alike.

I've poked at these before, but I recently got a bee in my bonnet about making them available in a more useful form so they could be mapped. I talked to Josh Livni (of Land Summary) quite a while back about his plans for a demographic summary site that would store everything in a database in the cloud. Then Amazon made it available as a public dataset. Still I was not satisfied - both approaches to handling the data seemed a bit ocean-boiling in retrospect.

I've been experimenting with something I'm tentatively calling census-tools that seeks to make this data a bit more accessible. I'm motivated by the idea that predictably-structured zip files stored on a web server and accessed with Python's excellent stream-handling libraries might actually be considered quite a good API, so the first tool in the repository proceeds from there. It does a very simple thing: given an optional U.S. state, a geographic summary level (e.g. census tract or county), and a type of data, it unzips those remote files into memory and converts them to a tab-separated values file.

Here's an example:

python census2text.py ––verbose ––wide ––state=Hawaii ––geography=county ––table=P18 ––output=hawaii-households.txt

It outputs a chatty text file of household data for every county in Hawaii into a file called hawaii-households.txt. It takes about a minute to churn through a 2.8MB zip file and output the results. Omitting the state name gets you every county in the U.S. in about 20 minutes:

python census2text.py ––verbose ––wide ––geography=county ––table=P18 ––output=national-households.txt

I tested with Hawaii because it's small, and immediately discovered the strangely underpopulated Kalawao County:

The county is coextensive with the Kalaupapa National Historical Park, and encompasses the Kalaupapa Settlement where the Kingdom of Hawai'i, the territory, and the state once exiled persons suffering from leprosy (Hansen's disease) beginning in the 1860s. The quarantine policy was lifted in 1969, after the disease became treatable on an outpatient basis and could be rendered non-contagious. However, many of the resident patients chose to remain, and the state has promised they can stay there for the rest of their lives. No new patients, or other permanent residents, are admitted. Visitors are only permitted as part of officially sanctioned tours. State law prohibits anyone under the age of 16 from visiting or living there.


Anyway, this small amount of information can be quite hard to get to. Between the impenetrable formatting of the geographic record files, the bewildering array of different kinds of geographic entities, and the depth of geographic minutiae, it can take quite a bit of head-scratching to extract even the first bits of information from the U.S. Census.

I hope this first tool makes it a little bit less of a hassle. I'd accept whatever patches people choose to offer: support for summary files beyond SF1, additional geograph summary levels, general patches, and more.

May 2017
Su M Tu W Th F Sa

Recent Entries

  1. three open data projects: openstreetmap, openaddresses, and who’s on first
  2. building up redistricting data for North Carolina
  3. district plans by the hundredweight
  4. baby steps towards measuring the efficiency gap
  5. things I’ve recently learned about legislative redistricting
  6. oh no
  7. landsat satellite imagery is easy to use
  8. openstreetmap: robots, crisis, and craft mappers
  9. quoted in the news
  10. dockering address data
  11. blog all dog-eared pages: the best and the brightest
  12. five-minute geocoder for openaddresses
  13. notes on debian packaging for ubuntu
  14. guyana trip report
  15. openaddresses population comparison
  16. blog all oft-played tracks VII
  17. week 1,984: back to the map
  18. bike eleven: trek roadie
  19. code like you don’t have the time
  20. projecting elevation data