tecznotes

Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Jun 19, 2016 2:03am

dockering address data

I’ve been on a Docker vision quest for the past week, since Tom Lee made this suggestion in OpenAddresses chat:

hey, slightly crazy idea: @migurski what would you think about defining an API for machine plugins? basically a set of whitelisted scripts that do the sorts of things defined in openaddresses/scripts with configurable frequency (both for the sake of the stack and bc some data sources are only released quarterly).

Mostly, OpenAddresses sources retrieve data directly from government authorities, via file downloads or ESRI feature server APIs. Each week, we re-run the entire collection and crawl for new data. Some data sources, however, are hard to integrate and they require special processing. They might need a session token for download, or they might be released in some special snowflake format. For these sources, we download and convert to plainer cached format, and keep around a script so we can repeat the process.

Docker seemed like it might be a good answer to Tom’s question, for a few reasons. A script might use a specific installed version of something, or some such similar particular environment. Docker can encapsulate that expectation. Using “docker run”, it’s possible to have this environment behave essentially like a shell script, cleaning up after itself upon exit. I am a card-carrying docker skeptic but this does appear to be right in the sweet spot, without triggering any of the wacko microservices shenanigans that docker people seem excited about.

(docker is the future)

Waldo Jaquith followed up to report that it worked well for him in similar situations: “For my purposes, Docker’s sweet spot is periodically running computationally-intensive batch processes in a strictly-defined environments.”

So, I decided to try with three OpenAddresses sources.

Australia

Australia’s G-NAF was my first test subject. It’s distributed via S3 in two big downloadable files, and there is a maintained loader script by Hugh Saalmans. This was easy to make into a completely self-contained Dockerfile even with the Postgres requirement. The main challenge with this one was disk usage. Docker is a total couch hog, but individual containers are given ~8GB of disk space. I didn’t want to mess with defaults, so I learned how to use “docker run --volume” to mount a temporary directory on the host system, and then configured the contained Postgres database to use a tablespace in that directory.

No big challenges with Australia, it’s just a big dataset and takes a lot of time and space to handle.

Here’s the script: openaddresses/scripts/au.

New Zealand

Downloading data from Land Information New Zealand (LINZ) had a twist. The data set must be downloaded manually using an asynchronous web UI. I didn’t want to mess with Selenium here, so running the NZ script requires that a copy of the Street Address data file be provided ahead of time. It’s not as awful as USGS’s occasional “shopping cart” metaphor with emailed links, but it’s still hard to automate. The docker instructions for New Zealand include instructions for this manual process.

With the file in place, the process is straightforward. At this point I learned about using “docker save” to stash complete docker images, and started noting that they could be loaded from cache in the README files.

Here’s the script: openaddresses/scripts/nz.

Tennessee

The Tennessee Property Viewer website is backed by a traditional ESRI feature service, but it requires a token to use. The token is created in Javascript… somehow… so the Tennessee docker directions include a note about using a browser debug console to find it.

The token is passed to docker via the run command, though I could have also used an environment variable.

Here’s the script: openaddresses/scripts/us/tn.

What Now?

These scripts work pretty well. Once the data is processed, we post it to S3 where the normal OpenAddresses weekly update cycle takes over, using the S3 cache URL instead of the authority’s original.

With some basic direction-following anyone can update data dependencies for OpenAddresses. As the project moves forward and manually-created caches fall out of date, we’re going to be seeing an increasing number of sources in need of manual intervention, and I hope that they’ll be easier to work with in the future.

Comments (9)

  1. That gem came late in the game, when the Sabres were down 3-1, and I was feeling down too. Watch the games with your kids as much as you can. They’re good medicine, and great company.

    Posted by شركة تنظيف شقق بجدة on Tuesday, June 21 2016 9:33pm UTC

  2. I like this post, enjoyed this one thank you for putting up. No man is wise enough by himself. by Titus Maccius Plautus. geddkdbddbbfbebc

    Posted by Smithe812 on Sunday, August 14 2016 5:56am UTC

  3. I really enjoy the article post. Cool. aadeddcebbddbcbe

    Posted by Smithf884 on Monday, September 12 2016 7:51am UTC

  4. Hi, Neat post. There is a problem with your web site in internet explorer, would check this IE still is the market leader and a large portion of people will miss your magnificent writing because of this problem. kcgfbeddeebddecf

    Posted by Smithb321 on Monday, September 12 2016 7:53am UTC

  5. Once I initially commented I clicked the Notify me when new feedback are added checkbox and now each time a remark is added I get four emails with the same comment. Is there any way you possibly can remove me from that service? Thanks! dddbackfdaegeaab

    Posted by Smithd449 on Tuesday, September 13 2016 10:54am UTC

  6. I do accept as true with all of the concepts you have offered in your post. They're really convincing and will definitely work. Still, the posts are very brief for beginners. May you please extend them a little from next time? Thanks for the post. ackdccfbbeddaeef

    Posted by Smithd408 on Tuesday, September 13 2016 10:56am UTC

  7. Very efficiently written information. It will be beneficial to everyone who employess it, including myself. Keep up the good work for sure i will check out more posts. kadfdddbfgecakbc

    Posted by Smithe603 on Wednesday, September 14 2016 1:58pm UTC

  8. Well I really enjoyed reading it. This information procured by you is very practical for correct planning. dckbdecddbdbbdkc

    Posted by Smitha993 on Wednesday, September 14 2016 1:59pm UTC

  9. This actually answered my drawback, thank you! aafgfdbadedeeadb

    Posted by Smithe966 on Wednesday, September 14 2016 1:59pm UTC

Sorry, no new comments on old posts.

March 2024
Su M Tu W Th F Sa
     
      

Recent Entries

  1. Mapping Remote Roads with OpenStreetMap, RapiD, and QGIS
  2. How It’s Made: A PlanScore Predictive Model for Partisan Elections
  3. Micromobility Data Policies: A Survey of City Needs
  4. Open Precinct Data
  5. Scoring Pennsylvania
  6. Coming To A Street Near You: Help Remix Create a New Tool for Street Designers
  7. planscore: a project to score gerrymandered district plans
  8. blog all dog-eared pages: human transit
  9. the levity of serverlessness
  10. three open data projects: openstreetmap, openaddresses, and who’s on first
  11. building up redistricting data for North Carolina
  12. district plans by the hundredweight
  13. baby steps towards measuring the efficiency gap
  14. things I’ve recently learned about legislative redistricting
  15. oh no
  16. landsat satellite imagery is easy to use
  17. openstreetmap: robots, crisis, and craft mappers
  18. quoted in the news
  19. dockering address data
  20. blog all dog-eared pages: the best and the brightest

Archives