tecznotes

Michal Migurski's notebook, listening post, and soapbox. Subscribe to this blog. Check out the rest of my site as well.

Jun 19, 2016 2:03am

dockering address data

I’ve been on a Docker vision quest for the past week, since Tom Lee made this suggestion in OpenAddresses chat:

hey, slightly crazy idea: @migurski what would you think about defining an API for machine plugins? basically a set of whitelisted scripts that do the sorts of things defined in openaddresses/scripts with configurable frequency (both for the sake of the stack and bc some data sources are only released quarterly).

Mostly, OpenAddresses sources retrieve data directly from government authorities, via file downloads or ESRI feature server APIs. Each week, we re-run the entire collection and crawl for new data. Some data sources, however, are hard to integrate and they require special processing. They might need a session token for download, or they might be released in some special snowflake format. For these sources, we download and convert to plainer cached format, and keep around a script so we can repeat the process.

Docker seemed like it might be a good answer to Tom’s question, for a few reasons. A script might use a specific installed version of something, or some such similar particular environment. Docker can encapsulate that expectation. Using “docker run”, it’s possible to have this environment behave essentially like a shell script, cleaning up after itself upon exit. I am a card-carrying docker skeptic but this does appear to be right in the sweet spot, without triggering any of the wacko microservices shenanigans that docker people seem excited about.

(docker is the future)

Waldo Jaquith followed up to report that it worked well for him in similar situations: “For my purposes, Docker’s sweet spot is periodically running computationally-intensive batch processes in a strictly-defined environments.”

So, I decided to try with three OpenAddresses sources.

Australia

Australia’s G-NAF was my first test subject. It’s distributed via S3 in two big downloadable files, and there is a maintained loader script by Hugh Saalmans. This was easy to make into a completely self-contained Dockerfile even with the Postgres requirement. The main challenge with this one was disk usage. Docker is a total couch hog, but individual containers are given ~8GB of disk space. I didn’t want to mess with defaults, so I learned how to use “docker run --volume” to mount a temporary directory on the host system, and then configured the contained Postgres database to use a tablespace in that directory.

No big challenges with Australia, it’s just a big dataset and takes a lot of time and space to handle.

Here’s the script: openaddresses/scripts/au.

New Zealand

Downloading data from Land Information New Zealand (LINZ) had a twist. The data set must be downloaded manually using an asynchronous web UI. I didn’t want to mess with Selenium here, so running the NZ script requires that a copy of the Street Address data file be provided ahead of time. It’s not as awful as USGS’s occasional “shopping cart” metaphor with emailed links, but it’s still hard to automate. The docker instructions for New Zealand include instructions for this manual process.

With the file in place, the process is straightforward. At this point I learned about using “docker save” to stash complete docker images, and started noting that they could be loaded from cache in the README files.

Here’s the script: openaddresses/scripts/nz.

Tennessee

The Tennessee Property Viewer website is backed by a traditional ESRI feature service, but it requires a token to use. The token is created in Javascript… somehow… so the Tennessee docker directions include a note about using a browser debug console to find it.

The token is passed to docker via the run command, though I could have also used an environment variable.

Here’s the script: openaddresses/scripts/us/tn.

What Now?

These scripts work pretty well. Once the data is processed, we post it to S3 where the normal OpenAddresses weekly update cycle takes over, using the S3 cache URL instead of the authority’s original.

With some basic direction-following anyone can update data dependencies for OpenAddresses. As the project moves forward and manually-created caches fall out of date, we’re going to be seeing an increasing number of sources in need of manual intervention, and I hope that they’ll be easier to work with in the future.

Comments (9)

That gem came late in the game, when the Sabres were down 3-1, and I was feeling down too. Watch the games with your kids as much as you can. They’re good medicine, and great company.

Posted by شركة تنظيف شقق بجدة on Tuesday, June 21 2016 9:33pm UTC
I like this post, enjoyed this one thank you for putting up. No man is wise enough by himself. by Titus Maccius Plautus. geddkdbddbbfbebc

Posted by Smithe812 on Sunday, August 14 2016 5:56am UTC
I really enjoy the article post. Cool. aadeddcebbddbcbe

Posted by Smithf884 on Monday, September 12 2016 7:51am UTC
Hi, Neat post. There is a problem with your web site in internet explorer, would check this IE still is the market leader and a large portion of people will miss your magnificent writing because of this problem. kcgfbeddeebddecf

Posted by Smithb321 on Monday, September 12 2016 7:53am UTC
Once I initially commented I clicked the Notify me when new feedback are added checkbox and now each time a remark is added I get four emails with the same comment. Is there any way you possibly can remove me from that service? Thanks! dddbackfdaegeaab

Posted by Smithd449 on Tuesday, September 13 2016 10:54am UTC
I do accept as true with all of the concepts you have offered in your post. They're really convincing and will definitely work. Still, the posts are very brief for beginners. May you please extend them a little from next time? Thanks for the post. ackdccfbbeddaeef

Posted by Smithd408 on Tuesday, September 13 2016 10:56am UTC
Very efficiently written information. It will be beneficial to everyone who employess it, including myself. Keep up the good work for sure i will check out more posts. kadfdddbfgecakbc

Posted by Smithe603 on Wednesday, September 14 2016 1:58pm UTC
Well I really enjoyed reading it. This information procured by you is very practical for correct planning. dckbdecddbdbbdkc

Posted by Smitha993 on Wednesday, September 14 2016 1:59pm UTC
This actually answered my drawback, thank you! aafgfdbadedeeadb

Posted by Smithe966 on Wednesday, September 14 2016 1:59pm UTC

Sorry, no new comments on old posts.

permanent link | tecznotes