The OpenAddresses project recently crossed 250 million worldwide address points with the addition of countrywide data for Australia. Data from OA is used by Mapbox, Consumer Finance Protection Bureau, and my company, Mapzen.
Now, you can use OpenAddresses in a high-quality geocoder yourself. “Geocoding” is the process of transforming input text, such as an address, or a name of a place to a geographic location on the earth's surface. Every time you search for a destination on your phone, you’re geocoding. Mapzen’s Search service uses an open source server we call Pelias, and if you’re using the popular Ubuntu Linux operating system, you can get it set up and serving addresses in just a few minutes.
Start with a clean server running a current version of Ubuntu LTS (long-term support); either 14.04 or 16.04 will work. Amazon has readymade Ubuntu images available on EC2, or a local copy running under Virtualbox will do for testing. Both the address import process and the Elasticsearch index are hungry for lots of memory, so pick a server with 4-8GB of memory to prevent failures.
Next, install the Pelias software using instructions from OpenAddresses:
# Tell Ubuntu where to find packages: add-apt-repository ppa:openaddresses/geocoder -y wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add - echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list # Install Pelias and dependencies: apt-get update && apt-get install pelias
This installs the Pelias geocoder, the OpenAddresses importer, a simple web-based map search interface, and the underlying Elasticsearch index.
After installation, you will need to import data. Visit results.openaddresses.io and pick a processed zip file to download. Start small with a city like Berkeley, CA to test the process. Download and unzip it in the directory `/var/tmp/openaddresses` where Pelias expects to find CSV files, then run `pelias-openaddresses-import` to index the data.
# Get a sample file of address data: cd /var/tmp/openaddresses curl -OL https://results.openaddresses.io/latest/run/us/ca/berkeley.zip apt-get install unzip && unzip berkeley.zip # Index the addresses: pelias-openaddresses-import
The Mapzen Search service includes some additional features that aren’t yet covered here. For example, to include administrative areas like cities or states in searches, it’s necessary to do an admin lookup while importing, and to include data from Who’s On First. I’m also interested to learn more about tuning Elasticsearch for smaller-sized servers with less system RAM. It should be possible to run a geocoder with 1-2GB of memory, and Elasticsearch may require adjustments to make this possible.
Links to more information about geocoding with OpenAddresses:
I’ve devoted some time over the past month to learning how to create distribution packages for Ubuntu, using the Debian packaging system. This has been a longstanding interest for me since Dane Springmeyer and Robert Coup created a Personal Package Archive (PPA) to easily and quickly distribute reliable versions of Mapnik. It’s become more important lately as programming languages have developed their own mutually-incompatible packaging systems like npm, Ruby Gems, or PyPi, while developer interest has veered toward container-style technology such as Vagrant or Docker. My immediate interest comes from an OpenAddresses question from Waldo Jaquith: what would a packaged OA geocoder look like? I’ve watched numerous software projects create Vangrantfile or Dockerfile scripts before, only to let those fall out of date and eventually become a source of unanswerable help requests. OS-level packages offer a stable, fully-enclosed download with no external dependencies on 3rd party hosting services.
What does it take to prepare a package?
The staff at Launchpad, particularly Colin Watson, have been helpful in answering my questions as I moved from getting a tiny “hello world” package onto a PPA to wrapping up Mapzen’s worldwide geocoder software, Pelias.
I started with a relatively-simple tutorial from Ask Ubuntu, where a community member steps you through each part of wrapping a complete Debian package from a single shell script source. This mostly worked, but there are a few tricks along the way. The various required files are often shown inside a directory called “DEBIAN”, but I’ve found that it needs to be lower-case “debian” in order to work with the various preparation scripts. The control file, debian/control, is the most important one, and has a set of required fields arranged in stanzas that must conform to a particular pattern. My first Launchpad question addressed a series of mistakes I was making.
- My final debian/control file: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/debian/control
- Introduction to control files: https://www.debian.org/doc/manuals/maint-guide/dreq.en.html#control
- Deeper documentation for control file fields: https://www.debian.org/doc/debian-policy/ch-controlfields.html
The file debian/changelog was the second challenge. It needs to conform to an exact syntax, and it’s easiest to have the utility dch (available in the devscripts package) do it for you. You need to provide a meaningful version number, release target, notes, and signature for this file to work. The release target is usually a Debian or Ubuntu codename, in this case “xenial” for Ubuntu’s 16.04 Xenial Xerus release. The version number is also tricky; it’s generally assumed that a package maintainer is downstream from the original developer, so the version number will be a combination of upstream version and downstream release target, in my case Pelias 2.2.0 + “ubuntu” + “xenial”.
- My final debian/changelog file: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/debian/changelog
- An explanation of version numbering for Ubuntu: http://www.ducea.com/2006/06/17/ubuntu-package-version-naming-explanation/
A debian/rules file is also required, but it seems to be sufficient to use a short default file that calls out to the Debian helper script dh.
- My debian/rules file: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/debian/rules
- Another sample rules file: https://www.debian.org/doc/manuals/maint-guide/dreq.en.html#defaultrules
I have not been able to determine how to test my debian director package locally, but I have found that the emails sent from Launchpad after using dput to post versions of my packages can be helpful when debugging. I tested with a simple package called “hellodeb”; here is a complete listing of each attempt I made to publish this package in my “hello” PPA as I learned the process.
My second Launchpad question concerned the contents of the package: why wasn’t anything being installed? The various Debian helper scripts try to do a lot of work for you, and as a newcomer it’s sometimes hard to guess where it’s being helpful, and where it’s subtly chastising you for doing the wrong thing. For example, after I determined that including an “install” make target in the default project Makefile that wrote files to $DESTDIR was the way to create output, it turned out that my attempt to install under /usr/local was being thwarted by dh_usrlocal, a script which enforces the Linux filesystem standard convention that only users should write files to /usr/local, never OS-level packages. In the end, while it’s possible to simply list everything in debian/install, it seems better to do that work in a more central and easy-to-find Makefile.
- My Makefile, with a dummy “all” target and a critical “install” target writing to $DESTDIR: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/Makefile#L11-L13
Finally, I learned through trial-and-error that the Launchpad build system prevents network access. Since Pelias is written in Node, it is necessary to send the complete code along with all dependencies under node_modules to the build system. This ensures that builds are more predictable and reliable, and circumvents many of the SNAFU situations that can result from dynamic build systems.
- Complete code for Pelias API with all dependencies is included in our package: https://github.com/openaddresses/pelias-api-ubuntu-xenial/tree/2.2.0-ubuntu1-xenial4/pelias-api
- Makefile includes some rules on how to regenerate the contents of pelias-api anyway: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/Makefile#L3-L9
- I had to make some maintainer tweaks to one package that made incompatible assumptions about where it’s allowed to touch the filesystem: https://github.com/openaddresses/pelias-api-ubuntu-xenial/blob/2.2.0-ubuntu1-xenial4/patches/cluster2-pids-logs.patch
Rolling a new release is a four step process:
- Create a new entry in the debian/changelog file using dch, which will determine the version number.
- From inside the project directory, run debuild -k'8CBDE645' -S (“8CBDE645” is my GPG key ID, used by Launchpad to be sure that I’m me) to create a set of files with names like pelias-api_2.2.0-ubuntu1~xenial5_source.*.
- From outside the project directory, run dput ppa:migurski/hello "pelias-api_2.2.0-ubuntu1~xenial5_source.changes” to push the new package version to Launchpad.