18 November 2019

Improved autocomplete parsing is here

At Geocode Earth, one of our primary goals is to build a high quality, accurate, and fast autocomplete geocoder for the entire world.

A few weeks ago, we released a major new step in that journey: a completely new autocomplete parser that vastly improves accuracy.

Autocomplete is a challenging problem for geocoders: it requires sifting through huge amounts of data (over 600 million records for the whole planet in our case), returning results in only a fraction of a second, and most challenging of all, attempting to predict the desired result from incomplete input.

To tackle this challenge, our geocoder, Pelias, has always used a mixed approach where a parser first attempts to classify components of the input text into fields such as house number, street, or city.

Based on the result of this parsing, querying our full database of places becomes much more of an educated guess than a shot in the dark.

Pelias has long used a similar strategy for forward geocoding, where the excellent libpostal library uses a powerful machine learning model trained on a global dataset to return extremely accurate parses. However, the challenges of partial input mean that libpostal is not suitable for autocomplete.

For years, we’ve been narrowing in on the ideal properties of an autocomplete friendly parser, and we finally started serious work on what has now become the Pelias Parser in early 2019.

Introducing the Pelias Parser

The core requirements we settled on for our new Pelias Parser are:

  1. Support incomplete input as a first class concern
  2. Return multiple parser interpretations to handle inherently ambiguous inputs
  3. Support numerous result types including point of interest, address, street, and most notably, street intersection
  4. Be extensible, configurable, and debuggable by humans, rather than relying on black-box machine learning models

Considering the current excitement around AI and machine learning, that last requirement is perhaps the most contentious and interesting. We’ll save it for a separate blog post where we dive into the internals of the Pelias Parser.

Let’s take a look at the other goals:

Incomplete input

Unlike forward or reverse geocoding, where addresses, venue names, or coordinates are likely coming from an existing dataset, autocomplete is primarily used directly by people.

While returning correct results is an obvious requirement, a good autocomplete interface saves people time by returning what they want without typing it out completely.

Autocomplete test output in Budapest, Hungary
An example of the test output from our autocomplete test suite. The green characters show where the autocomplete interface is able to return the correct result. Note that it saves the user from typing most of the address.

Our previous parser, on the other hand, used signals such as a comma to separate parts of an address, or street types (such as road, avenue, or way) to signal the end of a street address.

Assumptions such as “the word ‘avenue’ signifies the end of the street name are not only often incorrect (such as in France where street names like Avenue Aristide Briand are common), but means a street can’t be classified correctly until it’s been typed completely.

For the new Pelias Parser, we worked hard to eliminate reliance on complete input wherever possible. As a result, the Pelias Parser has better support for international address formats and, finally, after many years, means Pelias no longer requires commas before city or country names!

Multiple interpretations

Quick, what place do you think of when you see the name Ontario, CA?

If you said the Canadian province Ontario, you are correct!

But, if you said the city of Ontario in California, you are also correct.

One job of an autocomplete interface is to present multiple possible options, and have the user select the one they want.

All previous parsers used by Pelias returned only a single interpretation of a given input. No matter how good of a job these parsers did, they were never telling the whole story.

The Pelias Parser was designed from the start to provide multiple interpretations of a given input, ordered with a confidence score.

Multiple parse solutions for Ontario, CA
Multiple solutions for the input 'Ontario, CA'. The parser uses data from Who's on First to generate several possible solutions.

Multiple parsing interpretations is a massive new superpower, one we’ve only just begun to explore. We have a lot of exiting work left to do to start using these interpretations elsewhere in our geocoder to return better results, and will likely be continuing to make improvements from this new capability for years.

Street intersection support

In addition to running a hosted geocoding API, we also help teams and organizations run their own geocoders that meet their custom needs.

One of our very first clients, TriMet, needed street intersection support for their new trip planner, so we began investigating what it would take.

Street intersection support has been one of the goals of the Pelias geocoder almost since the beginning, but we quickly confirmed our fears that our existing parser simply could not understand intersections. The same code that erroneously assumed the word ‘street’ signaled the end of an address was even more confused when it would see the word ‘street’ twice.

As a result, when designing the Pelias Parser, a modular, reusable set of classification rules was one of the key considerations.

Essentially, we are able to define parsing logic for intersections as anything that matches a street, followed by a separator (such as ‘and’, ‘at’, etc), followed by another street. Then any improvements to our street parsing logic are automatically improvements to our street intersection parsing logic too!

Search intersection query
TriMet's trip planner, showing an autocomplete query for a street intersection.

A street intersection query is no use without a street intersection database. TriMet, thanks to their focus on the Portland metro area, was able to generate a dataset of intersections. One of our projects in the future will be to generate a suitable global street intersection dataset so we can bring this functionality to Geocode Earth users.

Moving Forward

The new Pelias Parser has been in use on the Geocode Earth service since mid-October now, and is already showing significant improvements.

With its extensibility and configurability, we’re anticipating even more improvement to come as we continue to work with our users, partners, and community members to improve the parser even more.

Like its inputs, the Pelias Parser will always be incomplete. As with all of Pelias, we release our work as open source because we believe good software is built buy the contributions and feedback of a wide range of people.

If you’re interested in being involved, reach out to us or take a look at the Pelias Parser on GitHub.