G-NAF is Australia’s authoritative, geocoded address file built using government data. In this post I’ll be exploring the G-NAF dataset with the aid of SQLite.
Sourcing the latest G-NAF data.
I downloaded the August vintage directly from data.gov.au, new versions are published on a quarterly basis.
The zip file itself is ~1.7GB and decompresses to a little under 8GB.
What’s inside the archive?
The documentation is nicely bundled within the /Documents/ directory, there’s a few goodies in /Extras/, but the bulk of the data is stored in .psv (pipe separated) files under /G-NAF AUGUST 2021/.
Generating the SQLite schema
You’ll find a convenient ANSI SQL schema file within the archive which is compatible with SQLite.
There’s also a file which defines a view called ADDRESS_VIEW, SQLite doesn’t support the OR REPLACE syntax so I needed to make a minor change to get it working:
Generating an import query
Pasting this bash script in the terminal will generate a file named gnaf-import.sql.
Running the import
Time to take a break and grab a tea or coffee, on my laptop it took a little under 10 minutes and resulted in a 8.2GB database file.
Exploring the database
Data from each state has been merged into national tables, so for instance, the ADDRESS_DETAIL table will contain rows from both the VIC_ADDRESS_DETAIL_psv.psv and NSW_ADDRESS_DETAIL_psv.psv file.
Querying the database
First, let’s find the unique street ID for Brunswick Street, Fitzroy, VIC
Using the ADDRESS_VIEW mentioned earlier we can easily retrieve a count of all addresses on this street.
Speeding things up
That worked great, we found 1246 addresses 🎉 …but it took 55 seconds 😿
Unfortunately the add_fk_constraints.sql script from the archive isn’t compatible with SQLite.
I wanted to avoid manually defining indices, so the following script simply creates an index for every column whos name ends with either pid or code:
It’s worth noting here that it’s unlikely all of these indices are actually required, this is just the simplest method of creating them.
Time to start heating up the billycan again, generating the indices takes another 8 minutes and doubles the database to 17GB.
Let’s try that again
The same query which took 55 seconds without indices now takes 200ms ⚡
Converting to GeoJSON
Most SQLite distributions come with the JSON1 extension installed, we can use these functions to generate a GeoJSON FeatureCollection from within SQLite.
It’s worth mentioning that I downloaded the data in the GDA94 CRS whereas GeoJSON uses WGS84. I’ll try to follow up with a blog post about using spatialite to reproject the data to correct this error.
Displaying the data on a map
Here you can see a rendering of all 1246 G-NAF address points for Brunswick Street.
👋 That’s all for today, reach out if you’re interested in seeing another blog post about working with geographic data in SQLite.