25 October 2021

Exploring G-NAF with SQLite

G-NAF is Australia’s authoritative, geocoded address file built using government data. In this post I’ll be exploring the G-NAF dataset with the aid of SQLite.

Sourcing the latest G-NAF data.

I downloaded the August vintage directly from data.gov.au, new versions are published on a quarterly basis.

The zip file itself is ~1.7GB and decompresses to a little under 8GB.

What’s inside the archive?

The documentation is nicely bundled within the /Documents/ directory, there’s a few goodies in /Extras/, but the bulk of the data is stored in .psv (pipe separated) files under /G-NAF AUGUST 2021/.

├── Documents
│   ├── G-NAF Product Description.pdf
│   └── G-NAF Release Report August 2021.pdf
├── Extras
│   ├── GNAF_TableCreation_Scripts
│   │   ├── add_fk_constraints.sql
│   │   ├── create_tables_ansi.sql
│   │   └── create_tables_sqlserver.sql
│   └── GNAF_View_Scripts
│       └── address_view.sql
└── G-NAF AUGUST 2021
    ├── Authority Code
    │   ├── Authority_Code_ADDRESS_ALIAS_TYPE_AUT_psv.psv
    │   ├── Authority_Code_ADDRESS_CHANGE_TYPE_AUT_psv.psv
    │   ├── ...
    └── Standard
        ├── ACT_ADDRESS_ALIAS_psv.psv
        ├── ACT_ADDRESS_DEFAULT_GEOCODE_psv.psv
        ├── ...

Generating the SQLite schema

You’ll find a convenient ANSI SQL schema file within the archive which is compatible with SQLite.

# generate the database schema
sqlite3 gnaf.db < Extras/GNAF_TableCreation_Scripts/create_tables_ansi.sql

# print the newly generated schema
sqlite3 gnaf.db .schema

There’s also a file which defines a view called ADDRESS_VIEW, SQLite doesn’t support the OR REPLACE syntax so I needed to make a minor change to get it working:

# generate the database views
sqlite3 gnaf.db <<(sed 's/ OR REPLACE//g' Extras/GNAF_View_Scripts/address_view.sql)

Generating an import query

Pasting this bash script in the terminal will generate a file named gnaf-import.sql.

# handle spaces in filenames
exec > gnaf-import.sql
OIFS="$IFS"
IFS=$'\n'

# csv mode (configured for .psv files)
echo '.mode csv'
echo '.separator "|"'

# fast import pragmas
echo 'PRAGMA synchronous=OFF;'
echo 'PRAGMA journal_mode=OFF;'
echo 'PRAGMA temp_store=MEMORY;'

# be verbose
echo '.echo on'

# import 'authority code'
for FILEPATH in `find "G-NAF AUGUST 2021/Authority Code" -type f -name "*.psv"`; do
  BASENAME=$(basename $FILEPATH)
  TABLE_NAME="${BASENAME/Authority_Code_/}"
  TABLE_NAME="${TABLE_NAME/_psv.psv/}"
  TABLE_NAME="${TABLE_NAME/.psv/}"
  echo ".import '${FILEPATH}' '${TABLE_NAME}'"
done

# import 'standard'
for FILEPATH in `find "G-NAF AUGUST 2021/Standard" -type f -name "*.psv"`; do
  BASENAME=$(basename $FILEPATH)
  TABLE_NAME="${BASENAME#*_}"
  TABLE_NAME="${TABLE_NAME/_psv.psv/}"
  TABLE_NAME="${TABLE_NAME/.psv/}"

  # only import to uppercase tables
  # this avoids files like 'nt_locality_pid_linkage.psv which dont exist in the schema
  if [[ $TABLE_NAME != $(echo $TABLE_NAME | tr '[:lower:]' '[:upper:']) ]]; then
    continue
  fi

  # skip the header row
  echo ".import '| tail -n +2 \"${FILEPATH}\"' '${TABLE_NAME}'"
done

IFS="$OIFS"
exec >/dev/tty

Running the import

Time to take a break and grab a tea or coffee, on my laptop it took a little under 10 minutes and resulted in a 8.2GB database file.

sqlite3 gnaf.db < gnaf-import.sql

Exploring the database

Data from each state has been merged into national tables, so for instance, the ADDRESS_DETAIL table will contain rows from both the VIC_ADDRESS_DETAIL_psv.psv and NSW_ADDRESS_DETAIL_psv.psv file.

sqlite3 gnaf.db '.tables'
ADDRESS_ALIAS                   LOCALITY_ALIAS
ADDRESS_ALIAS_TYPE_AUT          LOCALITY_ALIAS_TYPE_AUT
ADDRESS_CHANGE_TYPE_AUT         LOCALITY_CLASS_AUT
ADDRESS_DEFAULT_GEOCODE         LOCALITY_NEIGHBOUR
ADDRESS_DETAIL                  LOCALITY_POINT
ADDRESS_FEATURE                 MB_2011
ADDRESS_MESH_BLOCK_2011         MB_2016
ADDRESS_MESH_BLOCK_2016         MB_2021
ADDRESS_MESH_BLOCK_2021         MB_MATCH_CODE_AUT
ADDRESS_SITE                    PRIMARY_SECONDARY
ADDRESS_SITE_GEOCODE            PS_JOIN_TYPE_AUT
ADDRESS_TYPE_AUT                STATE
ADDRESS_VIEW                    STREET_CLASS_AUT
FLAT_TYPE_AUT                   STREET_LOCALITY
GEOCODED_LEVEL_TYPE_AUT         STREET_LOCALITY_ALIAS
GEOCODE_RELIABILITY_AUT         STREET_LOCALITY_ALIAS_TYPE_AUT
GEOCODE_TYPE_AUT                STREET_LOCALITY_POINT
LEVEL_TYPE_AUT                  STREET_SUFFIX_AUT
LOCALITY                        STREET_TYPE_AUT

Querying the database

First, let’s find the unique street ID for Brunswick Street, Fitzroy, VIC

sqlite3 gnaf.db <<SQL
.headers on

  SELECT street_locality_pid
  FROM STREET_LOCALITY
  JOIN LOCALITY USING(locality_pid)
  WHERE street_name = 'BRUNSWICK'
  AND street_type_code = 'STREET'
  AND locality_name = 'FITZROY'
SQL

street_locality_pid
VIC1930629

Using the ADDRESS_VIEW mentioned earlier we can easily retrieve a count of all addresses on this street.

sqlite3 gnaf.db <<SQL
.headers on
.timer on

  SELECT COUNT(*)
  FROM ADDRESS_VIEW
  WHERE STREET_LOCALITY_PID = 'VIC1930629'
SQL

COUNT(*)
1246
Run Time: real 55.672 user 34.325452 sys 17.280557

Speeding things up

That worked great, we found 1246 addresses 🎉 …but it took 55 seconds 😿

Unfortunately the add_fk_constraints.sql script from the archive isn’t compatible with SQLite.

I wanted to avoid manually defining indices, so the following script simply creates an index for every column whos name ends with either pid or code:

{ sqlite3 gnaf.db > gnaf-indices.sql } <<SQL

  /* use fast import pragmas */
  SELECT 'PRAGMA synchronous=OFF;';
  SELECT 'PRAGMA journal_mode=OFF;';
  SELECT 'PRAGMA temp_store=MEMORY;';

  SELECT printf(
    'CREATE INDEX IF NOT EXISTS %s ON %s (%s);',
    printf('%s_%s', LOWER(t.name), LOWER(c.name)),
    t.name, c.name
  )
  FROM sqlite_master t
  LEFT OUTER JOIN pragma_table_info(t.name) c
  WHERE t.type = 'table'
  AND (
    c.name LIKE '%\_pid' ESCAPE '\' OR
    c.name LIKE '%\_code' ESCAPE '\' OR
    c.name == 'code'
  );
SQL

It’s worth noting here that it’s unlikely all of these indices are actually required, this is just the simplest method of creating them.

Time to start heating up the billycan again, generating the indices takes another 8 minutes and doubles the database to 17GB.

sqlite3 gnaf.db < gnaf-indices.sql

Let’s try that again

The same query which took 55 seconds without indices now takes 200ms ⚡

sqlite3 gnaf.db <<SQL
.headers on
.timer on

  SELECT COUNT(*)
  FROM ADDRESS_VIEW
  WHERE STREET_LOCALITY_PID = 'VIC1930629'
SQL

COUNT(*)
1246
Run Time: real 0.022 user 0.014577 sys 0.007477

Converting to GeoJSON

Most SQLite distributions come with the JSON1 extension installed, we can use these functions to generate a GeoJSON FeatureCollection from within SQLite.

It’s worth mentioning that I downloaded the data in the GDA94 CRS whereas GeoJSON uses WGS84. I’ll try to follow up with a blog post about using spatialite to reproject the data to correct this error.

{ sqlite3 gnaf.db > brunswick-street.geojson } <<SQL
  SELECT json_object(
    'type', 'FeatureCollection',
    'features', json_group_array(
      json_object(
        'type', 'Feature',
        'geometry', json_object(
          'type', 'Point',
          'coordinates', json_array(
            LONGITUDE, LATITUDE
          )
        ),
        'properties', json_object(
          'ADDRESS_DETAIL_PID', ADDRESS_DETAIL_PID,
          'STREET_LOCALITY_PID', STREET_LOCALITY_PID,
          'LOCALITY_PID', LOCALITY_PID,
          'BUILDING_NAME', BUILDING_NAME,
          'LOT_NUMBER_PREFIX', LOT_NUMBER_PREFIX,
          'LOT_NUMBER', LOT_NUMBER,
          'LOT_NUMBER_SUFFIX', LOT_NUMBER_SUFFIX,
          'FLAT_TYPE', FLAT_TYPE,
          'FLAT_NUMBER_PREFIX', FLAT_NUMBER_PREFIX,
          'FLAT_NUMBER', FLAT_NUMBER,
          'FLAT_NUMBER_SUFFIX', FLAT_NUMBER_SUFFIX,
          'LEVEL_TYPE', LEVEL_TYPE,
          'LEVEL_NUMBER_PREFIX', LEVEL_NUMBER_PREFIX,
          'LEVEL_NUMBER', LEVEL_NUMBER,
          'LEVEL_NUMBER_SUFFIX', LEVEL_NUMBER_SUFFIX,
          'NUMBER_FIRST_PREFIX', NUMBER_FIRST_PREFIX,
          'NUMBER_FIRST', NUMBER_FIRST,
          'NUMBER_FIRST_SUFFIX', NUMBER_FIRST_SUFFIX,
          'NUMBER_LAST_PREFIX', NUMBER_LAST_PREFIX,
          'NUMBER_LAST', NUMBER_LAST,
          'NUMBER_LAST_SUFFIX', NUMBER_LAST_SUFFIX,
          'STREET_NAME', STREET_NAME,
          'STREET_CLASS_CODE', STREET_CLASS_CODE,
          'STREET_CLASS_TYPE', STREET_CLASS_TYPE,
          'STREET_TYPE_CODE', STREET_TYPE_CODE,
          'STREET_SUFFIX_CODE', STREET_SUFFIX_CODE,
          'STREET_SUFFIX_TYPE', STREET_SUFFIX_TYPE,
          'LOCALITY_NAME', LOCALITY_NAME,
          'STATE_ABBREVIATION', STATE_ABBREVIATION,
          'POSTCODE', POSTCODE,
          'LATITUDE', LATITUDE,
          'LONGITUDE', LONGITUDE,
          'GEOCODE_TYPE', GEOCODE_TYPE,
          'CONFIDENCE', CONFIDENCE,
          'ALIAS_PRINCIPAL', ALIAS_PRINCIPAL,
          'PRIMARY_SECONDARY', PRIMARY_SECONDARY,
          'LEGAL_PARCEL_ID', LEGAL_PARCEL_ID,
          'DATE_CREATED', DATE_CREATED
        )
      )
    )
  )
  FROM ADDRESS_VIEW
  WHERE STREET_LOCALITY_PID = 'VIC1930629';
SQL

Displaying the data on a map

Here you can see a rendering of all 1246 G-NAF address points for Brunswick Street.

Brunswick Street
view an interactive version.

That’s all for today, reach out if you’re interested in seeing another blog post about working with geographic data in SQLite.

Photo credit: ‘Melbourne’ by Steven Penton