25 October 2021
Exploring G-NAF with SQLite
data:image/s3,"s3://crabby-images/5ad56/5ad56bd2dd5f7fd3c2e5d98216d068d3111babe8" alt=""
G-NAF is Australia’s authoritative, geocoded address file built using government data. In this post I’ll be exploring the G-NAF dataset with the aid of SQLite.
Sourcing the latest G-NAF data.
I downloaded the August vintage directly from data.gov.au, new versions are published on a quarterly basis.
The zip file itself is ~1.7GB and decompresses to a little under 8GB.
What’s inside the archive?
The documentation is nicely bundled within the /Documents/
directory, there’s a few goodies in /Extras/
, but the bulk of the data is stored in .psv
(pipe separated) files under /G-NAF AUGUST 2021/
.
├── Documents
│ ├── G-NAF Product Description.pdf
│ └── G-NAF Release Report August 2021.pdf
├── Extras
│ ├── GNAF_TableCreation_Scripts
│ │ ├── add_fk_constraints.sql
│ │ ├── create_tables_ansi.sql
│ │ └── create_tables_sqlserver.sql
│ └── GNAF_View_Scripts
│ └── address_view.sql
└── G-NAF AUGUST 2021
├── Authority Code
│ ├── Authority_Code_ADDRESS_ALIAS_TYPE_AUT_psv.psv
│ ├── Authority_Code_ADDRESS_CHANGE_TYPE_AUT_psv.psv
│ ├── ...
└── Standard
├── ACT_ADDRESS_ALIAS_psv.psv
├── ACT_ADDRESS_DEFAULT_GEOCODE_psv.psv
├── ...
Generating the SQLite schema
You’ll find a convenient ANSI SQL schema file within the archive which is compatible with SQLite.
# generate the database schema
sqlite3 gnaf.db < Extras/GNAF_TableCreation_Scripts/create_tables_ansi.sql
# print the newly generated schema
sqlite3 gnaf.db .schema
There’s also a file which defines a view called ADDRESS_VIEW
, SQLite doesn’t support the OR REPLACE
syntax so I needed to make a minor change to get it working:
# generate the database views
sqlite3 gnaf.db <<(sed 's/ OR REPLACE//g' Extras/GNAF_View_Scripts/address_view.sql)
Generating an import query
Pasting this bash script in the terminal will generate a file named gnaf-import.sql
.
# handle spaces in filenames
exec > gnaf-import.sql
OIFS="$IFS"
IFS=$'\n'
# csv mode (configured for .psv files)
echo '.mode csv'
echo '.separator "|"'
# fast import pragmas
echo 'PRAGMA synchronous=OFF;'
echo 'PRAGMA journal_mode=OFF;'
echo 'PRAGMA temp_store=MEMORY;'
# be verbose
echo '.echo on'
# import 'authority code'
for FILEPATH in `find "G-NAF AUGUST 2021/Authority Code" -type f -name "*.psv"`; do
BASENAME=$(basename $FILEPATH)
TABLE_NAME="${BASENAME/Authority_Code_/}"
TABLE_NAME="${TABLE_NAME/_psv.psv/}"
TABLE_NAME="${TABLE_NAME/.psv/}"
echo ".import '${FILEPATH}' '${TABLE_NAME}'"
done
# import 'standard'
for FILEPATH in `find "G-NAF AUGUST 2021/Standard" -type f -name "*.psv"`; do
BASENAME=$(basename $FILEPATH)
TABLE_NAME="${BASENAME#*_}"
TABLE_NAME="${TABLE_NAME/_psv.psv/}"
TABLE_NAME="${TABLE_NAME/.psv/}"
# only import to uppercase tables
# this avoids files like 'nt_locality_pid_linkage.psv which dont exist in the schema
if [[ $TABLE_NAME != $(echo $TABLE_NAME | tr '[:lower:]' '[:upper:']) ]]; then
continue
fi
# skip the header row
echo ".import '| tail -n +2 \"${FILEPATH}\"' '${TABLE_NAME}'"
done
IFS="$OIFS"
exec >/dev/tty
Running the import
Time to take a break and grab a tea or coffee, on my laptop it took a little under 10 minutes and resulted in a 8.2GB database file.
sqlite3 gnaf.db < gnaf-import.sql
Exploring the database
Data from each state has been merged into national tables, so for instance, the ADDRESS_DETAIL
table will contain rows from both the VIC_ADDRESS_DETAIL_psv.psv
and NSW_ADDRESS_DETAIL_psv.psv
file.
sqlite3 gnaf.db '.tables'
ADDRESS_ALIAS LOCALITY_ALIAS
ADDRESS_ALIAS_TYPE_AUT LOCALITY_ALIAS_TYPE_AUT
ADDRESS_CHANGE_TYPE_AUT LOCALITY_CLASS_AUT
ADDRESS_DEFAULT_GEOCODE LOCALITY_NEIGHBOUR
ADDRESS_DETAIL LOCALITY_POINT
ADDRESS_FEATURE MB_2011
ADDRESS_MESH_BLOCK_2011 MB_2016
ADDRESS_MESH_BLOCK_2016 MB_2021
ADDRESS_MESH_BLOCK_2021 MB_MATCH_CODE_AUT
ADDRESS_SITE PRIMARY_SECONDARY
ADDRESS_SITE_GEOCODE PS_JOIN_TYPE_AUT
ADDRESS_TYPE_AUT STATE
ADDRESS_VIEW STREET_CLASS_AUT
FLAT_TYPE_AUT STREET_LOCALITY
GEOCODED_LEVEL_TYPE_AUT STREET_LOCALITY_ALIAS
GEOCODE_RELIABILITY_AUT STREET_LOCALITY_ALIAS_TYPE_AUT
GEOCODE_TYPE_AUT STREET_LOCALITY_POINT
LEVEL_TYPE_AUT STREET_SUFFIX_AUT
LOCALITY STREET_TYPE_AUT
Querying the database
First, let’s find the unique street ID for Brunswick Street, Fitzroy, VIC
sqlite3 gnaf.db <<SQL
.headers on
SELECT street_locality_pid
FROM STREET_LOCALITY
JOIN LOCALITY USING(locality_pid)
WHERE street_name = 'BRUNSWICK'
AND street_type_code = 'STREET'
AND locality_name = 'FITZROY'
SQL
street_locality_pid
VIC1930629
Using the ADDRESS_VIEW
mentioned earlier we can easily retrieve a count of all addresses on this street.
sqlite3 gnaf.db <<SQL
.headers on
.timer on
SELECT COUNT(*)
FROM ADDRESS_VIEW
WHERE STREET_LOCALITY_PID = 'VIC1930629'
SQL
COUNT(*)
1246
Run Time: real 55.672 user 34.325452 sys 17.280557
Speeding things up
That worked great, we found 1246 addresses 🎉 …but it took 55 seconds 😿
Unfortunately the add_fk_constraints.sql
script from the archive isn’t compatible with SQLite.
I wanted to avoid manually defining indices, so the following script simply creates an index for every column whos name ends with either pid
or code
:
{ sqlite3 gnaf.db > gnaf-indices.sql } <<SQL
/* use fast import pragmas */
SELECT 'PRAGMA synchronous=OFF;';
SELECT 'PRAGMA journal_mode=OFF;';
SELECT 'PRAGMA temp_store=MEMORY;';
SELECT printf(
'CREATE INDEX IF NOT EXISTS %s ON %s (%s);',
printf('%s_%s', LOWER(t.name), LOWER(c.name)),
t.name, c.name
)
FROM sqlite_master t
LEFT OUTER JOIN pragma_table_info(t.name) c
WHERE t.type = 'table'
AND (
c.name LIKE '%\_pid' ESCAPE '\' OR
c.name LIKE '%\_code' ESCAPE '\' OR
c.name == 'code'
);
SQL
It’s worth noting here that it’s unlikely all of these indices are actually required, this is just the simplest method of creating them.
Time to start heating up the billycan again, generating the indices takes another 8 minutes and doubles the database to 17GB.
sqlite3 gnaf.db < gnaf-indices.sql
Let’s try that again
The same query which took 55 seconds without indices now takes 200ms ⚡
sqlite3 gnaf.db <<SQL
.headers on
.timer on
SELECT COUNT(*)
FROM ADDRESS_VIEW
WHERE STREET_LOCALITY_PID = 'VIC1930629'
SQL
COUNT(*)
1246
Run Time: real 0.022 user 0.014577 sys 0.007477
Converting to GeoJSON
Most SQLite distributions come with the JSON1 extension installed, we can use these functions to generate a GeoJSON FeatureCollection from within SQLite.
It’s worth mentioning that I downloaded the data in the GDA94
CRS whereas GeoJSON uses WGS84
. I’ll try to follow up with a blog post about using spatialite to reproject the data to correct this error.
{ sqlite3 gnaf.db > brunswick-street.geojson } <<SQL
SELECT json_object(
'type', 'FeatureCollection',
'features', json_group_array(
json_object(
'type', 'Feature',
'geometry', json_object(
'type', 'Point',
'coordinates', json_array(
LONGITUDE, LATITUDE
)
),
'properties', json_object(
'ADDRESS_DETAIL_PID', ADDRESS_DETAIL_PID,
'STREET_LOCALITY_PID', STREET_LOCALITY_PID,
'LOCALITY_PID', LOCALITY_PID,
'BUILDING_NAME', BUILDING_NAME,
'LOT_NUMBER_PREFIX', LOT_NUMBER_PREFIX,
'LOT_NUMBER', LOT_NUMBER,
'LOT_NUMBER_SUFFIX', LOT_NUMBER_SUFFIX,
'FLAT_TYPE', FLAT_TYPE,
'FLAT_NUMBER_PREFIX', FLAT_NUMBER_PREFIX,
'FLAT_NUMBER', FLAT_NUMBER,
'FLAT_NUMBER_SUFFIX', FLAT_NUMBER_SUFFIX,
'LEVEL_TYPE', LEVEL_TYPE,
'LEVEL_NUMBER_PREFIX', LEVEL_NUMBER_PREFIX,
'LEVEL_NUMBER', LEVEL_NUMBER,
'LEVEL_NUMBER_SUFFIX', LEVEL_NUMBER_SUFFIX,
'NUMBER_FIRST_PREFIX', NUMBER_FIRST_PREFIX,
'NUMBER_FIRST', NUMBER_FIRST,
'NUMBER_FIRST_SUFFIX', NUMBER_FIRST_SUFFIX,
'NUMBER_LAST_PREFIX', NUMBER_LAST_PREFIX,
'NUMBER_LAST', NUMBER_LAST,
'NUMBER_LAST_SUFFIX', NUMBER_LAST_SUFFIX,
'STREET_NAME', STREET_NAME,
'STREET_CLASS_CODE', STREET_CLASS_CODE,
'STREET_CLASS_TYPE', STREET_CLASS_TYPE,
'STREET_TYPE_CODE', STREET_TYPE_CODE,
'STREET_SUFFIX_CODE', STREET_SUFFIX_CODE,
'STREET_SUFFIX_TYPE', STREET_SUFFIX_TYPE,
'LOCALITY_NAME', LOCALITY_NAME,
'STATE_ABBREVIATION', STATE_ABBREVIATION,
'POSTCODE', POSTCODE,
'LATITUDE', LATITUDE,
'LONGITUDE', LONGITUDE,
'GEOCODE_TYPE', GEOCODE_TYPE,
'CONFIDENCE', CONFIDENCE,
'ALIAS_PRINCIPAL', ALIAS_PRINCIPAL,
'PRIMARY_SECONDARY', PRIMARY_SECONDARY,
'LEGAL_PARCEL_ID', LEGAL_PARCEL_ID,
'DATE_CREATED', DATE_CREATED
)
)
)
)
FROM ADDRESS_VIEW
WHERE STREET_LOCALITY_PID = 'VIC1930629';
SQL
Displaying the data on a map
Here you can see a rendering of all 1246 G-NAF address points for Brunswick Street.
data:image/s3,"s3://crabby-images/9611d/9611ddb32993087133e0fa77a1af102448a00ecd" alt="Brunswick Street"
👋 That’s all for today, reach out if you’re interested in seeing another blog post about working with geographic data in SQLite.
Image credit: Melbourne by Steven Penton.