Table of contents
This note describes my efforts in processing AddressBase from Ordnance Survey into SQLite.
The Addresspack project processes a standard “AddressBase Premium full (CSV)” into a SQLite database.
Before jumping into AddressBase details, you might want to consider OS Open UPRN. It is a slim option if you only want the coordinates for a given URPN. It is licenced under the Open Government Licence and offers two options: a table-oriented CSV and a GeoPackage.
Notice that now AddressBase Core is also offered as GeoPackage.
AddressBase Premium CSV is a particular beast. The download is a ~8GB zip file that contains around 352 files with a
Although these files are labelled as "csv" they are not structured as a table. They have to be treated as a single list of comma-separated lines where each line starts with the numeric identifier for the table they belong to.
So, to recompose the original relational model, you need to read the technical specification and create the expected schema.
If you only want construct tabular CSV files, you might be able to get away with just using the header files.
At this point, you need to read the given files line by line, parse each line as CSV, promote empty values as
NULL, and load all ~352 million records into the expected table.
# Why SQLite
The journey you have to go through to get from what Ordnance Survey gives you to the point where you are able to run your first query against the AddressBase database can't be qualified as a good experience.
SQLite is a single-file relational database with a good SQL engine. What would be the experience if instead of downloading the CSV zip you could instead download a SQLite database ready to be used? Perhaps the database could include some metadata and enough documentation to let you get an understanding of what is each table for and what are the relationships between them. Similar to what you get in the technical specification but co-located with the data.
Well, Addresspack is an exercise to explore this idea.
The bare schema before loading the full dataset provides
column_info which lets you explore the basic documentation for each table and columns.
For example, to know what columns are available and what are they for for the table with id
10 (the first record of every CSV file starts with
10) you could:
sqlite3 addresspack.sqlite \ "select table_info.name, column_info.id, column_info.definition \ from column_info join table_info on column_info.table_id = table_info.id \ where table_id = 10;"
# Getting the data
The Ordnance Survey website has an order's section where you can buy access to the AddressBase Premium database. The one Addresspack expects is the CSV with Download as a delivery method.
You can download the pre-build package for your platform from the release page or you can build and install with
cargo install \ --git https://github.com/arnau/addresspack \ --tag v0.1.0
Note: It has only been tested with Rust 1.44 but it is likely that it works for older versions as well.
SQLite only allows a single writer so Addresspack can't use multiple threads to parallelise. Although I tried to make the ingestion process fast, it is extremely slow. In the order of hours.
By default, a transaction is commited every 10 files (i.e. every 10,000,000 inserts) using the SQLite WAL journaling. If the process stops (e.g. you kill the process), you can run it again and it will resume where it stopped with a potential data loss of 10 million rows.
If you have ideas or want to contribute to make it faster, you are more than welcome to reach to me!
Another limitation is that Addresspack is only able to process the full version of AddressBase Premium. I haven't looked at what would it mean to load a data update.
application_cross_reference table can be impractical to use due to its size:
$ time sqlite3 addresspack.sqlite "select count(*) from application_cross_reference;" 191207473 2.26s user 29.48s system 21% cpu 2:29.76 total
So, if there is a need for a query that requires a full scan, expect it to be slow.
Out of curiosity, this is the same count using xsv after exporting the table to CSV:
$ time xsv count xref.csv 191207473 30.49s user 8.93s system 96% cpu 40.764 total
And of course, SQLite is a single file for the whole database which was the main driver for this exercise.
These are some numbers that might help understand the scale of the task:
|AB76GB_CSV (unzipped, 352 files)||44 GB|