databases - GreenAsh Poignant wit and hippie ramblings that are pertinent to databases https://greenash.net.au/thoughts/topics/databases/ 2015-06-22T00:00:00Z Generating a Postgres DB dump of a filtered relational set 2015-06-22T00:00:00Z 2015-06-22T00:00:00Z Jaza https://greenash.net.au/thoughts/2015/06/generating-a-postgres-db-dump-of-a-filtered-relational-set/ PostgreSQL is my favourite RDBMS, and it's the fave of many others too. And rightly so: it's a good database! Nevertheless, nobody's perfect.

When it comes to exporting Postgres data (as SQL INSERT statements, at least), the tool of choice is the standard pg_dump utility. Good ol' pg_dump is rock solid but, unfortunately, it doesn't allow for any row-level filtering. Turns out that, for a recent project of mine, a filtered SQL dump is exactly what the client ordered.

On account of this shortcoming, I spent some time whipping up a lil' Python script to take care of this functionality. I've converted the original code (written for a client-specific data set) to a more generic example script, which I've put up on GitHub under the name "PG Dump Filtered". If you're just after the code, then feel free to head over to the repo without further ado. If you'd like to stick around for the tour, then read on.

Worlds apart

For the example script, I've set up a simple schema of four entities: worlds, countries, cities, and people. This schema happens to be purely hierarchical (i.e. each world has zero or more countries, each country has zero or more cities, and each city has zero or more people), for the sake of simplicity; but the script could be adapted to any valid set of foreign-key based relationships.

CREATE TABLE world (
    name character varying(255) NOT NULL,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    active boolean NOT NULL,
    uuid bytea,
    id integer NOT NULL
);

ALTER TABLE ONLY world
    ADD CONSTRAINT world_pkey PRIMARY KEY (id);

CREATE TABLE country (
    name character varying(255) NOT NULL,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    active boolean NOT NULL,
    uuid bytea,
    id integer NOT NULL,
    world_id integer,
    bigness numeric(10,2)
);

ALTER TABLE ONLY country
    ADD CONSTRAINT country_pkey PRIMARY KEY (id);
ALTER TABLE ONLY country
    ADD CONSTRAINT country_world_id_fkey FOREIGN KEY (world_id)
    REFERENCES world(id);

CREATE TABLE city (
    name character varying(255) NOT NULL,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    active boolean NOT NULL,
    uuid bytea,
    id integer NOT NULL,
    country_id integer,
    weight integer,
    is_big boolean DEFAULT false NOT NULL,
    pseudonym character varying(255) DEFAULT ''::character varying
        NOT NULL,
    description text DEFAULT ''::text NOT NULL
);

ALTER TABLE ONLY city
    ADD CONSTRAINT city_pkey PRIMARY KEY (id);
ALTER TABLE ONLY city
    ADD CONSTRAINT city_country_id_fkey FOREIGN KEY (country_id)
    REFERENCES country(id);

CREATE TABLE person (
    name character varying(255) NOT NULL,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    active boolean NOT NULL,
    uuid bytea,
    id integer NOT NULL,
    city_id integer,
    person_type character varying(255) NOT NULL
);

ALTER TABLE ONLY person
    ADD CONSTRAINT person_pkey PRIMARY KEY (id);
ALTER TABLE ONLY person
    ADD CONSTRAINT person_city_id_fkey FOREIGN KEY (city_id)
    REFERENCES city(id);

Using this schema, data belonging to two different worlds can co-exist in the same database. For example, we can have data for the world "Krypton" co-exist with data for the world "Romulus":

INSERT INTO world (name, created_at, updated_at, active, uuid, id)
VALUES ('Krypton', '2015-06-01 09:00:00.000000',
        '2015-06-06 09:00:00.000000', true,
        '\x478a43577ebe4b07ba8631ca228ee42a', 1);
INSERT INTO world (name, created_at, updated_at, active, uuid, id)
VALUES ('Romulus', '2015-06-01 10:00:00.000000',
        '2015-06-05 13:00:00.000000', true,
        '\x82e2c0ac3ba84a34a1ad3bbbb2063547', 2);

INSERT INTO country (name, created_at, updated_at, active, uuid, id,
                     world_id, bigness)
VALUES ('Crystalland', '2015-06-02 09:00:00.000000',
        '2015-06-08 09:00:00.000000', true,
        '\xcd0338cf2e3b40c3a3751b556a237152', 1, 1, 3.86);
INSERT INTO country (name, created_at, updated_at, active, uuid, id,
                     world_id, bigness)
VALUES ('Greenbloodland', '2015-06-03 11:00:00.000000',
        '2015-06-07 13:00:00.000000', true,
        '\x17591321d1634bcf986d0966a539c970', 2, 2, NULL);

INSERT INTO city (name, created_at, updated_at, active, uuid, id,
                  country_id, weight, is_big, pseudonym, description)
VALUES ('Kryptonopolis', '2015-06-05 09:00:00.000000',
        '2015-06-11 09:00:00.000000', true,
        '\x13659f9301d24ea4ae9c534d70285edc', 1, 1, 100, true,
        'Pointyville',
        'Nice place, once you get used to the pointiness.');

INSERT INTO city (name, created_at, updated_at, active, uuid, id,
                  country_id, weight, is_big, pseudonym, description)
VALUES ('Rom City', '2015-06-04 09:00:00.000000',
        '2015-06-13 09:00:00.000000', true,
        '\xc45a9fb0a92a43df91791b11d65f5096', 2, 2, 200, false,
        '',
        'Gakkkhhhh!');

INSERT INTO person (name, created_at, updated_at, active, uuid,
                    city_id, person_type)
VALUES ('Superman', '2015-06-14 09:00:00.000000',
        '2015-06-15 22:00:00.000000', true,
        '\xbadd1ca153994deca0f78a5158215cf6', 1,
        'Awesome Heroic Champ');
INSERT INTO person (name, created_at, updated_at, active, uuid,
                    city_id, person_type)
VALUES ('General Zod', '2015-06-14 10:00:00.000000',
        '2015-06-15 23:00:00.000000', true,
        '\x796031428b0a46c2a9391eb5dc45008a', 1,
        'Bad Bloke');

INSERT INTO person (name, created_at, updated_at, active, uuid,
                    city_id, person_type)
VALUES ('Mister Funnyears', '2015-06-14 11:00:00.000000',
        '2015-06-15 22:30:00.000000', false,
        '\x22380f6dc82d47f488a58153215864cb', 2,
        'Mediocre Dude');
INSERT INTO person (name, created_at, updated_at, active, uuid,
                    city_id, person_type)
VALUES ('Captain Greeny', '2015-06-15 05:00:00.000000',
        '2015-06-16 08:30:00.000000', true,
        '\x485e31758528425dbabc598caaf86fa4', 2,
        'Weirdo');

In this case, our two key stakeholders – the Kryptonians and the Romulans – have been good enough to agree to their respective data records being stored in the same physical database. After all, they're both storing the same type of data, and they accept the benefits of a shared schema in terms of cost-effectiveness, maintainability, and scalability.

However, these two stakeholders aren't exactly the best of friends. In fact, they're not even on speaking terms (have you even seen them both feature in the same franchise, let alone the same movie?). Plus, for legal reasons (and in the interests of intergalactic peace), there can be no possibility of Kryptonian records falling into Romulan hands, or vice versa. So, it really is critical that, as far as these two groups are concerned, the data appears to be completely partitioned.

(It's also lucky that we're using Postgres and Python, which all parties appear to be cool with – the Klingons are mad about Node.js and MongoDB these days, so the Romulans would never have come on board if we'd gone down that path…).

Fortunately, thanks to the wondrous script that's now been written, these unlikely DB room-mates can have their dilithium and eat it, too. The Romulans, for example, can simply specify their World ID of 2:

./pg_dump_filtered.py \
"postgresql://pg_dump_test:pg_dump_test@localhost:5432/pg_dump_test" 2 \
> ~/pg_dump_test_output.sql

And they'll get a DB dump of what is (as far as they're concerned) … well, the whole world! (Note: please do not change your dietary habits per above innuendo, dilithium can harm your unborn baby).

And all thanks to a lil' bit of Python / SQL trickery, to filter things according to their world:

# ...

# Thanks to:
# http://bytes.com/topic/python/answers/438133-find-out-schema-psycopg
t_cur.execute((
    "SELECT        column_name "
    "FROM          information_schema.columns "
    "WHERE         table_name = '%s' "
    "ORDER BY      ordinal_position") % table)

t_fields_str = ', '.join([x[0] for x in t_cur])
d_cur = conn.cursor()

# Start constructing the query to grab the data for dumping.
query = (
    "SELECT        x.* "
    "FROM          %s x ") % table

# The rest of the query depends on which table we're at.
if table == 'world':
    query += "WHERE         x.id = %(world_id)s "
elif table == 'country':
    query += "WHERE         x.world_id = %(world_id)s "
elif table == 'city':
    query += (
        "INNER JOIN    country c "
        "ON            x.country_id = c.id ")
    query += "WHERE         c.world_id = %(world_id)s "
elif table == 'person':
    query += (
        "INNER JOIN    city ci "
        "ON            x.city_id = ci.id "
        "INNER JOIN    country c "
        "ON            ci.country_id = c.id ")
    query += "WHERE         c.world_id = %(world_id)s "

# For all tables, filter by the top-level ID.
d_cur.execute(query, {'world_id': world_id})

With a bit more trickery thrown in for good measure, to more-or-less emulate pg_dump's export of values for different data types:

# ...

# Start constructing the INSERT statement to dump.
d_str = "INSERT INTO %s (%s) VALUES (" % (table, t_fields_str)
d_vals = []

for i, d_field in enumerate(d_row):
    d_type = type(d_field).__name__

    # Rest of the INSERT statement depends on the type of
    # each field.
    if d_type == 'datetime':
        d_vals.append("'%s'" % d_field.isoformat().replace('T', ' '))
    elif d_type == 'bool':
        d_vals.append('%s' % (d_field and 'true' or 'false'))
    elif d_type == 'buffer':
        d_vals.append(r"'\x" + ("%s'" % hexlify(d_field)))
    elif d_type == 'int':
        d_vals.append('%d' % d_field)
    elif d_type == 'Decimal':
        d_vals.append('%f' % d_field)
    elif d_type in ('str', 'unicode'):
        d_vals.append("'%s'" % d_field.replace("'", "''"))
    elif d_type == 'NoneType':
        d_vals.append('NULL')

d_str += ', '.join(d_vals)
d_str += ');'

(Above code samples from: pg_dump_filtered.py).

And that's the easy part done! Now, on to working out how to efficiently do Postgres master-slave replication over a distance of several thousand light years, without disrupting the space-time continuum.

(livelong AND prosper);

Hope my little example script comes in handy, for anyone else needing a version of pg_dump that can do arbitrary filtering on inter-related tables. As I said in the README, with only a small amount of tweaking, this script should be able to produce a dump of virtually any relational data set, filtered by virtually any criteria that you might fancy.

Also, this script is for Postgres: the pg_dump utility lacks any query-level filtering functionality, so using it in this way is simply not an option. The script could also be quite easily adapted to other DBMSes (e.g. MySQL, SQL Server, Oracle), although most of Postgres' competitors have a dump utility with at least some filtering capability.

]]>
Australian LGA to postcode mappings with PostGIS and Intersects 2014-07-12T00:00:00Z 2014-07-12T00:00:00Z Jaza https://greenash.net.au/thoughts/2014/07/australian-lga-to-postcode-mappings-with-postgis-and-intersects/ For a recent project, I needed to know the LGAs (Local Government Areas) of all postcodes in Australia, and vice versa. As it turns out, there is no definitive Australia-wide list containing this data anywhere. People have been discussing the issue for some time, with no clear outcome. So, I decided to get creative.

To cut a long story short: I've produced my own list! You can download my Australian LGA postcode mappings spreadsheet from Google Docs.

If you want the full story: I imported both the LGA boundaries data and the Postal Area boundaries data from the ABS, into PostGIS, and I did an "Intersects" query on the two datasets. I exported the results of this query to CSV. Done! And all perfectly reproducible, using freely available public data sets, and using free and open-source software tools.

The process

I started by downloading the Geo data that I needed, from the ABS. My source was the page Australian Statistical Geography Standard (ASGS): Volume 3 - Non ABS Structures, July 2011. This was the most recent page that I could find on the ABS, containing all the data that I needed. I downloaded the files "Local Government Areas ASGS Non ABS Structures Ed 2011 Digital Boundaries in MapInfo Interchange Format", and "Postal Areas ASGS Non ABS Structures Ed 2011 Digital Boundaries in MapInfo Interchange Format".

Big disclaimer: I'm not an expert at anything GIS- or spatial-related, I'm a complete n00b at this. I decided to download the data I needed in MapInfo format. It's also available on the ABS web site in ArcGIS Shapefile format. I could have downloaded the Shapefiles instead – they can also be imported into PostGIS, using the same tools that I used. I chose the MapInfo files because I did some quick Googling around, and I got the impression that MapInfo files are less complex and are somewhat more portable. I may have made the wrong choice. Feel free to debate the merits of MapInfo vs ArcGIS files for this task, and to try this out yourself using ArcGIS instead of MapInfo. I'd be interested to see the difference in results (theoretically there should be no difference… in practice, who wants to bet there is?).

I then had to install PostGIS (I already had Postgres installed) and related tools on my local machine (running Ubuntu 12.04). I'm not providing PostGIS installation instructions here, there's plenty of information available elsewhere to help you get set up with all the tools you need, for your specific OS / requirements. Installing PostGIS and related tools can get complicated, so if you do decide to try all this yourself, don't say I didn't warn you. Ubuntu is probably one of the easier platforms on which to install it, but there are plenty of guides out there for Windows and Mac too.

Once I was all set up, I imported the data files into a PostGIS-enabled Postgres database with these commands:

ogr2ogr -a_srs EPSG:4283 -f "PostgreSQL" \
PG:"host=localhost user=lgapost dbname=lgapost password=PASSWORD" \
-lco OVERWRITE=yes -nln lga LGA_2011_AUST.mid

ogr2ogr -a_srs EPSG:4283 -f "PostgreSQL" \
PG:"host=localhost user=lgapost dbname=lgapost password=PASSWORD" \
-lco OVERWRITE=yes -nln postcodes POA_2011_AUST.mid

If you're interested in the OGR Toolkit (ogr2ogr and friends), there are plenty of resources available; in particular, this OGR Toolkit guide was very useful for me.

After playing around with a few different map projections, I decided that EPSG:4283 was probably the correct one to use as an argument to ogr2ogr. I based my decision on seeing the MapInfo projection string "CoordSys Earth Projection 1, 116" in the header of the ABS data files, and then finding this list of common Australian-used map projections. Once again: I am a total n00b at this. I know very little about map projections (except that it's a big and complex topic). Feel free to let me know if I've used completely the wrong projection for this task.

I renamed the imported tables to 'lga' and 'postcodes' respectively, and I then ran this from the psql shell, to find all LGAs that intersect with all postal areas, and to export the result to a CSV:

\copy (SELECT     l.state_name_2011,
                  l.lga_name_2011,
                  p.poa_code_2011
       FROM       lga l
       INNER JOIN postcodes p
       ON         ST_Intersects(
                      l.wkb_geometry,
                      p.wkb_geometry)
       ORDER BY   l.state_name_2011,
                  l.lga_name_2011,
                  p.poa_code_2011)
TO '/path/to/lga_postcodes.csv' WITH CSV HEADER;

Final remarks

That's about it! Also, some notes of mine (mainly based on the trusty Wikipedia page Local Government in Australia):

  • There's no data for the ACT, since the ACT has no LGAs
  • Almost the entire Brisbane and Gold Coast metro areas, respectively, are one LGA
  • Some areas of Australia aren't part of any LGA (although they're all remote areas with very small populations)
  • Quite a large number of valid Australian postcodes are not part of any LGA (because they're for PO boxes, for bulk mail handlers, etc, and they don't cover a geographical area as such, in the way that "normal" postcodes do)

I hope that this information is of use, to anyone else who needs to link up LGAs and postcodes in a database or in a GIS project.

]]>