Importing data from PostgreSQL to Google BigQuery

Cedric Sadai
Powerspace Engineering
2 min readNov 14, 2017

--

As we operate our transition from an on-premises environment to Google Cloud, we face new challenges.

We operate in the advertising industry, offering an automated and scalable platform to purchase and serve non-intrusive advertising.

A key component of our mission is to track all the user events in a timely and safe manner, and let business users query this data in any way possible.

As we started our move to the cloud last year, we needed to offer a similar experience to the end users compared to what they were used to when we were on-premises. Part of it is the ability to join our Event table to various business tables, stored in a different database.

Our business tables are stored in a PostgreSQL database. This one has as well been migrated to Google Cloud SQL.
Hence, to always have fresh data to join on in BigQuery, we need to perform a regular export of our business tables to BigQuery.

Unfortunately, Google currently doesn’t offer the ability to sync Cloud SQL tables to BigQuery.

At first, as we are power users of streaming solutions (using Akka streams, Kafka streams, or Google Dataflow), we investigated Debezium, to create a stream of changes that would land of Kafka, from which a streamer could perform the modifications in real-time. Yet, BigQuery doesn’t (really) offer mutability.

So we had to go down the export / import road. To do this, we decided to leverage the existing connectors in Spark, (written in Scala), and build a command-line utility on top of it.

You can download it and read the source code right here: https://github.com/Powerspace/pg2bq

The usage is quite straightforward

./bin/pg2db -Dconfig.file=configuration.json

In the configuration, you can define the route to your database and to Google Cloud settings, as well as explicitely list the tables included in the import:

jdbc {
url = “jdbc:postgresql://mypg:5432/mydb”
user = “myuser”
password = “mypwd”
tables = [ “user”, “campaign”, “website” ]
}
gcloud {
project = “gcloud-project-id”
service-account-key-path = “/path/to/service-account-key.json”
bq.dataset = “mypg”
gcs.tmp-bucket = “pg-export-tmp”
}

The strength of this program is that it exports the tables to Avro format, preserving the type of each column, which eases the table creation and import processes. Also, note that every time the program runs, it performs a “delete-and-import” kind of operation, as increments are not well supported in BigQuery (no UPDATE/DELETE).

We will keep sharing our experience of migrating to the cloud, and open-sourcing components that can be useful to the Google Cloud community.

If you like playing with the likes of Scala, Google Cloud, Spark, and others, we have data-engineering positions currently open in Paris (jobs@powerspace.com)

--

--