Hidden gems of BigQuery — Part 2 — Change tracking

hidden-gems-of-bigquery — part-2 — change-tracking

Hidden gems of BigQuery — Part 2 — Change tracking

Some time ago, I posted “Hidden gems of BigQuery” on the Google Cloud Blog. Since the time of this post, several things have changed, especially in the area of change tracking. So, it is time for an update…

Before we start, here is a quick update on Multi-statement transactions and Search Indexes — both features are now GA and ready to be used on production environments!

Now to the main part…

In my initial post, I discussed ways to detect what data was already processed and what was added after that. One way to handle it was using my favorite “hidden gem” — AUTO:

BigQuery AUTO feature

However, one of the main limitation of AUTO was that it is only supported by Streaming API, and Streaming API is now marked as legacy (and Storage Write API should be used instead). I checked with the BigQuery Team, and they confirmed that AUTO would not be supported by Storage Write API.

However, now there are two new and even better ways to do it!

DEFAULT values

There is a new feature (currently still in Preview) called DEFAULT values, and it works exactly as you would expect — you can set a default value for a column (when defining a table or later using ALTER statement) and when a new record is added (or updated), and value for this column is not provided — it will be populated with a specified default value.

BigQuery DEFAULT values feature

DEFAULT feature supports literals and a set of functions, including very important for change tracking functions like CURRENT_DATETIME and CURRENT_TIMESTAMP (the full list can be found here).

This means that this feature can be used as a replacement for AUTO for detecting when new data is added to the BigQuery table (but opposite to AUTO, DEFAULTS are supported by all ways to insert data into BigQuery).

The column data will be populated with a default value if a column data is not provided OR “DEFAULT” is provided as a value. “DEFAULT” can even be used in UPDATE statement.

There are some limitations to this feature, but none of it really limits its usability (and some of them probably would be removed by the time feature goes to GA).

BigQuery DEFAULT feature limitations

There is more about DEFAULT values, but there is no point in repeating here full documentation of this feature that can be found here.

Change History

Another amazing new feature (also still in Preview currently) is Change History.

There is no way to describe it better than official documentation:

BigQuery change history lets you track the history of changes to a BigQuery table. The change history for a table is exposed as a SQL table-valued function (TVF) that shows you particular types of changes made during a specified time range. This feature lets you process incremental changes made to a table.

BigQuery Change History

APPENDS function returns the table with all the columns of the original table, plus two additional columns — _CHANGE_TIMESTAMP (which can be used to track when data was added) and _CHANGE_TYPE (which is kind of useless now since it can only have one value — INSERT).

Opposite to the DEFAULT feature, Change History has one significant limitation — currently, it only works for appending data (and not for modifying data). But the existence of _CHANGE_TYPE column that currently supports only one value INSERT hints that this limitation is temporary and functionality will be extended later.

Another limitation — it works within a time-travel window (which is seven days by default), and there is another Preview feature that allows you to change it.

There is more about Change History, but there is no point in repeating here full documentation of this feature that can be found here.

Experiments

I was curious to get answers to the following questions:

  • If a table has two columns and both of them have default values set to CURRENT_TIMESTAMP() — would the values in those columns be the same or not?
  • Would the value of the column with a default set to CURRENT_TIMESTAMP() be equal to _CHANGE_TIMESTAMP value from the APPENDS function?

Let’s do a simple test:

CREATE TABLE dataset.NewTable
(
InsertTimestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
InsertTimestamp2 TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
);

INSERT INTO dataset.NewTable (InsertTimestamp, InsertTimestamp2)
VALUES (DEFAULT, DEFAULT);

SELECT
_CHANGE_TYPE AS change_type,
_CHANGE_TIMESTAMP AS change_time,
* FROM
APPENDS(TABLE dataset.NewTable, NULL, NULL);

This simple experiment showed that:

  • If a table has two columns and both of them have default values set to CURRENT_TIMESTAMP() — values would be the same.
  • The value created by DEFAULT set to CURRENT_TIMESTAMP() is smaller than the value returned by _CHANGE_TIMESTAMP (and the difference can be significant — in my test, it was around 1.5 seconds).

Summary

BigQuery is amazing! Thanks to the Team for making it, and thanks to Googlers who, after my initial post, pointed me to those new features!

Links to docs about mentioned features:


Hidden gems of BigQuery — Part 2 — Change tracking was originally published in Google Developers Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
1
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
most-simple-and-clean-designer-portfolios

Most Simple And Clean Designer Portfolios

Next Post
how-to-host-your-flask-app-free-online

How to host your Flask app free online

Related Posts