BigQuery is not like the other SQL languages in that one key thing does not exist... indexes!
Instead we have clustering to work with to improve not just peformance but cost.
BigQuery's pricing model is based on querying costs, and those querying costs are proportional to how much data is read in each query. As such a standard SELECT * FROM dataset.table
can start to be very expensive when that table is reaching towards 100s of GBs. Using clustering to reduce the amount of data being read is essential.
Clustering
I'm not going to re-explain what BigQuery clustering is as I believe the docs do it quite well. For a very basic use-case, that is.
When suddenly you have a requirement that is filtering a clustered column on say, subquery results, it all falls apart. I do hope this is something Google fixes, or at least creates more clear documentation on because one of my main findings is that clustering is very inconsistent.
Execution Plan
When we came across a single query that cost $30 (and needed to be run regularly!), I soon scrambled for a way to find out why exactly my supposed clustering optimisation was not working.
Take a look at these two similar but oh so different execution stages:
$1:Column1, $2.ClusteredColumn, $3.Column2
FROM project.dataset.clusteredTable
WHERE in_array($2, ARRAY<...>)
$1:Column1, $2.ClusteredColumn, $3.Column2, $4.FilterColumn
FROM project.dataset.clusteredTable
SEMI HASH JOIN project.dataset.filterTable EACH WITH ALL ON $4 = $2
Although seemingly performing the same job, there is some mysterious magic in BigQuery's back-end which means clustering is only used in the first case. Therefore the new goal is to write queries that consistently result in the first execution plan stage.
Dynamic Queries
If you are wanting to take advantage of clustering by dynamic results in a single query the unfortunate answer is, I don't think it's possible. To result in the WHERE
clause in the execution plan, the clustered column must be filtered on a static value.
Luckily, we have multi-statement queries to save the day!
My specific use-case with BigQuery is having a lot of views to external query connections and a few large base tables. This is my company's current hybrid setup after a previously fully MySQL solution to get the best of both worlds out of MySQL's transactional nature and BigQuery's big data storage. Therefore, a lot of our queries depend on first filtering results of these external views before joining to our base tables.
Here's how I ensured that clustering is always used in the execution plan.
Non-Optimised for Clustering:
SELECT column1, column2
FROM dataset.clusteredTable c
JOIN dataset.filteredTable f ON c.clusteredcolumn = f.joinColumn
WHERE f.filterColumn = 'Some Value'
Optimised for Clustering:
DECLARE filters_variableARRAY<STRING> DEFAULT (
SELECT ARRAY_AGG(joinColumn)
FROM dataset.filteredTable
WHERE filterColumn = 'Some Value'
);
SELECT column1, column2
FROM dataset.clusteredTable c
WHERE c.clusteredcolumn IN UNNEST(filters_variable)
Although not the prettiest query, I found using declared variables as in the optimised version to be the only way to consistently ensure that query costs were low. Using CTEs or subqueries did not have the same affect.
Limitations
Of course, nothing is perfect and there are some limitations with using DECLARE
to create array variables each time you need to filter on our clustered column.
- Unreadable queries. Especially when you start to need multiple variables, things can soon get out of hand
- Size limitations. If your filter array has too many values than you can hit a BigQuery limit.
Script expression exceeded evaluation limit of 1048576 bytes.
Ultimately, BigQuery was designed for denormalised data where ideally the values you need to filter by are within the table itself which is exactly what clustering was designed for. However, if like us you have migrated a relational data structure into BigQuery, things may not always be as simple without entirely re-designing all your schemas.
Using this hack to ensure clustering is consistently applied to queries can be a good cost-saving measure to force BigQuery to play nice with normalised data. :)