I recently completed the final stage of a data migration project for the start-up I work at by rewriting a bunch of MySQL queries into BigQuery queries.
With all the different flavours of SQL (T-SQL, MySQL, Google SQL...) and the move from on-premise to cloud (or cloud to other cloud), I feel as if data migrations and the subsequent reworking of all its dependencies are extremely common right now. This is the 2nd migration project I have been involved in since starting my data career and perhaps the most finicky thing is the slight syntax differences between each version of SQL.
In doing this MySQL to BigQuery mass rewrite, I put together a list of all of the translations I encountered and little tricks for working with BigQuery.
I plan to write another post on using clustering and partioning with BigQuery, coming from a relational database design into BigQuery's preferred denormalised structure.
Let's get into it.
Functions
MySQL | BigQuery |
---|---|
NOW |
CURRENT_DATETIME |
CONCAT_WS |
STRING_AGG or ARRAY_TO_STRING
|
STR_TO_DATE('31/01/24','%d,%m,%y') |
PARSE_DATE(’%d,%m,%y’ , '31/01/24’) |
GROUP_CONCAT |
ARRAY_AGG |
LOCATE(substring, string) |
INSTR(string, substring) |
MD5() |
TO_HEX(MD5()) |
SUBSTRING_INDEX(substring, delim, index) |
SPLIT(substring, delim)[SAFE_OFFSET(index-1)] * |
* indexing in MySQL is 1-based, BigQuery is 0-based
Key Words
MySQL | BigQuery |
---|---|
UNION |
UNION ALL |
SET @variable = value; |
DECLARE variable {TYPE} DEFAULT (value) |
Something AS 'column' |
Something AS column (no quotes) |
Data Types
MySQL | BigQuery |
---|---|
SIGNED |
INT64 |
BOOL |
INT64 |
DECIMAL |
NUMERIC |
Other Tips & Tricks
- In MySQL
ORDER BY NULL
which is used for performance improvement with use ofGROUP BY
has no BigQuery equivalent - MySQL allows subqueries in join predicates unlike BigQuery. Instead you should define the subquery as the joined table instead.
-- MYSQL:
JOIN database.joined_table b
ON (
a.column1 = b.column1
AND a.column2 = (
SELECT
c.column2
FROM
database.inner_joined_table c
WHERE
c.column3 = b.column3
)
)
-- BigQuery:
JOIN (
SELECT
b.column1, c.column3
FROM
database.joined_table b
INNER JOIN
database.inner_joined_table c
ON
c.column3 = b.column3
) b
ON
a.column1 = b.column1 AND b.column3 = a.column2
-
GROUP BY
in MySQL does not need to specify all non-aggregate columns and instead selects a random value. In Bigquery you must either add non-aggregate columns to theGROUP BY
clause or aggregate them. - Using
JSON_VALUE
with an iterator in MySQL does not work in BigQuery. Instead you must use CTEs and theUNNEST
function to achieve the same behaviour.
-- BigQuery:
WITH json_row_data AS (
SELECT
id
,JSON_EXTRACT_ARRAY(json_data, '$') AS r
FROM
database.json_data
),
unnested_rows AS (
SELECT
id
,unnested_row AS json_data
FROM json_row_data, UNNEST(r) unnested_row
)
SELECT
id,
JSON_VALUE(ajd.json_data, '$.myColumn') AS `myColumn`
FROM unnested_rows ajd
GROUP BY id
- In general, using BigQuery window functions is preferred over recreating the login in a custom
GROUP BY
as the window functions are better optimised. - The above point also applies for cases when tables are joined to themselves to access previous rows. Look into if using
LEAD
andROW_NUMBER
could satisfy your requirements instead. This also results in more cost-efficient queries as BigQuery costs are directly related to how much data is read from each table (no point to access the same table twice if unneeded!)
This is definitely not a comphrensive list of all translations between the two different syntaxes but is a bunch of the things I came across in my migration task so hopefully it may help someone else who is similarly going through a process of rewriting queries into BigQuery!