Understanding Denormalization for BigQuery
A long time ago in a galaxy far, far away...
In order to understand denormalization, we need to take a trip back in time; back to the last century. This was a time when CPU speeds were measured in MegaHertz and hard drives were sold by the MegaByte. Passing ruffians were sometimes seen saying “neee” to old ladies, and modems made funny noises when connecting online services. Oh these were dark days.
In these ancient times we normalized our databases. But why? It was simple really. Hard drive space was expensive and computers were slow. When saving data, we wanted to use as little space as possible, and data retrieval had to use as little compute power as possible. Normalization saved space. Data separated into many tables could be combined in different ways for flexible data retrieval. Indexes made querying from multiple tables fast and efficient.
It was, however, complicated. Sometimes databases were poorly designed. Other times, data requirements changed over time, causing a good design to deteriorate. Sometimes there were conflicting data requirements. A proper design for one use case might be a poor design for a different use case. And what about when you wanted to combine data from different databases or data sources that were not relational? Oh the humanity...
Neanderthals developed tools...
Then Google said, “Go west young man and throw hardware at the problem.”
“What do you mean?” asked the young prince.
If hard drives are slow, just connect a lot of them together, and combined, they will be faster. And don’t worry about indexes. Just dump the data in files and read the files with a few thousand computers.
And the road to BigQuery was paved...
Run, Forest, Run!
When data is put in BigQuery, each field is stored in a separate file on a massively distributed file system. Storage is cheap; only a couple cents per GB per month. Storage is plentiful; there is no limit to the amount a data that can be put into BigQuery. Storage is fast, super fast! Terabytes can read in seconds.
Data processing is done on a massive cluster of computers which are separate from storage. Storage and compute are connected with a Petabit network. Processing is fast and plentiful. If the job is big, just use more machines!
Danger, Will Robinson!
Ah, but there is a caveat. Joins are expensive in BigQuery. It doesn’t mean you can’t do a join, it just means there might be a more efficient way.
Denormalize the data! (said in a loud, deep voice with a little echo)
BigQuery tables support nested, hierarchical data. So, in addition to the usual data types like strings, numbers, and booleans, fields can be records which are composite types made up of multiple fields. Fields can also be repeated like an array. Thus, a field can be an array of a complex type. So, you don’t need two tables for a one-to-many relationship. Mash it together into one table.
Don’t store your orders in one table and order details in a different table. In the Orders table, create a field called Details, which is an array of complex records containing all the information about each item ordered. Now there is no need for a join, which means we can run an efficient query without an index.
“But doesn’t this make querying less flexible?”, asked the young boy in the third row with the glasses and Steph Curry tee-shirt.
Yes, I guess that’s true. But storage is cheap and plentiful. So, just store the data multiple times in different ways when you want to query it in different ways.
“Heresy” screamed the old man as he was dragged away clinging to his ergonomic keyboard and trackball.