GCP Cost Reduction in a nutshell

In this article we will cover key factors in reducing costs over BigQuery from writing queries properly and building data architecture to support one goal – reducing costs.

Queries in BQ Cost Money

Pricing in BQ is in a pay as you go model : $5 per 1TB scanned when executing your query

Rule of thumb – in most of the cases, if your query costs more than $1 you are apparently doing something wrong.

Be in control of your Cost – This tool can help you dramatically with that :

BigQuery Mate – (free chrome extension)

Do’s and Don'ts

Try to avoid using “Select *”,
This way you avoid paying for data scans of columns that you don’t really need in your query result.
Using the “Limit” clause on a query will not affect the amount of data scanned.
Use the “Table Preview” option in order to see few lines of the table (instead of using limit) – This is Free
There is also a more extreme protection level under the Query settings – Advanced options section : “Maximum bytes billed”
Using this will limit the bytes billed for this query. If this query will have bytes billed beyond this limit, the query will be failed (without incurring a charge).
use the default table expiration time to remove the data when it’s no longer needed. – This will save Storage Costs.
Use streaming inserts only if your data must be immediately available to query from. In other words – Don’t use streaming inserts unless you have to ! It costs money.
Quotas – You can define a hard limit of data scan of a daily level or even a project level – but not on a user level. That’s a good practice to set a daily budget that will keep you in the safe zone in terms of cost.

Performance (& Cost) Key Factors

Partition your tables – the significantly big ones !
This will improve both query performance and cost.
There are several ways to partition tables – the most common and most of the time also most cost and performance effective is the time-partitioned tables method.
De-normalise when possible. – Bigquery by its nature is designed for one huge table. Implementing a normalised table relations will result with many joins and overhead in terms of performance and cost.
Use “Order By” only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
Avoid self-joins. Use a window function instead.
Same thing applies on Cross-Joins

Cost reducing Data Architecture Guidelines

As mentioned at the top of the article – The key principle that guides us when designing a cost effective architecture is to reduce the amount of data scanned by the end user.
In most cases the end user might be the analyst or a business user using a BI platform.
We want to reduce the amount of data scanned both in terms of cost and lets not forget performance.
You don’t want your CEO to wait for a dashboard more than a few seconds until it’s presented with fresh data.
The way of reducing the data is by a simple layering method that starts with the raw data – ingestion layer and ends with a very small portion of it designed and grouped specifically for the BI tool of your choice – The Presentation layer
In between those layers you should do all the cleansing / transforming /joining /aggregating in order to support the top layer to be optimised for fast and cheap queries for your BI visualisations.
We strongly recommend decoupling your compute and storage.
A common solution to make sure of that is by using Airflow to orchestrate all of your data pipelines.
The data operations on each level can be executed by a tool of your choice. In order to simplify the process, a good practice is to start with implementing it solely with internal views.
This is how it looks like eventually :

Ilan Rosen

Hands on tech savvy ,data driven executive with vast experience and proven results in key positions.
Passionate about customers and products, leading and inspiring teams in creating
solutions to real customer needs (not necessarily wants :) )