Cloud Computing, Google Cloud (GCP)

4 Mins Read

Query Optimization Techniques on BigQuery

Overview

Data pipelines must handle enormous loads due to the growth of big data, and it is getting easier and more accessible for expenses to spiral out of control. Writing syntax that runs is no longer the only requirement for querying. Additionally, it must be quick and cost-effective.

What is BigQuery?

BigQuery is an entirely managed enterprise data warehouse that offers built-in technologies like machine learning, geospatial analysis, and business intelligence to assist you in managing and analyzing your data. With no infrastructure administration required, BigQuery’s serverless architecture enables you to perform SQL queries to resolve your business’s most pressing issues. You may query terabytes in seconds and petabytes in minutes using BigQuery’s scalable, distributed analytical engine.

  • Cloud Migration
  • Devops
  • AIML & IoT
Know More

BigQuery Pricing

BigQuery pricing majorly has two main components:

  • Analysis Pricing – The price of processing queries, such as SQL queries, user-defined functions, scripts, and specific data definition language (DDL) and data manipulation language (DML) table-scanning statements.
  • Storage Pricing – The cost to store the data loaded into BigQuery.

Analysis Pricing Models

BigQuery provides two pricing options for executing queries –

  • On-demand Pricing – With this price structure, you are charged based on how many bytes each query processes. The first 1 TB of processed query data each month is free.
  • Flat-rate Pricing – With this pricing model, you purchase slots, which are virtual CPUs. When you purchase slots, you purchase a specific amount of processing power that you may utilize to execute queries. Various slots plans are available as per the commitment plan-
    1. Flex Slots – You pay for the initial 60 seconds.
    2. Monthly – You pay for the initial 30 days.
    3. Annual – You pay for 365 days.

Note – We can combine both the plan models as per our requirement. You pay for what you use when using on-demand pricing. Performance can vary since your queries use a shared pool of slots. With flat-rate pricing, you can buy capacity that is assured at a lower cost for a longer-term commitment.

Query Optimization Techniques

  1. Selecting only required columns – Limit the number of bytes to be scanned, i.e., NO select *. Instead of selecting all the columns from the table, select only the required columns. We can also use EXCEPT clause to eliminate the columns that are not required. Because BigQuery uses columnar storage, the data scan is proportional to the number of columns used in the query.

query1

2. Make use of partitioning and clustering – Make use of portioning and clustering table functionality to eliminate the data that is not required. This includes leveraging where clauses, filters, and columns are used to partition or cluster the table.

query2

3. Using aggregation –

Initially, the data is read from the storage system, and group bys are done on individual slots. Then the results from each slot are bucketed using a hash function such that data from the same key ends up in the same slot, called shuffling. Further aggregation happens, and the result is passed to the last slot, limiting the results. Since group by requires multiple aggregation steps, it can be more costly. For this reason, it is better to aggregate as late as possible.

query3

Note – Another thing we can do to avoid group is to nest the repeated data.

query3b

4. Optimizing join queries – Make sure to put the larger table first in the join query, followed by the smallest, and then by decreasing size.

query4

5. Filters before Joins – Where clause should be executed as soon as possible so that the slots performing the joins work with the least data. Also, make sure to use filters on both tables.

query5

6. Expression order matters – The first part of the where clause should always include a filter that will eliminate the most data.

query6

7. ORDER BY with LIMIT – If you are using order by statement, you may run into a resource exceeded error. This is because the final sorting of the query must be done on a single slot. To optimize this case, use the LIMIT so that the result set is easier to manage.

query7

Conclusion

The above discussed query optimization techniques can be used while performing data analytics on BigQuery to reduce the incurred cost.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

  • Cloud Training
  • Customized Training
  • Experiential Learning
Read More

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding BigQuery and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

1. What is the best suitable case for flat-rate pricing?

ANS: – Flat-rate pricing is appropriate if you need constant monthly spending on analytics. When you sign up for flat-rate pricing, you purchase dedicated query processing capability, measured in BigQuery slots. The monthly flat rate pricing includes the cost of all bytes handled. Your queries are queued until your flat-rate resources become available if they exceed your flat-rate capacity.

2. What are slots in BigQuery?

ANS: – BigQuery divides the processing power needed to run SQL queries into discrete units called slots.

3. What are clustered tables?

ANS: – Tables with a user-defined column sort order utilizing clustered columns are called clustered tables in BigQuery. Clustered tables can save query costs and increase query performance.

WRITTEN BY Sahil Kumar

Sahil Kumar works as a Subject Matter Expert - Data and AI/ML at CloudThat. He is a certified Google Cloud Professional Data Engineer. He has a great enthusiasm for cloud computing and a strong desire to learn new technologies continuously.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!