Voiced by Amazon Polly |
Introduction
Recently in AWS Re:Invent 2022, most of the featured announcements are related to data analytics, and one that might attract data analysts among them is that Amazon Redshift now has the capability of supporting new SQL functionalities namely, MERGE, ROLLUP, CUBE, and GROUPING SETS. Though developers working with analytical applications based out of Snowflake might already have some knowledge regarding these SQL features, let’s see how they work and help build data-driven applications and manage fast-changing data in Redshift.
MERGE
The ‘MERGE’ command is used to merge rows from a source table to a target table based on a condition, which means it inserts, updates, and deletes values based on values in a second table. Traditionally, this might take writing multiple insert, update, and delete statements separately to happen. This is useful when data in the target table is subjected to change more frequently. For every load, the target table may have to insert new rows that are new, update a few rows that are modified in the source, and/or delete a few rows that are marked as non-existent or not useful in the source table.
Syntax:
1 2 3 4 5 6 7 |
MERGE INTO target_table USING source_table [ [ AS ] alias ] ON match_condition WHEN MATCHED THEN { UPDATE SET col_name = { expr } [,...] | DELETE } WHEN NOT MATCHED THEN INSERT [ ( col_name [,...] ) ] VALUES ( { expr } [,...] ) |
Parameters:
target_table = permanent table that the rows merge into.
source_table = the table that supplies rows to merge into target_table. This cannot be a view or subquery.
alias = temporary alternative name. This is optional with the preceding AS.
match_condition = specifies the expression on which the target and source table rows will be compared. This can be single or multiple conditions.
WHEN MATCHED = specify the action to be taken when the match condition is evaluated to TRUE. We can perform either UPDATE or DELETE here.
UPDATE = updates the matching rows in the target table. Only the columns we specify will be updated.
DELETE = deletes the matching rows in the target table.
WHEN NOT MATCHED = specify the action that needs to be taken when the match condition is evaluated as FALSE. We can only perform INSERT here.
INSERT = inserts a new row into a target table. If we don’t specify the column names, the default order is the columns declared order.
col_name = one or more column names that we want to be modified.
expr = the expression that defines the new values to be inserted/updated/deleted for the above defined column names.
Example:
Let’s demonstrate a simple Merge with a simple example:
- First, we create two tables named target_table and source_table with some values inserted.
2. Next, we perform a merge operation to merge rows from source_table into target_table. When the matching condition is evaluated to true, it will update the row in target_table with source_table row values. When the matching condition is evaluated as false, it inserts the rows into target_table from source_table.
3. If we observe the record in target_table, it will be updated with the new value.
Points to remember:
- The source table and target table should not be the same.
- The target table should not be a system table or an external table.
- The source table should not be a view or subquery.
- Based on the matching conditions, merge behavior can be classified into two categories:
- Non-deterministic behavior: When multiple rows are matched between the source table and target table.
- Deterministic behavior: When there is a single exact match between the source and target tables.
5. Redshift only encourages deterministic behavior of merge. It doesn’t support any functionality to handle non-deterministic behavior yet.
6. Also, one observation while using the merge operation in Redshift is that you should always write both matched and not matched clauses. Both are mandatory in Redshift, unlike other data platforms where you can define any one clause based on the requirement.
Customized Cloud Solutions to Drive your Business Success
- Cloud Migration
- Devops
- AIML & IoT
Aggregation Extensions
Amazon Redshift now supports some aggregation extensions to perform multiple GROUP BY operations in a single statement. Those aggregation extensions are GROUPING SETS, ROLLUP, and CUBE.
Let’s look at each of them one by one.
GROUPING SETS
- This is one of the powerful extensions of the GROUP BY clause that allows us to compute multiple clauses in a single statement.
- A ‘Grouping Set’ is a set of single GROUP BY clause. We can group a query’s result by a set of 0 or more columns using GROUPING SETS.
- Grouping by a GROUPING SETS is equivalent to UNION ALL query on a result set grouped by multiple columns.
- GROUP BY GROUPING SETS((a)) is equivalent to the single grouping set operation GROUP BY a.
- GROUP BY GROUPING SETS((a), (b)) is equivalent to GROUP BY a UNION ALL GROUP BY b.
4. The output of a GROUPING SETS typically involves NULLs because it implicitly involves a UNION operation on the result set.
Syntax:
1 2 3 4 5 6 7 8 9 |
SELECT ... FROM ... [ ... ] GROUP BY GROUPING SETS ( groupSet [ , groupSet [ , ... ] ] ) [ ... ] |
Parameter:
groupSet = column alias or position of columns or expression.
Example:
Let’s see a simple demonstration of GROUPING SETS in 3 simple steps. For this, we are using a table of information about nurses. All nurses have a license as a ‘Registered Nurse’ (RN) and an additional license in a disaster-related specialty such as search and rescue, radio communications like Technician, or General called ‘Licensed Vocational Nurse’ (LVN).
- Creating and loading the data into a Redshift table.
2. Performing a simple GROUP BY GROUPING SETS.
3. Analyzing the output.
- The first two rows show the count of LVNs and RNs, and NULLs in the radio_license column are expected because the query grouped all of the medical_license values regardless of the radio_license values
- The last line shows NULLs in both columns. The medical_license is NULL because no one value of the medical_license column necessarily applies to all members of the radio_license column and vice versa.
ROLLUP
- ROLLUP is another extension of the GROUP BY clause, which produces sub-total and grouped rows. Sub total rows are those rows that further aggregate to compute the values by applying the same aggregate functions to the grouped rows.
- It assumes a hierarchy where preceding columns are considered parents of subsequent columns.
- We can think of GROUP BY ROLLUP as a series of GROUP BY GROUPING SETS or as generating multiple results sets, each of which is the aggregate of the previous result set.
- GROUP BY ROLLUP((a), (b)) is equivalent to GROUP BY GROUPING SETS ((a, b), (a), ()).
- If there are ‘N’ elements in a ROLLUP, it is equivalent to ‘N+1’ GROUPING SETS.
Syntax:
1 2 3 4 5 6 7 8 9 |
SELECT ... FROM ... [ ... ] GROUP BY ROLLUP ( groupRollup [ , groupRollup [ , ... ] ] ) [ ... ] |
Parameters:
groupRollup = column alias or position of columns or expression.
Example:
Let’s see a simple demonstration of how GROUP BY ROLLUP works. For this, we use sample data that includes information regarding store sales with branches in different cities, states, and territories in the sales table and product information in the product table.
- Creating & loading data into sales and product tables.
2. Performing a simple GROUP BY ROLLUP operation that shows profit by city, state, and total across all states.
3. Analyzing the result.
- Some rows of ROLLUP results contain NULLs, just like GROUPING SETS results. Here the last row in the result contains both state and city as NULL because, in that row, the profit is calculated for entire states and cities.
4. Comparing the result of GROUP BY ROLLUP with GROUP BY GROUPING SETS.
CUBE
- CUBE returns all the rows that ROLLUP will return, but the result of CUBE includes sub-total rows for every combination of grouping columns that ROLLUP does not cover. These additional rows can also be called ‘cross-tabulations’ rows.
- For example, GROUP BY CUBE((a), (b)) returns a result set that was first grouped by a and then by b while assuming b is a subsection of a and then by b alone.
- GROUP BY CUBE((a), (b)) is equivalent to GROUP BY GROUPING SETS ((a, b), (a), (b), ()).
- If there are ‘N’ elements in CUBE, it is equivalent to 2^N GROUPING SETS.
Syntax:
1 2 3 4 5 6 7 8 9 |
SELECT ... FROM ... [ ... ] GROUP BY CUBE ( groupCube [ , groupCube [ , ... ] ] ) [ ... ] |
Parameters:
groupCube = column alias or position of columns or expression.
Example:
Let’s see a simple CUBE operation for the same sales and product data we used for ROLLUP.
- Use GROUP BY ROLLUP to analyze hierarchical data and GROUP BY CUBE to analyze data across independent axes.
Conclusion
All the above features are currently in the preview version and strongly recommended by AWS to be used only for testing, not in production until those are live. Although we don’t use these features in daily life, these will come in handy to data analysts and save so much time during analysis.
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
About CloudThat
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Amazon Redshift and I will get back to you quickly.
To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.
FAQs
1. How to use the above mentioned SQL functionalities with our data workloads?
ANS: – To integrate these SQL functionalities with our data workloads, we must use the ‘Preview Cluster’ AWS offers to use preview features. The steps to create a preview cluster are below: a) Open the Amazon Redshift console in your AWS Management Console or use the link: https://console.aws.amazon.com/redshift/. b) Choose Clusters in the Provisioned Clusters dashboard. c) A banner displays ‘Preview’ on the clusters list page. Click the ‘Create Preview Cluster’ button to open the create cluster page. d) Enter properties for your cluster, choose the preview track containing the features you want to test and click on ‘Create Preview Cluster’.
2. When will the public preview end?
ANS: – AWS public preview for the above functionalities will end on ‘February 28, 2023’. So, using these functionalities with production workloads is not preferred until then.
3. How many GROUPING SETS can the Amazon Redshift GROUP BY clause support?
ANS: – Currently, Amazon Redshift supports ’64 GROUPING SETS’ in the GROUP BY clause. The number is subjected to change in the future.
WRITTEN BY Yaswanth Tippa
Yaswanth Tippa is working as a Research Associate - Data and AIoT at CloudThat. He is a highly passionate and self-motivated individual with experience in data engineering and cloud computing with substantial expertise in building solutions for complex business problems involving large-scale data warehousing and reporting.
Click to Comment