Cloud Computing, Data Analytics

4 Mins Read

Implementing SCD Type 2 and Type 3 Using PySpark

Voiced by Amazon Polly

Overview

In the world of data warehousing, one of the fundamental challenges is handling changes in dimension data over time. The concept of Slowly Changing Dimensions (SCD) addresses this challenge, allowing businesses to track and store historical data as it evolves. SCD refers to how data in dimensional tables changes slowly and gradually rather than frequently or unpredictably. There are different strategies (or types) for managing these changes, and in this blog, we will focus on SCD Type 2 and SCD Type 3, explaining their concepts and showing how they can be implemented using PySpark.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Slowly Changing Dimension (SCD)

A dimension in a data warehouse refers to descriptive information (like a customer’s name, address, product category, etc.) that categorizes facts and measures to enable meaningful analysis. Slowly Changing Dimensions are dimensions that change over time but much slower than transactional data.

For example:

  • A customer’s address or marital status might change, but these changes are infrequent.
  • A product’s category could change if the business redefines its taxonomy.

Types of Slowly Changing Dimensions

Multiple types of SCDs define how to store historical changes in dimension data:

table1

SCD Type 2: Full Historical Tracking

SCD Type 2 is used when we want to preserve the entire history of changes in dimension attributes. This means that every time a change occurs in a dimension record, a new record is inserted into the dimension table, and the old record is maintained with an indicator of whether it is the current record. This allows businesses to see exactly how a dimension has evolved.

For example, consider a customer dimension table:

table2

  • The first record represents the customer’s address before the change.
  • The second record captures the updated address and has a Current_Flag of ‘Y’, indicating that this is the most recent address.

Implementation of SCD Type 2 Using PySpark

In PySpark, implementing SCD Type 2 involves merging incoming data with the existing dimension data and creating new records for updates. Below is a simplified implementation:

Pyspark code:

In this code:

  1. We compare the existing data (existing_df) with the incoming new data (new_df).
  2. For any changed records, we “close” the old records by updating the End_Date and setting the Current_Flag to ‘N’.
  3. New records are added with the updated information and a Current_Flag of ‘Y’.

Benefits:

  • Complete historical tracking.
  • Allows analysis of data as it existed at any point in time.
  • It is useful when business processes need to be tracked over time, such as customer activity, product changes, etc.

SCD Type 3: Storing Previous and Current Values

SCD Type 3 tracks limited history by storing only the current and previous values. It’s suitable for infrequent changes where full historical tracking isn’t needed.

Example:
If a customer’s address changes, the table keeps both current and previous addresses:

table3

Implementation using PySpark:

Benefits:

  • Tracks current and previous states.
  • Saves storage compared to Type 2.
  • Suitable for scenarios where limited history is sufficient.

Conclusion

Slowly Changing Dimensions, particularly SCD Type 2 and Type 3, are pivotal in ensuring the relevance and accuracy of the data used for analysis in a business intelligence environment. Understanding these strategies allows organizations to choose the right approach based on their needs and requirements. Incorporating PySpark into your data processing framework enables efficient handling of large datasets and simplifies the implementation of SCDs, ultimately leading to more insightful analysis.

So, whether you are developing a system for customer analytics or managing any business process with changing attributes, the strategic implementation of SCDs can provide deeper insights and answer questions about your operations over time.

Drop a query if you have any questions regarding SCD and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is the Difference between SCD Type 2 and Type 3?

ANS: –

  • Type 2: Creates a new record for every change, preserving full history with Start_Date, End_Date, and Current_Flag.
  • Type 3: Stores only current and previous values, tracking limited history for recent changes.

2. When to use Type 2 vs. Type 3?

ANS: –

  • Use Type 2 to track full history (e.g., address changes).
  • Use Type 3 for limited changes (e.g., last two addresses).

WRITTEN BY Aiswarya Sahoo

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!