Implementing SCD Type 2 and Type 3 Using PySpark

Overview

In the world of data warehousing, one of the fundamental challenges is handling changes in dimension data over time. The concept of Slowly Changing Dimensions (SCD) addresses this challenge, allowing businesses to track and store historical data as it evolves. SCD refers to how data in dimensional tables changes slowly and gradually rather than frequently or unpredictably. There are different strategies (or types) for managing these changes, and in this blog, we will focus on SCD Type 2 and SCD Type 3, explaining their concepts and showing how they can be implemented using PySpark.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Slowly Changing Dimension (SCD)

A dimension in a data warehouse refers to descriptive information (like a customer’s name, address, product category, etc.) that categorizes facts and measures to enable meaningful analysis. Slowly Changing Dimensions are dimensions that change over time but much slower than transactional data.

For example:

A customer’s address or marital status might change, but these changes are infrequent.
A product’s category could change if the business redefines its taxonomy.

Types of Slowly Changing Dimensions

Multiple types of SCDs define how to store historical changes in dimension data:

table1

SCD Type 2: Full Historical Tracking

SCD Type 2 is used when we want to preserve the entire history of changes in dimension attributes. This means that every time a change occurs in a dimension record, a new record is inserted into the dimension table, and the old record is maintained with an indicator of whether it is the current record. This allows businesses to see exactly how a dimension has evolved.

For example, consider a customer dimension table:

table2

The first record represents the customer’s address before the change.
The second record captures the updated address and has a Current_Flag of ‘Y’, indicating that this is the most recent address.

Implementation of SCD Type 2 Using PySpark

In PySpark, implementing SCD Type 2 involves merging incoming data with the existing dimension data and creating new records for updates. Below is a simplified implementation:

Pyspark code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_date, lit, when

# Initialize Spark session
spark = SparkSession.builder.appName("SCDType2").getOrCreate()

# Sample DataFrames
existing_df = spark.createDataFrame([
    (1, 'Aiswarya', 'Newtown', '2020-01-01', '2023-01-01', 'N'),
], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date', 'End_Date', 'Current_Flag'])

new_df = spark.createDataFrame([
    (1, 'Aiswarya', 'Saltlake', '2023-01-02'),
], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date'])

# Step 1: Identify changes
change_df = existing_df.alias('existing').join(
    new_df.alias('new'),
    on=['Customer_ID'],
    how='inner'
). filter(
    (col('existing.Address') != col('new.Address'))
)

# Step 2: Close out old records (set end date and flag them as 'N')
updated_existing_df = existing_df.alias('existing').join(
    change_df.select('existing.Customer_ID'),
    on=['Customer_ID'],
    how='inner'
). withColumn(
    'End_Date', current_date()
).withColumn(
    'Current_Flag', lit('N')
)

# Step 3: Insert new records for changed data
new_records_df = change_df.withColumn(
    'End_Date', lit (None)
).withColumn(
    'Current_Flag', lit('Y')
)

# Combine the new and old records
final_df = updated_existing_df.union(new_records_df)

final_df.show()

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, current_date, lit, when

# Initialize Spark session

spark = SparkSession.builder.appName("SCDType2").getOrCreate()

# Sample DataFrames

existing_df = spark.createDataFrame([

(1, 'Aiswarya', 'Newtown', '2020-01-01', '2023-01-01', 'N'),

], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date', 'End_Date', 'Current_Flag'])

new_df = spark.createDataFrame([

(1, 'Aiswarya', 'Saltlake', '2023-01-02'),

], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date'])

# Step 1: Identify changes

change_df = existing_df.alias('existing').join(

new_df.alias('new'),

on=['Customer_ID'],

how='inner'

). filter(

(col('existing.Address') != col('new.Address'))

)

# Step 2: Close out old records (set end date and flag them as 'N')

updated_existing_df = existing_df.alias('existing').join(

change_df.select('existing.Customer_ID'),

on=['Customer_ID'],

how='inner'

). withColumn(

'End_Date', current_date()

).withColumn(

'Current_Flag', lit('N')

)

# Step 3: Insert new records for changed data

new_records_df = change_df.withColumn(

'End_Date', lit (None)

).withColumn(

'Current_Flag', lit('Y')

)

# Combine the new and old records

final_df = updated_existing_df.union(new_records_df)

final_df.show()

In this code:

We compare the existing data (existing_df) with the incoming new data (new_df).
For any changed records, we “close” the old records by updating the End_Date and setting the Current_Flag to ‘N’.
New records are added with the updated information and a Current_Flag of ‘Y’.

Benefits:

Complete historical tracking.
Allows analysis of data as it existed at any point in time.
It is useful when business processes need to be tracked over time, such as customer activity, product changes, etc.

SCD Type 3: Storing Previous and Current Values

SCD Type 3 tracks limited history by storing only the current and previous values. It’s suitable for infrequent changes where full historical tracking isn’t needed.

Example:
If a customer’s address changes, the table keeps both current and previous addresses:

table3

Implementation using PySpark:

# Existing and new data
existing_df = spark.createDataFrame([(1, 'Aiswarya', 'Newtown', 'Saltlake')],
                                    ['Customer_ID', 'Customer_Name', 'Current_Address', 'Previous_Address'])

new_df = spark.createDataFrame([(1, 'Aiswarya', 'Saltlake')],
                                ['Customer_ID', 'Customer_Name', 'Current_Address'])

# Step 1: Identify changes
change_df = existing_df.join(new_df, on='Customer_ID').filter(
    col('Current_Address') != col('new.Current_Address')
)

# Step 2: Update current and previous addresses
updated_df = existing_df.withColumn('Previous_Address', col('Current_Address')).withColumn(
    'Current_Address', col('new.Current_Address'))
updated_df.show()

# Existing and new data

existing_df = spark.createDataFrame([(1, 'Aiswarya', 'Newtown', 'Saltlake')],

['Customer_ID', 'Customer_Name', 'Current_Address', 'Previous_Address'])

new_df = spark.createDataFrame([(1, 'Aiswarya', 'Saltlake')],

['Customer_ID', 'Customer_Name', 'Current_Address'])

# Step 1: Identify changes

change_df = existing_df.join(new_df, on='Customer_ID').filter(

col('Current_Address') != col('new.Current_Address')

)

# Step 2: Update current and previous addresses

updated_df = existing_df.withColumn('Previous_Address', col('Current_Address')).withColumn(

'Current_Address', col('new.Current_Address'))

updated_df.show()

Benefits:

Tracks current and previous states.
Saves storage compared to Type 2.
Suitable for scenarios where limited history is sufficient.

Conclusion

Slowly Changing Dimensions, particularly SCD Type 2 and Type 3, are pivotal in ensuring the relevance and accuracy of the data used for analysis in a business intelligence environment. Understanding these strategies allows organizations to choose the right approach based on their needs and requirements. Incorporating PySpark into your data processing framework enables efficient handling of large datasets and simplifies the implementation of SCDs, ultimately leading to more insightful analysis.

So, whether you are developing a system for customer analytics or managing any business process with changing attributes, the strategic implementation of SCDs can provide deeper insights and answer questions about your operations over time.

Drop a query if you have any questions regarding SCD and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is the Difference between SCD Type 2 and Type 3?

ANS: –

Type 2: Creates a new record for every change, preserving full history with Start_Date, End_Date, and Current_Flag.
Type 3: Stores only current and previous values, tracking limited history for recent changes.

2. When to use Type 2 vs. Type 3?

ANS: –

Use Type 2 to track full history (e.g., address changes).
Use Type 3 for limited changes (e.g., last two addresses).

WRITTEN BY Aiswarya Sahoo

Aiswarya is a Data Engineer at CloudThat, with a strong focus on designing and building scalable data pipelines and cloud-based solutions. He is skilled in working with big data tools and technologies such as PySpark, AWS Glue, AWS Lambda, Amazon S3, and Amazon RDS. Aiswarya has a solid understanding of data processing, ETL workflows, and optimizing data systems for performance and reliability. In his free time, he enjoys exploring advancements in cloud computing, experimenting with new data tools, and staying updated with industry trends.