Voiced by Amazon Polly |
Understand Dataskew
In Apache Spark, data skew occurs when certain keys in your dataset are significantly more frequent than others.
For efficient usage of Spark, the data needs to go into each executor of similar size. But consider the following scenario where we are joining two datasets. Suppose you’re joining two datasets on customer_id. If customer_id = 123 occurs in 80% of the records, all those records are sent to the same partition. So here, one executor gets overloaded, while others remain idle.
Impact of skewing:
- Straggler tasks that run much longer than others.
- Executors suffering from out-of-memory (OOM) errors.
- Overall poor Spark job performance, even if cluster resources look sufficient.
Freedom Month Sale — Upgrade Your Skills, Save Big!
- Up to 80% OFF AWS Courses
- Up to 30% OFF Microsoft Certs
Techniques to Overcome Data Skew
Spark provides multiple ways to deal with data skew:
- Repartitioning / Coalesce
– Distribute data more evenly by increasing or reducing partitions.
- Broadcast Joins
– Send a smaller dataset to all executors to avoid shuffles.
- Skew Join Hints
– Spark 3.x introduced /*+ SKEW */ hints to optimize skewed joins automatically.
- Salting
– Artificially split hot keys into multiple sub-keys so that executors share the load.
Each of these helps, but salting in Spark is the most direct way to fix extreme key skew.
Salting Technique with Example
Let’s understand skewing and salting with the below example
The Problem
Imagine two datasets:
- df1 → very skewed on customer_id = 123.
- df2 → balanced data.
Usual join is
df1.join(df2, “customer_id”)
So, all customer_id=123 records will end up in one executor, creating skew.
The Solution: Salting
We add a salt column to the skewed dataset (df1) to distribute heavily repeated keys across multiple partitions. At the same time, we expand the smaller dataset (df2) by duplicating each row for every possible salt value. This ensures that when the join is performed on (customer_id, salt), the salted keys from df1 still have corresponding matches in df2, while the workload is evenly balanced across executors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
<strong><em>from pyspark.sql.functions import col, lit, rand, floor, explode, array</em></strong> <strong><em># Add salt to skewed dataset (df1)</em></strong> <strong><em>df1_salted = df1.withColumn("salt", floor(rand()*5)) </em></strong> <strong><em>Here 5 buckets are created for distributing the data into more partitions</em></strong> <strong><em> </em></strong> <strong><em># Expand df2 to replicate rows with all salt values</em></strong> <strong><em>salt_values = [lit(i) for i in range(5)]</em></strong> <strong><em>df2_expanded = df2.withColumn("salt", explode(array(*salt_values)))</em></strong> <strong><em> </em></strong> <strong><em># Join on (key + salt)</em></strong> <strong><em> df_joined = df1_salted.join(df2_expanded, ["customer_id", "salt"])</em></strong> |
How This Works
- df1 → adds a random salt (0–4) to each row.
- df2 → replicates each row across all salt values, so that salted keys still match.
- Join happens on (customer_id, salt) → spreading skewed keys across 5 executors instead of 1.
- After join, you can drop the salt column if not needed.
Executor-Level Behavior
To have clarity on the salting techniques and how it works within the executors. Consider there were 5 executors for the above example.
- Without salting:
- Executor-1: 100 records of customer_id=123
- Executors 2–5: almost idle
- With salting (5 buckets):
- Executor-1: 20 records (123_0)
- Executor-2: 20 records (123_1)
- Executor-3: 20 records (123_2)
- Executor-4: 20 records (123_3)
- Executor-5: 20 records (123_4)
Load is evenly balanced, all executors work in parallel, job finishes faster.
Refer to the below links for more details
Conclusion
Data skew is one of the most common performance bottlenecks in Spark jobs. While Spark provides multiple tools—broadcast joins, repartitioning, and skew hints—salting in Spark remains a powerful, hands-on technique for balancing workloads when a few keys dominate your dataset.
By understanding how salting in Spark works at the executor level, data engineers can make their pipelines faster, more efficient, and more reliable—whether on open-source Spark or platforms like Azure Databricks.
Interested to learn on spark and Azure Databricks, please visit CloudThat Website for our customized trainings https://www.cloudthat.com/training/databricks .
Freedom Month Sale — Discounts That Set You Free!
- Up to 80% OFF AWS Courses
- Up to 30% OFF Microsoft Certs
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
WRITTEN BY G R Deeba Lakshmi
Comments