|
Voiced by Amazon Polly |
Introduction
When you work with real production data in PySpark, maps show up more often than you might expect. Event attributes, feature stores, JSON payloads, configuration blobs, metadata columns. All of these commonly land in Spark as MapType columns.
The mistake many teams make is treating maps like opaque JSON objects. They explode them, flatten them, or push the logic downstream into Python UDFs. That works at a small scale, but performance drops quickly once data grows.
Spark SQL already provides powerful, native map transformation functions that run within the Catalyst optimizer and Tungsten engine. They are vectorized, parallel, and far faster than any UDF-based approach.
In this post, we will focus on four such functions that are both expressive and high-performance:
- transform_keys
- transform_values
- map_filter
- map_contains_key
We will look at what they do, why they are fast, and how to use them effectively with small, clear examples.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Why native map transformations matter?
- They run entirely inside Spark’s execution engine.
- Catalyst optimizes them.
- They avoid serialization overhead.
- They scale cleanly across large datasets.
If performance matters, and it usually does, native functions should always be your first choice.
Setup: A simple map column
Let’s assume we have a DataFrame like this:
|
1 2 3 4 5 6 7 8 |
from pyspark.sql import functions as F data = [ (1, {"cpu": 80, "memory": 70, "disk": 90}), (2, {"cpu": 40, "memory": 60}), (3, {"disk": 50}) ] df = spark.createDataFrame(data, ["id", "metrics"]) df.show(truncate=False) |
The metrics column is a MapType(StringType, IntegerType) that represents system metrics per record.
- transform_keys: changing map keys without reshaping data
transform_keys allows you to modify map keys while keeping values intact. This is extremely useful when normalizing schemas, standardizing naming conventions, or aligning data from multiple sources.
Example: prefix all keys
|
1 2 3 4 5 6 7 |
df_transformed = df.withColumn( "metrics_prefixed", F.transform_keys("metrics", lambda k, v: F.concat(F.lit("sys_"), k)) ) df_transformed.show(truncate=False) Output: {"sys_cpu":80, "sys_memory":70, "sys_disk":90} |
Why is this powerful?
- No explode required
- No regrouping
- No shuffle
- No UDF
Spark applies this transformation directly at the map level, which keeps execution fast and memory efficient.
2. transform_values: modifying values while preserving structure
transform_values lets you update map values without touching keys. This is perfect for scaling, thresholding, or cleaning numeric data.
Example: cap values at 75
|
1 2 3 4 5 6 7 8 |
df_capped = df.withColumn( "metrics_capped", F.transform_values( "metrics", lambda k, v: F.when(v > 75, 75).otherwise(v) ) ) df_capped.show(truncate=False) |
Use cases in real pipelines
- Normalizing scores
- Applying unit conversions
- Masking sensitive values
- Filling defaults conditionally
Again, this happens natively inside Spark’s engine.
3. map_filter: removing unwanted entries efficiently
Often, you want to drop certain key-value pairs based on business logic. map_filter lets you do exactly that.
Example: keep only metrics above 60
|
1 2 3 4 5 |
df_filtered = df.withColumn( "metrics_filtered", F.map_filter("metrics", lambda k, v: v > 60) ) df_filtered.show(truncate=False) |
Why does this beat explode + filter
Exploding a map multiplies rows. That increases data size, memory usage, and shuffle cost. map_filter avoids all of that by operating directly on the map structure.
This is one of the most underrated performance wins in Spark SQL.
4. map_contains_key: fast existence checks
Sometimes you only need to know whether a key exists. map_contains_key is built exactly for that.
Example: check if CPU metric exists
|
1 2 3 4 5 |
df_flagged = df.withColumn( "has_cpu", F.map_contains_key("metrics", F.lit("cpu")) ) df_flagged.show(truncate=False) |
This is far cleaner and faster than checking metrics[‘cpu’] is not null.
Practical uses
- Feature presence validation
- Conditional transformations
- Data quality checks
- Routing logic in pipelines
Using map_contains_key with filtering
A very common pattern is filtering rows based on whether a key exists.
|
1 2 3 |
df_cpu_only = df.filter( F.map_contains_key("metrics", F.lit("cpu")) ) |
This avoids exceptions and safely handles sparse maps.
Performance considerations and best practices
To get the most out of these functions, keep a few rules in mind:
- Avoid UDFs whenever possible
Native map functions are always faster. - Do not explode unless you truly need rows
Explode is expensive and often unnecessary. - Chain transformations thoughtfully
Spark can optimize chained expressions better than multiple intermediate columns. - Prefer SQL expressions for complex logic
These functions integrate cleanly with when, case, and other SQL expressions. - Keep maps reasonably sized
Maps are great for semi-structured data, but extremely large maps can still impact memory.
When maps are the right choice?
Maps are ideal when:
- Keys are dynamic or sparse
- Schema evolution is expected
- You want to preserve structure
- You need high-performance transformations
They are less ideal when you need heavy aggregation across keys, in which case a normalized table may be better.
Conclusion
High-performing PySpark pipelines are built by working with Spark’s strengths, not around them. Native map transformation functions do exactly that. They let you express rich logic in a way Spark can optimize and scale efficiently.
Map columns do not need to be flattened to be useful. When treated as structured data and transformed natively, they give you flexibility without sacrificing performance. Mastering these functions is a small investment that pays off every time your pipeline runs in production.
Drop a query if you have any questions regarding PySpark and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. Can these transformations be written in pure SQL instead of PySpark?
ANS: – Yes. Every function discussed also works in Spark SQL. This is useful when building views, running transformations inside Databricks SQL, or maintaining logic in .sql files.
2. Can I use these functions with nested maps?
ANS: – Yes, but only at one level at a time. If you have a map whose values are themselves maps, you can apply these functions recursively by nesting expressions. Spark does not automatically traverse nested structures, so each level must be handled explicitly.
3. Do these functions cause data shuffles?
ANS: – No. These transformations operate within each row and do not trigger shuffles on their own.
WRITTEN BY Aehteshaam Shaikh
Aehteshaam works as a SME at CloudThat, specializing in AWS, Python, SQL, and data analytics. He has built end-to-end data pipelines, interactive dashboards, and optimized cloud-based analytics solutions. Passionate about analytics, ML, generative AI, and cloud computing, he loves turning complex data into actionable insights and is always eager to learn new technologies.
Login

March 16, 2026
PREV
Comments