Handling Unexpected Data Issues in Spark Ingestion

Introduction

When working with Apache Spark, one of the most frustrating scenarios for a data engineer is encountering corrupt records during data ingestion. They silently creep into your pipelines, break transformations, or slip into downstream systems without anyone noticing.

Corrupt records are odd puzzle pieces that don’t quite fit something, but it takes time to figure out where. If you’ve ever faced weird parsing errors, unexpected nulls, or misaligned columns while reading CSV, JSON, or other semi-structured formats in Spark, chances are you’ve already met them.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Spark Corrupt Records

When reading data, a corrupt record is any row that Spark cannot map into your defined schema.
This often happens when the row structure deviates from the expected format, missing commas, broken quotes, incomplete JSON, or invalid data types.

Example: Expected CSV

id,name,age
1,John,30
2,Jane,28

id,name,age

1,John,30

2,Jane,28

Corrupt CSV Example

3,"Mike, 35
4,Sarah,abc

1 2	3,"Mike, 35 4,Sarah,abc

First row: Unclosed quote
Second row: String “abc” instead of an integer

By default, Spark may either skip these rows, fail immediately, or mark them for inspection depending on your configuration.

Why Do Spark Corrupt Records Happen?

From real-world experience, corrupt records often appear due to:

Human entry errors (typos, missing values, mismatched quotes)
Encoding mismatches (UTF-8 vs ISO-8859-1)
Schema drift, where columns are added or removed without updating the schema in Spark
Partial writes from failed ETL jobs
Upstream bugs in APIs generating malformed JSON or CSV

If left unchecked, these can cause incorrect analytics, failed pipelines, and unreliable dashboards.

Catching Corrupt Records in Spark

Using Read Modes

When loading data, Spark offers three read modes:

PERMISSIVE (default) – Reads valid fields, stores corrupted ones in a special column
DROPMALFORMED – Silently removes problematic rows
FAILFAST – Stops execution immediately on encountering a corrupted record

Example with CSV:

df = spark.read \
    .option("header", "true").option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .csv("/path/to/file.csv")

df = spark.read \

.option("header", "true").option("mode", "PERMISSIVE") \

.option("columnNameOfCorruptRecord", "_corrupt_record") \

.csv("/path/to/file.csv")

This adds a _corrupt_record column containing the raw data for any problematic row.

JSON files work similarly:

df = spark.read \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .json("/path/to/file.json")

df = spark.read \

.option("mode", "PERMISSIVE") \

.option("columnNameOfCorruptRecord", "_corrupt_record") \

.json("/path/to/file.json")

Malformed lines will appear in _corrupt_record, making them easy to isolate.

Separating corrupted and non_corrupted Data

corrupted_df = df.filter(df["_corrupt_record"].isNotNull())
non_corrupted_df = df.filter(df["_corrupt_record"].isNull())

1 2	corrupted_df = df.filter(df["_corrupt_record"].isNotNull()) non_corrupted_df = df.filter(df["_corrupt_record"].isNull())

You now have two cleanly separated datasets: one for downstream use and one for debugging.

Logging Corrupt Records for Investigation

Simply dropping corrupted rows is risky, you might lose valuable insights into upstream issues.
Instead, log and store them.

If you use Kafka, Kinesis, or ELK, send corrupted records there for alerts.

Fixing Corrupt Records

Adjusting the Schema

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", StringType(), True)
])

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([

StructField("id", IntegerType(), True),

StructField("name", StringType(), True),

StructField("age", StringType(), True)

])

This lets you load data without type mismatches.

Replacing Invalid Values

Invalid integers become NULL instead of breaking the pipeline.

df = df.withColumn("age", when(col("age").rlike("^[0-9]+$"), col("age")).otherwise(None))

1	df = df.withColumn("age", when(col("age").rlike("^[0-9]+$"), col("age")).otherwise(None))

Using corruptedRecordsPath

Spark can automatically save corrupted rows to a location:

df = spark.read \
    .option("corruptedRecordsPath", "/path/to/corrupted_records/") \

1 2	df = spark.read \ .option("corruptedRecordsPath", "/path/to/corrupted_records/") \

This is a production-friendly safety net.

Best Practices to Prevent Corrupt Records

Validate data before ingestion
Monitor trends in _corrupt_record counts
Keep schema definitions version-controlled
Test ingestion jobs with small sample files
Coordinate schema changes with upstream teams

Real-World Workflow Example

A corrupt record handling process might look like this:

Ingest data using PERMISSIVE mode with _corrupt_record and corruptedRecordsPath
Split good and corrupted data into separate DataFrames
Store corrupted data in Amazon S3 with partitioning
Send alerts if corrupted record counts exceed thresholds
Repair and reprocess when possible

Conclusion

Corrupt records in Apache Spark are inevitable in messy, real-world datasets. The difference between a fragile and resilient pipeline lies in how you catch, log, and fix them.

You can prevent corrupted data from silently creeping into your downstream systems by using read modes like PERMISSIVE, tracking _corrupt_record, and maintaining a structured logging and repair process.

With these practices in place, you’ll spend less time firefighting and more time delivering reliable analytics.

Drop a query if you have any questions regarding Apache Spark and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How do I know if my Spark job fails because of corrupt records or other issues?

ANS: – Check the error messages in your Spark logs. Errors caused by corrupt records usually point to parsing issues (e.g., malformed JSON, unexpected token, type mismatch). If the errors are more about memory, shuffles, or missing files, it’s likely unrelated to corrupt records.

2. What’s the performance impact of enabling _corrupt_record?

ANS: – Minimal for small datasets, but maintaining and writing the extra column can increase I/O in high-volume streaming or batch pipelines. If performance is a concern, consider logging only corrupted records separately instead of keeping the column in every DataFrame.

3. What happens if my data schema evolves frequently?

ANS: – Frequent schema changes increase the chance of corrupt records. Use schema-on-read approaches (like Delta Lake with evolving schemas) or maintain schema registries. This prevents Spark from interpreting legitimate new fields as corruption.

WRITTEN BY Aehteshaam Shaikh

Aehteshaam works as a SME at CloudThat, specializing in AWS, Python, SQL, and data analytics. He has built end-to-end data pipelines, interactive dashboards, and optimized cloud-based analytics solutions. Passionate about analytics, ML, generative AI, and cloud computing, he loves turning complex data into actionable insights and is always eager to learn new technologies.