Spark DataFrames for Machine Learning Applications

Recap

In the “Introduction to Spark – Part 1”, we discussed the fundamentals of Spark and its basic abstraction called Resilient Distributed Datasets and its transformations. In this blog, we will look at DataFrames, which is a data structure provided by Spark. When it comes to processing structured data, Spark provides an abstraction called DataFrame which holds organized information about the data such as its features, and the data is distributed in the form of a matrix/table.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

What is a DataFrame?

Spark DataFrame is a collection of data organized in the form of columns and rows, where each column can be feature data. They are akin to database tables or Pandas DataFrames but with richer optimization, to integrate with large-scale datasets for machine learning applications or algorithms. A DataFrame fundamentally is a Dataset that is organized into named columns. Spark DataFrames can be created from various sources like log tables, databases, Hive Tables, or RDDs. Sparks provides multiple ways to interact with DataFrames like Spark SQL, SQL, or Dataset API. Spark provides DataFrame API which is available in Java and Scala. Each DataFrame is structured into Schemas; where each column has a name, datatype, and nullable (can be set to true) properties.

Dataframe1

Spark abstractions run across multiple clusters for providing faster data
Processing, Spark DataFrames are distributed across multiple nodes and are
Optimized with Catalyst. The Catalyst optimizer takes in, for example, SQL
Queries and start parallel computational jobs, and then they are converted into RDD abstractions.

It works in the following steps:

The DataFrames are built through certain operations and to optimize them, the catalyst needs to compile them for execution in the “physical plan”. need to be compiled by the Catalyst Optimizer for execution. To apply optimizations, the Catalyst tries to understand operations that need to be performed on the data/information, it will initiate smart selections to speed up the computations.
Catalyst applies logical optimizations like predicate pushdown. The optimizer pushes the filtered predicates down into the data source, and as a result, the physical execution can skip the inapplicable or irrelevant information. In the case of Parquet files, entire blocks are often skipped and comparisons on strings are often changed into cheaper number/integer comparisons through encoding such as dictionary encoding. Finally, in the case of relational databases, external databases are scaled down to reduce the traffic by pushing predicated area units into the external databases.
Catalyst then complies with any operations and generates the JVM bytecode for execution. The JVM Bytecode is often optimized so, for example, it can choose intelligently between joins that are broadcasted or join that are shuffled to scale down or reduce the network traffic. It can also perform various lower-level optimizations to remove object allocations that are taxing and reduce the number of function calls. The generated JVM bytecode will have the same performance for python and Java/Scale users.

Dataframe2

Features of Spark DataFrames

Multiple Data Format – DataFrames support JSON, Hive, CSV, JSON, Cassandra, and Parquet.

Programming API support – DataFrames support APIs for Python, R, Java, and Spark.

Optimizations – Spark Provides customized memory management to reduce overload and Improve performance. Spark Catalyst optimizes data processing efficiently across various languages. It offers a huge performance improvement over RDDs.

Schema support – Spark DataFrames are defined around schemas for structuring the Data providing robustness and providing the use of SQL on the DataFrames.

Exploring the Basics of DataFrames in PySpark

We shall be exploring Spark DataFrames using PySpark

To create a DataFrame we would require the following classes from PySpark:

Dataframe3

Then we shall create SparkSession(using its builder method) to build a Spark object that will help us create DataFrames:

Dataframe4

Here we have created a new instance of SparkSession called “DataFrame”

As we saw above in the blog, Spark creates a schema for a DataFrame, storing the column name, its data type, and its nullable property. We can infer a schema or provide a schema for a Dataframe:

Dataframe5

We are reading a CSV file and we are telling spark to automatically infer the schema from the CSV file (that is if we have headers on our CSV). It will intelligently make a correct inference about the columns and their associated data types. We also can manually supply to the schema:

Dataframe6

In this case, we would need to import StructType, StructField, and other data types from “pyspark.sql.types”:

Dataframe7

The DataFrame looks something like this:

Dataframe8

There are two ways to manipulate data frames:

Domain-specific language such as python

PySpark provides an in-built method for selecting and sorting the data. For example, we can filter the method combined with the sort method.

Dataframe9

SQL syntax

We can run SQL queries on DataFrame using by creating a view with “createOrReplaceTempView” and then using spark’s SQL method on that view:

Dataframe10

As we can see, manipulation semantics seem similar only they are different in syntax and domain.

Conclusion

Spark DataFrames are powerful data structures that are highly optimized and use domain-specific language and SQL to manipulate the structured data. They are similar to what we see in pandas and database tables but they do not utilize or perform distributed computations and optimization provided by Spark, making them a good use case for large-scale datasets for machine learning applications.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is the difference between Pandas DataFrames and Spark DataFrame?

ANS: – Pandas DataFrames and Spark DataFrames are similar in terms of their visual representation as tables, but they are different in how they are stored, implemented and the methods they support. Pandas DataFrames are faster than Spark DataFrames due to parallelization: one runs on single machines and the other runs on multiple nodes/machines(or a cluster).

2. What is the language support for Spark?

ANS: – Spark can be used or leveraged with python, Scale, Java, and R. Compared to Pandas library which is strictly made for python. This also affects the learning rate if you are only accustomed to one programming language.

3. When should we use Spark?

ANS: – Ideally, if you have a large dataset with many features and it needs to be processed, Spark would be the right choice. If the dataset is smaller, there can be additional overhead, which can be slower to process, and in that case, one should use pandas. It should be noted that, Spark is majorly used for processing structured and semi-structured dataset.

WRITTEN BY Mohmmad Shahnawaz Ahangar

Shahnawaz is a Research Associate at CloudThat. He is certified as a Microsoft Azure Administrator. He has experience working on Data Analytics, Machine Learning, and AI project migrations on the cloud for clients from various industry domains. He is interested to learn new technologies and write blogs on advanced tech topics.