Voiced by Amazon Polly
In the “Introduction to Spark – Part 1”, we discussed the fundamentals of Spark and its basic abstraction called Resilient Distributed Datasets and its transformations. In this blog, we will look at DataFrames, which is a data structure provided by Spark. When it comes to processing structured data, Spark provides an abstraction called DataFrame which holds organized information about the data such as its features, and the data is distributed in the form of a matrix/table.
What is a DataFrame?
Spark DataFrame is a collection of data organized in the form of columns and rows, where each column can be feature data. They are akin to database tables or Pandas DataFrames but with richer optimization, to integrate with large-scale datasets for machine learning applications or algorithms. A DataFrame fundamentally is a Dataset that is organized into named columns. Spark DataFrames can be created from various sources like log tables, databases, Hive Tables, or RDDs. Sparks provides multiple ways to interact with DataFrames like Spark SQL, SQL, or Dataset API. Spark provides DataFrame API which is available in Java and Scala. Each DataFrame is structured into Schemas; where each column has a name, datatype, and nullable (can be set to true) properties.
- Spark abstractions run across multiple clusters for providing faster data
- Processing, Spark DataFrames are distributed across multiple nodes and are
- Optimized with Catalyst. The Catalyst optimizer takes in, for example, SQL
- Queries and start parallel computational jobs, and then they are converted into RDD abstractions.
It works in the following steps:
- The DataFrames are built through certain operations and to optimize them, the catalyst needs to compile them for execution in the “physical plan”. need to be compiled by the Catalyst Optimizer for execution. To apply optimizations, the Catalyst tries to understand operations that need to be performed on the data/information, it will initiate smart selections to speed up the computations.
- Catalyst applies logical optimizations like predicate pushdown. The optimizer pushes the filtered predicates down into the data source, and as a result, the physical execution can skip the inapplicable or irrelevant information. In the case of Parquet files, entire blocks are often skipped and comparisons on strings are often changed into cheaper number/integer comparisons through encoding such as dictionary encoding. Finally, in the case of relational databases, external databases are scaled down to reduce the traffic by pushing predicated area units into the external databases.
- Catalyst then complies with any operations and generates the JVM bytecode for execution. The JVM Bytecode is often optimized so, for example, it can choose intelligently between joins that are broadcasted or join that are shuffled to scale down or reduce the network traffic. It can also perform various lower-level optimizations to remove object allocations that are taxing and reduce the number of function calls. The generated JVM bytecode will have the same performance for python and Java/Scale users.
Features of Spark DataFrames
Multiple Data Format – DataFrames support JSON, Hive, CSV, JSON, Cassandra, and Parquet.
Programming API support – DataFrames support APIs for Python, R, Java, and Spark.
Optimizations – Spark Provides customized memory management to reduce overload and Improve performance. Spark Catalyst optimizes data processing efficiently across various languages. It offers a huge performance improvement over RDDs.
Schema support – Spark DataFrames are defined around schemas for structuring the Data providing robustness and providing the use of SQL on the DataFrames.
Exploring the Basics of DataFrames in PySpark
We shall be exploring Spark DataFrames using PySpark
To create a DataFrame we would require the following classes from PySpark:
Then we shall create SparkSession(using its builder method) to build a Spark object that will help us create DataFrames:
Here we have created a new instance of SparkSession called “DataFrame”
As we saw above in the blog, Spark creates a schema for a DataFrame, storing the column name, its data type, and its nullable property. We can infer a schema or provide a schema for a Dataframe:
We are reading a CSV file and we are telling spark to automatically infer the schema from the CSV file (that is if we have headers on our CSV). It will intelligently make a correct inference about the columns and their associated data types. We also can manually supply to the schema:
In this case, we would need to import StructType, StructField, and other data types from “pyspark.sql.types”:
The DataFrame looks something like this:
There are two ways to manipulate data frames:
- Domain-specific language such as python
PySpark provides an in-built method for selecting and sorting the data. For example, we can filter the method combined with the sort method.
- SQL syntax
We can run SQL queries on DataFrame using by creating a view with “createOrReplaceTempView” and then using spark’s SQL method on that view:
As we can see, manipulation semantics seem similar only they are different in syntax and domain.
Spark DataFrames are powerful data structures that are highly optimized and use domain-specific language and SQL to manipulate the structured data. They are similar to what we see in pandas and database tables but they do not utilize or perform distributed computations and optimization provided by Spark, making them a good use case for large-scale datasets for machine learning applications.
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Spark DataFrame and I will get back to you quickly.
1. What is the difference between Pandas DataFrames and Spark DataFrame?
ANS: – Pandas DataFrames and Spark DataFrames are similar in terms of their visual representation as tables, but they are different in how they are stored, implemented and the methods they support. Pandas DataFrames are faster than Spark DataFrames due to parallelization: one runs on single machines and the other runs on multiple nodes/machines(or a cluster).
2. What is the language support for Spark?
ANS: – Spark can be used or leveraged with python, Scale, Java, and R. Compared to Pandas library which is strictly made for python. This also affects the learning rate if you are only accustomed to one programming language.
3. When should we use Spark?
ANS: – Ideally, if you have a large dataset with many features and it needs to be processed, Spark would be the right choice. If the dataset is smaller, there can be additional overhead, which can be slower to process, and in that case, one should use pandas. It should be noted that, Spark is majorly used for processing structured and semi-structured dataset.
WRITTEN BY Mohmmad Shahnawaz Ahangar
Shahnawaz is a Research Associate at CloudThat. He is certified as a Microsoft Azure Administrator. He has experience working on Data Analytics, Machine Learning, and AI project migrations on the cloud for clients from various industry domains. He is interested to learn new technologies and write blogs on advanced tech topics.