Detailed Guide to Accelerate Data Analytics Engine with Apache Spark

Introducing Apache Spark

Apache Spark, an open-source cluster framework, is an improvement over Hadoop. It leverages the use of in-memory computations to achieve high-speed data processing compared to Hadoop (MapReduce), which is slower as the I/O operations are done in HDFS (Hadoop Distributed File System). In Hadoop, the I/O operations in HDFS are necessarily required for coordination and fault tolerance, which are intermediate writes to HDFS. As the task grow and are coordinated (Via MapReduce) to achieve the results, the intermediate data must be written to the HDFS, which may not be required by the user and hence creating additional costs in terms of disk usage.

Spark addresses these issues using an abstraction known as “Resilient Distributed datasets” (RDDs) — data is distributed in the cluster and is stored and computed in the memory (RAM) of the nodes of that cluster, thus eliminating the intermediary disk writes. The total execution or the data flow takes the shape of a Directed Acyclic Graph (DAG), where each execution step represents a node and the edge of its transformation/action, resulting in the next execution step. As a result, through lazy evaluation, we can parallelize and optimize computations, particularly iterative calculations. In the following visualization, we can see a lineage or step of transformation done in stages using DAG, which makes it easier to track all the steps, and in case of a failure, we can roll back.

pyspark1

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

Apache Spark Ecosystem

Spark Core

Foundation of the Spark, it provides distributed task dispatching, scheduling, and all the basic I/O operations, abstractions, and API for different languages such as Java, Python, Scala, .NET, and R.

Spark SQL

It is a component built on top of the Spark core itself. An abstraction known as DataFrames is provided for the same to run Domain-specific language in Java, Scala, Python, and . NET. It provides full SQL language support.

Spark Streaming

Due to the fast-scheduling ability of the Spark Engine, streaming analytics is possible. The data is ingested in mini-batches, and RDD transformations are performed on those mini-batches.

Spark MLlib

Spark provides a machine learning library on top of its core and due to its fast computation processing, achieved through distributed in-memory architecture, it is much faster. It can run many ML algorithms such as logistic regression, linear regression, Random Forest, etc.

GraphX

It is a component built on top of the core for faster-distributed graph processing. It provides two APIs for implementing algorithms such as PageRank through an abstraction known as Pregel abstraction.

Resilient Distributed Datasets

Spark offers an essential abstraction, a collection of fault-tolerant and immutable objects. The records stored in an RDD are divided into partitions for parallel computations in the Spark cluster. The objects are distributed to each node of the Spark cluster for processing. Spark loads the data from a disk and keeps it in memory where it is processed, thereby achieving caching/persistence only to be reused later if needed. This makes data processing faster in Spark than in Hadoop. Spark RDD performs immutability in that when new transformations are applied on that RDD, Spark creates a new RDD and maintains the RDD Lineage and, in-effect achieving fault tolerance. In this blog, we will explore this abstraction.

Getting started with PySpark

PySpark, an open-source framework, is the python API for Apache Spark. If you are familiar with python, it makes Spark easy to use.

I will use PySpark Databricks Community Edition as it is free to use. We will explore all some common transformations on RDDs.

Steps to launch your Spark Cluster and Notebook

Creating a Spark Cluster and attaching it to a Notebook
a) Login to your Databricks Community Edition
b) Click on Create >> Cluster

pyspark2

c) You will be prompted with the following page:

pyspark3

Here, you need to name your cluster and provide your runtime. Databricks Community editions provide a community driver which provides 15 GBs of Ram and 2 Cores. Click on “Create Cluster” and your Spark Cluster will be created in 5-10 mins.
d) Click create >> notebook and you will be prompted with the following:

pyspark4

You can see that any task run or executed on this notebook will be executed inside the cluster that we have created

2. Importing PySpark Libraries To create an RDD, we need to import the following classes: SparkContext and SparkConfSparkContext is like entry point to use Spark transformations or functionalities. It connects you to a Spark Cluster and drives its functionality.SparkConf controls the configuration settings for that application (that is, for a particular Spark Context). It will load the values from Spark itself and the arguments passed to it. For example:a) setAppName() — name of the application. Serves as a name for the context as there can be many.
b) setMaster() — sets the URL of the master node or threads

pyspark5

setMaster() will create a thread locally — or we can provide a URL to the master node which can do it for us. Next, we pass the configuration to the context and are ready to create the RDDs.Working with RDD abstractionTo create an RDD, we will use parallelize() method of the Spark Context we created. It will load data — a python list — into the cluster and distribute it on the nodes.

pyspark6

We can even create RDD by reading the data from a text file using the textFile() method of the SparkContext.

pyspark7

To convert the RDD back to a python list, we can use the collect() method on that RDD

pyspark8

Transformations on RDD Abstractions

RDDs transformations are a way to transform or update one RDD to another. Because of the immutable nature of the RDDs, transformations applied always create a new RDD without updating an existing one, and hence a lineage is created. They are also “Lazy evaluations” meaning that no transformations get executed until an action is called on RDD itself.

There are a couple of transformations that Spark provides. For example:

Map() – applies a function on the dataset and returns the transformation of the same dataset

pyspark9

FlatMap() – if the dataset contains an array, it will convert each element in that array as one, a long row

pyspark10

Filter() – returns a new RDD after applying a function to filter on a dataset

pyspark11

Union() – Similar to set operation, it combines two datasets together

pyspark13

Intersection — Like set operation, it returns a dataset with similar elements

pyspark14

distinct() – it removes any duplicated elements in the dataset

pyspark15

Final Thoughts

Spark is an excellent enhancement over previous data analytics engines for large-scale data processing. It guarantees speed and fault tolerance, and its services can be used with Python, Scala, and Java. Spark provides both streaming services for streaming analytics and has built-in support for Kafka, Flume, and Kinesis. It includes Machine Learning libraries on top of its core, and since it runs on distributed computing, making the machine learning algorithm faster. Hence, it perfectly uses data streaming, analytics, and machine learning for large-scale datasets.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

WRITTEN BY Mohmmad Shahnawaz Ahangar

Shahnawaz is a Research Associate at CloudThat. He is certified as a Microsoft Azure Administrator. He has experience working on Data Analytics, Machine Learning, and AI project migrations on the cloud for clients from various industry domains. He is interested to learn new technologies and write blogs on advanced tech topics.