{"id":13880,"date":"2022-08-09T10:11:45","date_gmt":"2022-08-09T10:11:45","guid":{"rendered":"https:\/\/blog.cloudthat.com\/?p=13880"},"modified":"2024-06-25T10:55:52","modified_gmt":"2024-06-25T10:55:52","slug":"detailed-guide-accelerate-data-analytics-engine-with-apache-spark-part-1","status":"publish","type":"blog","link":"https:\/\/www.cloudthat.com\/resources\/blog\/detailed-guide-to-accelerate-data-analytics-engine-with-apache-spark-part-1","title":{"rendered":"Detailed Guide to Accelerate Data Analytics Engine with Apache Spark \u2013 Part 1"},"content":{"rendered":"<table border=\"0\">\n<tbody>\n<tr>\n<td>\n<h2><span style=\"color: #000080;\"><strong>TABLE OF CONTENT<\/strong><\/span><\/h2>\n<\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#introduction\">1. Introducing Apache Spark<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#apachespark\">2. Apache Spark Ecosystem<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#resilient\">3. Resilient Distributed Datasets<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#gettingstarted\">4. Getting started with PySpark <\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#stepstolaunch\">5. Steps to launch your Spark Cluster and Notebook<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#transformations\">6. Transformations on RDD Abstractions<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#finalthoughts\">7. Final Thoughts<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#aboutcloudthat\">8. About CloudThat <\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2 id=\"introduction\"><strong><span style=\"color: #000080;\">Introducing Apache Spark<\/span><\/strong><\/h2>\n<p><span style=\"color: #000000;\">Apache Spark, an open-source cluster framework, is an improvement over Hadoop. It leverages the use of in-memory computations to achieve high-speed data processing compared to Hadoop (MapReduce), which is slower as the I\/O operations are done in HDFS (Hadoop Distributed File System). In Hadoop, the I\/O operations in HDFS are necessarily required for coordination and fault tolerance, which are intermediate writes to HDFS. As the task grow and are coordinated (Via MapReduce) to achieve the results, the intermediate data must be written to the HDFS, which may not be required by the user and hence creating additional costs in terms of disk usage.<\/span><\/p>\n<p><span style=\"color: #000000;\">Spark addresses these issues using an abstraction known as \u201cResilient Distributed datasets\u201d (RDDs) \u2014 data is distributed in the cluster and is stored and computed in the memory (RAM) of the nodes of that cluster, thus eliminating the intermediary disk writes. The total execution or the data flow takes the shape of a Directed Acyclic Graph (DAG), where each execution step represents a node and the edge of its transformation\/action, resulting in the next execution step. As a result, through lazy evaluation, we can parallelize and optimize computations, particularly iterative calculations. In the following visualization, we can see a lineage or step of transformation done in stages using DAG, which makes it easier to track all the steps, and in case of a failure, we can roll back.<\/span><\/p>\n<p><a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13886\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark1.png\" alt=\"PySpark\" width=\"604\" height=\"228\" \/><\/a><\/p>\n<h2 id=\"apachespark\"><strong><span style=\"color: #000080;\">Apache Spark Ecosystem<\/span><\/strong><\/h2>\n<ul>\n<li><span style=\"color: #000000;\"><strong>Spark Core <\/strong><\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\"><strong>\u00a0<\/strong>Foundation of the Spark, it provides distributed task dispatching, scheduling, and all the basic I\/O operations, abstractions, and API for different languages such as Java, Python, Scala, .NET, and R.<\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\"><strong>Spark SQL <\/strong><\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">It is a component built on top of the Spark core itself. An abstraction known as DataFrames is provided for the same to run Domain-specific language in Java, Scala, Python, and . NET. It provides full SQL language support.<\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\"><strong>Spark Streaming <\/strong><\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">Due to the fast-scheduling ability of the Spark Engine, streaming analytics is possible. The data is ingested in mini-batches, and RDD transformations are performed on those mini-batches.<\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\"><strong>Spark MLlib <\/strong><\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">Spark provides a machine learning library on top of its core and due to its fast computation processing, achieved through distributed in-memory architecture, it is much faster. It can run many ML algorithms such as logistic regression, linear regression, Random Forest, etc.<\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\"><strong>GraphX <\/strong><\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">It is a component built on top of the core for faster-distributed graph processing. It provides two APIs for implementing algorithms such as PageRank through an abstraction known as Pregel abstraction.<\/span><\/p>\n<h2 id=\"resilient\"><strong><span style=\"color: #000080;\">Resilient Distributed Datasets<\/span><\/strong><\/h2>\n<p><span style=\"color: #000000;\">Spark offers an essential abstraction, a collection of fault-tolerant and immutable objects. The records stored in an RDD are divided into partitions for parallel computations in the Spark cluster. The objects are distributed to each node of the Spark cluster for processing. Spark loads the data from a disk and keeps it in memory where it is processed, thereby achieving caching\/persistence only to be reused later if needed. This makes data processing faster in Spark than in Hadoop. Spark RDD performs immutability in that when new transformations are applied on that RDD, Spark creates a new RDD and maintains the RDD Lineage and, in-effect achieving fault tolerance. In this blog, we will explore this abstraction.<\/span><\/p>\n<h2 id=\"gettingstarted\"><strong><span style=\"color: #000080;\">Getting started with PySpark<\/span><\/strong><\/h2>\n<p><span style=\"color: #000000;\">PySpark, an open-source framework, is the python API for Apache Spark. If you are familiar with python, it makes Spark easy to use.<\/span><\/p>\n<p><span style=\"color: #000000;\">I will use PySpark Databricks Community Edition as it is free to use. We will explore all some common transformations on RDDs.<\/span><\/p>\n<h2 id=\"stepstolaunch\"><strong><span style=\"color: #000080;\">Steps to launch your Spark Cluster and Notebook<\/span><\/strong><\/h2>\n<ol>\n<li><span style=\"color: #000000;\"><strong>Creating a Spark Cluster and attaching it to a Notebook<br \/>\n<\/strong>a) Login to your Databricks Community Edition<\/span><br \/>\n<span style=\"color: #000000;\"> b) Click on <strong>Create &gt;&gt; Cluster<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13887\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark2.png\" alt=\"PySpark\" width=\"457\" height=\"346\" \/><\/a><br \/>\n<\/strong>c) You will be prompted with the following page:<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13888\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark3.png\" alt=\"PySpark\" width=\"606\" height=\"368\" \/><\/a><\/span><br \/>\n<span style=\"color: #000000;\"> Here, you need to name your cluster and provide your runtime. Databricks Community editions provide a community driver which provides 15 GBs of Ram and 2 Cores. Click on \u201cCreate Cluster\u201d and your Spark Cluster will be created in 5-10 mins.<br \/>\nd) Click <strong>create &gt;&gt; notebook<\/strong> and you will be prompted with the following:<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13889\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark4.png\" alt=\"PySpark\" width=\"604\" height=\"367\" \/><\/a><\/span><br \/>\n<span style=\"color: #000000;\"> You can see that any task run or executed on this notebook will be executed inside the cluster that we have created<\/span><\/li>\n<li><span style=\"color: #000000;\"><strong><strong>Importing PySpark Libraries\u00a0<\/strong><\/strong>To create an RDD, we need to import the following classes: <em>SparkContext<\/em> and <em>SparkConf<\/em><em>SparkContext<\/em> is like entry point to use Spark transformations or functionalities. It connects you to a Spark Cluster and drives its functionality.<\/span><span style=\"color: #000000;\">SparkConf controls the configuration settings for that application (that is, for a particular Spark Context). It will load the values from Spark itself and the arguments passed to it. For example:<\/span><span style=\"color: #000000;\">a) <em>setAppName()<\/em> \u2014 name of the application. Serves as a name for the context as there can be many.<\/span><br \/>\n<span style=\"color: #000000;\"> b) <em>setMaster()<\/em> \u2014 sets the URL of the master node or threads<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13890\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark5.png\" alt=\"PySpark\" width=\"604\" height=\"167\" \/><\/a><\/span><br \/>\n<span style=\"color: #000000;\"> setMaster() will create a thread locally \u2014 or we can provide a URL to the master node which can do it for us. Next, we pass the configuration to the context and are ready to create the RDDs.<\/span><span style=\"color: #000000;\"><strong>Working with RDD abstraction<\/strong><\/span><span style=\"color: #000000;\">To create an RDD, we will use parallelize() method of the Spark Context we created. It will load data \u2014 a python list \u2014 into the cluster and distribute it on the nodes.<\/span><a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark6.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13891\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark6.png\" alt=\"PySpark\" width=\"573\" height=\"97\" \/><br \/>\n<\/a><span style=\"color: #000000;\">We can even create RDD by reading the data from a text file using the textFile() method of the SparkContext.<\/span><a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark7.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13892\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark7.png\" alt=\"PySpark\" width=\"573\" height=\"111\" \/><\/a><span style=\"color: #000000;\">To convert the RDD back to a python list, we can use the collect() method on that RDD<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark8.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13893\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark8.png\" alt=\"PySpark\" width=\"604\" height=\"188\" \/><\/a><br \/>\n<\/span><\/li>\n<\/ol>\n<h2 id=\"transformations\"><span style=\"color: #000080;\"><strong>Transformations on RDD Abstractions<\/strong><\/span><\/h2>\n<p><span style=\"color: #000000;\">RDDs transformations are a way to transform or update one RDD to another. Because of the immutable nature of the RDDs, transformations applied always create a new RDD without updating an existing one, and hence a lineage is created. They are also \u201cLazy evaluations\u201d meaning that no transformations get executed until an action is called on RDD itself.<\/span><\/p>\n<p><span style=\"color: #000000;\">There are a couple of transformations that Spark provides. For example:<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong><em>Map()<\/em><\/strong> \u2013 applies a function on the dataset and returns the transformation of the same dataset<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark9.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13894\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark9.png\" alt=\"PySpark\" width=\"604\" height=\"225\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong><em>FlatMap()<\/em><\/strong> \u2013 if the dataset contains an array, it will convert each element in that array as one, a long row<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark10.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13895\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark10.png\" alt=\"PySpark\" width=\"604\" height=\"199\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong><em>Filter()<\/em><\/strong> \u2013 \u00a0returns a new RDD after applying a function to filter on a dataset<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark11.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13896\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark11.png\" alt=\"PySpark\" width=\"604\" height=\"243\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong>Union()<\/strong> \u2013 Similar to set operation, it combines two datasets together<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark13.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13897\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark13.png\" alt=\"PySpark\" width=\"604\" height=\"235\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong><em>Intersection<\/em><\/strong> &#8212; Like set operation, it returns a dataset with similar elements<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark14.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13898\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark14.png\" alt=\"PySpark\" width=\"604\" height=\"258\" \/><\/a><br \/>\n<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong><em>distinct()<\/em><\/strong> \u2013 it removes any duplicated elements in the dataset<br \/>\n<a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark15.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13899\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/pyspark15.png\" alt=\"PySpark\" width=\"604\" height=\"258\" \/><\/a><br \/>\n<\/span><\/p>\n<h2 id=\"finalthoughts\"><strong><span style=\"color: #000080;\">Final Thoughts<\/span><\/strong><\/h2>\n<p><span style=\"color: #000000;\">Spark is an excellent enhancement over previous data analytics engines for large-scale data processing. It guarantees speed and fault tolerance, and its services can be used with Python, Scala, and Java. Spark provides both streaming services for streaming analytics and has built-in support for Kafka, Flume, and Kinesis. It includes Machine Learning libraries on top of its core, and since it runs on distributed computing, making the machine learning algorithm faster. Hence, it perfectly uses data streaming, analytics, and machine learning for large-scale datasets.<\/span><\/p>\n<h2 id=\"aboutcloudthat\"><strong><span style=\"color: #000080;\">About CloudThat<\/span><\/strong><\/h2>\n<p id=\"About CloudThat\"><span style=\"color: #000000;\"><a href=\"https:\/\/www.cloudthat.com\/\" target=\"_blank\" rel=\"noopener\"><strong>CloudThat<\/strong><\/a>\u00a0is\u00a0the official AWS (Amazon Web Services) Advanced Consulting Partner, Microsoft Gold Partner, Google Cloud Partner, and Training Partner helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build\u00a0a robust\u00a0cloud computing ecosystem by disseminating\u00a0knowledge on technological intricacies within the cloud space.<span class=\"TextRun BCX0 SCXP93070984\" lang=\"EN-IN\" xml:lang=\"EN-IN\"><span class=\"NormalTextRun BCX0 SCXP93070984\"><span class=\"TextRun BCX0 SCXP59000031\" lang=\"EN-IN\" xml:lang=\"EN-IN\"><span class=\"NormalTextRun BCX0 SCXP59000031\">\u00a0<\/span><\/span><span class=\"TextRun BCX0 SCXP59000031\" lang=\"EN-IN\" xml:lang=\"EN-IN\"><span class=\"NormalTextRun BCX0 SCXP59000031\"><span class=\"EOP SCXP258354852 BCX0\"><span class=\"EOP SCXP66056781 BCX0\"><span class=\"EOP SCXP242272637 BCX0\"><span class=\"TextRun SCXP239778695 BCX0\" lang=\"EN-IN\" xml:lang=\"EN-IN\"><span class=\"NormalTextRun SCXP239778695 BCX0\">Explore our\u00a0<strong><a href=\"https:\/\/www.cloudthat.com\/consulting\/\" target=\"_blank\" rel=\"noopener\">consulting here<\/a>.<\/strong><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p><span style=\"color: #000000;\">If you have any queries regarding Apache Spark, PySpark, or Python programming, drop a line below the comments section. I will get back to you at the earliest.\u00a0<\/span><\/p>\n","protected":false},"author":318,"featured_media":13933,"parent":0,"comment_status":"open","ping_status":"open","template":"","blog_category":[3607,3640],"user_email":"mohmmad.a@cloudthat.com","published_by":"324","primary-authors":"","secondary-authors":"","acf":[],"_links":{"self":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13880"}],"collection":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/users\/318"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/comments?post=13880"}],"version-history":[{"count":1,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13880\/revisions"}],"predecessor-version":[{"id":45623,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13880\/revisions\/45623"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/media?parent=13880"}],"wp:term":[{"taxonomy":"blog_category","embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog_category?post=13880"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}