Voiced by Amazon Polly |
Overview
In the dynamic landscape of data processing and analysis, Directed Acyclic Graphs (DAGs) stand as a cornerstone of efficiency and clarity. This blog takes you through the world of DAGs, focusing on their role within PySpark, a leading framework for big data processing. Discover how PySpark and DAGs intertwine to provide scalable, fault-tolerant, and versatile solutions for your data-driven challenges.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Introduction to DAG
This implies that each subsequent stage relies on the culmination of the preceding one, and tasks within a stage can operate autonomously from one another.
Comprising Vertices and Edges, the structure embodies RDDs through vertices and the intended RDD operation through edges. Within the Spark DAG, each edge uniformly progresses from a preceding point to a subsequent one. The generated DAG is handed over to the DAG Scheduler upon invoking an Action, which dissects the graph into task stages.
Lazy Evaluation
Lazy Evaluation in Spark is simply a strategy that delays evaluating an expression or series of transformations until an action is called. In simpler words, we can say that transformations will not affect the data in any sense until the action is called.
Transformations are the set of instructions that we use to modify the data in a way we want it to solve our purpose. It could be in any number until we get the form of the data we want, like map, filter, join, etc.
Action is a statement to ask for a value in return, which must be computed at the very instance. Some examples of action commands are show, count, collect, save, etc.
Here are some benefits of the Lazy Evaluation:
- Efficient Execution: Lazy evaluation postpones the actual computation until an action is called. This enables Spark to optimize and combine transformations, avoiding unnecessary computations and reducing overhead.
- Optimization: Spark’s Catalyst optimizer can analyze the sequence of transformations and optimize the execution plan based on available information.
- Reduced Data Movement: Lazy evaluation allows Spark to eliminate or minimize intermediate data movements.
- Resource Management: By delaying execution, Spark can better manage resources.
Resilient Distributed Dataset
RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark, a popular distributed computing framework.
RDDs are the main logical data units in Spark. They are a distributed collection of objects stored in memory or on disks of different cluster machines. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different cluster machines.
RDDs are immutable (read-only) in nature. You cannot change an original RDD but create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.
Now, we will discuss some of the key characteristics of RDDs:
- Resilient: RDDs are fault-tolerant, meaning they can recover from node failures.
- Distributed: RDDs are distributed across multiple nodes in a cluster, enabling parallel processing.
- Immutable: RDDs are read-only and cannot be changed after creation. Any transformation applied to an RDD results in a new RDD.
- In-Memory Processing: RDDs are stored in memory, enabling faster data processing than traditional disk-based processing.
There are multiple ways of creating RDDs. They can be created by loading the data operation or transforming them to other RDDs like map, filter, join, etc.
Now that we have an intuition of the RDD and the concept of Lazy evaluation in Spark, we can return to the DAG and understand how it works.
Working of DAG
While executing various transformations, Spark retains them within a Directed Acyclic Graph (DAG). You can visualize this DAG in the Spark UI interface.
Within Spark, the responsibility of the DAG Scheduler is to convert a series of RDD transformations and actions into a directed acyclic graph (DAG) containing stages and tasks. This DAG can then run concurrently on a cluster of machines. The DAG Scheduler is a pivotal element of the Spark execution engine, influencing the efficiency of Spark jobs.
There are two types of transformation: wide and narrow transformation. The stage changes whenever there is a wide transformation applied to the data. It allows us to have a deep dive view of all the stages. In each stage, we can see all the RDDs belonging to that stage.
After this logical plan is created comes into play the DAG Optimizer. It optimizes the plan most efficiently.
It detects consecutive operations or transformations that can be merged into a single task, thereby decreasing the need for data shuffling and enhancing performance.
It examines the task dependencies and spot chances for pipelining. This entails running subsequent tasks as soon as their input data is ready, without waiting for preceding tasks to finish. This advancement in inter-stage latency enhances data processing speed and overall throughput.
It identifies self-contained stages in a job that can run simultaneously. Stages without data interdependencies can be executed concurrently, making optimal use of the cluster’s resources. It employs multiple methods to enhance the efficiency of data shuffling, a frequently resource-intensive operation in distributed data processing.
It considers data locality when arranging tasks, striving to allocate tasks on nodes where the data is already available or being processed. This strategy minimizes data transfer across the network, reducing network strain and enhancing performance.
Code Implementation
Here, we will be applying the number of transformations to an RDD and see the DAG for the following:
Here, we have applied a basic transformation to have a basic intuition on the DAG visualization:
In the shell, write the following code:
- We have created a list that contains some tech names.
1 |
>>>tech['spark','hadoop','pyspark','hadoop','spark','pyspark','pyspark','hadoop','spark'] |
2. Then use the Spark Context and parallelize to convert the list to the RDD
1 |
>>>myrdd=spark.sparkContext.parallelize(tech) |
3. Apply the map function and make a pair of each tech with an integer
1 |
>>> pairrdd=myrdd.map(lambda x:(x,1)) |
4. Next, we will apply the reduceByKey to get the frequency of each technology name present in the RDD.
1 |
>>> rddcount=pairrdd.reduceByKey(lambda x,y:x+y) |
5. Finally, we are calling the action using collect.
1 |
>>> rddcount.collect() |
6. Output
1 |
[('hadoop', 3), ('spark', 3), ('pyspark', 3)] |
DAG for the above transformation is shown below:
Conclusion
Through this blog, we understand what DAG is and how it works. Further, we can say that using DAG within Apache Spark leads to enhanced execution, robust fault tolerance, and proficient task scheduling. These factors collectively yield swifter data processing and elevated overall performance.
Drop a query if you have any questions regarding DAG and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 850k+ professionals in 600+ cloud certifications and completed 500+ consulting projects globally, CloudThat is an official AWS Premier Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront Service Delivery Partner, Amazon OpenSearch Service Delivery Partner, AWS DMS Service Delivery Partner, AWS Systems Manager Service Delivery Partner, Amazon RDS Service Delivery Partner, AWS CloudFormation Service Delivery Partner, AWS Config, Amazon EMR and many more.
FAQs
1. What is a DAG in Apache Spark?
ANS: – A DAG (Directed Acyclic Graph) in Apache Spark represents the sequence of transformations and actions applied to data. It organizes computations into stages and tasks for efficient parallel execution.
2. What is a Narrow Transformation?
ANS: – It is a type of transformation applied to an RDD that doesn’t require data shuffling across partitions. For example: select, filter, map, union, etc.
3. What is Wide Transformation?
ANS: – Wide Transformation refers to a type of transformation applied to an RDD that may require shuffling or redistributing data across partitions. For example: groupBy, reduceByKey, join, etc.
WRITTEN BY Parth Sharma
Comments