Accelerate Data Analysis with Amazon Athena and Apache Spark

Overview

Apache Spark, a famous open-source distributed processing solution intended for rapid analytics workloads against data of any scale, is now supported by Amazon Athena.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Introduction

Amazon Athena is an interactive data analysis platform that can handle complicated queries in a short amount of time. Because it is serverless, setting it up or managing the infrastructure is unnecessary. As it is not a database service, you pay for the queries you conduct. Point the data in S3, set the schema necessary, and you’re ready to start using regular SQL.

Amazon Athena automatically scales by performing queries in parallel, resulting in rapid returns even with big datasets and sophisticated queries.

Using Apache Spark with Athena

Amazon Athena enables interactive data analytics and exploration with Apache Spark without the requirement for resource planning, configuration, or management. Running Apache Spark apps on Athena entails sending Spark code for processing and receiving results without further settings. You may leverage the streamlined notebook experience in the Amazon Athena interface to construct Apache Spark applications using Python or Athena notebook APIs. Apache Spark on Amazon Athena is serverless and offers dynamic, on-demand scalability to suit changing data volumes and processing requirements.

AWS Regions where Amazon Athena for Apace Spark is available now –

Asia Pacific (Tokyo)
Europe (Ireland)
US East (N. Virginia)
US East (Ohio)
US West (Oregon)

Here is the list of some preinstalled python libraries that can be directly leveraged –

boto3==1.24.31

botocore==1.27.31

certifi==2022.6.15

charset-normalizer==2.1.0

cycler==0.11.0

cython==0.29.30

docutils==0.19

fonttools==4.34.4

idna==3.3

jmespath==1.0.1

joblib==1.1.0

kiwisolver==1.4.4

matplotlib==3.5.2

mpmath==1.2.1

numpy==1.23.1

packaging==21.3

pandas==1.4.3

patsy==0.5.2

pillow==9.2.0

plotly==5.9.0

pmdarima==1.8.5

pyathena==2.9.6

pyparsing==3.0.9

python-dateutil==2.8.2

pytz==2022.1

requests==2.28.1

s3transfer==0.6.0

scikit-learn==1.1.1

scipy==1.8.1

seaborn==0.11.2

six==1.16.0

statsmodels==0.13.2

sympy==1.10.1

tenacity==8.0.1

threadpoolctl==3.1.0

urllib3==1.26.10

pyarrow==9.0.0

boto3==1.24.31

botocore==1.27.31

certifi==2022.6.15

charset-normalizer==2.1.0

cycler==0.11.0

cython==0.29.30

docutils==0.19

fonttools==4.34.4

idna==3.3

jmespath==1.0.1

joblib==1.1.0

kiwisolver==1.4.4

matplotlib==3.5.2

mpmath==1.2.1

numpy==1.23.1

packaging==21.3

pandas==1.4.3

patsy==0.5.2

pillow==9.2.0

plotly==5.9.0

pmdarima==1.8.5

pyathena==2.9.6

pyparsing==3.0.9

python-dateutil==2.8.2

pytz==2022.1

requests==2.28.1

s3transfer==0.6.0

scikit-learn==1.1.1

scipy==1.8.1

seaborn==0.11.2

six==1.16.0

statsmodels==0.13.2

sympy==1.10.1

tenacity==8.0.1

threadpoolctl==3.1.0

urllib3==1.26.10

pyarrow==9.0.0

Setting up Apache Spark on Amazon Athena

To begin using Apache Spark on Amazon Athena, you must first set up a Spark enabled workgroup. After switching to the workgroup, you can start a new notebook or open an existing one. When you open a notebook in Athena, a new session is instantly launched, and you may work straight in the Athena notebook editor.

Steps to create a Spark enabled workgroup in Athena

Head to the Athena console https://console.aws.amazon.com/athena/
In the navigation pane, choose Workgroups, click the create workgroup button, and enter any workgroup name of your own.
For Analytics Engine, choose Apache Spark.
To use the example notebook for the sake of this tutorial, click Turn on the example notebook. This optional feature adds an example notebook to your workgroup with the name example-notebook-random string and AWS Glue-related permissions that the notebook can use to create, display, and delete databases and tables in your account as well as read permissions in Amazon S3 for the example dataset.

Note – Running the example notebook may incur some additional costs.

Switching workgroups and opening notebook explorer

Select the button next to the Spark enabled workgroup you just created on the Workgroups page of the Athena interface.
Choose Actions -> Switch workgroup. (You will be notified by the console that you have changed to the new workgroup.)
Choose Notebook explorer from the console navigation pane.

The Notebook explorer can be used in multiple ways –

A notebook can be opened in a new session by selecting its connected name.
Use the Action menu to rename, delete or export the notebook.
Choose Import file to import the notebook.
Click on Create Notebook to create a new notebook.

Running the example notebook

A dataset of New York City taxi trips is queried in the example notebook.

To run the example notebook

From Notebook explorer, select the linked name of the example notebook. This opens the notebook in the notebook editor and initiates a notebook session with the default settings. You are notified that a new Apache Spark session has been launched using the default settings (20 maximum DPUs).

Select the Run button once for each cell in the notebook to run the cells sequentially and view the results.

Scroll down to each cell to see the results and bring new cells into view.
A progress bar will be visible for the cells that include calculations, which display the percentage completion, elapsed time, and remaining time in completion.

Terminating a Session

Choose the session menu from the notebook editor and click on Terminate. A Confirm Session Termination prompt will pop up to confirm. Choose Confirm, and you will return to the notebook editor. Your notebook will be saved as well.

Creating your notebook

From the navigation console, choose Notebook explorer or Notebook editor.
Do as per the previous step –

In Notebook explorer -> Create Notebook.
In Notebook editor -> Create Notebook or click the (+) button to add a notebook.

3. Enter a name for the notebook in the Create notebook dialogue box.

4. Click on expand Session parameters to fill in values for optional parameters.

5. Click Create.

Supported data and storage formats

The natively supported formats are shown in the following table. See Data Sources in the Apache Spark manual for further details on Spark data sources.

table

Monitoring Apache Spark calculations with CloudWatch metrics

When the Publish CloudWatch metrics option for your Spark-enabled workgroup is chosen, Athena posts metrics related to calculations to Amazon CloudWatch. In the CloudWatch console, you can build personalized dashboards and configure alarms and triggers for metrics.

Athena publishes the following metric to the CloudWatch console under the AmazonAthenaForApacheSpark namespace:

DPCount – number of DPUs used to calculate during the session.

The DPCount metric has the following dimension –

SessionId – It is the ID of the session where calculations are submitted.
Workgroup – Name of the workgroup.

For the Amazon CloudWatch console to display metrics for Spark-enabled workgroups –

Head to the CloudWatch console at https://console.aws.amazon.com/cloudwatch/
Choose Metrices -> All Metrices from the navigation pane and select AmazonAthenaForApacheSpark namespace from the list.

To view the metrics using CLI –

>> aws cloudwatch list-metrics –namespace “AmazonAthenaForApacheSpark”

Conclusion

Apache Spark data analytics and exploration can be conducted interactively with Amazon Athena without the need to prepare, set up, or manage resources. Running Apache Spark applications on Athena without further configuration entails sending Spark code for processing and receiving the results immediately.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is DPUCount?

ANS: – A DPU is a metric for processing power that includes 16 GB of memory and 4 virtual CPUs with compute capacity.

2. How will the session be managed if you need to work on multiple projects simultaneously?

ANS: – You can make a session specifically for each project you need to work on at once, and the sessions will be independent.

3. What are magic commands, and how to use them?

ANS: – In a notebook cell, you can execute magic commands known as magics. For instance, %env displays the environment variables in a notebook session. A percent symbol (%) indicates that a magic function or a line of magic is present. The term “cell magic functions” or “cell magics” refers to spells that are written on many lines and are followed by a double percent sign (%%).

WRITTEN BY Sahil Kumar

Sahil Kumar works as a Subject Matter Expert - Data and AI/ML at CloudThat. He is a certified Google Cloud Professional Data Engineer. He has a great enthusiasm for cloud computing and a strong desire to learn new technologies continuously.