AWS, Cloud Computing

4 Mins Read

Accelerate Data Analysis with Amazon Athena and Apache Spark

Overview

Apache Spark, a famous open-source distributed processing solution intended for rapid analytics workloads against data of any scale, is now supported by Amazon Athena.

Introduction

Amazon Athena is an interactive data analysis platform that can handle complicated queries in a short amount of time. Because it is serverless, setting it up or managing the infrastructure is unnecessary. As it is not a database service, you pay for the queries you conduct. Point the data in S3, set the schema necessary, and you’re ready to start using regular SQL.

Amazon Athena automatically scales by performing queries in parallel, resulting in rapid returns even with big datasets and sophisticated queries.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Using Apache Spark with Athena

Amazon Athena enables interactive data analytics and exploration with Apache Spark without the requirement for resource planning, configuration, or management. Running Apache Spark apps on Athena entails sending Spark code for processing and receiving results without further settings. You may leverage the streamlined notebook experience in the Amazon Athena interface to construct Apache Spark applications using Python or Athena notebook APIs. Apache Spark on Amazon Athena is serverless and offers dynamic, on-demand scalability to suit changing data volumes and processing requirements.

AWS Regions where Amazon Athena for Apace Spark is available now –

  1. Asia Pacific (Tokyo)
  2. Europe (Ireland)
  3. US East (N. Virginia)
  4. US East (Ohio)
  5. US West (Oregon)

Here is the list of some preinstalled python libraries that can be directly leveraged –

Setting up Apache Spark on Amazon Athena

To begin using Apache Spark on Amazon Athena, you must first set up a Spark enabled workgroup. After switching to the workgroup, you can start a new notebook or open an existing one. When you open a notebook in Athena, a new session is instantly launched, and you may work straight in the Athena notebook editor.

Steps to create a Spark enabled workgroup in Athena

  1. Head to the Athena console https://console.aws.amazon.com/athena/
  2. In the navigation pane, choose Workgroups, click the create workgroup button, and enter any workgroup name of your own.
  3. For Analytics Engine, choose Apache Spark.
  4. To use the example notebook for the sake of this tutorial, click Turn on the example notebook. This optional feature adds an example notebook to your workgroup with the name example-notebook-random string and AWS Glue-related permissions that the notebook can use to create, display, and delete databases and tables in your account as well as read permissions in Amazon S3 for the example dataset.
Note – Running the example notebook may incur some additional costs.

Switching workgroups and opening notebook explorer

  1. Select the button next to the Spark enabled workgroup you just created on the Workgroups page of the Athena interface.
  2. Choose Actions -> Switch workgroup. (You will  be notified by the console that you have changed to the new workgroup.)
  3. Choose Notebook explorer from the console navigation pane.

The Notebook explorer can be used in multiple ways –

  • A notebook can be opened in a new session by selecting its connected name.
  • Use the Action menu to rename, delete or export the notebook.
  • Choose Import file to import the notebook.
  • Click on Create Notebook to create a new notebook.

Running the example notebook

A dataset of New York City taxi trips is queried in the example notebook.

To run the example notebook

  1. From Notebook explorer, select the linked name of the example notebook. This opens the notebook in the notebook editor and initiates a notebook session with the default settings. You are notified that a new Apache Spark session has been launched using the default settings (20 maximum DPUs).
  1. Select the Run button once for each cell in the notebook to run the cells sequentially and view the results.
  • Scroll down to each cell to see the results and bring new cells into view.
  • A progress bar will be visible for the cells that include calculations, which display the percentage completion, elapsed time, and remaining time in completion.

Terminating a Session

Choose the session menu from the notebook editor and click on Terminate. A Confirm Session Termination prompt will pop up to confirm. Choose Confirm, and you will return to the notebook editor. Your notebook will be saved as well.

Creating your notebook

  1. From the navigation console, choose Notebook explorer or Notebook editor.
  2. Do as per the previous step –
  • In Notebook explorer -> Create Notebook.
  • In Notebook editor -> Create Notebook or click the (+) button to add a notebook.

3. Enter a name for the notebook in the Create notebook dialogue box.

4. Click on expand Session parameters to fill in values for optional parameters.

5. Click Create.

Supported data and storage formats

The natively supported formats are shown in the following table. See Data Sources in the Apache Spark manual for further details on Spark data sources.

 

table

Monitoring Apache Spark calculations with CloudWatch metrics

When the Publish CloudWatch metrics option for your Spark-enabled workgroup is chosen, Athena posts metrics related to calculations to Amazon CloudWatch. In the CloudWatch console, you can build personalized dashboards and configure alarms and triggers for metrics.

Athena publishes the following metric to the CloudWatch console under the AmazonAthenaForApacheSpark namespace:

  • DPCount – number of DPUs used to calculate during the session.

The DPCount metric has the following dimension –

  1. SessionId – It is the ID of the session where calculations are submitted.
  2. Workgroup – Name of the workgroup.

For the Amazon CloudWatch console to display metrics for Spark-enabled workgroups –

  1. Head to the CloudWatch console at https://console.aws.amazon.com/cloudwatch/
  2. Choose Metrices -> All Metrices from the navigation pane and select AmazonAthenaForApacheSpark namespace from the list.

To view the metrics using CLI –

>> aws cloudwatch list-metrics –namespace “AmazonAthenaForApacheSpark”

Conclusion

Apache Spark data analytics and exploration can be conducted interactively with Amazon Athena without the need to prepare, set up, or manage resources. Running Apache Spark applications on Athena without further configuration entails sending Spark code for processing and receiving the results immediately.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

  • Cloud Training
  • Customized Training
  • Experiential Learning
Read More

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding Amazon Athena and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

1. What is DPUCount?

ANS: – A DPU is a metric for processing power that includes 16 GB of memory and 4 virtual CPUs with compute capacity.

2. How will the session be managed if you need to work on multiple projects simultaneously?

ANS: – You can make a session specifically for each project you need to work on at once, and the sessions will be independent.

3. What are magic commands, and how to use them?

ANS: – In a notebook cell, you can execute magic commands known as magics. For instance, %env displays the environment variables in a notebook session. A percent symbol (%) indicates that a magic function or a line of magic is present. The term “cell magic functions” or “cell magics” refers to spells that are written on many lines and are followed by a double percent sign (%%).

WRITTEN BY Sahil Kumar

Sahil Kumar works as a Subject Matter Expert - Data and AI/ML at CloudThat. He is a certified Google Cloud Professional Data Engineer. He has a great enthusiasm for cloud computing and a strong desire to learn new technologies continuously.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!