Query Optimization Techniques with Amazon Athena's Partition Projection

Overview

In the era of data-driven decision-making, the ability to efficiently query and analyze vast amounts of data has become a critical asset for businesses. Amazon Athena, a serverless interactive query service provided by Amazon Web Services (AWS), empowers organizations to analyze data stored in Amazon S3 using standard SQL queries without complex infrastructure management. One of the key features that contribute to Amazon Athena’s exceptional performance is partitioning. In this blog, we’ll dive into Amazon Athena partitioning, exploring its benefits, implementation, and best practices for optimizing Data Analysis.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Introduction

Partitioning is a data organization technique that divides a large dataset into smaller, more manageable subsets or partitions based on specific criteria. These criteria typically align with the natural structure of the data, such as dates, categories, or geographical locations. Partitioning enhances query performance by reducing the amount of data that needs to be scanned when executing queries.

Use Cases

Customer Segmentation: Analyzing data on specific attributes(e.g., region, time frame) to enable targeted analysis.
Multi-Account Organization: Organization structure where each Account has data such as events or application logs. Partition data by Account ID to keep data isolated, secure, and easy to query.

Benefits of Amazon Athena Partitioning

Improved Query Performance: Partitioning allows Amazon Athena to skip scanning irrelevant partitions when executing queries. This optimization significantly reduces query execution time and improves overall performance, enabling faster and more responsive data analysis.
Cost Efficiency: By scanning only the relevant partitions, Amazon Athena reduces the data processed, leading to cost savings. As you pay based on the amount of data scanned, partitioning can help optimize your query costs.
Simplified Data Management: Partitioning aligns data storage with the logical structure of the dataset. This organization simplifies data management, making it easier to locate, query, and maintain specific subsets of data.
Enhanced Data Exploration: Partitioning enables data analysts and scientists to focus on specific dataset segments during exploratory analysis. This targeted approach streamlines the process of uncovering insights and trends within the data.

Implementing Partitioning in Amazon Athena

To harness the benefits of partitioning in Amazon Athena, follow these key steps:

Data Preparation: Before partitioning, ensure that your data is appropriately structured. Partition keys should be carefully chosen based on the nature of your data and the type of queries you plan to run. For instance, partitioning by date can be a logical choice if your dataset contains sales data.
Partitioning Columns: Define the columns acting as partition keys when creating your Amazon Athena table. These partitioning columns determine how your data is divided into partitions. Use the PARTITIONED BY clause when creating your table, specifying the partitioning columns.
Directory Structure: In Amazon S3, each partition is represented by a separate subdirectory. The directory structure should mirror the partition keys, helping Amazon Athena efficiently navigate and scan the relevant data during query execution.
Query Optimization: Leverage partition elimination in your SQL queries by specifying the partition key values in the WHERE This instructs Amazon Athena to scan only the relevant partitions, enhancing query performance.

Steps to set Partitioning in Amazon Athena

Before you begin creating tables, you should understand more about the requirements for partitioning tables. This can help you create the tables that you need.

AWS CloudTrail service records all AWS activities and delivers them to Amazon S3. Several events occur daily in AWS CloudTrail, so it’s difficult to find out or time consuming to query from all the events logs at the time of incident. So, the demonstration below will show you how you can set up partitioning for AWS CloudTrail logs. Similarly, this can be implemented to other logs in the stored Amazon S3 bucket with the structure described below in step 1.

Step 1: Access the AWS console, locate the Amazon S3 bucket, and identify the folder. Note the folder’s path, which can serve as a partitioning key.

step1

Step 2: Create a Table using partitioning

Modify the LOCATION clause within the CREATE TABLE statement.

Format:

LOCATION 's3://cloudtrail_bucket_name/AWSLogs/<Account_ID>/CloudTrail/'

1	LOCATION 's3://cloudtrail_bucket_name/AWSLogs/<Account_ID>/CloudTrail/'

Example:

LOCATION ‘s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail

1	LOCATION ‘s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail

In the PARTITIONED BY clause, add an entry for the timestamp and region as a string, as in the following example:

PARTITIONED BY (timestamp string, region string)

1	PARTITIONED BY (timestamp string, region string)

In the TBLProperties clause, define partition column types and Amazon S3 location with a variable that acts as dynamic input for forecasting.

TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.timestamp.format'='yyyy/MM/dd',
  'projection.timestamp.interval'='1',
  'projection.timestamp.interval.unit'='DAYS', 
  'projection.timestamp.range'='2020/01/01,NOW',
  'projection.timestamp.type'='date',
  'projection.region.type'='injected',
  'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}')

TBLPROPERTIES (

'projection.enabled'='true',

'projection.timestamp.format'='yyyy/MM/dd',

'projection.timestamp.interval'='1',

'projection.timestamp.interval.unit'='DAYS',

'projection.timestamp.range'='2020/01/01,NOW',

'projection.timestamp.type'='date',

'projection.region.type'='injected',

'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}')

Here is the Complete DDL statement(sample) with partitioning schema

CREATE EXTERNAL TABLE demo_cloudtrail_partition(
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
            ec2RoleDelivery:string,
            webIdFederationData:map<string,string>
        >
    >,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
        tlsVersion:string,
        cipherSuite:string,
        clientProvidedHostHeader:string>
  )
PARTITIONED BY (
   `timestamp` string,
   `region` string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail'
TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.timestamp.format'='yyyy/MM/dd',
  'projection.timestamp.interval'='1',
  'projection.timestamp.interval.unit'='DAYS', 
  'projection.timestamp.range'='2020/01/01,NOW',
  'projection.timestamp.type'='date',
  'projection.region.type'='injected',
  'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}')

CREATE EXTERNAL TABLE demo_cloudtrail_partition(

eventVersion STRING,

userIdentity STRUCT<

type: STRING,

principalId: STRING,

arn: STRING,

accountId: STRING,

invokedBy: STRING,

accessKeyId: STRING,

userName: STRING,

sessionContext: STRUCT<

attributes: STRUCT<

mfaAuthenticated: STRING,

creationDate: STRING>,

sessionIssuer: STRUCT<

type: STRING,

principalId: STRING,

arn: STRING,

accountId: STRING,

userName: STRING>,

ec2RoleDelivery:string,

webIdFederationData:map<string,string>

eventTime STRING,

eventSource STRING,

eventName STRING,

awsRegion STRING,

sourceIpAddress STRING,

userAgent STRING,

errorCode STRING,

errorMessage STRING,

requestparameters STRING,

responseelements STRING,

additionaleventdata STRING,

requestId STRING,

eventId STRING,

readOnly STRING,

resources ARRAY<STRUCT<

arn: STRING,

accountId: STRING,

type: STRING>>,

eventType STRING,

apiVersion STRING,

recipientAccountId STRING,

serviceEventDetails STRING,

sharedEventID STRING,

vpcendpointid STRING,

tlsDetails struct<

tlsVersion:string,

cipherSuite:string,

clientProvidedHostHeader:string>

)

PARTITIONED BY (

`timestamp` string,

`region` string)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

's3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail'

TBLPROPERTIES (

'projection.enabled'='true',

'projection.timestamp.format'='yyyy/MM/dd',

'projection.timestamp.interval'='1',

'projection.timestamp.interval.unit'='DAYS',

'projection.timestamp.range'='2020/01/01,NOW',

'projection.timestamp.type'='date',

'projection.region.type'='injected',

'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}')

Run the Queries

To use partition, you are required to give additional parameters (region and timestamp) in where condition in query as shown below

Example query:

SELECT * FROM "<db name>"."<table name> " where region = '<aws-region>' and timestamp = '<yyyy/mm/dd>' limit 10;

1	SELECT * FROM "<db name>"."<table name> " where region = '<aws-region>' and timestamp = '<yyyy/mm/dd>' limit 10;

step2

Conclusion

Amazon Athena’s partitioning feature is powerful for optimizing data analysis and query performance.

By strategically organizing data into partitions based on meaningful criteria, organizations can unlock faster query execution, reduce costs, and simplify data management. As businesses grapple with ever-growing volumes of data, leveraging Amazon Athena partitioning becomes a crucial step toward efficient and insightful data-driven decision-making.

Whether analyzing sales trends, user behavior, or any other dataset, Amazon Athena partitioning empowers you to explore and extract value from your data quickly and precisely.

Drop a query if you have any questions regarding Amazon Athena and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can I partition data stored in other AWS services, like Amazon Redshift or Amazon RDS, with Amazon Athena?

ANS: – No, Amazon Athena partitions data stored in Amazon S3. If you have data in other AWS services, you may need to transfer or copy it to Amazon S3 before partitioning it with Amazon Athena.

2. What column should I choose for partitioning?

ANS: – The choice of partitioning column depends on your stored data structure and query patterns. Typically, it’s best to choose a column that is frequently used in WHERE clauses of your queries.

3. Can I change the partitioning column later without affecting data?

ANS: – Yes, you can change the partitioning column. It may involve recreating your table.

WRITTEN BY Kashyap Nitinbhai Shani

Kashyap Nitinbhai Shani is a Research Associate at CloudThat. He is interested to learn advanced technologies and gain insights into new and upcoming cloud services. He likes writing tech blogs and learning new languages.