Seamless Data Lake Management with Apache Iceberg AWS Glue and Amazon Athena

Overview

Apache Iceberg has revolutionized how we handle big data tables, enabling efficient table management with features like partition evolution, time travel, and atomic operations. Combining Iceberg with AWS Glue Catalog and Amazon Athena simplifies data lake workflows, making it accessible to modern cloud environments.

In this blog, we will explore:

Registering Iceberg Tables in AWS Glue Catalog using Amazon Athena and AWS Glue

For partitioned and unpartitioned tables.

2. Performing UPSERT operations with AWS Glue and Amazon Athena.

3. Enabling and leveraging Time Travel in Iceberg tables.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Registering Iceberg Tables in the AWS Glue Catalog

Iceberg tables can be partitioned or unpartitioned and registering them in the AWS Glue Data Catalog allows Amazon Athena and AWS Glue ETL jobs to query and manipulate these tables.

a. Registering Unpartitioned Iceberg Tables

To register an unpartitioned Iceberg table in the AWS Glue Catalog, follow these steps:

Step 1: Create the Table in Amazon Athena

Iceberg tables can be created using Amazon Athena’s SQL interface:

CREATE TABLE glue_catalog.database_name.unpartitioned_table (
 	   id BIGINT,
  	  name STRING,
   	 age INT
)
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
  TBLPROPERTIES ( 'table_type' = 'ICEBERG' );;

CREATE TABLE glue_catalog.database_name.unpartitioned_table (

id BIGINT,

name STRING,

age INT

)

LOCATION 's3://amzn-s3-demo-bucket/your-folder/'

TBLPROPERTIES ( 'table_type' = 'ICEBERG' );;

This command:

Creates an unpartitioned Iceberg table in the AWS Glue Catalog.
Sets the default storage type to Iceberg.

Step 2: Verify Registration

Confirm that the table appears under the appropriate database in the AWS Glue console.

Step 3: Query the Table with Amazon Athena

Test the table with simple queries in Athena:

SELECT * FROM glue_catalog.database_name.unpartitioned_table;

1	SELECT * FROM glue_catalog.database_name.unpartitioned_table;

b. Registering Partitioned Iceberg Tables

Partitioned tables allow efficient queries by reducing data scanning. To register a partitioned Iceberg table:

Step 1: Create the Partitioned Table

CREATE TABLE glue_catalog.database_name.partitioned_table (
    id BIGINT,
    name STRING,
    age INT
)
PARTITIONED BY (age)
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
  TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

CREATE TABLE glue_catalog.database_name.partitioned_table (

id BIGINT,

name STRING,

age INT

)

PARTITIONED BY (age)

LOCATION 's3://amzn-s3-demo-bucket/your-folder/'

TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

The PARTITIONED BY clause defines the partition key (age in this example).

Step 2: Load Data

Data can be inserted using Amazon Athena’s SQL:

INSERT INTO glue_catalog.database_name.partitioned_table VALUES
(1, 'Alice', 25),
(2, 'Bob', 30);

INSERT INTO glue_catalog.database_name.partitioned_table VALUES

(1, 'Alice', 25),

(2, 'Bob', 30);

Step 3: Verify Partitions

Use Amazon Athena to list partitions:

SHOW PARTITIONS glue_catalog.database_name.partitioned_table;

1	SHOW PARTITIONS glue_catalog.database_name.partitioned_table;

c. Registering Iceberg Tables via AWS Glue ETL Jobs

For AWS Glue ETL jobs to manage Iceberg tables:

Use AWS Glue version 3.0 or later.
Add the Iceberg connector jar (aws-glue-iceberg.jar) to the job if required.

Example PySpark Script

import sys
from pyspark.sql import SparkSession
spark = SparkSession.builder \
   	 .appName("Glue-Iceberg-Table") \
  .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    	.config("spark.sql.catalog.glue_catalog.warehouse", "s3://your-bucket/path/") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    	.getOrCreate()

#Create an Iceberg table
spark.sql("""
CREATE TABLE glue_catalog.database_name.etl_table (
    id BIGINT,
    name STRING,
    age INT
)
USING iceberg
PARTITIONED BY (age)
LOCATION 's3://your-bucket/path/'
TBLPROPERTIES ("format-version"="2")
""")

import sys

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("Glue-Iceberg-Table") \

.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \

.config("spark.sql.catalog.glue_catalog.warehouse", "s3://your-bucket/path/") \

.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \

.getOrCreate()

#Create an Iceberg table

spark.sql("""

CREATE TABLE glue_catalog.database_name.etl_table (

id BIGINT,

name STRING,

age INT

)

USING iceberg

PARTITIONED BY (age)

LOCATION 's3://your-bucket/path/'

TBLPROPERTIES ("format-version"="2")

""")

Performing UPSERT Operations

Iceberg tables support merge-on-read operations for upserts, combining INSERT and DELETE into one atomic operation.

Using Glue ETL for UPSERTs

Step 1: Load the Delta Data

Delta data (new or updated records) can be loaded into a Spark DataFrame.

Step 2: Perform the Merge

Iceberg uses the MERGE INTO SQL command for upserts.

spark.sql("""
MERGE INTO glue_catalog.database_name.target_table t
USING glue_catalog.database_name.delta_table d
ON t.id = d.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")

spark.sql("""

MERGE INTO glue_catalog.database_name.target_table t

USING glue_catalog.database_name.delta_table d

ON t.id = d.id

WHEN MATCHED THEN UPDATE SET *

WHEN NOT MATCHED THEN INSERT *

""")

WHEN MATCHED: Updates existing records.
WHEN NOT MATCHED: Inserts new records.

Using Amazon Athena for UPSERTs

Create the Iceberg Table (if not already created):

CREATE TABLE glue_catalog.database_name.iceberg_table (
    id INT,
    name STRING,
    updated_at TIMESTAMP
) 
USING ICEBERG;

CREATE TABLE glue_catalog.database_name.iceberg_table (

id INT,

name STRING,

updated_at TIMESTAMP

)

USING ICEBERG;

2. Load the New Data into a Staging Table (Optional): If your new data comes from Amazon S3, create an external table:

CREATE TABLE glue_catalog.database_name.staging_table (
    id INT,
    name STRING,
    updated_at TIMESTAMP
)
USING PARQUET
LOCATION 's3://your-bucket/new-data/';

CREATE TABLE glue_catalog.database_name.staging_table (

id INT,

name STRING,

updated_at TIMESTAMP

)

USING PARQUET

LOCATION 's3://your-bucket/new-data/';

3. Execute MERGE INTO to Perform UPSERT:

MERGE INTO glue_catalog.database_name.iceberg_table AS target
USING glue_catalog.database_name.staging_table AS source
ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET target.name = source.name, 
               target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (id, name, updated_at) VALUES (source.id, source.name, source.updated_at);

MERGE INTO glue_catalog.database_name.iceberg_table AS target

USING glue_catalog.database_name.staging_table AS source

ON target.id = source.id

WHEN MATCHED THEN

UPDATE SET target.name = source.name,

target.updated_at = source.updated_at

WHEN NOT MATCHED THEN

INSERT (id, name, updated_at) VALUES (source.id, source.name, source.updated_at);

Using Time Travel

Time travel is a powerful Iceberg feature that allows historical snapshots of the table to be accessed.

a. Querying Historical Snapshots

Version Based queries

Find the snapshot ID using the Iceberg metadata table:

SELECT * FROM glue_catalog.database_name.target_table.snapshots;

1	SELECT * FROM glue_catalog.database_name.target_table.snapshots;

Query a snapshot using its ID:

SELECT * FROM glue_catalog.database_name.target_table
FOR SYSTEM_VERSION AS OF 'snapshot-id';

1 2	SELECT * FROM glue_catalog.database_name.target_table FOR SYSTEM_VERSION AS OF 'snapshot-id';

2. Time travel queries

SELECT * FROM glue_catalog.database_name.target_table
  FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00';

1 2	SELECT * FROM glue_catalog.database_name.target_table FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00';

b. Using AWS Glue ETL for Time Travel

In AWS Glue ETL jobs, time travel is configured using Iceberg properties.

Example PySpark script:

# Query a historical snapshot
spark.sql("""
SELECT * FROM glue_catalog.database_name.target_table 
FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00'
""").show()

# Query a historical snapshot

spark.sql("""

SELECT * FROM glue_catalog.database_name.target_table

FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00'

""").show()

Conclusion

By leveraging Apache Iceberg with AWS Glue and Amazon Athena, you can efficiently manage your data lake, enabling advanced capabilities like partition evolution, atomic upserts, and time travel. AWS Glue’s ETL jobs provide seamless integration for processing and managing Iceberg tables, while Amazon Athena’s SQL interface simplifies querying.

Iceberg’s rich feature set and AWS’s powerful ecosystem empower modern data workflows to achieve scalability, consistency, and query performance.

Drop a query if you have any questions regarding Apache Iceberg, AWS Glue or Amazon Athena and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How do I register an Iceberg table in the AWS Glue Catalog?

ANS: – Use Amazon Athena’s CREATE TABLE command to register the table, then verify in the AWS Glue console.

2. Can I perform UPSERT operations on Iceberg tables?

ANS: – Yes, you can use the MERGE INTO command in Spark for UPSERTs or emulate it in Amazon Athena with temporary tables.

WRITTEN BY Rishi Raj Saikia

Rishi works as an Associate Architect. He is a dynamic professional with a strong background in data and IoT solutions, helping businesses transform raw information into meaningful insights. He has experience in designing smart systems that seamlessly connect devices and streamline data flow. Skilled in addressing real-world challenges by combining technology with practical thinking, Rishi is passionate about creating efficient, impactful solutions that drive measurable results.