AWS Kinesis Firehose – How to Stream Data Pipeline to Dynamodb

1. Introduction

Online streaming has become part and parcel of information consumption in today’s era. However, creating live, real-time systems is a niche skill in the world of cross-platform integration, subscriptions, instant notifications, etc. The core component of creating a real-time system is the continuous streaming of data from one application to another. Various tools provide this ability: RabbitMQ, Apache Kafka, Amazon Kinesis, and many more. Each tool has its fair share of advantages and disadvantages. Today we are going to focus on Amazon Kinesis.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

2. Amazon Kinesis Data Streams

It is used for capturing item-level modifications of any DynamoDB table. Our applications can access the Kinesis stream and view changes in near real-time. The Kinesis data stream will be able to continuously capture and store terabytes of data per hour, which we can use for longer retention by having additional audit and security transparency. Kinesis Data Streams can also be used with Kinesis Data Firehose – a delivery stream platform and Amazon QuickSight – where we can create real-time dashboards, generate alerts, etc.

3. Amazon Kinesis Data Firehose

It is a fully managed ETL service used for reliable loading of streaming data to the data stores, data lakes, analytics services. It can capture, transform, and deliver streaming data into S3 and other destinations like Redshift, OpenSearch, DataDog, etc. Kinesis Data Firehose can scale automatically to match the throughput of the data and used to batch, compress, transform and encrypt the data streams which minimizes the storage used and increased security.

4. High-Level Architecture Diagram

5. Step-by-Step Data Lake implementation guide for DynamoDB tables using Kinesis Streams

I will use AWS Kinesis Data streams to store DynamoDB table data into S3 (as a data lake) using Kinesis Data Firehose.

Step-1:

Create Kinesis Data Stream by provisioning required data stream capacity by selecting either On-demand capacity mode or provisioned capacity mode

Step-2:

Create a Delivery Stream which is used for sending streamed data into the S3 bucket
Choose the source as Kinesis Data Streams and destination as an S3 bucket
Under source settings, select the data stream created in the earlier step
We can transform the data in two ways either using Lambda (if stream data is not JSON) or using Glue to convert the records to Apache Parquet or Apache ORC format (converts JSON data to table schema which we can define) which provides efficient querying, or we can send the raw data directly to S3
Under Destination Settings, select the S3 bucket where the streamed data is to be stored. Select the custom S3 bucket prefix to store the data and error output prefix where any errors occurred will be logged
Dynamic Partitioning is a feature that can be enabled on the S3 bucket in Destination settings used to partition the streaming data into multiple folders as per our requirement. This feature can be enabled only when creating a delivery stream and cannot be allowed for the existing one.
We can set the S3 buffer limits with buffer size and buffer interval. Compression and encryption (for data records and server-side encryption) can also be enabled to reduce storage size and provide additional security.
After selecting all the required specifications, create the Delivery stream whose status will be Active upon creation

Step-3:

Now, go to DynamoDB console and enable Kinesis Data streams for the tables required
Any item modifications that have been happening on the DynamoDB table are now being captured and stored in S3

6. Conclusion

AWS Kinesis Data Streams and Data Firehose combined can be used as an efficient way to create a centralized data lake used for performing advanced analytics or sending the data to redshift for optimized querying. In addition, they can create dashboards using QuickSight or Athena for better visualization of data.

As Kinesis is a Managed Service, meaning AWS handles most of the administration and developers can focus on their code and not worry about managing their system. Hope that this step-by-step guide has been useful to you.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.