Streamlining Data Ingestion for Amazon EMR Using Amazon Kinesis

Overview

In today’s data-driven landscape, organizations often face the challenge of efficiently processing massive amounts of real-time data. This blog delves into integrating Amazon Kinesis and Amazon EMR to streamline real-time data ingestion and processing. It highlights the benefits of leveraging Kinesis for scalable, durable, and near real-time data streaming while utilizing Amazon EMR for powerful big data analytics. With a focus on use cases such as real-time analytics, data transformation, and log processing, the blog provides a detailed step-by-step guide to set up and optimize this robust data pipeline.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why Use Amazon Kinesis with Amazon EMR?

Amazon Kinesis is a scalable and durable real-time data streaming service. Integrating it with Amazon EMR allows organizations to process streaming data in near real-time, making it ideal for use cases like:

Real-time analytics: Monitoring website activity, processing IoT sensor data, or financial transactions.
Data transformation: Converting raw data into structured formats for downstream analysis.
Log processing: Streaming logs from various applications and processing them fly.

By leveraging Amazon Kinesis, you can decouple data producers from consumers and ensure that Amazon EMR only processes data when ready.

Components of the Pipeline

Amazon Kinesis Data Streams: This is the primary channel for ingesting real-time data from producers.
Amazon Kinesis Data Firehose: Delivers data from streams to destinations like Amazon S3, enabling Amazon EMR to process data in batches.
Amazon EMR: Processes the ingested data using frameworks like Apache Spark, Hive, or Presto.

Below is a detailed step-by-step guide to setting up the pipeline.

Step-by-Step Guide

Step 1: Create an Amazon Kinesis Data Stream

Log in to the AWS Management Console and navigate to Amazon Kinesis Data Streams.
Create a new stream:
1. Name your stream (e.g., real-time-stream).
2. Define the number of shards based on your expected data throughput. Each shard supports up to 1 MB/sec write and 2 MB/sec read.
Click Create Stream.

Configuring Data Producers

You can use AWS SDKs or libraries like boto3 for Python to send data to the stream. For example:

import boto3
import json
 
def send_to_kinesis(stream_name, data):
    kinesis_client = boto3.client('kinesis')
    response = kinesis_client.put_record(
        StreamName=stream_name,
        Data=json.dumps(data),
        PartitionKey="partition-key"
    )
    return response
 
# Example Usage
send_to_kinesis("real-time-stream", {"event": "page_view", "user": "1234"})

import boto3

import json

def send_to_kinesis(stream_name, data):

kinesis_client = boto3.client('kinesis')

response = kinesis_client.put_record(

StreamName=stream_name,

Data=json.dumps(data),

PartitionKey="partition-key"

)

return response

# Example Usage

send_to_kinesis("real-time-stream", {"event": "page_view", "user": "1234"})

Step 2: Configure Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose acts as a bridge between Kinesis Data Streams and Amazon S3.

Navigate to Amazon Kinesis Data Firehose in the AWS Management Console.
Create a Delivery Stream:
1. Choose Source as Amazon Kinesis Data Stream and select your stream (real-time-stream).
2. Choose Destination as Amazon S3.
Configure the destination:
1. Create or specify an Amazon S3 bucket (e.g., emr-data-bucket).
2. Optionally enable data transformation using an AWS Lambda function.
Define the buffer size and interval:
1. g., 5 MB or 60 seconds (whichever is met first).
Click Create Delivery Stream.

Step 3: Set Up an Amazon EMR Cluster

Navigate to the Amazon EMR Console and create a new cluster:
1. Choose the desired Amazon EMR version (e.g., emr-6.x) with Apache Spark.
2. Select an instance type like m5.xlarge based on your workload.
Configure Amazon S3 Input and Output Paths:
1. Input Path: s3://emr-data-bucket/
2. Output Path: s3://emr-output-bucket/
Enable permissions:
1. Attach AWS IAM roles allowing Amazon EMR to access Amazon S3 and Amazon Kinesis.

Step 4: Process Data on Amazon EMR Using Apache Spark

Apache Spark on EMR can process data from the Amazon S3 bucket, which Amazon Kinesis Firehose delivers.

Example Spark Job

Save the following Spark job as process_stream.py:

from pyspark.sql import SparkSession
 
# Initialize Spark Session
spark = SparkSession.builder.appName("KinesisEMRProcessing").getOrCreate()
 
# Read data from S3
input_path = "s3://emr-data-bucket/"
data = spark.read.json(input_path)
 
# Perform transformations
data_transformed = data.withColumnRenamed("event", "event_type")
 
# Write output to S3
output_path = "s3://emr-output-bucket/processed/"
data_transformed.write.mode("overwrite").json(output_path)
 
print("Data processing complete!")

from pyspark.sql import SparkSession

# Initialize Spark Session

spark = SparkSession.builder.appName("KinesisEMRProcessing").getOrCreate()

# Read data from S3

input_path = "s3://emr-data-bucket/"

data = spark.read.json(input_path)

# Perform transformations

data_transformed = data.withColumnRenamed("event", "event_type")

# Write output to S3

output_path = "s3://emr-output-bucket/processed/"

data_transformed.write.mode("overwrite").json(output_path)

print("Data processing complete!")

Submit the Job

Use the following command to submit the Spark job to your Amazon EMR cluster:

yarn submit --deploy-mode cluster --master yarn process_stream.py

1	yarn submit --deploy-mode cluster --master yarn process_stream.py

Step 5: Monitor the Pipeline

Monitor Amazon Kinesis Data Streams:
1. Use the Amazon Kinesis monitoring dashboard to track incoming data rate and shard utilization.
Monitor Firehose:
1. Check the delivery stream metrics for data delivery success/failure rates.
Monitor Amazon EMR:
1. Use Amazon CloudWatch to monitor Amazon EMR cluster health and Spark job performance.

Best Practices

Optimize Shard Count: Adjust the shard count dynamically using the Amazon Kinesis Scaling Utility or AWS Application Auto Scaling.
Enable Compression: Configure Firehose to compress data (e.g., using GZIP or Snappy) to reduce storage costs and speed up processing.
Use Partitioning: Partition data in Amazon S3 by timestamp or other meaningful keys to improve query performance in Amazon
Secure Your Pipeline:
- Encrypt data at rest using Amazon S3 server-side encryption (SSE-S3 or SSE-KMS).
- Encrypt data in transit using SSL/TLS.

Conclusion

Integrating Amazon Kinesis with Amazon EMR offers a scalable, efficient, and reliable way to process real-time data streams. Following this guide, you can build a robust pipeline for diverse real-time analytics and big data processing use cases.

With proper monitoring and optimizations, this architecture can deliver high performance while remaining cost-effective.

Drop a query if you have any questions regarding Amazon Kinesis or Amazon EMR and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can I use Amazon Kinesis Data Analytics instead of Amazon Kinesis Data Firehose in this setup?

ANS: – Yes, you can use Amazon Kinesis Data Analytics to process streaming data in real-time before sending it to an Amazon S3 bucket or other destinations. Amazon Kinesis Data Analytics allows for SQL-based transformations on streaming data, which can be beneficial if you need pre-processing before ingesting data into Amazon EMR.

2. How do I determine the optimal number of shards for my Kinesis Data Stream?

ANS: – The number of shards depends on your data’s throughput requirements:

Each shard supports up to 1 MB/second write and 2 MB/second
Calculate your expected data ingestion rate and divide it by these limits to determine the required number of shards. Using the Kinesis Scaling Utility or AWS Application Auto Scaling, you can also scale dynamically.

WRITTEN BY Deepak Kumar Manjhi

Deepak Kumar Manjhi works as a Research Associate (Data & AIoT) at CloudThat, specializing in AWS Data Engineering. With a strong focus on cloud-based data solutions, Deepak is building hands-on expertise in designing and implementing scalable data pipelines and analytics workflows on AWS. He is committed to continuously enhancing his knowledge of cloud computing and data engineering and is passionate about exploring emerging technologies to broaden his skill set.