Streaming Data to Amazon S3 Using the Amazon MSK Connect S3 Sink Connector

Overview

Real-time data streaming is crucial for building scalable, responsive, and intelligent applications in today’s data-driven world. Apache Kafka, a powerful distributed streaming platform, enables organizations to process massive volumes of data in real time. However, integrating Kafka with systems such as Amazon S3 often requires considerable setup, custom coding, and infrastructure management.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Introduction

Amazon MSK Connect steps in, a fully managed feature of Amazon MSK that simplifies Kafka Connect operations. Amazon MSK Connect S3 Sink Connector enables seamless data streaming from Kafka topics to Amazon S3. It discusses its benefits and walks through a practical step-by-step implementation using the AWS Management Console.

Amazon MSK Connect

Amazon MSK Connect is a managed service that allows developers to run Kafka Connect clusters directly in AWS with minimal operational overhead. Kafka Connect is an open-source framework that simplifies moving large-scale data between Apache Kafka and other systems.

With Amazon MSK Connect, you can easily deploy source connectors to bring data into Kafka topics or sink connectors to push data from Kafka topics to external destinations like Amazon S3, Amazon OpenSearch Service, and databases.

Key Features of Amazon MSK Connect:

Managed Infrastructure: Automatically handles provisioning, patching, and scaling of Connect clusters.
Auto-Scaling: Supports horizontal and vertical scaling of connector tasks.
Resiliency: Automatically restarts failed tasks to maintain stream continuity.
VPC Connectivity: Integrates with AWS PrivateLink for secure, private data transfer.
Supports Open Source Plugins: You can use existing Kafka Connect-compatible plugins or build custom ones.

Why Use the Amazon S3 Sink Connector?

The Amazon S3 Sink Connector is a plugin that exports Kafka topic data to Amazon S3. This is especially valuable for use cases such as:

Long-term archival and backup
Offline analytics and reporting
Feeding data lakes or machine learning pipelines
Storing raw logs or structured event data

Amazon S3 is an ideal destination because of its scalability, cost-efficiency, and durability. With the Amazon S3 Sink Connector, you can write data from Kafka to Amazon S3 in JSON, Avro, or Parquet formats, organized using various partitioning strategies.

Step-by-Step: Sending Kafka Data to Amazon S3

Below is a practical guide to configuring MSK Connect with an Amazon S3 Sink Connector using the AWS Console.

Step 1: Set Up MSK and Required Resources

Create an Amazon MSK cluster if you haven’t already.

msk

Set up a client Amazon EC2 instance to send messages to a Kafka topic.

Prepare an Amazon S3 bucket to receive data exported by the connector.

Create an AWS IAM role (e.g., mkc-tutorial-role) with Amazon MSK Connect and S3 permissions.

Step 2: Upload the Amazon S3 Sink Connector as a Plugin

Download the Confluent S3 Sink Connector (usually a .zip file).
Upload the .zip to an accessible Amazon S3 bucket.
In the AWS Console, navigate to Amazon MSK Connect > Custom plugins.
Choose Create custom plugin.
Browse to your uploaded ZIP file in Amazon S3 and select it.
Name the plugin (e.g., mkc-tutorial-plugin) and complete creation.

Step 3: Create the Kafka Topic

Use your Amazon EC2 client machine to create a Kafka topic for the connector to consume:

bash

kafka-topics.sh --create --topic mkc-tutorial-topic --bootstrap-server <broker-endpoint> --partitions 1 --replication-factor 1

bash

kafka-topics.sh --create --topic mkc-tutorial-topic --bootstrap-server <broker-endpoint> --partitions 1 --replication-factor 1

Step 4: Create the Connector

Go to Amazon MSK Connect > Connectors and click Create connector.
Choose the plugin you just created.
Name the connector (e.g., mkc-tutorial-connector).
Select your Amazon MSK cluster.
Paste the following example configuration (update region and bucket name):

msk2

properties

connector.class=io.confluent.connect.s3.S3SinkConnector
s3.region=us-east-1
format.class=io.confluent.connect.s3.format.json.JsonFormat
flush.size=1
schema.compatibility=NONE
tasks.max=2
topics=mkc-tutorial-topic
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
storage.class=io.confluent.connect.s3.storage.S3Storage
s3.bucket.name=<your-s3-bucket-name>
topics.dir=tutorial

Assign the AWS IAM role mkc-tutorial-role.

Review and create the connector.

Partitioning Strategies for Amazon S3 Sink Connector

The way your data is organized in Amazon S3 can be customized using different partitioners:

DefaultPartitioner

Stores each topic and partitions under a simple directory structure:

properties

partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
topics.dir=events-environment/development/frontend

Example Amazon S3 path:

s3://bucketname/events-environment/development/frontend/events/partition=0/

TimeBasedPartitioner

Organizes data by timestamp, helpful for time-series data:

properties

partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
path.format=YYYY/MM/dd/HH
timestamp.extractor=Record
partition.duration.ms=3600000

Example S3 path:

s3://bucketname/events-environment/development/frontend/2025/05/14/13/

Benefits of Using MSK Connect with Amazon S3 Sink

No Code Changes: Deploy existing Kafka Connect plugins as-is.
Managed Experience: Offloads operational complexity.
Elastic Scaling: Scales automatically based on throughput.
High Availability: Automatically recovers from task failures.
Security & Privacy: Uses private networking via AWS PrivateLink.

Conclusion

The Amazon MSK Connect S3 Sink Connector provides a powerful, managed solution for exporting Kafka data to Amazon S3 without the traditional operational complexity. It seamlessly integrates with your existing Kafka pipelines, offering data formatting and partitioning flexibility.

Whether building a real-time data lake, enabling historical analytics, or simply archiving event logs, the S3 Sink Connector offers a reliable and scalable path from streaming data to long-term storage.

By leveraging Amazon MSK Connect, you can focus more on your application’s core functionality and less on infrastructure management, a key win for modern data engineering teams.

Drop a query if you have any questions regarding Amazon MSK and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can I use custom connectors with Amazon MSK Connect?

ANS: – Yes, Amazon MSK Connect allows you to upload and use custom connector plugins. You can package your connector as a ZIP file, upload it to Amazon S3, and register it as a custom plugin in the Amazon MSK Connect console.

2. What data formats are supported by the Amazon S3 Sink Connector?

ANS: – The connector supports multiple formats, including JSON, Avro, and Parquet. You can specify the format using the format.class configuration property.

WRITTEN BY Suresh Kumar Reddy

Suresh is a highly skilled and results-driven Generative AI Engineer with over three years of experience and a proven track record in architecting, developing, and deploying end-to-end LLM-powered applications. His expertise covers the full project lifecycle, from foundational research and model fine-tuning to building scalable, production-grade RAG pipelines and enterprise-level GenAI platforms. Adept at leveraging state-of-the-art models, frameworks, and cloud technologies, Suresh specializes in creating innovative solutions to address complex business challenges.