Databricks Data Lakehouse: Migrating Data from AWS Aurora to Delta Lake Part 1

Overview

Any BI solution needs its data to be clean, processed, and formatted based on the company’s business requirements. To transform raw data into valuable information, we need a platform that can store, transform, and update the data regularly and keep it ready for the visualization tool.

Databricks is a platform for all the Data Engineering and AI workloads where we can automate the ETL jobs and build efficient ML pipelines at an enterprise level. As a company’s data could be very rapidly updated from time to time, we need to set up a pipeline between the firm’s Database and Databricks to get the updated data and send it to the BI. In this blog, we will learn how to migrate Full Load and CDC data from an AWS Aurora DB cluster to Databricks Data Lakehouse Platform.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

Data Migration Workflow

aurora1

The above diagram describes the flow of data during the migration process. We use AWS Database Migration Service to migrate data from a Database server to Databricks. The DMS pulls data from the DB server and stores it in the target S3 bucket. Using DMS, we can migrate the whole Database using a Full Load operation or replicate the changes happening in real-time to the Database using the Ongoing Replication operation.

To create a DMS task and replicate the data, firstly, we need a Database server configured with some data. For this blog, I have used an Aurora RDS with PostgreSQL 13 server to replicate the data to S3.

aurora2

Step-by-Step Guide for Data Migration

Prepare the Aurora RDS cluster for the migration process. Create a new Cluster Parameter Group with the following parameter values and attach it to the central cluster. Make sure you select the correct DB Family.

aurora3

For Aurora RDS with PostgreSQL: To replicate data from a PostgreSQL DB, we use the Logical Replication method, which provides fine-grained control over replicating and synchronizing parts of a database. To enable Logical Replication for a PostgreSQL DB, we need to make the following changes to the cluster parameter group of the Aurora cluster.

rds.logical_replication :1.
shared_preload_libraries : pglogical.
idle_in_transaction_session_timeout : 0.
wal_sender_timeout parameter : 0.

rds.logical_replication :1.

shared_preload_libraries : pglogical.

idle_in_transaction_session_timeout : 0.

wal_sender_timeout parameter : 0.

Ensure the value of the max_worker_processes parameter in your DB Cluster Parameter Group is equal to, or higher than the total combined values of

max_logical_replication_workers, autovacuum_max_workers, and max_parallel_workers

1	max_logical_replication_workers, autovacuum_max_workers, and max_parallel_workers

• track_commit_timestamp : 1.

1	• track_commit_timestamp : 1.

For Aurora RDS with MySQL: To migrate data from a MySQL /MariaDB server, the following cluster parameters should be modified to the mentioned values:

•	binlog_format : ROW
•	binlog_row_image: Full
•	binlog_checksum: NONE

• binlog_format : ROW

• binlog_row_image: Full

• binlog_checksum: NONE

Additionally, for RDS with MySQL, increase the Binary log retention time so that DMS could access it for CDC tasks using the command:

call mysql.rds_set_configuration('binlog retention hours', 24);

1	call mysql.rds_set_configuration('binlog retention hours', 24);

After creating the cluster parameter group, go to the dm instance and modify the cluster parameter group by assigning the new parameter group.

aurora4

A restart is required after attaching the new parameter group to the cluster.

2. Create a DMS replication instance that connects to the RDS and runs replication tasks. Ensure the replication instance is created in the same VPC as the RDS cluster.

aurora6

You can white-list the Public IP address of the Replication instance in the RDS security and RDS Public/Private IP in the DMS security group to allow communication between them.

3. Create an S3 bucket where the Migrated data will be stored with block all public access checked. Create an IAM role with S3 Full access for DMS so that it can access the S3 bucket. Copy the role ARN.

aurora7

4. In the Endpoints section of DMS, create a Source endpoint using Aurora RDS endpoint credentials and a Target endpoint using the IAM role created in the previous step and S3 bucket configurations. We can also use AWS Secrets Manager to manage the endpoint credentials.

aurora8

The following image shows us the Target Endpoint settings required by the DMS task to create a continuous replication task. For more source and target endpoints and their settings:
Refer to:
a. Sources for data migration
b. Targets for data migration

aurora9

5. After creating the endpoints, go to the Database Migration Tasks section of DMS and complete a replication task for our table. In the table mappings section, mention the schema and name of the tables we want to include/exclude for our migration task. We can also set custom filters to replicate specific columns from a table.
Using DMS, we can replicate data in 3 different ways,

Migrate existing data: This performs a one-time Full Load operation which migrates the entire data till that particular instant of time from Source to Target.
Migrate existing data and replicate ongoing changes: This performs a one-time Full Load Task for the first time and then continues replicating the data changes from Source to Target.
Replicate data changes only: This kind of task only replicates the continuous changes from Source to Target.
Here, I have created a task to migrate existing data and replicate ongoing changes

aurora10

aurora11

aurora12

6. After creating the Full Load with Ongoing Replication task, it will start saving the table data in the specified s3 bucket folder. The Full Load data is stored as a Load file, and the ongoing changes to the table will be saved as separate files for everyone minute or specified time interval.

aurora13

Final Thoughts

As we have set up the DMS tasks for Data Migration, we will get the continuous changes made to the Database as separate files and timestamps at the target s3 location. In the next blog, we will learn about importing the data to Databricks workspace and setting up the CDC Jobs on the changing data.

There are other options for Cloud Data Migration. Here is a blog to Simplify Cloud Data Migration with AWS DataSync – Amazon’s Online Data Transfer Service.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does AWS DMS support the migration of data from another type of Database?

ANS: – AWS DMS supports migration from a wide range of Databases. To learn more about migrating data from multiple source databases on Aurora RDS using DMS, refer to Database Migration Step-by-Step Walkthroughs.

2. What are file types supported by AWS DMS for target files in S3?

ANS: – AWS DMS currently supports “csv” or “parquet” target file types.