Any BI solution needs its data to be clean, processed, and formatted based on the company’s business requirements. To transform raw data into valuable information, we need a platform that can store, transform, and update the data regularly and keep it ready for the visualization tool.
Databricks is a platform for all the Data Engineering and AI workloads where we can automate the ETL jobs and build efficient ML pipelines at an enterprise level. As a company’s data could be very rapidly updated from time to time, we need to set up a pipeline between the firm’s Database and Databricks to get the updated data and send it to the BI. In this blog, we will learn how to migrate Full Load and CDC data from an AWS Aurora DB cluster to Databricks Data Lakehouse Platform.
Data Migration Workflow:
The above diagram describes the flow of data during the migration process. We use AWS Database Migration Service to migrate data from a Database server to Databricks. The DMS pulls data from the DB server and stores it in the target S3 bucket. Using DMS, we can migrate the whole Database using a Full Load operation or replicate the changes happening in real-time to the Database using the Ongoing Replication operation.
To create a DMS task and replicate the data, firstly, we need a Database server configured with some data. For this blog, I have used an Aurora RDS with PostgreSQL 13 server to replicate the data to S3.
Step-by-Step Guide for Data Migration
Prepare the Aurora RDS cluster for the migration process. Create a new Cluster Parameter Group with the following parameter values and attach it to the central cluster. Make sure you select the correct DB Family. For Aurora RDS with PostgreSQL: To replicate data from a PostgreSQL DB, we use the Logical Replication method, which provides fine-grained control over replicating and synchronizing parts of a database. To enable Logical Replication for a PostgreSQL DB, we need to make the following changes to the cluster parameter group of the Aurora cluster.
Ensure the value of the max_worker_processes parameter in your DB Cluster Parameter Group is equal to, or higher than the total combined values of
After creating the cluster parameter group, go to the dm instance and modify the cluster parameter group by assigning the new parameter group.
A restart is required after attaching the new parameter group to the cluster.
Create a DMS replication instance that connects to the RDS and runs replication tasks. Ensure the replication instance is created in the same VPC as the RDS cluster.
You can white-list the Public IP address of the Replication instance in the RDS security and RDS Public/Private IP in the DMS security group to allow communication between them.
Create an S3 bucket where the Migrated data will be stored with block all public access checked. Create an IAM role with S3 Full access for DMS so that it can access the S3 bucket. Copy the role ARN.
In the Endpoints section of DMS, create a Source endpoint using Aurora RDS endpoint credentials and a Target endpoint using the IAM role created in the previous step and S3 bucket configurations. We can also use AWS Secrets Manager to manage the endpoint credentials.
The following image shows us the Target Endpoint settings required by the DMS task to create a continuous replication task. For more source and target endpoints and their settings:
Refer to: a. Sources for data migration b. Targets for data migration
After creating the endpoints, go to the Database Migration Tasks section of DMS and complete a replication task for our table. In the table mappings section, mention the schema and name of the tables we want to include/exclude for our migration task. We can also set custom filters to replicate specific columns from a table.
Using DMS, we can replicate data in 3 different ways,
Migrate existing data: This performs a one-time Full Load operation which migrates the entire data till that particular instant of time from Source to Target.
Migrate existing data and replicate ongoing changes: This performs a one-time Full Load Task for the first time and then continues replicating the data changes from Source to Target.
Replicate data changes only: This kind of task only replicates the continuous changes from Source to Target.
Here, I have created a task to migrate existing data and replicate ongoing changes
After creating the Full Load with Ongoing Replication task, it will start saving the table data in the specified s3 bucket folder. The Full Load data is stored as a Load file, and the ongoing changes to the table will be saved as separate files for every one minute or specified time interval.
As we have set up the DMS tasks for Data Migration, we will get the continuous changes made to the Database as separate files and timestamps at the target s3 location. In the next blog, we will learn about importing the data to Databricks workspace and setting up the CDC Jobs on the changing data.
CloudThat is the Databricks Partner, AWS Advanced Consulting Partner, Microsoft Gold Partner, and Google Cloud Partner helping people develop knowledge on the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
If you have any queries about or anything related to Cloud data Migration, Databricks Data Lakehouse, or Amazon Aurora, feel free to drop in a comment. We will get back to you quickly. Visit our Consulting Page for more updates on our customer offerings, expertise, and cloud services.
1. Does AWS DMS support the migration of data from another type of Database?