Migrating Data from AWS Aurora to Delta Lake: Databricks Data Lakehouse - Part 2

Overview

In our previous blog, we discussed how to migrate data from a database table using Full Load and Continuous Replication operations on Amazon Database Migration Service from an Amazon Aurora RDS instance to an S3 Bucket. In this blog, we will see how we can replicate the continuous data coming from Aurora RDS as Delta Tables on the Databricks Lakehouse Platform to use it for BI and ML workloads.

DataLake1

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

Creating a Databricks Workspace

First, let us create a Databricks account on Databricks – Sign in and launch a workspace. Databricks provides a 14-day free trial for first-time customers to try out its features. Remember, to create a Databricks workspace we need an Account in either AWS or Azure, or GCP Cloud Platform to manage the underlying infrastructure. Databricks provides us with the software which enables us to develop efficient ETL, ML, and Data Pipelines with the added advantage of Spark to handle huge loads of data. To run the software, we need a cloud account that would deploy and manage the underlying hardware.

Note: Databricks charges only for the platform as per the DBU usage of your workspace clusters. The underlying infrastructure is billed separately by the Cloud provider based on the resources launched.

Databricks provides us with a QuickStart option to launch all the required resources on an AWS account using the CloudFormation template. This will configure the workspace for you without letting you worry much about the technicalities of the cloud infrastructure.

DataLake2

Once you fill all the required data in the CloudFormation template and start it, the following resources will be created on your AWS account:

A Cross-Account IAM role which will create and manage your AWS infra for your Workspace.
A VPC configured spanning 3 Availability Zones, with Public and Private Subnets, Internet Gateway, NAT Gateway, and VPC endpoint to access S3 buckets and Security Groups for your Inbound and Outbound traffic.
An S3 bucket to store your Workspace Notebooks, Delta Tables, Logs, etc., and an IAM role to access your S3 bucket from your Workspace.
CloudWatch to monitor your Cluster EC2 instances.
AWS Security Token Service (AWS STS), to create access for multiple users on your Workspace.
Optionally you can also Encrypt your Notebooks using AWS KMS service.

DataLake3

Image Source: AWS

Creating a Workspace Cluster

Once the Workspace is ready, we need to create a cluster that runs our Workspace notebooks. To create a Cluster, hover over the left side pane of your Workspace and click on Cluster. Click on Create Cluster to start configuring your cluster. There are 3 modes of clusters,

Standard/Multi-Node Cluster: These clusters are made of at least 2 instances. One is the Driver node and the other is the Worker node. There will be exactly 1 Driver node and the worker nodes can be scaled based on the workload in these clusters. A Standard cluster can run code in any language (Python, Scala, SQL, R). This is recommended for a single-user Workspace.
High Concurrency Cluster: These clusters provide fine-grained access control for multiple user Workspace Environments. The notebooks and delta table’s access can be isolated and the resources are managed for optimum security, performance, and latency. A High Concurrency cluster doesn’t support Scala.
Single Node Cluster: These clusters don’t have any worker nodes. All of the jobs and notebooks run on the driver node.

We can also change our Databricks runtime environment and use Photon acceleration which accelerates your Spark performance at least 3x times with an increased DBU usage.

For now, let us create a Standard cluster with m5a.large EC2 instance type. Select your AWS role to access S3 in the instance profile section.

DataLake4

Additionally, in the advanced options, you can also

Select whether your instances should be On-Demand or Spot,
Set Spark Configurations,
Environmental variables to use in your notebooks,
Initialization scripts and
The number of EBS volumes that should be attached to your EC2 instance etc.

DataLake5

Mounting an S3 bucket to the Cluster

Create a new notebook by clicking on the (+) icon on the Workspace. Name the notebook and choose Python as the default language. In our previous blog, we migrated our table data to an S3 bucket as Parquet files. Databricks allows us to mount an S3 bucket as local storage using “dbfs” commands to access the folders and objects in it.

DataLake6

Use the following command to mount your S3 bucket to your Workspace,

dbutils.fs.mount(“s3a://%s”<s3-bucket-name>,”/mnt/%s”<mount-name>

1	dbutils.fs.mount(“s3a://%s”<s3-bucket-name>,”/mnt/%s”<mount-name>

Creating Delta table using Parquet data

Delta table is an open-source framework used by Databricks to simplify ETL workloads and build reliable and scalable data pipelines for multiple workloads.

To create a delta table, first, we need to create a database. Databricks gives us a “default” database along with the workspace. Databricks SQL is almost the same as normal SQL except for a few syntactical differences, with the added advantage of the Spark framework. We can also use Pyspark, Scala or R commands to manipulate and filter the tables.

We can create a delta table using Pyspark as follows,

read_format = 'parquet'
write_format = 'delta'
load_path = '/mnt/dms-s3-bucket2022/dms-migrated-data/public/customer/LOAD00000001.parquet'
table_name = 'public.customer'
df=spark.read.format(read_format).load(load_path)
df.write.format(write_format).mode("overwrite").saveAsTable(table_name)

read_format = 'parquet'

write_format = 'delta'

load_path = '/mnt/dms-s3-bucket2022/dms-migrated-data/public/customer/LOAD00000001.parquet'

table_name = 'public.customer'

df=spark.read.format(read_format).load(load_path)

df.write.format(write_format).mode("overwrite").saveAsTable(table_name)

DataLake7

Along with the Full Load data, we also have Continuous Replication data from our RDS table. The replication data is accompanied by 2 other columns, one “Op” which represents the type of Operation such as Insert, Delete, or Update performed on the data, and a “TIMESTAMP” column which gives us the time at which the operation was performed. Based on the above information we can Insert, Update or Delete the respective columns on our delta table by building efficient data pipelines. For now, we can use simple SQL queries to apply the changes to our database.

DataLake8

%sql
MERGE INTO public.customer f
USING insert_table i
ON f.CUST_CODE = i.CUST_CODE
WHEN MATCHED
  THEN UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
MERGE INTO public.customer f
USING update_table u
ON f.CUST_CODE = u.CUST_CODE
WHEN MATCHED
  THEN UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
DELETE FROM public.customer f WHERE EXISTS(SELECT 1 FROM delete_table d
where f.CUST_CODE = d.CUST_CODE)

%sql

MERGE INTO public.customer f

USING insert_table i

ON f.CUST_CODE = i.CUST_CODE

WHEN MATCHED

THEN UPDATE SET *

WHEN NOT MATCHED

THEN INSERT *

MERGE INTO public.customer f

USING update_table u

ON f.CUST_CODE = u.CUST_CODE

WHEN MATCHED

THEN UPDATE SET *

WHEN NOT MATCHED

THEN INSERT *

DELETE FROM public.customer f WHERE EXISTS(SELECT 1 FROM delete_table d

where f.CUST_CODE = d.CUST_CODE)

To know more about Databricks SQL visit, Databricks SQL guide | Databricks on AWS

Scheduling a Workspace notebook

Databricks allows us to schedule a notebook as a Job. A job can be run at a specific time or repeated times in a day. A job can also be scheduled in a Jobs cluster which is specifically designed to run a job at the scheduled time.

DataLake9

Conclusion

Thus, we are done with setting up the Data Pipeline between Amazon RDS and Databricks Lakehouse platform which can continuously replicate the data for processing. The Lakehouse platform built using the Delta Lake framework enables us to have a broader view of our data. The integrated workspace along with the added advantage of Apache Spark, Data Engineering, and ML capabilities of Databricks enables us to perform Complex transformations, Analytic Processing, ML modeling, and BI Reporting on large chunks of data within minutes.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does Databricks support other cloud platforms?

ANS: – Yes. Databricks Workspace can be currently hosted on AWS, Microsoft Azure, and Google Cloud Platform. It can also be connected to multiple cloud storage platforms. Visit Databricks integrations overview | Databricks on AWS for more info.

2. What programming languages are supported by databricks?

ANS: – Databricks supports Python, Scala, R, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, scikit-learn, etc. Users can also download their custom Python, R, or Scala Packages from the repository.

3. Can Databricks be connected to other Data and Analytic tools?

ANS: – Yes. Databricks partners with several other Data, AI, and Analytics tools and services for better integration and application development. Visit Partner Connect – Databricks for more info.

4. Can we customize and manage the Databricks Workspace setup to be in our desired Private Network?

ANS: – Yes. We can control and manage the underlying infrastructure of Databricks according to our requirements. Though this is not available in the Standard tier of Databricks, Premium and Enterprise tiers will offer much more granular controls over the infrastructure and security components of Databricks. Visit Databricks Pricing – Schedule a Demo Now! To check the various offerings provided by Databricks.