Voiced by Amazon Polly |
Overview
In our previous blog, we discussed how to migrate data from a database table using Full Load and Continuous Replication operations on Amazon Database Migration Service from an Amazon Aurora RDS instance to an S3 Bucket. In this blog, we will see how we can replicate the continuous data coming from Aurora RDS as Delta Tables on the Databricks Lakehouse Platform to use it for BI and ML workloads.
Customized Cloud Solutions to Drive your Business Success
- Cloud Migration
- Devops
- AIML & IoT
Creating a Databricks Workspace
First, let us create a Databricks account on Databricks – Sign in and launch a workspace. Databricks provides a 14-day free trial for first-time customers to try out its features. Remember, to create a Databricks workspace we need an Account in either AWS or Azure, or GCP Cloud Platform to manage the underlying infrastructure. Databricks provides us with the software which enables us to develop efficient ETL, ML, and Data Pipelines with the added advantage of Spark to handle huge loads of data. To run the software, we need a cloud account that would deploy and manage the underlying hardware.
Note: Databricks charges only for the platform as per the DBU usage of your workspace clusters. The underlying infrastructure is billed separately by the Cloud provider based on the resources launched.
Databricks provides us with a QuickStart option to launch all the required resources on an AWS account using the CloudFormation template. This will configure the workspace for you without letting you worry much about the technicalities of the cloud infrastructure.
Once you fill all the required data in the CloudFormation template and start it, the following resources will be created on your AWS account:
- A Cross-Account IAM role which will create and manage your AWS infra for your Workspace.
- A VPC configured spanning 3 Availability Zones, with Public and Private Subnets, Internet Gateway, NAT Gateway, and VPC endpoint to access S3 buckets and Security Groups for your Inbound and Outbound traffic.
- An S3 bucket to store your Workspace Notebooks, Delta Tables, Logs, etc., and an IAM role to access your S3 bucket from your Workspace.
- CloudWatch to monitor your Cluster EC2 instances.
- AWS Security Token Service (AWS STS), to create access for multiple users on your Workspace.
- Optionally you can also Encrypt your Notebooks using AWS KMS service.
Image Source: AWS
Creating a Workspace Cluster
Once the Workspace is ready, we need to create a cluster that runs our Workspace notebooks. To create a Cluster, hover over the left side pane of your Workspace and click on Cluster. Click on Create Cluster to start configuring your cluster. There are 3 modes of clusters,
- Standard/Multi-Node Cluster: These clusters are made of at least 2 instances. One is the Driver node and the other is the Worker node. There will be exactly 1 Driver node and the worker nodes can be scaled based on the workload in these clusters. A Standard cluster can run code in any language (Python, Scala, SQL, R). This is recommended for a single-user Workspace.
- High Concurrency Cluster: These clusters provide fine-grained access control for multiple user Workspace Environments. The notebooks and delta table’s access can be isolated and the resources are managed for optimum security, performance, and latency. A High Concurrency cluster doesn’t support Scala.
- Single Node Cluster: These clusters don’t have any worker nodes. All of the jobs and notebooks run on the driver node.
We can also change our Databricks runtime environment and use Photon acceleration which accelerates your Spark performance at least 3x times with an increased DBU usage.
For now, let us create a Standard cluster with m5a.large EC2 instance type. Select your AWS role to access S3 in the instance profile section.
Additionally, in the advanced options, you can also
- Select whether your instances should be On-Demand or Spot,
- Set Spark Configurations,
- Environmental variables to use in your notebooks,
- Initialization scripts and
- The number of EBS volumes that should be attached to your EC2 instance etc.
Mounting an S3 bucket to the Cluster
Create a new notebook by clicking on the (+) icon on the Workspace. Name the notebook and choose Python as the default language. In our previous blog, we migrated our table data to an S3 bucket as Parquet files. Databricks allows us to mount an S3 bucket as local storage using “dbfs” commands to access the folders and objects in it.
Use the following command to mount your S3 bucket to your Workspace,
1 |
dbutils.fs.mount(“s3a://%s”<s3-bucket-name>,”/mnt/%s”<mount-name> |
Creating Delta table using Parquet data
Delta table is an open-source framework used by Databricks to simplify ETL workloads and build reliable and scalable data pipelines for multiple workloads.
To create a delta table, first, we need to create a database. Databricks gives us a “default” database along with the workspace. Databricks SQL is almost the same as normal SQL except for a few syntactical differences, with the added advantage of the Spark framework. We can also use Pyspark, Scala or R commands to manipulate and filter the tables.
We can create a delta table using Pyspark as follows,
1 2 3 4 5 6 |
read_format = 'parquet' write_format = 'delta' load_path = '/mnt/dms-s3-bucket2022/dms-migrated-data/public/customer/LOAD00000001.parquet' table_name = 'public.customer' df=spark.read.format(read_format).load(load_path) df.write.format(write_format).mode("overwrite").saveAsTable(table_name) |
Along with the Full Load data, we also have Continuous Replication data from our RDS table. The replication data is accompanied by 2 other columns, one “Op” which represents the type of Operation such as Insert, Delete, or Update performed on the data, and a “TIMESTAMP” column which gives us the time at which the operation was performed. Based on the above information we can Insert, Update or Delete the respective columns on our delta table by building efficient data pipelines. For now, we can use simple SQL queries to apply the changes to our database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
%sql MERGE INTO public.customer f USING insert_table i ON f.CUST_CODE = i.CUST_CODE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * MERGE INTO public.customer f USING update_table u ON f.CUST_CODE = u.CUST_CODE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * DELETE FROM public.customer f WHERE EXISTS(SELECT 1 FROM delete_table d where f.CUST_CODE = d.CUST_CODE) |
To know more about Databricks SQL visit, Databricks SQL guide | Databricks on AWS
Scheduling a Workspace notebook
Databricks allows us to schedule a notebook as a Job. A job can be run at a specific time or repeated times in a day. A job can also be scheduled in a Jobs cluster which is specifically designed to run a job at the scheduled time.
Conclusion
Thus, we are done with setting up the Data Pipeline between Amazon RDS and Databricks Lakehouse platform which can continuously replicate the data for processing. The Lakehouse platform built using the Delta Lake framework enables us to have a broader view of our data. The integrated workspace along with the added advantage of Apache Spark, Data Engineering, and ML capabilities of Databricks enables us to perform Complex transformations, Analytic Processing, ML modeling, and BI Reporting on large chunks of data within minutes.
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 850k+ professionals in 600+ cloud certifications and completed 500+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront Service Delivery Partner, Amazon OpenSearch Service Delivery Partner, AWS DMS Service Delivery Partner, AWS Systems Manager Service Delivery Partner, Amazon RDS Service Delivery Partner, AWS CloudFormation Service Delivery Partner, AWS Config, Amazon EMR and many more.
FAQs
1. Does Databricks support other cloud platforms?
ANS: – Yes. Databricks Workspace can be currently hosted on AWS, Microsoft Azure, and Google Cloud Platform. It can also be connected to multiple cloud storage platforms. Visit Databricks integrations overview | Databricks on AWS for more info.
2. What programming languages are supported by databricks?
ANS: – Databricks supports Python, Scala, R, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, scikit-learn, etc. Users can also download their custom Python, R, or Scala Packages from the repository.
3. Can Databricks be connected to other Data and Analytic tools?
ANS: – Yes. Databricks partners with several other Data, AI, and Analytics tools and services for better integration and application development. Visit Partner Connect – Databricks for more info.
4. Can we customize and manage the Databricks Workspace setup to be in our desired Private Network?
ANS: – Yes. We can control and manage the underlying infrastructure of Databricks according to our requirements. Though this is not available in the Standard tier of Databricks, Premium and Enterprise tiers will offer much more granular controls over the infrastructure and security components of Databricks. Visit Databricks Pricing – Schedule a Demo Now! To check the various offerings provided by Databricks.
WRITTEN BY Sai Pratheek
Comments