Connecting SQL Databases to Databricks for Unified Data Analysis

Overview

Databricks has emerged as a powerful unified data platform that simplifies big data processing and advanced analytics. Its support for Apache Spark and seamless integration with cloud platforms like AWS, Azure, and Google Cloud empowers data engineers, analysts, and scientists to process large datasets efficiently. One of the most crucial steps in making the most of Databricks is connecting your external databases, whether PostgreSQL, MySQL, SQL Server, or other RDBMS, so that you can read, transform, and analyze data in a collaborative environment.

In this blog post, we will dive deep into how to mount your database to Databricks, including step-by-step instructions, key considerations, and best practices to ensure secure and scalable access.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why Mount a Database to Databricks?

Databricks allows you to work with various data sources, including file systems like Amazon S3, Azure Data Lake, and databases via JDBC. While mounting is typically associated with cloud storage systems, connecting to a database (which might be confused with “mounting”) means creating a live, reusable connection between Databricks and your relational or NoSQL data source.

Mounting or connecting your database to Databricks provides several advantages:

Centralized Access: Collaborate across teams on a common workspace.
ETL Simplification: Perform transformations and data movement using Spark SQL or PySpark.
Scalability: Leverage Databricks’ distributed processing for querying large tables.
Machine Learning: Prepare data easily for modeling using MLlib or integration with MLflow.
Automation: Automate reporting, alerts, and workflows using Databricks Jobs and notebooks.

Understanding the Architecture

Before proceeding, it’s important to understand what “mounting” means in the Databricks context:

Mounting File Systems: Refers to linking external storage (e.g., AWS S3, Azure Blob, GCS) to Databricks DBFS.
Connecting Databases: Involves using JDBC drivers to connect databases to notebooks or clusters securely, often referred to as “mounting” in colloquial usage. However, it isn’t a mount in the file system sense.

So, while you can’t directly mount a database like you would Amazon S3, you connect to it using a JDBC connection string and optionally register it as a permanent table or catalog entry.

Prerequisites

To connect your database to Databricks, you’ll need:

A running Databricks workspace.
A cluster with appropriate libraries installed (e.g., JDBC drivers).
Database credentials (host, port, database name, username, and password).
Proper network access (i.e., your Databricks cluster can reach your database).
Optional: A secure place to store secrets (Databricks Secret Scope, Azure Key Vault, or AWS Secrets Manager).

Supported Databases

Databricks supports a wide range of data sources, including but not limited to:

PostgreSQL
MySQL
Microsoft SQL Server
Oracle
Snowflake
Amazon Redshift
MongoDB (via Spark Connector)
Google BigQuery
Databricks SQL (internal)

Each requires a specific JDBC driver that can be installed via a Maven coordinate or uploaded manually.

Step-by-Step: Connecting a PostgreSQL Database to Databricks

We will use PostgreSQL for demonstration purposes, but the process is similar to other databases.

Step 1: Install the JDBC Driver

Navigate to Clusters > Libraries > Install New > Maven and use:

For MySQL:

mysql:mysql-connector-java:8.0.23

1	mysql:mysql-connector-java:8.0.23

This installs the JDBC driver on your cluster.

Step 2: Store Your Credentials Securely

Option A: Use Databricks Secrets

Create a secret scope:

databricks secrets create-scope --scope db-creds

1	databricks secrets create-scope --scope db-creds

Then store credentials:

databricks secrets put --scope db-creds --key pg-username
databricks secrets put --scope db-creds --key pg-password

1 2	databricks secrets put --scope db-creds --key pg-username databricks secrets put --scope db-creds --key pg-password

Access them in a notebook:

username = dbutils.secrets.get(scope="db-creds", key="pg-username")
password = dbutils.secrets.get(scope="db-creds", key="pg-password")

1 2	username = dbutils.secrets.get(scope="db-creds", key="pg-username") password = dbutils.secrets.get(scope="db-creds", key="pg-password")

Option B: Use a config file (not recommended for production)

username = "your_user"
password = "your_password"

1 2	username = "your_user" password = "your_password"

Step 3: Define JDBC Connection Parameters

jdbc_hostname = "your-db-hostname" jdbc_port = 5432 jdbc_database = "your-database"
jdbc_url = f"jdbc:postgresql://{jdbc_hostname}:{jdbc_port}/{jdbc_database}" connection_properties = {
 "user": username, 
"password": password, 
"driver": "org.postgresql.Driver" 
}

jdbc_hostname = "your-db-hostname" jdbc_port = 5432 jdbc_database = "your-database"

jdbc_url = f"jdbc:postgresql://{jdbc_hostname}:{jdbc_port}/{jdbc_database}" connection_properties = {

"user": username,

"password": password,

"driver": "org.postgresql.Driver"

}

Step 4: Read Data into a Spark DataFrame

df = spark.read.jdbc(
    url=jdbc_url,
    table="public.your_table",
    properties=connection_properties
)
df.show()

df = spark.read.jdbc(

url=jdbc_url,

table="public.your_table",

properties=connection_properties

)

df.show()

Step 5: Create a Temp or Permanent Table

df.createOrReplaceTempView("your_table_temp")

1	df.createOrReplaceTempView("your_table_temp")

Secure Practices

When working with databases in Databricks, security must be a top priority:

Avoid hardcoding credentials in notebooks.
Use Secret Scopes or Azure Key Vault-backed scopes for credential management.
Enable cluster-level AWS IAM roles or private endpoints for secure access.
Use network policies or Amazon VPC peering to restrict traffic to the DB.
Use parameterized queries or safe read/write methods to avoid SQL injection.

Conclusion

Mounting or connecting your database to Databricks is a powerful way to combine the scalability of Spark with the familiarity of SQL-based systems. Whether performing ETL, real-time analytics, or machine learning, a solid connection between your database and Databricks ensures a smooth data journey from source to insight.

With secure practices, performance optimization, and automation in place, your data platform can deliver consistent and reliable value across teams.

Drop a query if you have any questions regarding Databricks and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can I mount a database to Databricks like I mount cloud storage (e.g., S3 or ADLS)?

ANS: – No, databases cannot be “mounted” in the traditional file system sense like Amazon S3 or ADLS. Instead, you connect to them using JDBC within your Databricks notebook or cluster. The term “mounting a database” is often used informally to refer to setting up a persistent and reusable database connection.

2. What JDBC drivers are supported in Databricks, and how do I install them?

ANS: – Databricks supports many JDBC drivers, including PostgreSQL, MySQL, SQL Server, Oracle, and more. You can install them from the Maven repository via the Clusters > Libraries > Install New > Maven interface. For example, PostgreSQL uses org.postgresql:postgresql:42.2.20.

WRITTEN BY Vinay Lanjewar

Vinay specializes in designing and implementing scalable data pipelines and end-to-end data solutions on the AWS Cloud. Skilled in technologies such as Amazon EC2, S3, Athena, Glue, QuickSight, and Lambda, he also leverages Python and SQL scripting to build efficient ETL processes. Vinay has extensive experience in creating automated workflows using AWS services, transforming and organizing data, and developing insightful visualizations with Amazon QuickSight. His work ensures that data is collected efficiently, structured effectively, and made analytics-ready to drive informed decision-making.