Init Scripts in Databricks for Consistent Environments and Error-Free Deployment

Introduction

Init Scripts (a.k.a Initialization Scripts) are shell scripts that run to start required processes as part of the boot process. In other words, Init Scripts are like a set of instructions that a computer follows when it starts up. The init scripts perform tasks such as checking hardware components, loading essential software, configuring network settings, and starting important services or programs.

In the context of Databricks, an Init Script is a shell script that runs during the startup of each cluster node before the Apache Spark driver or worker JVM starts. When you work with data in Databricks, you often need to set up paths for environment variables, install specific libraries, etc. an init script can help you in such cases by executing a series of steps each time you start your Databricks Cluster. So, an init script in Databricks is an automated script that can prepare your computing environment before you start your data analysis and machine learning tasks.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Types of Init Script in Databricks

Databricks officially supports two kinds of init scripts:

Cluster-level Init Scripts:

These init scripts are called ‘Cluster-scoped Init Scripts’.
They run on every cluster configured with the script.
This is the recommended way of running init scripts from Databricks.
Cluster-level init scripts help you standardize the setup across multiple clusters in the workspace.

2. Workspace-level Init Scripts:

These init scripts are called ‘Global Init Scripts’.
They run on every cluster available in the workspace.
These init scripts can ensure that a specific cluster configuration is enforced consistently across the workspace.

When you configure the above two types of init scripts in your workspace, Databricks follows a specific order of execution while running init scripts. The order of execution will be:

Global init script
Cluster-scoped init script

Remember that each time you create a new init script or modify the existing init script, you must restart the cluster it is executing on.

Environment Variables

Cluster-scoped init scripts and global init scripts support the below Databricks environment variables:

DB_CLUSTER_ID: This variable returns the ID of the cluster on which the init script is currently running.
DB_CONTAINER_IP: This variable returns the private IP address of the container in which spark runs. The init script runs inside this container.
DB_IS_DRIVER: This variable returns a Boolean value based on whether the init script runs on a driver node.
DB_DRIVER_IP: This variable returns the IP address of the driver node.
DB_INSTANCE_TYPE: This variable returns the instance hosting the virtual machine.
DB_CLUSTER_NAME: This variable returns the cluster name on which the init script executes.
DB_IS_JOB_CLUSTER: This variable returns a Boolean value based on whether the cluster was created to run a job.

Use Cases

Init scripts in Databricks offer several use cases that can enhance your workflow and streamline your data analysis processes. Below are some of the most common scenarios where the usage of init scripts can be beneficial:

Library installations: With init scripts, we can install libraries and their dependencies not included in the Databricks runtime. This ensures that all the required software components are readily available when you start your Databricks workspace or cluster, saving you the time and effort of manually installing them each time.
Configuring artifact repository: The required libraries may sometimes reside inside an artifactory. To comply with the organization’s security policies, you should only install libraries from that artifactory. Init scripts can help in such cases by automating the artifactory configuration like setting paths of artifactory, passing credentials, retrieving tokens to access artifactory data, etc.
Data Preprocessing: If you need to perform certain data preprocessing tasks before you start your data analysis, then init scripts can help. For example, you can use init scripts to download, prepare datasets, clean data, or transform data into a suitable format ensuring your data is ready for analysis before you start your work.
Configuring custom SSL certificate authority: To avoid connection errors to your endpoints, you may have to import custom CA certificates, which must be loaded into ‘/etc/ssl/certs’ for Databricks to verify them. In this case, you can use an init script to load them from their source path into the Databricks recommended path every time you run your cluster.
Configuring 3rd party observability tools such as Datadog/Amazon Cloudwatch etc.
Configuring 3rd party governance tools such as Immuta/Protegrity etc.
Configuring External Hive Metastore: In Databricks, when you work with large amounts of data, you may have multiple datasets or tables stored in various formats like Parquet, CSV, or JSON. The external Hive Metastore is a centralized catalog that stores information about those datasets, including their location, structure, and metadata. Instead of manually specifying the details of each dataset, you can register the datasets with the Hive Metastore with the help of an init script. The registration process involves specifying the location of the datasets, such as Azure Blob Storage, Amazon S3, etc.

Example

Below shell script is a sample global init script that can be used to copy a custom CA certificate called ‘MyCA.pem’ that is located in the ‘/dbfs/user/user_name/’ path into ‘/etc/ssl/certs’ path in dbfs each time when you start the cluster.

#! /bin/bash
cp /dbfs/user/user_name/MyCA.pem /etc/ssl/certs/

1 2	#! /bin/bash cp /dbfs/user/user_name/MyCA.pem /etc/ssl/certs/

Conclusion

Init scripts allow you to standardize the setup process across teams and projects by ensuring consistent environments and reducing the chances of discrepancies or errors.

You can easily share and reproduce your workspace or cluster configurations with colleagues or other teams by automating the setup steps.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Are there any other types of Init Scripts that Databricks stop supporting?

ANS: – Yes. Databricks stop supporting ‘Legacy Global Init Scripts’ and ‘Cluster-name Init Scripts’. Databricks deprecate these and cannot be used on new workspaces.

2. How can I check if my Databricks workspace still contains ‘Legacy Global Init Scripts’?

ANS: – ‘/databricks/init’ is the reserved location for legacy global init scripts in every Databricks workspace.

3. Who can create Global Init Scripts?

ANS: – Only ‘Workspace Admins’ can create global init scripts in a Databricks workspace.

WRITTEN BY Yaswanth Tippa

Yaswanth Tippa is working as a Research Associate - Data and AIoT at CloudThat. He is a highly passionate and self-motivated individual with experience in data engineering and cloud computing with substantial expertise in building solutions for complex business problems involving large-scale data warehousing and reporting.