Init Scripts (a.k.a Initialization Scripts) are shell scripts that run to start required processes as part of the boot process. In other words, Init Scripts are like a set of instructions that a computer follows when it starts up. The init scripts perform tasks such as checking hardware components, loading essential software, configuring network settings, and starting important services or programs.
In the context of Databricks, an Init Script is a shell script that runs during the startup of each cluster node before the Apache Spark driver or worker JVM starts. When you work with data in Databricks, you often need to set up paths for environment variables, install specific libraries, etc. an init script can help you in such cases by executing a series of steps each time you start your Databricks Cluster. So, an init script in Databricks is an automated script that can prepare your computing environment before you start your data analysis and machine learning tasks.
Types of Init Script in Databricks
Databricks officially supports two kinds of init scripts:
- Cluster-level Init Scripts:
- These init scripts are called ‘Cluster-scoped Init Scripts’.
- They run on every cluster configured with the script.
- This is the recommended way of running init scripts from Databricks.
- Cluster-level init scripts help you standardize the setup across multiple clusters in the workspace.
2. Workspace-level Init Scripts:
- These init scripts are called ‘Global Init Scripts’.
- They run on every cluster available in the workspace.
- These init scripts can ensure that a specific cluster configuration is enforced consistently across the workspace.
When you configure the above two types of init scripts in your workspace, Databricks follows a specific order of execution while running init scripts. The order of execution will be:
- Global init script
- Cluster-scoped init script
Remember that each time you create a new init script or modify the existing init script, you must restart the cluster it is executing on.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Cluster-scoped init scripts and global init scripts support the below Databricks environment variables:
- DB_CLUSTER_ID: This variable returns the ID of the cluster on which the init script is currently running.
- DB_CONTAINER_IP: This variable returns the private IP address of the container in which spark runs. The init script runs inside this container.
- DB_IS_DRIVER: This variable returns a Boolean value based on whether the init script runs on a driver node.
- DB_DRIVER_IP: This variable returns the IP address of the driver node.
- DB_INSTANCE_TYPE: This variable returns the instance hosting the virtual machine.
- DB_CLUSTER_NAME: This variable returns the cluster name on which the init script executes.
- DB_IS_JOB_CLUSTER: This variable returns a Boolean value based on whether the cluster was created to run a job.
Init scripts in Databricks offer several use cases that can enhance your workflow and streamline your data analysis processes. Below are some of the most common scenarios where the usage of init scripts can be beneficial:
- Library installations: With init scripts, we can install libraries and their dependencies not included in the Databricks runtime. This ensures that all the required software components are readily available when you start your Databricks workspace or cluster, saving you the time and effort of manually installing them each time.
- Configuring artifact repository: The required libraries may sometimes reside inside an artifactory. To comply with the organization’s security policies, you should only install libraries from that artifactory. Init scripts can help in such cases by automating the artifactory configuration like setting paths of artifactory, passing credentials, retrieving tokens to access artifactory data, etc.
- Data Preprocessing: If you need to perform certain data preprocessing tasks before you start your data analysis, then init scripts can help. For example, you can use init scripts to download, prepare datasets, clean data, or transform data into a suitable format ensuring your data is ready for analysis before you start your work.
- Configuring custom SSL certificate authority: To avoid connection errors to your endpoints, you may have to import custom CA certificates, which must be loaded into ‘/etc/ssl/certs’ for Databricks to verify them. In this case, you can use an init script to load them from their source path into the Databricks recommended path every time you run your cluster.
- Configuring 3rd party observability tools such as Datadog/Amazon Cloudwatch etc.
- Configuring 3rd party governance tools such as Immuta/Protegrity etc.
- Configuring External Hive Metastore: In Databricks, when you work with large amounts of data, you may have multiple datasets or tables stored in various formats like Parquet, CSV, or JSON. The external Hive Metastore is a centralized catalog that stores information about those datasets, including their location, structure, and metadata. Instead of manually specifying the details of each dataset, you can register the datasets with the Hive Metastore with the help of an init script. The registration process involves specifying the location of the datasets, such as Azure Blob Storage, Amazon S3, etc.
Below shell script is a sample global init script that can be used to copy a custom CA certificate called ‘MyCA.pem’ that is located in the ‘/dbfs/user/user_name/’ path into ‘/etc/ssl/certs’ path in dbfs each time when you start the cluster.
cp /dbfs/user/user_name/MyCA.pem /etc/ssl/certs/
You can easily share and reproduce your workspace or cluster configurations with colleagues or other teams by automating the setup steps.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Init Scripts, Databricks, I will get back to you quickly.
1. Are there any other types of Init Scripts that Databricks stop supporting?
ANS: – Yes. Databricks stop supporting ‘Legacy Global Init Scripts’ and ‘Cluster-name Init Scripts’. Databricks deprecate these and cannot be used on new workspaces.
2. How can I check if my Databricks workspace still contains ‘Legacy Global Init Scripts’?
ANS: – ‘/databricks/init’ is the reserved location for legacy global init scripts in every Databricks workspace.
3. Who can create Global Init Scripts?
ANS: – Only ‘Workspace Admins’ can create global init scripts in a Databricks workspace.
WRITTEN BY Yaswanth Tippa
Yaswanth Tippa is working as a Research Associate - Data and AIoT at CloudThat. He is a highly passionate and self-motivated individual with experience in data engineering and cloud computing with substantial expertise in building solutions for complex business problems involving large-scale data warehousing and reporting.