Significance of Data Build Tool in Modern Data Stack

Introduction

Data is essential for a business, whether it is decision-making, understanding performance, improving processes, or understanding consumer behavior. As the volume of the data increases, so does the need for quality and making it available for everyone in the organization to use. Data Build Tool, popularly known as dbt, helps data teams achieve data engineering activities accessible to people with analyst skills to perform transformations. To understand dbt, let’s first understand how traditional data teams work and what steps are included in the traditional data processing.

Customized Cloud Solutions to Drive your Business Success

Cloud Migration
Devops
AIML & IoT

Know More

Traditional Data Teams

A traditional data team consists of Data Engineers and Data Analysts.

Typical data engineer’s work includes building the infrastructure to host the data like data warehouses, databases, tables, etc., and taking care of ETL processes to send the data from a source to where it needs to be after manipulating if any transformations are required, whereas data analyst work includes querying the tables that data engineers have built to serve data, building dashboards and reports that help businesses in taking better decisions.

The data analyst knows what needs to be built, and the data engineer knows how it needs to be built & put into production.

Traditional Data Processing

A typical data processing architecture includes the following steps:

Extracting data from a source or downloading it in CSV, or Excel files, typically from databases.
Manipulating(transforming) the data by using any third-party machine.
Loading that transformed data into a database so someone can use it.

These three steps can be collectively called as ETL (Extract Transform Load) process.

But with the advent of cloud-based data warehousing solutions, the entire game of analytics has changed. The storage and computing power can be purchased with the introduction of cloud platforms like Snowflake, Amazon Redshift, Azure Synapse Analytics, Big Query, and Databricks. It can scale them as needed whenever organizations want to. With these, we can take the raw data, load into a database and then transform it from there. There is no need to extract and load data repeatedly. So, the term ETL has now been redefined as ELT (Extract Load Transform).

Role of Analytics Engineer

The cloud-based data warehouses transitioned ETL into ELT, and with this change, a new role called ‘Analytics Engineer’ emerged. So, the sole responsibility of an analytics engineer is to take the raw data that was extracted & already loaded into a database and transform it into clean data, which analysts can use to serve the business needs.

To conclude, analytics engineers are in charge of ‘T’, which stands for Transform in ELT. Introducing this role into data teams can free data engineers and analysts from doing unnecessary tasks. Data Engineers can focus more on ‘E’ and ‘L’ in ELT or on maintaining infrastructure, whereas Data Analysts can work closely with Analytics Engineers to deliver the transformed data to a BI tool for fast and accurate reporting.

Ultimately, modern data teams can consist of these three roles:

Data Engineer
Analytics Engineer
Data Analyst

In small organizations, a team of one can serve all the above purposes, but as the organization grows, so is the data, and there may be a need for all these roles.

With this enough backstory, now we can talk about what is dbt and where it can fit in modern data stack.

dbt in Modern Data Stack

The modern data stack mainly consists of the following:

Data Sources: Specifies where our raw data is. It can be CSV, Excel, IoT devices, Cloud storage, etc.
Loaders: Tools that can load the raw data into our required destinations, like Azure Data Factory, AWS Glue, etc.
Data Platforms: Where our data (both raw and transformed) resides, like Snowflake, Amazon Redshift, Azure Synapse Analytics, Big Query, etc.

So, here dbt fits in near the Data Platforms. dbt works directly with Data Platforms to manage transformations, test them, and document them. It doesn’t extract or load the data but helps us transform the already loaded data into a data warehouse.

dbt

The main reason one should use dbt in their project is that it combines our analytics code (SQL) with software engineering principles & best practices to make our analytics collaborative.

Advantages of using dbt

Modularization:

In simple terms, Modularity means building a final product piece by piece rather than all at once. With dbt, we can write modular simple SQL SELECT statements (CTEs) and reuse them throughout the project to avoid repeating the logic in another long SQL query.

Version Control:

Analytics code, whether Python, SQL, Java, or Scala, should be version controlled because as data grows and business evolves, it is essential to know who changed what and when. dbt can be configured with CI/CD pipelines and supports most cloud data platforms such as Snowflake, Amazon Redshift, Databricks, Azure synapse Analytics, Google Big Query, etc.

Materialization:

In dbt, we store SQL codes within ‘.sql’ files, also called dbt models. While writing these SQL codes, we can avoid boilerplate DDL and DML to manage transactions, schema changes, or creating/dropping tables. We have to write the business logic in a simple SQL SELECT statement that returns the data we need, and dbt takes care of the materialization.

Ref Function:

With dbt, we can implement our transformations in a stage wise manner. For this, dbt provides a ‘ref’ function, and with the help of this ‘ref’ function, instead of selecting data from a table or view in the warehouse, we can select that data from another dbt model.

Jinjas and Macros:

‘Jinjas’ means lightweight template language. Using SQL, we can implement Jinja functions in dbt to use control structures like if, if-else, for loop, renaming column names, schema changes, etc. Also, if we have any SQL code that we use repeatedly, we can store it as ‘Macros’ and use it in whichever model we want. We can write DRY (Don’t repeat yourself) code by leveraging these.

Incremental Models:

With dbt, we can optimize the long running queries by leveraging metadata with the help of ‘Incremental Models’, which are easy to configure and use.

Testing:

To improve the integrity, dbt provides a way to write assertions, i.e., Tests to test our model each time our project runs, like ensuring a key is unique or not null before that code has run in production.

Documentation:

dbt provides a mechanism to write manually, or we can auto-generate the documentation for our dbt models when our project runs.

Package Management:

Like how software engineers modularize code into libraries, we can create libraries in dbt for the required models/macros and publish those libraries that others can reference. These libraries in dbt are called ‘Packages’.

10. Seeds:

So often in analytics, we need to map the raw values to a more readable format, like converting a country code to a country value. In such cases, we can save these data sources as ‘Seeds’, which are nothing but CSV files, and dbt can run these seed files each time our project runs.

11. Snapshots:

Data in data sources is mutable and often changes over time. If we want to load any historical data, dbt provides ‘Snapshots’ to load raw data for a time.

Conclusion

dbt has the potential to excel as the best transformation tool in modern data analytics because it eliminates the common problems that data engineers and data analysts face while also providing an easy and reliable workflow for collaboration and knowledge exchange.

In this blog, we saw what is dbt and why we should use it. In part-2 of this dbt series, we will see how to use dbt within our projects, configurations with different data platforms, and some of the use cases.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

Cloud Training
Customized Training
Experiential Learning

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

FAQs

1. What is a DRY code?

ANS: – DRY is a software development principle that stands for ‘Don’t Repeat Yourself’. Following this principle means we aim to reduce repetitive patterns and duplicate code and logic in favor of modular and referenceable code.

2. How to connect to dbt?

ANS: – We can access dbt in our project in 2 ways:

dbt core: This open-source tool must be installed and used through the command line.
dbt cloud: This was built around dbt core, but it also provides web-based UI to be a more accessible, hosted environment, so it is faster to get up and to run, also some other features like observability on metadata, in-app job scheduler, integrations with other tools.

3. What are the supported Data Platforms?

ANS: – dbt uses open-source python modules called ‘Plugins’ to connect to our data platforms. The dbt supported data platforms are:

AlloyDB
Azure Synapse
BigQuery
Databricks
Dremio
Postgres
Redshift
Snowflake
Spark
Starburst & Trino

WRITTEN BY Yaswanth Tippa

Yaswanth Tippa is working as a Research Associate - Data and AIoT at CloudThat. He is a highly passionate and self-motivated individual with experience in data engineering and cloud computing with substantial expertise in building solutions for complex business problems involving large-scale data warehousing and reporting.