A Guide to Build a Simple ETL Pipeline with Google Cloud Platform and Python

Introduction

In today’s data-driven world, efficiently extracting, transforming, and loading (ETL) data is paramount.

ETL pipelines enable organizations to process and integrate data from various sources, making it readily available for analysis and decision-making.

In this blog, we will explore how to build a simple ETL pipeline using Python and leverage the power of Google Cloud Platform (GCP) services to streamline the process.

ad2

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Steps to Implement ETL Pipeline

Step 1: Set Up a Google Cloud Platform Project

Sign in to the Google Cloud Console (console.cloud.google.com) and create a new project.
Enable the necessary APIs for your project, including Compute Engine, Cloud Storage, BigQuery, and Cloud Scheduler.

Step 2: Set Up Google Cloud Storage

Create a new bucket in Google Cloud Storage to store input and output data.
Note the bucket name, as it will be required in the Python code.

Step 3: Develop the Python ETL Script

Write a Python script that handles the ETL process. This script should include code for extracting data from a source, performing required transformations, and loading the transformed data into Google Cloud Storage.
Utilize Python libraries such as Pandas, NumPy, or database connectors to facilitate data extraction and transformation tasks.
Use the Google Cloud Storage client libraries to interact with the storage service. Install the required libraries using pip.

Here’s an example code snippet, which demonstrates a simple ETL process using Python and Google Cloud Storage

from google.cloud import bigquery
from google.cloud import storage
# Set up Google Cloud Storage and BigQuery client
storage_client = storage.Client()
bigquery_client = bigquery.Client()
def load_data_to_bigquery(bucket_name, file_name, dataset_id, table_id):
    # Specify the URI of the data file in Cloud Storage
      uri = f"gs://{bucket_name}/{file_name}"
    # Define the BigQuery dataset and table references
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    job_config = bigquery.LoadJobConfig()
    job_config.source_format = bigquery.SourceFormat.CSV
    job_config.skip_leading_rows = 1
  # Load data from Cloud Storage into BigQuery
    load_job = bigquery_client.load_table_from_uri(
        uri, table_ref, job_config=job_config )
    load_job.result() 
 # Wait for the job to complet
    table = bigquery_client.get_table(table_ref)
    print(f"Data loaded into BigQuery table {table.project}.{table.dataset_id}.{table.table_id}.")
# Example usage
    bucket_name = 'your_bucket_name'
    file_name = 'your_file_name.csv'
    dataset_id = 'your_dataset_id'
    table_id = 'your_table_id'
    load_data_to_bigquery(bucket_name, file_name, dataset_id, table_id)

from google.cloud import bigquery

from google.cloud import storage

# Set up Google Cloud Storage and BigQuery client

storage_client = storage.Client()

bigquery_client = bigquery.Client()

def load_data_to_bigquery(bucket_name, file_name, dataset_id, table_id):

# Specify the URI of the data file in Cloud Storage

uri = f"gs://{bucket_name}/{file_name}"

# Define the BigQuery dataset and table references

dataset_ref = bigquery_client.dataset(dataset_id)

table_ref = dataset_ref.table(table_id)

job_config = bigquery.LoadJobConfig()

job_config.source_format = bigquery.SourceFormat.CSV

job_config.skip_leading_rows = 1

# Load data from Cloud Storage into BigQuery

load_job = bigquery_client.load_table_from_uri(

uri, table_ref, job_config=job_config )

load_job.result()

# Wait for the job to complet

table = bigquery_client.get_table(table_ref)

print(f"Data loaded into BigQuery table {table.project}.{table.dataset_id}.{table.table_id}.")

# Example usage

bucket_name = 'your_bucket_name'

file_name = 'your_file_name.csv'

dataset_id = 'your_dataset_id'

table_id = 'your_table_id'

load_data_to_bigquery(bucket_name, file_name, dataset_id, table_id)

Step 4: Authenticate the Python Script with the Google Cloud Platform

Create a service account in the Google Cloud Console, granting it the necessary permissions to access relevant resources.
Download the service account key as a JSON file and securely store it.
Set the environment variable ‘Google_Application_Crendentials’ in your Python script to point to the downloaded service account key file.

Step 5: Test the Python ETL Script Locally

Run the Python script locally to validate that the ETL process functions as expected.
Ensure that the script successfully retrieves the input data, performs the required transformations, and stores the transformed data in the desired format in Google Cloud Storage.

Step 6: Set up Cloud Scheduler and Cloud Functions

Go to the Google Cloud Console and navigate to the Cloud Scheduler service.
Create a new job with the desired schedule (e.g., daily, hourly) and configure it to trigger a Cloud Function.
Deploy a Cloud Function that will execute your Python ETL script. Ensure the function is triggered by the Cloud Scheduler job.
Set up the Cloud Function to access the necessary resources, such as Cloud Storage and any specific dependencies your script requires.

Step 7: Set up Google BigQuery

Create a BigQuery dataset to store the processed data.
Define your data schema in BigQuery, specifying the table structure and column types.

Step 8: Modify the Python ETL Script to Load Data into BigQuery

Enhance your Python ETL script to include the code necessary to load the processed data into BigQuery.
Use the BigQuery client libraries to interact with the BigQuery service and load the transformed data into the designated table within the created dataset.

Conclusion

By combining the power of Python, Google Cloud Platform, and Cloud Scheduler, you can build a simple yet efficient ETL pipeline. GCP’s services, such as Cloud Storage and Compute Engine, provide the necessary infrastructure and storage.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is an ETL pipeline?

ANS: – ETL stands for Extract, Transform, Load. An ETL pipeline is a set of processes and workflows to extract data from various sources, transform it into a desired format, and load it into a target system or database for analysis or further processing.

2. Why use Google Cloud Platform for building an ETL pipeline?

ANS: – Google Cloud Platform (GCP) offers a comprehensive suite of cloud services, including scalable storage, computing power, and scheduling capabilities. By leveraging GCP services, you can easily build a reliable and scalable ETL pipeline.

3. What is Google Cloud Storage, and why is it used in an ETL pipeline?

ANS: – Google Cloud Storage is an object storage service that allows you to store and retrieve data in the cloud. It is widely used in ETL pipelines to store both input and output data. It provides durability, scalability, and easy integration with other GCP services.