Conversion of CSV Files to Parquet Format using AWS Glue Job

Overview

AWS Glue is a serverless AWS data integration service that facilitates the discovery, preparation, movement, and integration of data from many sources for analytics users. It can be used for application development, machine learning, and analytics. Additional productivity and data operations functionality is also included for authoring, running jobs, and implementing business workflows.

You can find and connect to more than 70 data sources using AWS Glue and manage your data in a centralized data catalog. You can graphically build, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. You may also rapidly search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue combines significant data integration capabilities into a single service. These include centralized data finding, cleaning, contemporary ETL, and cataloging. Additionally, because it is serverless, there is no infrastructure to maintain. AWS Glue enables users across multiple workloads and types of users with flexible support for all workloads like ETL, ELT, and streaming in a single service.

Now we will convert some CSV files to parquet format without writing any program code.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Step-by-Step Guide

Step 1: CSV files in Amazon S3 buckets must be converted to Parquet format.

Go to Amazon S3 Bucket from the AWS Console,

step1

Step 2: Go to the AWS Glue Console and select ETL Jobs from the left menu pane,

Select the “Visual with a source and target” option, then click on “Create”

step2

Select the Data source, edit the “Data source properties – S3,” and provide your S3 bucket and the path in which CSV files residing,

Select the Data format as CSV,

step2b

Now select the “Data target – S3 bucket”, then select the Format as parquet, compression type as Uncompressed,

step2c

Now select the Output file location,

Then please save the Job which we created,

Now we need to click on the “Run” Job,

step2d

Step 3:

You can view that your Job is Running,

step3

Once the Job is complete, you can view the parquet formatted file in your target path in the Amazon S3 bucket,

step3b

Conclusion

Now we have easily converted CSV files to parquet format without using any programming code. Nodes for your job are configured using the visual job editor. Every node represents a certain action, such as reading data from the source location or transforming the data. Each node you include in your job includes characteristics that reveal details about the transform or the location of the data.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. When do billing for my AWS Glue Jobs start and stop?

ANS: – Billing begins as soon as the task is planned to be completed and continues until the full job is completed. You pay for your job’s time with AWS Glue, not for environment provisioning or termination time.

2. What programming language can I use to develop AWS Glue ETL code?

ANS: – You have the option of using Scala or Python.

3. Can I include custom libraries in my ETL script?

ANS: – Yes. Custom Python libraries and Jar files can be imported into your AWS Glue ETL process.