AWS, Cloud Computing

3 Mins Read

Conversion of CSV Files to Parquet Format using AWS Glue Job

Overview

AWS Glue is a serverless AWS data integration service that facilitates the discovery, preparation, movement, and integration of data from many sources for analytics users. It can be used for application development, machine learning, and analytics. Additional productivity and data operations functionality is also included for authoring, running jobs, and implementing business workflows.

You can find and connect to more than 70 data sources using AWS Glue and manage your data in a centralized data catalog. You can graphically build, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. You may also rapidly search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue combines significant data integration capabilities into a single service. These include centralized data finding, cleaning, contemporary ETL, and cataloging. Additionally, because it is serverless, there is no infrastructure to maintain. AWS Glue enables users across multiple workloads and types of users with flexible support for all workloads like ETL, ELT, and streaming in a single service.

Now we will convert some CSV files to parquet format without writing any program code.

Step-by-Step Guide

Step 1: CSV files in Amazon S3 buckets must be converted to Parquet format.

Go to Amazon S3 Bucket from the AWS Console,

step1

Step 2: Go to the AWS Glue Console and select ETL Jobs from the left menu pane,

Select the “Visual with a source and target” option, then click on “Create”

step2

Select the Data source, edit the “Data source properties – S3,” and provide your S3 bucket and the path in which CSV files residing,

Select the Data format as CSV,

step2b

Now select the “Data target – S3 bucket”, then select the Format as parquet, compression type as Uncompressed,

step2c

Now select the Output file location,

Then please save the Job which we created,

Now we need to click on the “Run” Job,

step2d

Step 3:

You can view that your Job is Running,

step3

Once the Job is complete, you can view the parquet formatted file in your target path in the Amazon S3 bucket,

step3b

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Conclusion

Now we have easily converted CSV files to parquet format without using any programming code. Nodes for your job are configured using the visual job editor. Every node represents a certain action, such as reading data from the source location or transforming the data. Each node you include in your job includes characteristics that reveal details about the transform or the location of the data.

Making IT Networks Enterprise-ready – Cloud Management Services

  • Accelerated cloud migration
  • End-to-end view of the cloud environment
Get Started

About CloudThat

CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding AWS Glue Job, I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.

FAQs

1. When do billing for my AWS Glue Jobs start and stop?

ANS: – Billing begins as soon as the task is planned to be completed and continues until the full job is completed. You pay for your job’s time with AWS Glue, not for environment provisioning or termination time.

2. What programming language can I use to develop AWS Glue ETL code?

ANS: – You have the option of using Scala or Python.

3. Can I include custom libraries in my ETL script?

ANS: – Yes. Custom Python libraries and Jar files can be imported into your AWS Glue ETL process.

WRITTEN BY Deepak Surendran

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!