AWS, Cloud Computing

3 Mins Read

Conversion of CSV Files to Parquet Format using AWS Glue Job

Voiced by Amazon Polly

Overview

AWS Glue is a serverless AWS data integration service that facilitates the discovery, preparation, movement, and integration of data from many sources for analytics users. It can be used for application development, machine learning, and analytics. Additional productivity and data operations functionality is also included for authoring, running jobs, and implementing business workflows.

You can find and connect to more than 70 data sources using AWS Glue and manage your data in a centralized data catalog. You can graphically build, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. You may also rapidly search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue combines significant data integration capabilities into a single service. These include centralized data finding, cleaning, contemporary ETL, and cataloging. Additionally, because it is serverless, there is no infrastructure to maintain. AWS Glue enables users across multiple workloads and types of users with flexible support for all workloads like ETL, ELT, and streaming in a single service.

Now we will convert some CSV files to parquet format without writing any program code.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Step-by-Step Guide

Step 1: CSV files in Amazon S3 buckets must be converted to Parquet format.

Go to Amazon S3 Bucket from the AWS Console,

step1

Step 2: Go to the AWS Glue Console and select ETL Jobs from the left menu pane,

Select the “Visual with a source and target” option, then click on “Create”

step2

Select the Data source, edit the “Data source properties – S3,” and provide your S3 bucket and the path in which CSV files residing,

Select the Data format as CSV,

step2b

Now select the “Data target – S3 bucket”, then select the Format as parquet, compression type as Uncompressed,

step2c

Now select the Output file location,

Then please save the Job which we created,

Now we need to click on the “Run” Job,

step2d

Step 3:

You can view that your Job is Running,

step3

Once the Job is complete, you can view the parquet formatted file in your target path in the Amazon S3 bucket,

step3b

Conclusion

Now we have easily converted CSV files to parquet format without using any programming code. Nodes for your job are configured using the visual job editor. Every node represents a certain action, such as reading data from the source location or transforming the data. Each node you include in your job includes characteristics that reveal details about the transform or the location of the data.

Making IT Networks Enterprise-ready – Cloud Management Services

  • Accelerated cloud migration
  • End-to-end view of the cloud environment
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 850k+ professionals in 600+ cloud certifications and completed 500+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFront Service Delivery PartnerAmazon OpenSearch Service Delivery PartnerAWS DMS Service Delivery PartnerAWS Systems Manager Service Delivery PartnerAmazon RDS Service Delivery PartnerAWS CloudFormation Service Delivery PartnerAWS ConfigAmazon EMR and many more.

FAQs

1. When do billing for my AWS Glue Jobs start and stop?

ANS: – Billing begins as soon as the task is planned to be completed and continues until the full job is completed. You pay for your job’s time with AWS Glue, not for environment provisioning or termination time.

2. What programming language can I use to develop AWS Glue ETL code?

ANS: – You have the option of using Scala or Python.

3. Can I include custom libraries in my ETL script?

ANS: – Yes. Custom Python libraries and Jar files can be imported into your AWS Glue ETL process.

WRITTEN BY Deepak Surendran

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!