Case Study

Achieving a 60% Efficiency Boost with AWS Glue Jobs for a Software Company

Download the Case Study
Industry

Software Industry

Expertise

AWS Glue, AWS Lambda, Amazon S3, Amazon MSK, Amazon Athena

Offerings/Solutions

Seamless data processing and analysis across regions with our AWS-based solution, leveraging dynamic AWS Glue jobs, AWS Lambda-driven crawlers, and Amazon Athena for efficient dataset creation and querying.

About the Client

CustomFit.ai is an AI-powered platform launched in 2019 that specializes in precise personalization for B2B websites without requiring any coding. It utilizes AI to understand individual visitors and dynamically adjust website content based on their preferences and clickstream data.

Highlights

60%

Efficiency Boost with AWS Glue Jobs

40%

Reduced Processing Time)

70%

Cost Reduction via Parquet Files

The Challenge

CustomFit.ai was facing data management challenges due to limitations in their current setup. They were utilizing Amazon MSK to ingest and store streaming data in JSON format, amounting to approximately 3 GB per day. However, they were struggling with complex filtering and joins across different tables due to their NoSQL database, Cassandra. While Cassandra had been reliable, its constraints were impeding advanced filtering operations, complex joins, and the use of aggregate functions. These limitations were hindering CustomFit.ai’s data analysis and processing capabilities.

Solutions

  • The solution is deployed in the Oregon region, with Amazon S3 buckets replicated to the Singapore region. 
  • Data is sourced from Amazon MSK via an MSK sink connector for Amazon S3, and it is stored as a single file in a day-wise partition in a staging bucket. 
  • An AWS glue crawler runs on this bucket creating a raw table in the database. 
  • There are 5 glue jobs in total which run on top of the raw table every day, extracting and storing files in 5 separate partitions in a processed data bucket. The extraction is done based on the keys extracted from the payload.  
  • The sub-partitions are created in the 5 main partitions which are further partitioned into day-wise basis. 
  • Everyday crawlers run on the main partitions populating the Glue Catalog. 
  • The dynamic creation of the crawler is achieved using an AWS Lambda.  
  • The processed data bucket is used as an event notification trigger for the AWS Lambda to create the crawlers if required. 
  • Amazon Athena is used to create the dataset based on the tables created in the glue catalog for the main data partition key. i.e. n tables for n partitions.

The Results

Scheduled AWS Glue jobs increased efficiency by 60%, enabled fine-grained data searching, reduced Amazon Athena processing time by 40%, and decreased data size by 70% by storing data as Parquet files.

Download the Case Study

AWS Partner – Data and Analytics Competency

Pioneering Data and Analytics space by being an AWS Partner - Data and Analytics Competency.

Learn more

An authorized partner for all major cloud providers

A cloud agnostic organization with the rare distinction of being an authorized partner for AWS, Microsoft, Google and VMware.

Learn more

A house of strong pool of certified consulting experts

150+ cloud certified experts in AWS, Azure, GCP, VMware, etc.; delivered 200+ projects for top 100 fortune 500 companies.

Learn more

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!