Data Pre-Processing Using SageMaker Data Wrangler – Part 1

Introduction

Nowadays, With the increment in the production of a vast variety of data from multiple resources inside the pipelines, the preprocessing steps to manage those amounts of data are also tough in the pipelines. So, to handle the preprocessing steps, Amazon SageMaker has a working functionality to preprocess the data which is known as SageMaker Data Wrangler. With the help of Data Wrangler, we can handle the vast amount of data in the pipeline itself, we just need to set up the flow of the preprocessing steps inside the Data Wrangler service.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. We can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. We can also add your Python scripts and transformations to customize workflows.

Amazon SageMaker Data Wrangler Core Functionalities

Import – We can connect to and import data from multiple sources like Amazon Simple Storage Service (Amazon S3), Amazon Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
Data Flow – We can create a data flow to define a series of ML data prep steps. We can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.
Transform – We can Clean and transform our dataset using standard transforms like string, vector, and numeric data formatting tools. We can also Feature our data using transforms like text and date/time embedding and categorical encoding.
Generate Data Insights – We can automatically verify data quality and detect abnormalities as well as anomalies in our data with Data Wrangler Data Insights and Quality Report.
Analyze – Using Data Wrangler we can analyze features in our dataset at any point in our flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation.
Export – We can export our data preparation workflow to a different location. The following are example locations:
- Amazon Simple Storage Service (Amazon S3) bucket
- Amazon SageMaker Model Building Pipelines – Use SageMaker Pipelines to automate model deployment. You can export the data that you’ve transformed directly to the pipelines.
- Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
- Python script – Store the data and their transformations in a Python script for your custom workflows.

Data Transformation using SageMaker Data Wrangler

Multiple functions are provided by the SageMaker Data Wrangler Transform feature to transform data. Here are some examples of the functions:

Join Datasets – We can join multiple datasets using the join operation
Balance Data – We can also handle the imbalanced dataset using different sampling techniques
Custom Transforms – Through custom transforms, we use Python (User-Defined Function), Pyspark, Pandas, or Pyspark (SQL) to define custom transformations.
Custom Formula – Use a Custom formula to define a new column using a Spark SQL expression to query data in the current data frame.
Encode Categorical – We can encode categorical features as well in the flow pipeline.
Featurize Text – Using the Feature Text transform group to inspect string-typed columns and use text embedding to featurize these columns.
Transform Time Series – We can also transform the time series data in the pipeline.
Handle Outliers – We can also handle the outliers in the pipeline using Data Wrangler.

Conclusion

Amazon SageMaker Data Wrangler helps to preprocess the data within the pipeline. Earlier there was no such service that maintain the data integrity while preprocessing and provides the feature of transformation along with multiple different feature engineering steps like handling missing values, dealing with imbalanced data, along with handling outliers automatically in the pipeline itself. Amazon SageMaker studio provides the feature, and we can also use these features in different real-time MLOps projects as well for preprocessing stage and dumping the data into the Data Warehouse.

Drop a query if you have any questions regarding Amazon SageMaker Data Wrangler and I will get back to you quickly.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.