Data Pre-Processing using SageMaker Data Wrangler – Part 2

Introduction to SageMaker Data Wrangler

Nowadays, with the increment in the production of a vast variety of data from multiple resources inside the pipelines, the preprocessing steps to manage those amounts of data are also tough in the pipelines. So, to handle the preprocessing steps, Amazon SageMaker has a working functionality to preprocess the data which is known as SageMaker Data Wrangler. With the help of Data Wrangler, we can handle the vast amount of data in the pipeline itself, we just need to set up the flow of the preprocessing steps inside the Data Wrangler service.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs
Ends August 31

Act Fast!

Implementing Data Wrangler Flow

Amazon SageMaker Data Wrangler flow, or a data flow, to create and modify a data preparation pipeline. The data flow connects the datasets, transformations, analyses, or steps, you create and can be used to define your pipeline. Each Data Wrangler flow has an Amazon EC2 instance associated with it.

Navigate to the Amazon SageMaker Studio console to create flow under SageMaker Data Wrangler

sm1

Now select the instance based on the preprocessing steps required in the pipeline.

sm2

After clicking save, the instance will be selected for the Data Wrangler Flow.

Data Flow UI

When we import the dataset, it will appear as the source in the Data Flow UI. Data Wrangler automatically infers the types of each column in our dataset and creates a new data frame named Data types. We can select this frame to update the inferred data types.

sm3

Each time we perform a transform step, we are creating a new data frame. When multiple transform steps (other than Join or Concatenate) are added to the same dataset, they are stacked.
Join, concatenate, and create standalone steps that contain the new joined or concatenated dataset. The following diagram shows a data flow with a join between two datasets, as well as two stacks of steps.

sm4

Adding Step in Data Flow

We can add the steps in the flow by clicking edit Data Types to change the structure of the data frame
We can also add the step of Add Transform to transform the columns which are present in the pipeline
We can also add the step of Add Analysis to analyze our data at any point in the data flow.
We can also join two datasets using the Joins functionality inside the flow.
Concatenation of two datasets to form a new dataset is also possible in the Data Flow step.

Deleting Step from Data Flow

We can delete an individual step for nodes in your data flow that have a single input.
We can’t delete individual steps for source, join, and concatenate nodes.
We can use the following procedure to delete a step in the Data Wrangler flow.

Choose the group of steps that has the step that we are deleting.
Choose the icon next to the step.
Choose Delete.

sm5

Conclusion

Amazon SageMaker Data Wrangler helps to preprocess the data within the pipeline. Earlier there was no such service that maintain the data integrity while preprocessing and provides the feature of transformation along with multiple different feature engineering steps like handling missing values, dealing with imbalanced data, along with handling outliers automatically in the pipeline itself. SageMaker studio provides the feature, and we can also use these features in different real-time MLOps projects as well for preprocessing stage and dumping the data into the Data Warehouse.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs
Ends August 31

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How is code secured with Amazon SageMaker?

ANS: – Code is secure, encryptable ML volumes by Amazon SageMaker.

2. What safety measures are SageMaker packed with?

ANS: – It guarantees the encryption of all the artifacts in transit and at rest. For model artifacts data, encrypted Amazon S3 buckets are an option. Accessing Sagemaker Notebooks, training tasks, and endpoints using AWS Key Management Service (KMS). The API and Sagemaker console support SSL connections.