Simplify Text using AWS Mphasis DeepInsights Text Summarizer

Overview

So far, we have seen in Part 1 what is Mphasis DeepInsights Text Summarizer and its Applications in a real-world scenario. Now we will implement its algorithm with the steps below:

To run the Text Summarizer Algorithm, we need to access the following AWS Services:

Access to AWS SageMaker and the model package.
An S3 bucket to specify input/output.
A role for AWS SageMaker to access input/output from S3.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

Implementation of Text Summarizer Algorithm

Usage Information

Usage Methodology for the algorithm:

Input should have a ‘.txt’ extension with ‘utf-8’ encoding.
Note- Model performance will interrupt if the file ‘.txt’ is not ‘utf-8’ encoded.
To ensure that the input data is ‘UTF-8’ encoded, please ‘Save As’ using Encoding as ‘UTF-8’
The input can have a maximum of 512 words (which is the Sagemaker limit)
Input should contain a minimum of 3 sentences (Model restriction)
Supported content types: text/plain.

Invoking Endpoint

Python

Real-time inference :
sample = 'input text file url location'
Transformer_x = model.transformer(1, 'ml.m5.xlarge')
Transformer_x.transform(sample, content_type="text")
Transformer_x.wait()
print("Batch Transform final output  " + transformer.output_url)

Real-time inference :

sample = 'input text file url location'

Transformer_x = model.transformer(1, 'ml.m5.xlarge')

Transformer_x.transform(sample, content_type="text")

Transformer_x.wait()

print("Batch Transform final output " + transformer.output_url)

Set up the Environment

Update Boto Client and AWS SDK
Initializing API in AWS Sagemaker to update Boto Client and AWS SDK, the new cells set it up to invoke the launched API.

Private Beta Setup

The private beta is limited to the us-east-2 region. The client we are setting up will only be hard-coded for the us-east-2 endpoint.

Sample input data

with open('./self_driving_test.txt', 'rb') as file_stream:
    input_text = file_stream.read().decode('utf-8')
print(input_text)

with open('./self_driving_test.txt', 'rb') as file_stream:

input_text = file_stream.read().decode('utf-8')

print(input_text)

Output:

output1

output2

Create the session

The session remembers our connection parameters to SageMaker. We will use it to perform all of our SageMaker operations.

import sagemaker as sage

from time import gmtime, strftime

from sagemaker import get_execution_role

 
sess = sage.Session()

role = get_execution_role()

import sagemaker as sage

from time import gmtime, strftime

from sagemaker import get_execution_role

sess = sage.Session()

role = get_execution_role()

Create Model

Now we use the Model Package to produce a model

model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/marketplace-text-summarizer-11-4'
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/marketplace-text-summarizer-11-4'

from sagemaker import ModelPackage

import sagemaker as sage

from sagemaker import get_execution_role

role = get_execution_role()

sagemaker_session = sage.Session()

model = ModelPackage(model_package_arn=model_package_arn,

role = role,

sagemaker_session = sagemaker_session)

Input File

Now we pull a sample input train for testing the model.

sample_txt="s3://aws-marketplace-mphasis-assets/Text Summarizer/self_driving.txt"

1	sample_txt="s3://aws-marketplace-mphasis-assets/Text Summarizer/self_driving.txt"

Batch Transform Job

Now let’s use the model erected to run a batch conclusion job and corroborate it works.

import json 
import uuid

transformer = model.transformer(1, 'ml.m5.xlarge')
transformer.transform(sample_txt, content_type='text/plain')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")

import json

import uuid

transformer = model.transformer(1, 'ml.m5.xlarge')

transformer.transform(sample_txt, content_type='text/plain')

transformer.wait()

#transformer.output_path

print("Batch Transform complete")

Output from Batch Transform

Note The following package is installed on the original system boto3

print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/self_driving.txt.out', f)
print("Output file loaded from bucket")

print(transformer.output_path)

bucketFolder = transformer.output_path.rsplit('/')[3]

#print(s3bucket,s3prefix)

s3_conn = boto3.client("s3")

bucket_name="sagemaker-us-east-2-786796469737"

with open('result.txt', 'wb') as f:

s3_conn.download_fileobj(bucket_name,bucketFolder+'/self_driving.txt.out', f)

print("Output file loaded from bucket")

Output:

s3://sagemaker-us-east-2-786796469737/marketplace-text-summarizer-11-4-2020-0-2020-04-11-17-47-35-070

Output file loaded from the bucket

with open('./result.txt', 'rb') as file_stream:

   output_text = file_stream.read().decode('utf-8')

print(output_text)

with open('./result.txt', 'rb') as file_stream:

output_text = file_stream.read().decode('utf-8')

print(output_text)

Output:

output3

Invoking through Endpoint

This is another way of planting the model that provides results as the real-time conclusion. Then’s a sample endpoint for reference.

import json

import uuid

from sagemaker import ModelPackage

import sagemaker as sage

from sagemaker import get_execution_role

from sagemaker import ModelPackage

import boto3

from IPython.display import Image

from PIL import Image as ImageEdit

 

role = get_execution_role()

 

sagemaker_session = sage.Session()

bucket=sagemaker_session.default_bucket()

 

content_type='text/plain'

model_name='summarizer-model'

real_time_inference_instance_type='ml.c4.2xlarge'

 

 

model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/marketplace-text-summarizer-11-4'

 

from sagemaker import ModelPackage

import sagemaker as sage

from sagemaker import get_execution_role

 

role = get_execution_role()

sagemaker_session = sage.Session()

def predict_wrapper(endpoint, session):

    return sage.RealTimePredictor(endpoint, session,content_type=content_type)

#create a deployable model from the model package.

model = ModelPackage(role=role,

                    model_package_arn=model_package_arn,

                    sagemaker_session=sagemaker_session,

                    predictor_cls=predict_wrapper)

 

predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

 

Invoking endpoint result through python code

f = open('./self_driving_test.txt', mode='r')

data=f.read()

prediction = predictor.predict(data)

from io import StringIO

 

s=str(prediction,'utf-8')

data = StringIO(s)

print(data.read())

import json

import uuid

from sagemaker import ModelPackage

import sagemaker as sage

from sagemaker import get_execution_role

from sagemaker import ModelPackage

import boto3

from IPython.display import Image

from PIL import Image as ImageEdit

role = get_execution_role()

sagemaker_session = sage.Session()

bucket=sagemaker_session.default_bucket()

content_type='text/plain'

model_name='summarizer-model'

real_time_inference_instance_type='ml.c4.2xlarge'

model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/marketplace-text-summarizer-11-4'

from sagemaker import ModelPackage

import sagemaker as sage

from sagemaker import get_execution_role

role = get_execution_role()

sagemaker_session = sage.Session()

def predict_wrapper(endpoint, session):

return sage.RealTimePredictor(endpoint, session,content_type=content_type)

#create a deployable model from the model package.

model = ModelPackage(role=role,

model_package_arn=model_package_arn,

sagemaker_session=sagemaker_session,

predictor_cls=predict_wrapper)

predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

Invoking endpoint result through python code

f = open('./self_driving_test.txt', mode='r')

data=f.read()

prediction = predictor.predict(data)

from io import StringIO

s=str(prediction,'utf-8')

data = StringIO(s)

print(data.read())

Output:

output4

Conclusion

Therefore, we’ve seen how we’ve got useful information from the long textbook using AWS Text Summarizer. The intention is to produce a coherent and fluent summary having only the main points outlined in the document. Furthermore, applying textbook summarization reduces reading time, accelerates the process of probing for information, and increases the quantum of information that can fit in an area. The introductory idea is to count the frequency of the words occurring in the textbook and assume that the loftiest occurring words are important given the occurrence threshold and grounded upon it, epitomizing the textbook.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Which algorithm is used in text summarization?

ANS: – Text summarization using the frequency system In this system, we find the frequency of all the words in our textbook data and store the textbook data and its frequency in a wordbook. After that, we tokenize our textbook data. The rulings which contain further high-frequency words will be kept in our final summary data.

import pandas as pd

import numpy as np

data = "my name is neetika gupta.  It's my pleasure to got occasion to write composition for  abc related to nlp"

from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.corpus import stopwords

def solve(text):

stopwords1 = set(stopwords.words("english"))

words = word_tokenize(text)

freqTable = {}

for word in words:

               word = word.lower()

if word in stopwords1:

                              continue

if word in freqTable:

freqTable[word] += 1

else :

                              freqTable[word] = 1

sentences = sent_tokenize(text)

sentenceValue = {}

for sentence in sentences:

                              for word, freq in freqTable.items():

                                             if word in sentence.lower():

                                                            if sentence in sentenceValue:

               sentenceValue[sentence] += freq

else :

                              sentenceValue[sentence] = freq

sumValues = 0

for sentence in sentenceValue:

                              sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

 

summary = ''

for sentence in sentences:

if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.2 * average)):

summary += "" + sentence

return summary

import pandas as pd

import numpy as np

data = "my name is neetika gupta. It's my pleasure to got occasion to write composition for abc related to nlp"

from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.corpus import stopwords

def solve(text):

stopwords1 = set(stopwords.words("english"))

words = word_tokenize(text)

freqTable = {}

for word in words:

word = word.lower()

if word in stopwords1:

continue

if word in freqTable:

freqTable[word] += 1

else :

freqTable[word] = 1

sentences = sent_tokenize(text)

sentenceValue = {}

for sentence in sentences:

for word, freq in freqTable.items():

if word in sentence.lower():

if sentence in sentenceValue:

sentenceValue[sentence] += freq

else :

sentenceValue[sentence] = freq

sumValues = 0

for sentence in sentenceValue:

sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

summary = ''

for sentence in sentences:

if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.2 * average)):

summary += "" + sentence

return summary

2. How is automatic summarization of text helpful?

ANS: – Automatic textbook summarization is an instigative exploration area with several operations on the assiduity. By condensing large amounts of information into short, summarization can prop numerous downstream operations, like creating news abridgments, report generation, news summarization, and caption generation. Summarization is the task of compressing text into a shorter version, reducing the size of the source text while preserving important elements of the information and the meaning of the content. Since manual text summarization is time-consuming and often tedious, automated tasks are gaining popularity, thus providing a strong impetus for academic research. Text summarization has important uses in various NLP-related tasks, such as B, text classification, question answering, legal text summarization, news summarization, and headline news production. Furthermore, these systems can be integrated with creating summaries as an intermediate step, which helps reduce document length. In the era of big data, the amount of text data from various sources is exploding. This text is an invaluable source of information that must be effectively summarized to be useful. The increase in document availability calls for extensive research in the field of NLP for automatic text summarization. Automatic text summarization is creating concise and fluent summaries without human intervention while preserving the meaning of the original text document. This is very challenging because, as humans, when we summarize a text, we usually read it in its entirety to deepen our understanding and then write a key-point summary. Since computers lack human knowledge and language skills, automatic text summarization becomes difficult and non-trivial. Various machine learning-based models have been proposed for this task. Most of these methods model this problem as a classification problem, returning whether a sentence should be included in the summary. Other methods use topic information, Latent Semantic Analysis (LSA), sequence-to-sequence models, reinforcement learning, and adversarial procedures.

WRITTEN BY Neetika Gupta

Neetika Gupta works as a Senior Research Associate in CloudThat has the experience to deploy multiple Data Science Projects into multiple cloud frameworks. She has deployed end-to-end AI applications for Business Requirements on Cloud frameworks like AWS, AZURE, and GCP and Deployed Scalable applications using CI/CD Pipelines.