Extract Data from an Image Using AWS Textract

Overview

Modern technology has solved this problem to a large extent and data can be extracted from structured forms without human touch. In other cases, however, data is received from a wide variety of unstructured documents without any rhyme or reason to the way the information is presented. Many businesses and government organizations extract data manually from scanned documents, such as PDFs, tables, and forms, which are slow, expensive, and prone to errors. Textract uses machine learning to handle any type of document in real-time, accurately extracting text, forms, and tables without any specification and code.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About AWS Textract

Amazon Textract is a highly scalable machine learning (ML) service that automatically extracts text, handwriting, and data from documents like images, pdf, etc. It can also analyze a document such as related text, tables, key-value pairs, and selection elements. Use Amazon Textract to detect and extract text in your documents.

When the Amazon Textract operation processes the document, the results are returned in an array of Block objects or an array of Expense Document objects. Both objects contain information that has been found about items, including their location in the document and their relationship to other items in the document.

Use Cases

Import documents and forms into business applications
Creating smart search indexes
Creating automated workflows for document processing
Maintaining compliance in document archives
Text Extraction for Natural Language Processing (NLP)
Text extraction for document classification

Architecture Diagram

AD_textract

Steps to Setup AWS S3

Step 1: Open AWS S3 Console

Step 2: Click on Create Bucket. Enter the bucket name (i.e., data-extract-from-image) and select the region that you want to perform.

step2

Step 3: Click on Create Bucket.

step3

Steps to Setup Amazon Lambda

Step 1: Open Aws lambda console.

Step 2: Click on create function and enter the function name (i.e., textract-lambda). Then select the python 3.9 version.

lambda_step2

Step 3: Select a role that defines the permissions of your lambda function. Select a new role with a basic lambda function and click on Create function.

lambda_step3

Step 4: Inside the lambda function there is another option configuration. Go to configuration and click on permission. Then click on Role name.

lambda_step4

Step 5: Attach AmazonTextractFullAccess and AWSLambdaExecute policies to the lambda permission role.

lambda_step5

Step 6: Add S3 bucket as a trigger in lambda.

lambda_step6

Step 7: Add code in lambda. Inside the code, we are using detect_document_text boto3 API which detects text in the input document. Amazon Textract API detects and analyses text in documents and converts it into machine-readable text. After adding the code save it and click on the deploy button. (GitHub Link)

lambda_step7

Step 8: Upload one invoice image on the data-extract-from-image bucket.

lambda_step8

Step 9: Check CloudWatch log groups. Inside the log event, you can get all your image extracted data.

lambda_step9

Conclusion

In this blog, we learned about how to use AWS Textract API to extract data from an Image without any ML experience. This solution will drive decision-making efficiency and can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders. We will discuss more use cases of AWS’s other services in our upcoming blogs.

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What document formats does Amazon Textract support?

ANS: – Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats. With synchronous APIs, you can send images either as an S3 object or as a byte array. For the asynchronous API, you can send S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPG, PNG), do not convert or resample it before uploading it to Amazon Textract.

2. In which AWS regions are Amazon Textract available?

ANS: – Amazon Textract is currently available in US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Regions Canada (Central), EU (Ireland), EU (London), EU (Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific (Mumbai).

3. Are there any limits on the number of questions I can ask per document?

ANS: – Queries are processed on a per-page basis, and information can be extracted using queries through synchronous or asynchronous operations. A maximum of 15 queries per page is supported for synchronous operations. A maximum of 30 queries per page is supported for asynchronous operations.

WRITTEN BY Modi Shubham Rajeshbhai

Shubham Modi is working as a Research Associate - Data and AI/ML in CloudThat. He is a focused and very enthusiastic person, keen to learn new things in Data Science on the Cloud. He has worked on AWS, Azure, Machine Learning, and many more technologies.