Voiced by Amazon Polly |
Overview
Modern technology has solved this problem to a large extent and data can be extracted from structured forms without human touch. In other cases, however, data is received from a wide variety of unstructured documents without any rhyme or reason to the way the information is presented. Many businesses and government organizations extract data manually from scanned documents, such as PDFs, tables, and forms, which are slow, expensive, and prone to errors. Textract uses machine learning to handle any type of document in real-time, accurately extracting text, forms, and tables without any specification and code.
Freedom Month Sale — Upgrade Your Skills, Save Big!
- Up to 80% OFF AWS Courses
- Up to 30% OFF Microsoft Certs
About AWS Textract
Amazon Textract is a highly scalable machine learning (ML) service that automatically extracts text, handwriting, and data from documents like images, pdf, etc. It can also analyze a document such as related text, tables, key-value pairs, and selection elements. Use Amazon Textract to detect and extract text in your documents.
When the Amazon Textract operation processes the document, the results are returned in an array of Block objects or an array of Expense Document objects. Both objects contain information that has been found about items, including their location in the document and their relationship to other items in the document.
Use Cases
- Import documents and forms into business applications
- Creating smart search indexes
- Creating automated workflows for document processing
- Maintaining compliance in document archives
- Text Extraction for Natural Language Processing (NLP)
- Text extraction for document classification
Architecture Diagram
Steps to Setup AWS S3
Step 1: Open AWS S3 Console
Step 2: Click on Create Bucket. Enter the bucket name (i.e., data-extract-from-image) and select the region that you want to perform.
Step 3: Click on Create Bucket.
Steps to Setup Amazon Lambda
Step 1: Open Aws lambda console.
Step 2: Click on create function and enter the function name (i.e., textract-lambda). Then select the python 3.9 version.
Step 3: Select a role that defines the permissions of your lambda function. Select a new role with a basic lambda function and click on Create function.
Step 4: Inside the lambda function there is another option configuration. Go to configuration and click on permission. Then click on Role name.
Step 5: Attach AmazonTextractFullAccess and AWSLambdaExecute policies to the lambda permission role.
Step 6: Add S3 bucket as a trigger in lambda.
Step 7: Add code in lambda. Inside the code, we are using detect_document_text boto3 API which detects text in the input document. Amazon Textract API detects and analyses text in documents and converts it into machine-readable text. After adding the code save it and click on the deploy button. (GitHub Link)
Step 8: Upload one invoice image on the data-extract-from-image bucket.
Step 9: Check CloudWatch log groups. Inside the log event, you can get all your image extracted data.
Conclusion
In this blog, we learned about how to use AWS Textract API to extract data from an Image without any ML experience. This solution will drive decision-making efficiency and can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders. We will discuss more use cases of AWS’s other services in our upcoming blogs.
Freedom Month Sale — Discounts That Set You Free!
- Up to 80% OFF AWS Courses
- Up to 30% OFF Microsoft Certs
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. What document formats does Amazon Textract support?
ANS: – Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats. With synchronous APIs, you can send images either as an S3 object or as a byte array. For the asynchronous API, you can send S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPG, PNG), do not convert or resample it before uploading it to Amazon Textract.
2. In which AWS regions are Amazon Textract available?
ANS: – Amazon Textract is currently available in US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Regions Canada (Central), EU (Ireland), EU (London), EU (Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific (Mumbai).
3. Are there any limits on the number of questions I can ask per document?
ANS: – Queries are processed on a per-page basis, and information can be extracted using queries through synchronous or asynchronous operations. A maximum of 15 queries per page is supported for synchronous operations. A maximum of 30 queries per page is supported for asynchronous operations.

WRITTEN BY Modi Shubham Rajeshbhai
Shubham Modi is working as a Research Associate - Data and AI/ML in CloudThat. He is a focused and very enthusiastic person, keen to learn new things in Data Science on the Cloud. He has worked on AWS, Azure, Machine Learning, and many more technologies.
Comments