Voiced by Amazon Polly
Modern technology has solved this problem to a large extent and data can be extracted from structured forms without human touch. In other cases, however, data is received from a wide variety of unstructured documents without any rhyme or reason to the way the information is presented. Many businesses and government organizations extract data manually from scanned documents, such as PDFs, tables, and forms, which are slow, expensive, and prone to errors. Textract uses machine learning to handle any type of document in real-time, accurately extracting text, forms, and tables without any specification and code.
About AWS Textract
Amazon Textract is a highly scalable machine learning (ML) service that automatically extracts text, handwriting, and data from documents like images, pdf, etc. It can also analyze a document such as related text, tables, key-value pairs, and selection elements. Use Amazon Textract to detect and extract text in your documents.
When the Amazon Textract operation processes the document, the results are returned in an array of Block objects or an array of Expense Document objects. Both objects contain information that has been found about items, including their location in the document and their relationship to other items in the document.
Helping organizations transform their IT infrastructure with top-notch Cloud Computing services
- Cloud Migration
- AIML & IoT
- Import documents and forms into business applications
- Creating smart search indexes
- Creating automated workflows for document processing
- Maintaining compliance in document archives
- Text Extraction for Natural Language Processing (NLP)
- Text extraction for document classification
Steps to Setup AWS S3
Step 1: Open AWS S3 Console
Step 2: Click on Create Bucket. Enter the bucket name (i.e., data-extract-from-image) and select the region that you want to perform.
Step 3: Click on Create Bucket.
Steps to Setup Amazon Lambda
Step 1: Open Aws lambda console.
Step 2: Click on create function and enter the function name (i.e., textract-lambda). Then select the python 3.9 version.
Step 3: Select a role that defines the permissions of your lambda function. Select a new role with a basic lambda function and click on Create function.
Step 4: Inside the lambda function there is another option configuration. Go to configuration and click on permission. Then click on Role name.
Step 5: Attach AmazonTextractFullAccess and AWSLambdaExecute policies to the lambda permission role.
Step 6: Add S3 bucket as a trigger in lambda.
Step 7: Add code in lambda. Inside the code, we are using detect_document_text boto3 API which detects text in the input document. Amazon Textract API detects and analyses text in documents and converts it into machine-readable text. After adding the code save it and click on the deploy button. (GitHub Link)
Step 8: Upload one invoice image on the data-extract-from-image bucket.
Step 9: Check CloudWatch log groups. Inside the log event, you can get all your image extracted data.
In this blog, we learned about how to use AWS Textract API to extract data from an Image without any ML experience. This solution will drive decision-making efficiency and can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders. We will discuss more use cases of AWS’s other services in our upcoming blogs.
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Textract and I will get back to you quickly.
1. What document formats does Amazon Textract support?
ANS: – Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats. With synchronous APIs, you can send images either as an S3 object or as a byte array. For the asynchronous API, you can send S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPG, PNG), do not convert or resample it before uploading it to Amazon Textract.
2. In which AWS regions are Amazon Textract available?
ANS: – Amazon Textract is currently available in US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Regions Canada (Central), EU (Ireland), EU (London), EU (Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific (Mumbai).
3. Are there any limits on the number of questions I can ask per document?
ANS: – Queries are processed on a per-page basis, and information can be extracted using queries through synchronous or asynchronous operations. A maximum of 15 queries per page is supported for synchronous operations. A maximum of 30 queries per page is supported for asynchronous operations.
WRITTEN BY Modi Shubham Rajeshbhai
Shubham Modi is working as a Research Associate - Data and AI/ML in CloudThat. He is a focused and very enthusiastic person, keen to learn new things in Data Science on the Cloud. He has worked on AWS, Azure, Machine Learning, and many more technologies.