Building an Intelligent Document Extraction Pipeline with Amazon Bedrock and Amazon Textract

Overview

Organizations handle diverse document types such as contract notes, identity proofs, resumes, and invoices, each with varying structures and complexity. While some documents follow fixed schemas, others are highly unstructured, and scanned files add further challenges due to the lack of machine-readable text. Managing separate extraction pipelines for each type increases complexity, costs, and inconsistency. This creates a need for a unified intelligent system that can automatically classify documents, apply the appropriate extraction method, and generate structured outputs efficiently without manual intervention.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

How the System Works?

Path 1: Contract Notes (Digital PDFs)

When a document is identified as a contract note based on specific keywords such as trade date, settlement number, or brokerage, it is processed using a schema-driven approach.

Extraction Tool: pdfplumber
Strategy: Structured extraction using predefined mapping logic
Output: Header, trades, and levies in a fixed schema

This ensures high accuracy and consistency for financial documents.

Path 2: General Documents (Digital PDFs)

If the document contains readable text but does not match contract note patterns, it is treated as a general document.

Extraction Tool: pdfplumber
Strategy: Open-ended extraction using LLM
Output: Flexible key-value pairs

This approach allows the system to adapt to resumes, invoices, and other unstructured documents without predefined schemas.

Path 3: Scanned Documents

When the system detects very low text content (e.g., fewer than 50 characters per page), it classifies the document as scanned.

Extraction Tool: Amazon Textract
Strategy: OCR + LLM-based structuring
Output: Dynamic key-value pairs

This fallback mechanism ensures that even image-based documents are processed effectively.

Key Components of the System

Digital vs Scanned Detection

The first step is identifying whether a document is digital or scanned. Instead of relying on metadata, we use a simple yet effective technique, character count per page.

If the extracted text is below a certain threshold, the document is classified as scanned. This method is lightweight, fast, and highly reliable in real-world scenarios.

Document Type Classification

Once the document is confirmed as digital, the system checks for domain-specific keywords.

For example:

Trade Date
Settlement No
Brokerage

If these are present, the document is routed to the contract note pipeline. Otherwise, it is treated as a general document.

This rule-based classification provides a strong baseline without requiring complex machine learning models.

Text Extraction with pdfplumber

For digital PDFs, pdfplumber plays a crucial role. It helps extract:

Raw text
Tables
Structured content

This extracted data is then passed to the LLM for further processing and structuring.

OCR with Amazon Textract

For scanned documents, Amazon Textract acts as the OCR engine. It converts images into machine-readable text while preserving layout information.

Amazon Textract is particularly useful for:

Identity documents
Signed forms
Low-quality scans

This ensures that no document is left unprocessed due to format limitations.

LLM-Based Structuring with Amazon Bedrock

Once text is extracted, the next step is structuring it. This is handled using Amazon Bedrock with powerful models like Qwen.

We use two different prompting strategies:

Schema-Driven Prompt (for Contract Notes)

Enforces a strict structure
Maps fields like UCC, trade number, and charges
Ensures consistency and accuracy

Open-Ended Prompt (for General Documents)

Extracts all relevant information
Adapts to different document formats
Produces flexible key-value outputs

This dual approach allows the system to handle both structured and unstructured data efficiently.

Architecture Flow

The document extraction pipeline has a smart, modular flow that ensures each document is processed in the best possible way. The design of the architecture makes it scalable, fault-tolerant, and able to work with different types of documents.

Step-by-Step Guide

Step 1: Document Digestion

The process starts when a user uploads a document using a frontend app, an API, or a batch processing system.

PDF (digital or scanned) is a supported format.
The document is kept for a short time, as in object storage such as Amazon S3.
The backend service gets a request to process

This step ensures that all incoming documents are the same before processing begins.

Step 2: Initial Text Extraction

The system tries to use pdfplumber to extract text from the file.

Gets tables and plain text from the PDF
Works well for PDFs that were made on a computer
Offers a quick way to check the content of a document

This step serves as both a data extraction and validation layer to determine what kind of document it is.

Step 3: Digital vs Scanned Classification

After the text is taken out, the system counts the characters.

If there is more text on each page than the limit, the document is classified as Digital.
If the amount of text on a page is less than a certain amount, the document is called Scanned.

This simple heuristic doesn’t rely on complex ML models, yet it still works very well in real life.

Step 4: OCR Processing

If the document is marked as scanned:

Amazon Textract gets the document.
OCR extracts text without altering the layout or structure.
Output includes forms, tables, and text blocks that were found

This makes sure that even documents that are just pictures or of poor quality can be read by machines.

Step 5: Classification by Document Type

For digital documents (and/or OCR-generated text), the following keyword-based classification technique is employed.

Search for the presence of certain industry keywords:

Trade Date
Settlement Number
Brokerage

If found, it is classified as a Contract Note

Otherwise, it is classified as a General Document

The above methodology is quick and efficient, requiring little computation.

Step 6: Intelligent Routing Layer

Using the classification result, the document will be channeled into one of three pipelines:

Contract Note Pipeline (structured documents)
General Documents Pipeline (unstructured digital documents)
Scanned Documents Pipeline (OCR-based processing)

This step guarantees that each document is processed using the most appropriate extraction method.

Step 7: LLM-Based Structuring (Amazon Bedrock)

The text is then sent to the LLM using Amazon Bedrock to structure the information.

Schema-Driven Prompt (Contract Notes)

Fulfills a specific schema
Extracts values like trade, charges, and identifiers
Produces accurate and consistent results

Unstructured Prompt (General and Scanned Documents)

Automatically structures the key-value pairs
Flexible to handle multiple document types

Step 8: Output Structuring & Validation

The response from LLM is:

Converted to JSON format
Validated against schema (as per requirement)
Cleansed and normalized for consumption

Step 9: Response Delivery

The structured response will now be:

Passed as an API response
Persisted to the database/data lake
Passed to downstream systems

This facilitates smooth integration with enterprise systems.

Production Considerations

For real-world deployment, a few factors are important:

Optimize OCR usage to control cost
Monitor token usage in LLM calls
Add validation layers for structured outputs
Handle edge cases like mixed-content PDFs
Implement logging and traceability for debugging

These considerations ensure the system remains reliable and cost-efficient at scale.

Conclusion

Document processing is no longer just about extracting text, it’s about understanding context and producing structured, meaningful outputs.

By combining intelligent routing, Amazon Textract OCR, and LLM-powered structuring via Amazon Bedrock, we can build a unified pipeline that handles any document type seamlessly.

This approach eliminates the need for multiple pipelines, reduces operational complexity, and enables scalable automation across use cases.

Drop a query if you have any questions regarding Document processing and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How does the system automatically detect whether a document is scanned or digital?

ANS: – The system uses a simple character-count threshold after extracting text with pdfplumber. If the extracted text is very low (e.g., less than 50 characters per page), the document is classified as scanned. In such cases, OCR is performed using Amazon Textract to extract readable content.

2. How does the pipeline differentiate between contract notes and general documents?

ANS: – The system uses keyword-based detection for classification. If specific financial keywords such as “Trade Date,” “Settlement Number,” or “Brokerage” are present, the document is treated as a contract note and processed using a schema-driven approach. Otherwise, it is handled as a general document with an open-ended extraction strategy using Amazon Bedrock.

3. Why are two different prompt strategies used in the system?

ANS: – Two prompt strategies ensure flexibility and accuracy. Schema-driven prompts are used for structured documents, such as contract notes, to maintain consistency and precise field mapping. Open-ended prompts are used for general or unknown document types to extract all possible information dynamically. This combination allows the system to handle both fixed and flexible document formats effectively.

WRITTEN BY Balaji M

Balaji works as a Research Associate in Data and AIoT at CloudThat, specializing in cloud computing and artificial intelligence–driven solutions. He is committed to utilizing advanced technologies to address complex challenges and drive innovation in the field.