Extracting Invoice and Receipt Data as JSON Using Fine-Tuned VLMs

Introduction

Extracting structured data from documents such as invoices, receipts, and contracts remains a persistent challenge for many organizations. Layout variability, vendor-specific formatting, and noisy scans make traditional rule-based OCR pipelines unreliable. Vision-Language Models (VLMs), which combine image understanding with natural-language reasoning, offer a more resilient alternative.

By fine-tuning a VLM on annotated documents and their corresponding JSON outputs, multipage PDFs or image files can be transformed into clean, structured JSON with high accuracy and schema conformity. Using Amazon SageMaker AI and the Swift fine-tuning framework, enterprises can now create scalable, production-grade document-to-JSON pipelines.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why Fine-Tuning Outperforms Prompting?

Several methods exist for intelligent document processing:

Zero-shot prompting: Passing a document and a textual instruction to a foundation model to generate JSON. It works for general cases but struggles with schema consistency and layout variations.
Few-shot prompting: Adding a few example documents and their corresponding outputs improves accuracy but still fails when layouts differ widely.
Retrieval-augmented few-shot (RAG): Dynamically fetching similar examples for the prompt enhances results but increases latency and cost due to larger prompt sizes.
Fine-tuning a VLM: Training the model on labeled document images and expected JSON outputs allows it to internalize the schema and relationships, ensuring consistent extraction without prompt complexity.

Fine-tuning moves intelligence into the model’s parameters, reducing reliance on prompt engineering. Once trained, inference becomes faster, cheaper, and more reliable.

Architecture and Tooling Overview

Core Components

Amazon SageMaker AI provides a managed training and deployment environment, enabling distributed training, job tracking, and endpoint hosting.
Swift Fine-Tuning Framework simplifies the orchestration of fine-tuning workflows and supports parameter-efficient techniques such as LoRA (Low-Rank Adaptation) and DoRA (Dynamic Rank Adaptation).
Vision-Language Base Model, such as Qwen2.5-VL or similar architectures, serves as the foundation. These models understand both visual structure and textual semantics, making them ideal for document interpretation.

Preparing the Training Dataset

The dataset forms the foundation for fine-tuning. Proper preparation ensures robustness and schema consistency.

Source Dataset

A diverse dataset of invoices or similar documents with annotated fields is essential. The dataset should cover variations in layout, font, and field placement.

2. Format Conversion

Documents, often available as PDFs, should be converted into page-wise images. The corresponding annotations must be standardized into a uniform JSON format. Missing keys should be included with null values to maintain schema uniformity.

3. Prompt Template and Key Definition

Each training sample includes a prompt specifying the expected JSON keys. Multipage documents are represented using sequential <image> tokens to preserve page order.

4. Train-Validation Split

A standard data split (for instance, 80/10/10) helps track generalization and prevents overfitting. Validation accuracy guides hyperparameter tuning.

Fine-Tuning Workflow

Choose the Fine-Tuning Strategy
- Full Fine-Tuning: Updates all model weights, offering maximum flexibility but at high computational cost.
- PEFT (LoRA / DoRA): Updates a small subset of parameters via adapter layers, greatly reducing compute needs while maintaining strong accuracy.
Configure the Training Job

Using Amazon SageMaker AI, define the compute instance type, Amazon S3 data paths, model ID, and LoRA configuration. Training can be initiated as a remote function running on GPU-backed instances such as ml.g6.8xlarge.

3. Optimize Cost and Efficiency

Spot instances have a significantly lower cost during experimentation. Early stopping and checkpointing should be enabled to conserve resources.

4. Monitor and Log Metrics

Training logs should capture loss, validation accuracy, and field-wise errors. Consistent monitoring helps detect underfitting or overfitting early.

Model Evaluation

Evaluating fine-tuned models requires metrics tailored for structured output:

Exact Match (EM): Measures the proportion of fields where the extracted text matches the ground truth exactly.
Character Error Rate (CER): Computes normalized edit distance, indicating minor textual inaccuracies.
Schema Compliance: Ensures that every output JSON adheres to the predefined key structure.
Visual-Layout Accuracy: Assesses how well the model captures spatially dependent fields like “Invoice Total” or “Address.”

Comparing the baseline (untuned) performance against the fine-tuned version typically shows a notable boost in key-level accuracy and a reduction in schema violations.

Deployment Strategies with Amazon SageMaker AI

After fine-tuning, deployment can be tailored to different operational needs.

Real-Time Inference Endpoints

The fine-tuned model can be deployed as an API endpoint on Amazon SageMaker, serving document-to-JSON conversions in real time.

2. Batch Inference Pipelines

For large document sets, Amazon SageMaker batch transform or asynchronous inference provides scalable, cost-effective processing.

3. Inference Components and Pipelines

Combining preprocessing (e.g., PDF to image), the fine-tuned VLM, and post-processing (e.g., JSON validation) into a pipeline ensures end-to-end automation.

4. Resource Management

Decommissioning endpoints and cleaning up Amazon S3 or Amazon ECR artifacts prevents unnecessary costs and maintains compliance.

Best Practices and Optimization Tips

Maintain consistent JSON schemas across all training samples.
Include documents with diverse layouts, fonts, and resolutions to improve generalization.
Use PEFT for quick iterations and reduced compute cost.
Evaluate models on a field-wise basis rather than overall averages to pinpoint weaknesses.
Explicitly handle multi-page documents with ordered <image> tokens.
Integrate automated cleanup and monitoring scripts for operational stability.

Conclusion

Fine-tuning vision-language models with Amazon SageMaker AI and the Swift framework enables accurate, schema-compliant, and scalable document-to-JSON extraction. This approach eliminates dependence on fragile OCR systems and manual rule-sets.

Proper dataset preparation, efficient tuning strategies, and comprehensive evaluation enable document processing to become a fully automated, production-grade workflow. Businesses gain reliable, structured data extraction that scales effortlessly while reducing cost and latency.

Drop a query if you have any questions regarding Amazon SageMaker AI and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What volume of labeled data is required for effective fine-tuning?

ANS: – A few hundred annotated examples, typically between 200 and 500, are often sufficient when using parameter-efficient fine-tuning. Broader layout diversity improves generalization.

2. How does PEFT compare to full fine-tuning?

ANS: – Parameter-efficient fine-tuning (LoRA, DoRA) modifies only adapter layers rather than all model parameters. It significantly reduces GPU memory usage and training costs while achieving similar performance, particularly on limited datasets.

3. Can this approach handle handwritten or noisy scanned documents?

ANS: – Yes, provided the training set includes examples of such noise. Since VLMs analyze image pixels directly, they can learn to handle imperfections when properly exposed during fine-tuning.

WRITTEN BY Daniya Muzammil

Daniya works as a Research Associate at CloudThat, specializing in backend development and cloud-native architectures. She designs scalable solutions leveraging AWS services with expertise in Amazon CloudWatch for monitoring and AWS CloudFormation for automation. Skilled in Python, React, HTML, and CSS, Daniya also experiments with IoT and Raspberry Pi projects, integrating edge devices with modern cloud systems.