AWS, Cloud Computing, Data Analytics

5 Mins Read

Bridging Natural Language and Complex SQL in Multi-Database Environments

Voiced by Amazon Polly

Introduction

Hundreds or thousands of complexes, domain-specific tables are often stored in multiple environments, such as Amazon Aurora, Oracle, and Teradata, and cover a broad spectrum of domains, such as security, finance, and human resources. Domain knowledge and complex queries are required to extract this data. Large language models and generative artificial intelligence have augmented natural language to SQL (NL2SQL) technology.

However, translating complex queries at the corporate level remains challenging, especially when handling nested structures such as XML strings. This article illustrates an elastic solution developed by AWS and Cisco that reduces the complexity of SQL generation by utilizing simpler, more cost-effective models.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Specific Challenges for Enterprise-Scale NL2SQL

Accuracy in NL2SQL is very important since erroneous queries can leak confidential information or lead to poor decisions. Some of the main challenges are:

  • Sophisticated Schemas: Storage-optimized databases, i.e., nested tables and multi-dimensional structures that require complex querying.
  • Varied Queries: User queries are typically across entities, date ranges, or multiple metrics, resulting in longer SQL queries.
  • LLM Knowledge Gap: LLMs are exposed to simple, public schemas and cannot handle sophisticated production databases.
  • LLM Attention Burden: Processing multi-dimensional data puts a heavier load on attention, which results in slower, less stable queries.

Solution Design and Methodology

The solution enhances NL2SQL precision by lowering the LLM’s burden and concentrating generation steps in targeted data domains.

  1. Restricting Scope to Data Domains
  • Query Mapping: Categorize queries into target domains to target prompts.
  • Schema Scoping: Eliminate irrelevant schemas, enabling more affluent few-shot examples.
  • Metadata Enrichment: Add table/column descriptions to provide better SQL direction.
  • Connection Info: Extract SQL dialects and multi-cloud information for scale-out support.
  1. Handling Identifiers

Entities specified in queries are resolved to unique IDs through NLP-based NER. Rather than embedding IDs into SQL directly, they are stored temporarily to ease SQL generation through simple set operations such as IN.

  1. Dealing with Complex Data Structures

Temporary tables or views encode complex joins and nested data to provide easier and more correct SQL output.

  1. Enriching Data with Data Definitions for Prompt Construction
  • Standard views and tables are created with clear descriptive comments.
  • Join hints define when to use joins, such as outer joins.
  • Domain-specific exclusions are ruled by domain-specific rules (e.g., columns not to include in SELECT).
  • Few-shot examples translate user queries into SQL.
  • Temporary tables/views are pre-defined for convenience.
  • Domain-specific prompts define the LLM role, SQL dialect, and scope.
  • SQL scripts are set up to develop static or dynamic temporary structures.
  • Dynamic prompt building utilizes this “data domain context” per query.

llm

Image Source – Link

Bringing It All Together: The Execution Flow

This section explains the solution’s execution flow with the help of an example database containing Olympics statistics and an employee vacation schedule. The flow is demonstrated with the query, “In what games competed Isabelle Werth, Nedo Nadi, and Allyson Felix?”, showing how inputs are processed and outputs are generated across the solution steps.

llm2

Image Source – Link

Preprocess the Request

The initial step of the NL2SQL flow categorizes the user query into a particular data domain, reducing the scope for SQL generation. It also identifies named resources from the query, which are later utilized to obtain equivalent database identifiers in the subsequent step. Example inputs and outputs of this step are as follows:

Resolve Identifiers (to Database IDs)

This step maps extracted named resources to their corresponding database identifiers using solution-specific methods like database lookups or an ID service.

The following code shows the execution of this step in our running example:

Prepare the Request

This fundamental step produces two very important components:

  • An instruction for the LLM to formulate the SQL query from the user query and the domain context query.
  • An SQL script is used to establish the domain schema.

Generate SQL

In this step, the prepared prompt and context are provided to the SQL-generating LLM, which outputs the SQL query with the correct joins, filters, and columns. See the following code:

Execute the SQL

The SQL preamble and generated SQL are combined into a complete script, which is then executed against the data store. The response is fetched and returned to the client or end-user. See the following code:

Solution Benefits

The solution is more than 95% accurate over three domains of data. SQL queries are produced within 1–3 seconds by light-weight LLMs such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3. It efficiently processes big data (200 resources, 10,000 rows) with 2–5 second latency. Complex queries are abstracted via data abstractions for correct SQL generation and lower LLM workload.

AWS Solution Architecture

Users query via a REST API on Amazon API Gateway, protected by Amazon Cognito. The API calls an AWS Lambda function that calls Anthropic’s Claude Haiku 3 on Amazon Bedrock. Steps are:

  1. Domain classification.
  2. Named resource extraction.
  3. Identity lookup.
  4. Prompt creation and temporary structure definition.
  5. SQL generation.
  6. SQL execution against the database (e.g., SQLite).

AWS Lambda executes and returns the result to the user seamlessly.

llm3

Image Source – Link

Conclusion

AWS and Cisco demonstrated simplifying the process of making enterprise-grade SQL generation easier, more precise, and quicker. Although the demo was performed with Olympic athletes, the method can be used in many business environments.

Drop a query if you have any questions regarding LLM and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFront Service Delivery PartnerAmazon OpenSearch Service Delivery PartnerAWS DMS Service Delivery PartnerAWS Systems Manager Service Delivery PartnerAmazon RDS Service Delivery PartnerAWS CloudFormation Service Delivery PartnerAWS ConfigAmazon EMR and many more.

FAQs

1. How can the solution provide high SQL generation accuracy?

ANS: – It focuses LLM attention on specific data, uses metadata enrichment, and leverages domain-specific prompts.

2. What SQL generation models are used?

ANS: – Lightweight models such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3 are used in SQL generation.

WRITTEN BY Daniya Muzammil

Daniya Muzammil works as a Research Intern at CloudThat and is passionate about learning new and emerging technologies.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!