Bridging Natural Language and Complex SQL in Multi-Database Environments

Introduction

Hundreds or thousands of complexes, domain-specific tables are often stored in multiple environments, such as Amazon Aurora, Oracle, and Teradata, and cover a broad spectrum of domains, such as security, finance, and human resources. Domain knowledge and complex queries are required to extract this data. Large language models and generative artificial intelligence have augmented natural language to SQL (NL2SQL) technology.

However, translating complex queries at the corporate level remains challenging, especially when handling nested structures such as XML strings. This article illustrates an elastic solution developed by AWS and Cisco that reduces the complexity of SQL generation by utilizing simpler, more cost-effective models.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Specific Challenges for Enterprise-Scale NL2SQL

Accuracy in NL2SQL is very important since erroneous queries can leak confidential information or lead to poor decisions. Some of the main challenges are:

Sophisticated Schemas: Storage-optimized databases, i.e., nested tables and multi-dimensional structures that require complex querying.
Varied Queries: User queries are typically across entities, date ranges, or multiple metrics, resulting in longer SQL queries.
LLM Knowledge Gap: LLMs are exposed to simple, public schemas and cannot handle sophisticated production databases.
LLM Attention Burden: Processing multi-dimensional data puts a heavier load on attention, which results in slower, less stable queries.

Solution Design and Methodology

The solution enhances NL2SQL precision by lowering the LLM’s burden and concentrating generation steps in targeted data domains.

Restricting Scope to Data Domains

Query Mapping: Categorize queries into target domains to target prompts.
Schema Scoping: Eliminate irrelevant schemas, enabling more affluent few-shot examples.
Metadata Enrichment: Add table/column descriptions to provide better SQL direction.
Connection Info: Extract SQL dialects and multi-cloud information for scale-out support.

Handling Identifiers

Entities specified in queries are resolved to unique IDs through NLP-based NER. Rather than embedding IDs into SQL directly, they are stored temporarily to ease SQL generation through simple set operations such as IN.

Dealing with Complex Data Structures

Temporary tables or views encode complex joins and nested data to provide easier and more correct SQL output.

Enriching Data with Data Definitions for Prompt Construction

Standard views and tables are created with clear descriptive comments.
Join hints define when to use joins, such as outer joins.
Domain-specific exclusions are ruled by domain-specific rules (e.g., columns not to include in SELECT).
Few-shot examples translate user queries into SQL.
Temporary tables/views are pre-defined for convenience.
Domain-specific prompts define the LLM role, SQL dialect, and scope.
SQL scripts are set up to develop static or dynamic temporary structures.
Dynamic prompt building utilizes this “data domain context” per query.

llm

Image Source – Link

Bringing It All Together: The Execution Flow

This section explains the solution’s execution flow with the help of an example database containing Olympics statistics and an employee vacation schedule. The flow is demonstrated with the query, “In what games competed Isabelle Werth, Nedo Nadi, and Allyson Felix?”, showing how inputs are processed and outputs are generated across the solution steps.

llm2

Image Source – Link

Preprocess the Request

The initial step of the NL2SQL flow categorizes the user query into a particular data domain, reducing the scope for SQL generation. It also identifies named resources from the query, which are later utilized to obtain equivalent database identifiers in the subsequent step. Example inputs and outputs of this step are as follows:

user_query = "In what games competed Isabelle Werth, Nedo Nadi and Allyson Felix?"

pre_processed_request = request_pre_processor.run(user_query)

domain = pre_processed_request[app_consts.DOMAIN]

# Output pre_processed_request:

{'user_query': 'In what games competed Isabelle Werth, Nedo Nadi and Allyson Felix?',

'domain': 'olympics',

'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'}}

Resolve Identifiers (to Database IDs)

This step maps extracted named resources to their corresponding database identifiers using solution-specific methods like database lookups or an ID service.

The following code shows the execution of this step in our running example:

named_resources = pre_processed_request[app_consts.NAMED_RESOURCES]

if len(named_resources) > 0:

identifiers = id_service_facade.resolve(named_resources)

# add identifiers to the pre_processed_request object

pre_processed_request[app_consts.IDENTIFIERS] = identifiers

else:

pre_processed_request[app_consts.IDENTIFIERS] = []

# Output pre_processed_request:

{'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?',

'domain': 'olympics',

'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'},

'identifiers': [ {'id': 34551, 'role': 32, 'name': 'allyson felix'},

{'id': 129726, 'role': 32, 'name': 'isabelle werth'},

{'id': 84026, 'role': 32, 'name': 'nedo nadi'} ] }

Prepare the Request

This fundamental step produces two very important components:

An instruction for the LLM to formulate the SQL query from the user query and the domain context query.
An SQL script is used to establish the domain schema.

prepared_request = request_preparer.run(pre_processed_request)

Output prepared_request:

{'llm_prompt': 'You are an expert at SQL. Considering the following SQL definitions of tables,...

CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER,...);

question: How many gold medals has Yukio Endo won? answer: `{"sql":

"SELECT a.id, count(m.medal_name) as "count"

FROM athletes_in_focus a INNER JOIN games_competitor gc ....

WHERE m.medal_name = 'Gold' GROUP BY a.id;" }`

</example>

'sql_preamble': [ 'CREATE temp TABLE athletes_in_focus (row_id INTEGER

PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);',

'INSERT INTO athletes_in_focus VALUES

(1,84026,'nedo nadi'), (2,34551,'allyson felix'), (3,129726,'isabelle werth');”]}

Generate SQL

In this step, the prepared prompt and context are provided to the SQL-generating LLM, which outputs the SQL query with the correct joins, filters, and columns. See the following code:

llm_response = llm_service_facade.invoke(prepared_request[ 'llm_prompt' ])

generated_sql = llm_response[ 'llm_output' ]

# Output generated_sql:

{'sql': 'SELECT g.games_name, g.games_year FROM athletes_in_focus a

JOIN games_competitor gc ON gc.person_id = a.id

JOIN games g ON gc.games_id = g.id;'}

Execute the SQL

The SQL preamble and generated SQL are combined into a complete script, which is then executed against the data store. The response is fetched and returned to the client or end-user. See the following code:

sql_script = prepared_request[ 'sql_preamble' ] + [ generated_sql[ 'sql' ] ]

database = app_consts.get_database_for_domain(domain)

results = rdbms_service_facade.execute_sql(database, sql_script)

# Output results:

{'rdbms_output': [

('games_name', 'games_year'),

('2004 Summer', 2004),

...

('2016 Summer', 2016)],

'processing_status': 'success'}

Solution Benefits

The solution is more than 95% accurate over three domains of data. SQL queries are produced within 1–3 seconds by light-weight LLMs such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3. It efficiently processes big data (200 resources, 10,000 rows) with 2–5 second latency. Complex queries are abstracted via data abstractions for correct SQL generation and lower LLM workload.

AWS Solution Architecture

Users query via a REST API on Amazon API Gateway, protected by Amazon Cognito. The API calls an AWS Lambda function that calls Anthropic’s Claude Haiku 3 on Amazon Bedrock. Steps are:

Domain classification.
Named resource extraction.
Identity lookup.
Prompt creation and temporary structure definition.
SQL generation.
SQL execution against the database (e.g., SQLite).

AWS Lambda executes and returns the result to the user seamlessly.

llm3

Image Source – Link

Conclusion

AWS and Cisco demonstrated simplifying the process of making enterprise-grade SQL generation easier, more precise, and quicker. Although the demo was performed with Olympic athletes, the method can be used in many business environments.

Drop a query if you have any questions regarding LLM and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How can the solution provide high SQL generation accuracy?

ANS: – It focuses LLM attention on specific data, uses metadata enrichment, and leverages domain-specific prompts.

2. What SQL generation models are used?

ANS: – Lightweight models such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3 are used in SQL generation.

WRITTEN BY Daniya Muzammil

Daniya works as a Research Associate at CloudThat, specializing in backend development and cloud-native architectures. She designs scalable solutions leveraging AWS services with expertise in Amazon CloudWatch for monitoring and AWS CloudFormation for automation. Skilled in Python, React, HTML, and CSS, Daniya also experiments with IoT and Raspberry Pi projects, integrating edge devices with modern cloud systems.