Voiced by Amazon Polly |
Introduction
However, translating complex queries at the corporate level remains challenging, especially when handling nested structures such as XML strings. This article illustrates an elastic solution developed by AWS and Cisco that reduces the complexity of SQL generation by utilizing simpler, more cost-effective models.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Specific Challenges for Enterprise-Scale NL2SQL
Accuracy in NL2SQL is very important since erroneous queries can leak confidential information or lead to poor decisions. Some of the main challenges are:
- Sophisticated Schemas: Storage-optimized databases, i.e., nested tables and multi-dimensional structures that require complex querying.
- Varied Queries: User queries are typically across entities, date ranges, or multiple metrics, resulting in longer SQL queries.
- LLM Knowledge Gap: LLMs are exposed to simple, public schemas and cannot handle sophisticated production databases.
- LLM Attention Burden: Processing multi-dimensional data puts a heavier load on attention, which results in slower, less stable queries.
Solution Design and Methodology
The solution enhances NL2SQL precision by lowering the LLM’s burden and concentrating generation steps in targeted data domains.
- Restricting Scope to Data Domains
- Query Mapping: Categorize queries into target domains to target prompts.
- Schema Scoping: Eliminate irrelevant schemas, enabling more affluent few-shot examples.
- Metadata Enrichment: Add table/column descriptions to provide better SQL direction.
- Connection Info: Extract SQL dialects and multi-cloud information for scale-out support.
- Handling Identifiers
Entities specified in queries are resolved to unique IDs through NLP-based NER. Rather than embedding IDs into SQL directly, they are stored temporarily to ease SQL generation through simple set operations such as IN.
- Dealing with Complex Data Structures
Temporary tables or views encode complex joins and nested data to provide easier and more correct SQL output.
- Enriching Data with Data Definitions for Prompt Construction
- Standard views and tables are created with clear descriptive comments.
- Join hints define when to use joins, such as outer joins.
- Domain-specific exclusions are ruled by domain-specific rules (e.g., columns not to include in SELECT).
- Few-shot examples translate user queries into SQL.
- Temporary tables/views are pre-defined for convenience.
- Domain-specific prompts define the LLM role, SQL dialect, and scope.
- SQL scripts are set up to develop static or dynamic temporary structures.
- Dynamic prompt building utilizes this “data domain context” per query.
Image Source – Link
Bringing It All Together: The Execution Flow
This section explains the solution’s execution flow with the help of an example database containing Olympics statistics and an employee vacation schedule. The flow is demonstrated with the query, “In what games competed Isabelle Werth, Nedo Nadi, and Allyson Felix?”, showing how inputs are processed and outputs are generated across the solution steps.
Image Source – Link
Preprocess the Request
The initial step of the NL2SQL flow categorizes the user query into a particular data domain, reducing the scope for SQL generation. It also identifies named resources from the query, which are later utilized to obtain equivalent database identifiers in the subsequent step. Example inputs and outputs of this step are as follows:
1 2 3 4 5 6 7 8 |
user_query = "In what games competed Isabelle Werth, Nedo Nadi and Allyson Felix?" pre_processed_request = request_pre_processor.run(user_query) domain = pre_processed_request[app_consts.DOMAIN] # Output pre_processed_request: {'user_query': 'In what games competed Isabelle Werth, Nedo Nadi and Allyson Felix?', 'domain': 'olympics', 'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'}} |
Resolve Identifiers (to Database IDs)
This step maps extracted named resources to their corresponding database identifiers using solution-specific methods like database lookups or an ID service.
The following code shows the execution of this step in our running example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
named_resources = pre_processed_request[app_consts.NAMED_RESOURCES] if len(named_resources) > 0: identifiers = id_service_facade.resolve(named_resources) # add identifiers to the pre_processed_request object pre_processed_request[app_consts.IDENTIFIERS] = identifiers else: pre_processed_request[app_consts.IDENTIFIERS] = [] # Output pre_processed_request: {'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?', 'domain': 'olympics', 'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'}, 'identifiers': [ {'id': 34551, 'role': 32, 'name': 'allyson felix'}, {'id': 129726, 'role': 32, 'name': 'isabelle werth'}, {'id': 84026, 'role': 32, 'name': 'nedo nadi'} ] } |
Prepare the Request
This fundamental step produces two very important components:
- An instruction for the LLM to formulate the SQL query from the user query and the domain context query.
- An SQL script is used to establish the domain schema.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
prepared_request = request_preparer.run(pre_processed_request) Output prepared_request: {'llm_prompt': 'You are an expert at SQL. Considering the following SQL definitions of tables,... CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER,...); <example> question: How many gold medals has Yukio Endo won? answer: `{"sql": "SELECT a.id, count(m.medal_name) as "count" FROM athletes_in_focus a INNER JOIN games_competitor gc .... WHERE m.medal_name = 'Gold' GROUP BY a.id;" }` </example> 'sql_preamble': [ 'CREATE temp TABLE athletes_in_focus (row_id INTEGER PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);', 'INSERT INTO athletes_in_focus VALUES (1,84026,'nedo nadi'), (2,34551,'allyson felix'), (3,129726,'isabelle werth');”]} |
Generate SQL
In this step, the prepared prompt and context are provided to the SQL-generating LLM, which outputs the SQL query with the correct joins, filters, and columns. See the following code:
1 2 3 4 5 6 7 |
llm_response = llm_service_facade.invoke(prepared_request[ 'llm_prompt' ]) generated_sql = llm_response[ 'llm_output' ] # Output generated_sql: {'sql': 'SELECT g.games_name, g.games_year FROM athletes_in_focus a JOIN games_competitor gc ON gc.person_id = a.id JOIN games g ON gc.games_id = g.id;'} |
Execute the SQL
The SQL preamble and generated SQL are combined into a complete script, which is then executed against the data store. The response is fetched and returned to the client or end-user. See the following code:
1 2 3 4 5 6 7 8 9 10 11 |
sql_script = prepared_request[ 'sql_preamble' ] + [ generated_sql[ 'sql' ] ] database = app_consts.get_database_for_domain(domain) results = rdbms_service_facade.execute_sql(database, sql_script) # Output results: {'rdbms_output': [ ('games_name', 'games_year'), ('2004 Summer', 2004), ... ('2016 Summer', 2016)], 'processing_status': 'success'} |
Solution Benefits
The solution is more than 95% accurate over three domains of data. SQL queries are produced within 1–3 seconds by light-weight LLMs such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3. It efficiently processes big data (200 resources, 10,000 rows) with 2–5 second latency. Complex queries are abstracted via data abstractions for correct SQL generation and lower LLM workload.
AWS Solution Architecture
Users query via a REST API on Amazon API Gateway, protected by Amazon Cognito. The API calls an AWS Lambda function that calls Anthropic’s Claude Haiku 3 on Amazon Bedrock. Steps are:
- Domain classification.
- Named resource extraction.
- Identity lookup.
- Prompt creation and temporary structure definition.
- SQL generation.
- SQL execution against the database (e.g., SQLite).
AWS Lambda executes and returns the result to the user seamlessly.
Image Source – Link
Conclusion
AWS and Cisco demonstrated simplifying the process of making enterprise-grade SQL generation easier, more precise, and quicker. Although the demo was performed with Olympic athletes, the method can be used in many business environments.
Drop a query if you have any questions regarding LLM and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront Service Delivery Partner, Amazon OpenSearch Service Delivery Partner, AWS DMS Service Delivery Partner, AWS Systems Manager Service Delivery Partner, Amazon RDS Service Delivery Partner, AWS CloudFormation Service Delivery Partner, AWS Config, Amazon EMR and many more.
FAQs
1. How can the solution provide high SQL generation accuracy?
ANS: – It focuses LLM attention on specific data, uses metadata enrichment, and leverages domain-specific prompts.
2. What SQL generation models are used?
ANS: – Lightweight models such as Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3 are used in SQL generation.
WRITTEN BY Daniya Muzammil
Daniya Muzammil works as a Research Intern at CloudThat and is passionate about learning new and emerging technologies.
Comments