RAG Powered Voice Assistants for Grounded Speech to Speech AI

Introduction

Amazon Nova Sonic is a speech-to-speech foundation model on Amazon Bedrock. It takes audio in and produces audio out over a single bidirectional WebSocket stream, no separate STT or TTS pipeline required. That makes it ideal for real-time voice assistants.

But a model that only speaks from training data will hallucinate. Ask it for hotel availability or flight schedules, and it will invent answers that sound convincing but are wrong. On a live phone call, callers trust what they hear, there is no way to cross-check. We solved this by implementing Retrieval-Augmented Generation (RAG) as a native tool inside Amazon Nova Sonic. The model calls our RAG tool mid-conversation, retrieves verified data from a knowledge base, and speaks the answer back to the caller. This post covers the major parts of that implementation.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why RAG Is Essential for Voice AI?

LLMs have a knowledge cutoff. Domain-specific facts, such as product catalogs, schedules, and pricing, change constantly. RAG decouples the model’s conversational ability from the data it references. Your domain team maintains a knowledge base with up-to-date information, and the model retrieves it at query time rather than guessing. For voice assistants, this is not optional, it is the difference between a useful product and a liability.

How Tool Use Works in Amazon Nova Sonic?

Amazon Nova Sonic natively supports tool calling. During session setup, you pass an array of tool specifications in the promptStart event. Each spec has a name, a description (which the model uses to decide when to call it), and a JSON Schema for the input parameters. When the model decides it needs external data, it emits a toolUse event on the stream. Your application executes the function and sends the result back as a toolResult event. Amazon Nova then speaks the answer using that result.

Step-by-Step Guide

Step 1: Defining the RAG Tool Specification

We defined a tool spec that tells Amazon Nova Sonic when and how to call our knowledge base. The description is precise, so the model only invokes it for domain-specific questions, not general conversation. An optional category parameter with an enum lets the model filter results at retrieval time:

rag_search_schema = json.dumps({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "query": {
            "type": "string",
            "description": "The query to search in the knowledge base",
        },
        "category": {
            "type": "string",
            "enum": ["hotels", "departures", "itineraries",
                     "flights", "company_info"],
            "description": "Category filter to narrow results",
        }
    },
    "required": ["query"],
})

available_tools = [
    {
        "toolSpec": {
            "name": "ragSearchTool",
            "description": "Search the knowledge base for domain-"
                "specific data: packages, schedules, hotels, "
                "itineraries. Do NOT use for general questions.",
            "inputSchema": {"json": rag_search_schema},
        }
    },
]

rag_search_schema = json.dumps({

"$schema": "http://json-schema.org/draft-07/schema#",

"type": "object",

"properties": {

"query": {

"type": "string",

"description": "The query to search in the knowledge base",

"category": {

"type": "string",

"enum": ["hotels", "departures", "itineraries",

"flights", "company_info"],

"description": "Category filter to narrow results",

}

"required": ["query"],

})

available_tools = [

{

"toolSpec": {

"name": "ragSearchTool",

"description": "Search the knowledge base for domain-"

"specific data: packages, schedules, hotels, "

"itineraries. Do NOT use for general questions.",

"inputSchema": {"json": rag_search_schema},

}

]

Step 2: Registering Tools at Session Start

The tool specs are passed to Amazon Nova Sonic in the promptStart event, along with audio and text output configuration. Once sent, the model knows about every tool available for the session:

await send_event(session_id, {
    "event": {
        "promptStart": {
            "promptName": session.prompt_name,
            "audioOutputConfiguration": {
                "audioType": "SPEECH",
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 8000,
                "voiceId": "tiffany",
            },
            "toolUseOutputConfiguration": {
                "mediaType": "application/json"
            },
            "toolConfiguration": {
                "tools": available_tools  # <-- RAG tool registered here
            },
        }
    }
})

await send_event(session_id, {

"event": {

"promptStart": {

"promptName": session.prompt_name,

"audioOutputConfiguration": {

"audioType": "SPEECH",

"mediaType": "audio/lpcm",

"sampleRateHertz": 8000,

"voiceId": "tiffany",

"toolUseOutputConfiguration": {

"mediaType": "application/json"

"toolConfiguration": {

"tools": available_tools # <-- RAG tool registered here

}

})

Step 3: Handling Tool Calls in the Response Stream

As the bidirectional stream runs, we watch for toolUse events. When one arrives, we capture the tool name and arguments. A contentEnd event with type TOOL signals that the model is waiting for the result. We execute the RAG search and send the result back using a three-event sequence, contentStart, toolResult, contentEnd:

# Detecting the tool call:
if evt.get("toolUse"):
    session.tool_name = evt["toolUse"]["toolName"]
    session.tool_use_id = evt["toolUse"]["toolUseId"]
    session.tool_args = evt["toolUse"].get("content", "{}")

elif evt.get("contentEnd") and evt["contentEnd"]["type"] == "TOOL":
    result = await tool_processor(session.tool_name, session.tool_args)
    await send_tool_result(session_id, session.tool_use_id, result)

# Sending the result back (three-event sequence):
async def send_tool_result(session_id, tool_use_id, result):
    content_id = str(uuid4())

    await send_event(session_id, {"event": {"contentStart": {
        "promptName": session.prompt_name,
        "contentName": content_id,
        "type": "TOOL",
        "toolResultInputConfiguration": {
            "toolUseId": tool_use_id,
            "type": "TEXT",
            "textInputConfiguration": {"mediaType": "text/plain"},
        },
    }}})

    await send_event(session_id, {"event": {"toolResult": {
        "promptName": session.prompt_name,
        "contentName": content_id,
        "content": json.dumps(result),
        "role": "TOOL",
    }}})

    await send_event(session_id, {"event": {"contentEnd": {
        "promptName": session.prompt_name,
        "contentName": content_id,
    }}})

# Detecting the tool call:

if evt.get("toolUse"):

session.tool_name = evt["toolUse"]["toolName"]

session.tool_use_id = evt["toolUse"]["toolUseId"]

session.tool_args = evt["toolUse"].get("content", "{}")

elif evt.get("contentEnd") and evt["contentEnd"]["type"] == "TOOL":

result = await tool_processor(session.tool_name, session.tool_args)

await send_tool_result(session_id, session.tool_use_id, result)

# Sending the result back (three-event sequence):

async def send_tool_result(session_id, tool_use_id, result):

content_id = str(uuid4())

await send_event(session_id, {"event": {"contentStart": {

"promptName": session.prompt_name,

"contentName": content_id,

"type": "TOOL",

"toolResultInputConfiguration": {

"toolUseId": tool_use_id,

"type": "TEXT",

"textInputConfiguration": {"mediaType": "text/plain"},

}}})

await send_event(session_id, {"event": {"toolResult": {

"promptName": session.prompt_name,

"contentName": content_id,

"content": json.dumps(result),

"role": "TOOL",

}}})

await send_event(session_id, {"event": {"contentEnd": {

"promptName": session.prompt_name,

"contentName": content_id,

}}})

Step 4: The RAG Search Function

The actual retrieval uses Amazon Bedrock Knowledge Bases. The function takes the query and optional category filter, runs a semantic search, and returns the top results as plain text that Nova Sonic can speak aloud:

def rag_search(args: dict) -> dict:
    query = args.get("query", "")
    category = args.get("category", "")

    config = {"vectorSearchConfiguration": {
        "numberOfResults": 5,
        "overrideSearchType": "SEMANTIC",
    }}
    if category:
        config["vectorSearchConfiguration"]["filter"] = {
            "equals": {"key": "category", "value": category}
        }

    response = kb_client.retrieve(
        knowledgeBaseId=KB_ID,
        retrievalConfiguration=config,
        retrievalQuery={"text": query},
    )

    results = [r["content"]["text"]
               for r in response["retrievalResults"]]
    return {"answer": "\n\n".join(results)}

def rag_search(args: dict) -> dict:

query = args.get("query", "")

category = args.get("category", "")

config = {"vectorSearchConfiguration": {

"numberOfResults": 5,

"overrideSearchType": "SEMANTIC",

}}

if category:

config["vectorSearchConfiguration"]["filter"] = {

"equals": {"key": "category", "value": category}

}

response = kb_client.retrieve(

knowledgeBaseId=KB_ID,

retrievalConfiguration=config,

retrievalQuery={"text": query},

)

results = [r["content"]["text"]

for r in response["retrievalResults"]]

return {"answer": "\n\n".join(results)}

Handling Latency with Filler Sentences

A knowledge base lookup takes 2 to 5 seconds. On a phone call, that silence is unacceptable. We inject a filler phrase the moment a tool call is detected using Nova Sonic’s cross-modal text input. The model speaks the filler while the RAG search runs in parallel:

FILLERS = [
    "One moment, let me check that for you.",
    "Sure, let me look into that.",
    "Let me pull that up real quick.",
]

# On toolUse event — before executing the tool:
filler = random.choice(FILLERS)
await send_cross_modal_text(session_id,
    f"[Say to the caller: '{filler}']")

# Tool executes in parallel, result sent when ready

FILLERS = [

"One moment, let me check that for you.",

"Sure, let me look into that.",

"Let me pull that up real quick.",

]

# On toolUse event — before executing the tool:

filler = random.choice(FILLERS)

await send_cross_modal_text(session_id,

f"[Say to the caller: '{filler}']")

# Tool executes in parallel, result sent when ready

Conclusion

Implementing RAG as a tool inside Amazon Nova Sonic is what turns a speech-to-speech model into a trustworthy voice assistant. The architecture has four moving parts: a tool spec that tells the model when to search, a promptStart event that registers it, a response stream handler that detects tool calls and sends results back, and a retrieval function that queries your knowledge base.

Add filler sentences to cover latency, and you have a voice bot that sounds natural and answers accurately, exactly what callers expect on a live phone call.

Drop a query if you have any questions regarding Amazon Nova Sonic and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does Amazon Nova Sonic support tool calling natively, or do I need a separate orchestration layer?

ANS: – Amazon Nova Sonic supports tool calling natively within its bidirectional streaming protocol. You define tools in the promptStart event, and the model emits toolUse events on the same stream. No external orchestration framework is needed.

2. Can I register multiple tools in a single session?

ANS: – Yes. The toolConfiguration.tools array accepts multiple tool specs. The model uses each tool’s description to decide which one to call for a given query. You can have a RAG tool, a web search tool, a transfer-to-agent tool, and more, all in the same session.

WRITTEN BY Ahmad Wani

Ahmad works as a Research Associate in the Data and AIoT Department at CloudThat. He specializes in Generative AI, Machine Learning, and Deep Learning, with hands-on experience in building intelligent solutions that leverage advanced AI technologies. Alongside his AI expertise, Ahmad also has a solid understanding of front-end development, working with technologies such as React.js, HTML, and CSS to create seamless and interactive user experiences. In his free time, Ahmad enjoys exploring emerging technologies, playing football, and continuously learning to expand his expertise.