|
Voiced by Amazon Polly |
Introduction
Amazon Nova Sonic is a speech-to-speech foundation model on Amazon Bedrock. It takes audio in and produces audio out over a single bidirectional WebSocket stream, no separate STT or TTS pipeline required. That makes it ideal for real-time voice assistants.
But a model that only speaks from training data will hallucinate. Ask it for hotel availability or flight schedules, and it will invent answers that sound convincing but are wrong. On a live phone call, callers trust what they hear, there is no way to cross-check. We solved this by implementing Retrieval-Augmented Generation (RAG) as a native tool inside Amazon Nova Sonic. The model calls our RAG tool mid-conversation, retrieves verified data from a knowledge base, and speaks the answer back to the caller. This post covers the major parts of that implementation.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Why RAG Is Essential for Voice AI?
LLMs have a knowledge cutoff. Domain-specific facts, such as product catalogs, schedules, and pricing, change constantly. RAG decouples the model’s conversational ability from the data it references. Your domain team maintains a knowledge base with up-to-date information, and the model retrieves it at query time rather than guessing. For voice assistants, this is not optional, it is the difference between a useful product and a liability.
How Tool Use Works in Amazon Nova Sonic?
Amazon Nova Sonic natively supports tool calling. During session setup, you pass an array of tool specifications in the promptStart event. Each spec has a name, a description (which the model uses to decide when to call it), and a JSON Schema for the input parameters. When the model decides it needs external data, it emits a toolUse event on the stream. Your application executes the function and sends the result back as a toolResult event. Amazon Nova then speaks the answer using that result.
Step-by-Step Guide
Step 1: Defining the RAG Tool Specification
We defined a tool spec that tells Amazon Nova Sonic when and how to call our knowledge base. The description is precise, so the model only invokes it for domain-specific questions, not general conversation. An optional category parameter with an enum lets the model filter results at retrieval time:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
rag_search_schema = json.dumps({ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "query": { "type": "string", "description": "The query to search in the knowledge base", }, "category": { "type": "string", "enum": ["hotels", "departures", "itineraries", "flights", "company_info"], "description": "Category filter to narrow results", } }, "required": ["query"], }) available_tools = [ { "toolSpec": { "name": "ragSearchTool", "description": "Search the knowledge base for domain-" "specific data: packages, schedules, hotels, " "itineraries. Do NOT use for general questions.", "inputSchema": {"json": rag_search_schema}, } }, ] |
Step 2: Registering Tools at Session Start
The tool specs are passed to Amazon Nova Sonic in the promptStart event, along with audio and text output configuration. Once sent, the model knows about every tool available for the session:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
await send_event(session_id, { "event": { "promptStart": { "promptName": session.prompt_name, "audioOutputConfiguration": { "audioType": "SPEECH", "mediaType": "audio/lpcm", "sampleRateHertz": 8000, "voiceId": "tiffany", }, "toolUseOutputConfiguration": { "mediaType": "application/json" }, "toolConfiguration": { "tools": available_tools # <-- RAG tool registered here }, } } }) |
Step 3: Handling Tool Calls in the Response Stream
As the bidirectional stream runs, we watch for toolUse events. When one arrives, we capture the tool name and arguments. A contentEnd event with type TOOL signals that the model is waiting for the result. We execute the RAG search and send the result back using a three-event sequence, contentStart, toolResult, contentEnd:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Detecting the tool call: if evt.get("toolUse"): session.tool_name = evt["toolUse"]["toolName"] session.tool_use_id = evt["toolUse"]["toolUseId"] session.tool_args = evt["toolUse"].get("content", "{}") elif evt.get("contentEnd") and evt["contentEnd"]["type"] == "TOOL": result = await tool_processor(session.tool_name, session.tool_args) await send_tool_result(session_id, session.tool_use_id, result) # Sending the result back (three-event sequence): async def send_tool_result(session_id, tool_use_id, result): content_id = str(uuid4()) await send_event(session_id, {"event": {"contentStart": { "promptName": session.prompt_name, "contentName": content_id, "type": "TOOL", "toolResultInputConfiguration": { "toolUseId": tool_use_id, "type": "TEXT", "textInputConfiguration": {"mediaType": "text/plain"}, }, }}}) await send_event(session_id, {"event": {"toolResult": { "promptName": session.prompt_name, "contentName": content_id, "content": json.dumps(result), "role": "TOOL", }}}) await send_event(session_id, {"event": {"contentEnd": { "promptName": session.prompt_name, "contentName": content_id, }}}) |
Step 4: The RAG Search Function
The actual retrieval uses Amazon Bedrock Knowledge Bases. The function takes the query and optional category filter, runs a semantic search, and returns the top results as plain text that Nova Sonic can speak aloud:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def rag_search(args: dict) -> dict: query = args.get("query", "") category = args.get("category", "") config = {"vectorSearchConfiguration": { "numberOfResults": 5, "overrideSearchType": "SEMANTIC", }} if category: config["vectorSearchConfiguration"]["filter"] = { "equals": {"key": "category", "value": category} } response = kb_client.retrieve( knowledgeBaseId=KB_ID, retrievalConfiguration=config, retrievalQuery={"text": query}, ) results = [r["content"]["text"] for r in response["retrievalResults"]] return {"answer": "\n\n".join(results)} |
Handling Latency with Filler Sentences
A knowledge base lookup takes 2 to 5 seconds. On a phone call, that silence is unacceptable. We inject a filler phrase the moment a tool call is detected using Nova Sonic’s cross-modal text input. The model speaks the filler while the RAG search runs in parallel:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
FILLERS = [ "One moment, let me check that for you.", "Sure, let me look into that.", "Let me pull that up real quick.", ] # On toolUse event — before executing the tool: filler = random.choice(FILLERS) await send_cross_modal_text(session_id, f"[Say to the caller: '{filler}']") # Tool executes in parallel, result sent when ready |
Conclusion
Add filler sentences to cover latency, and you have a voice bot that sounds natural and answers accurately, exactly what callers expect on a live phone call.
Drop a query if you have any questions regarding Amazon Nova Sonic and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
FAQs
1. Does Amazon Nova Sonic support tool calling natively, or do I need a separate orchestration layer?
ANS: – Amazon Nova Sonic supports tool calling natively within its bidirectional streaming protocol. You define tools in the promptStart event, and the model emits toolUse events on the same stream. No external orchestration framework is needed.
2. Can I register multiple tools in a single session?
ANS: – Yes. The toolConfiguration.tools array accepts multiple tool specs. The model uses each tool’s description to decide which one to call for a given query. You can have a RAG tool, a web search tool, a transfer-to-agent tool, and more, all in the same session.
WRITTEN BY Ahmad Wani
Ahmad works as a Research Associate in the Data and AIoT Department at CloudThat. He specializes in Generative AI, Machine Learning, and Deep Learning, with hands-on experience in building intelligent solutions that leverage advanced AI technologies. Alongside his AI expertise, Ahmad also has a solid understanding of front-end development, working with technologies such as React.js, HTML, and CSS to create seamless and interactive user experiences. In his free time, Ahmad enjoys exploring emerging technologies, playing football, and continuously learning to expand his expertise.
Login

May 8, 2026
PREV
Comments