AI/ML, AWS, Cloud Computing

< 1 min

Creating a Real Time AI Voice Agent with Exotel and Amazon Nova Sonic

Voiced by Amazon Polly

Overview

The blog discusses the idea of picking up the phone and chatting with an AI assistant in a natural way, as if you were speaking to a person who understands what you say and can speak just like you, without those irritating “press 1 for sales” prompts. This blog aims to build exactly such an experience by combining Exotel’s cloud telephony services with Amazon Nova Sonic.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Introduction

Voice has always been the most natural form of communication; however, most automated telephone systems aren’t exactly natural. The conventional technique combines three distinct services: Speech-to-Text (STT), a Large Language Model (LLM), and Text-to-Speech (TTS). With each hop, latency increases, leading to awkward pauses between what the user says and the automated response.

Amazon Nova Sonic is completely different in that it’s a native speech-to-speech model that processes raw audio input and generates raw audio output in a single bidirectional stream. There’s absolutely no text intermediary involved. The model listens and responds to the user at the same time, just like any person.

Exote is a cloud-based telephony platform that provides programmable voice APIs and call routing features. Most importantly, it allows streaming audio over WebSockets with its Voice Bot Applet, making it a great candidate for building voice bots on top of Amazon Nova Sonic.

Combining those two platforms with a minimalistic Python server yields a true AI voice agent capable of engaging in meaningful phone conversations.

Architecture Flow

The Architecture flow is simple. A FastAPI server sits between Exotel and Amazon Bedrock, serving as an audio bridge and an orchestration tool.

nova

The incoming call is forwarded by Exotel through the FastAPI server using WebSockets. Then the raw PCM audio is sent to Amazon Nova Sonic’s bidirectional streaming API via Amazon Bedrock. In this API, the speaker’s speech is recognized and, depending on the case, tools such as the knowledge base or the web can be activated. This is followed by providing spoken responses to the FastAPI server via this API.

How a Call Flows Through the System?

  1. Call received: Exotel handles the incoming call and triggers the server at the endpoint /incoming-call through the HTTP protocol. The server then establishes an Amazon Nova Sonic session using Amazon Bedrock and returns a WebSocket URL for Exotel to connect.
  2. Start audio streaming: Exotel establishes the WebSocket connection and starts streaming raw PCM audio frames (16 bits per sample, mono channel at 8kHz). The server passes them through directly to the bi-directional stream in Amazon Nova Sonic.
  3. Speech processing with Amazon Nova Sonic: The model analyzes the speech, recognizes that the user is done talking (turn detection), understands the intent behind the speech, and speaks its answer back in the same session. There are no separate calls made to STT and TTS services.
  4. Calls to tools: In case the speech prompts for some external information, Amazon Nova Sonic raises a tool use event. The server invokes the relevant tool (e.g., querying an Amazon Bedrock Knowledge Base or calling a search API) and pushes the result back through the stream.
  5. Call ended: When the caller ends the call or disconnects from the WebSocket, the server saves the log transcript to Amazon DynamoDB and closes the session.

Key Implementation Details

  1. Pre-warming Sessions to Minimize Initial Response Delay

It takes 1-2 seconds to establish two-way communication with Amazon Bedrock. Instead of keeping the caller on hold, the server initializes the Amazon Nova Sonic session when sending the first HTTP request, before establishing the WebSocket connection. When the audio starts streaming, the model is already pre-warmed and listening for commands.

  1. Audio Format Compatibility Without Conversion Costs

Another fortunate circumstance: Exotel’s Voice Bot Applet and Amazon Nova Sonic utilize precisely the same audio encoding scheme, 16-bit signed little-endian PCM with 8kHz mono sampling rate. The audio conversion module becomes a mere pass-through process that requires no computation.

  1. Silence Detection and Follow-up Prompt

An idle detection routine tracks the duration of the caller’s silence. Once the predefined silence threshold is exceeded, the server prompts Nova Sonic with a textual message and waits for the response from the language model.

  1. Preventing Unnecessary Pauses While Interacting with Tools

Interactions with knowledge bases, web search engines, etc., may take several seconds. To avoid an awkward pause, the server injects a fill-in phrase into Amazon Nova Sonic.

Conclusion

It is possible to create AI voice agents capable of engaging in realistic telephone conversations by utilizing Exotel along with Amazon Nova Sonic. Exotel manages the telephony layer, including calls and audio streams. At the same time, Amazon’s Nova Sonic handles the cognitive aspects of speech recognition, reasoning, and conversational response. In between sits a lightweight FastAPI server that manages audio connections and invokes other tools. Any business considering automating telephone interactions can benefit from this setup.

Drop a query if you have any questions regarding Amazon Nova Sonic and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is Amazon Nova Sonic, and how does it differ from conventional voice AI?

ANS: – Amazon Nova Sonic is a speech-to-speech foundation model offered by Amazon Bedrock. Conventional voice AI pipelines include Speech-to-Text, LLM, and Text-to-Speech APIs chained together. Amazon Nova Sonic uses a unidirectional pipe for speech-in/speech-out instead of the three APIs, lowering latency and simplifying the architecture.

2. What makes Exotel suitable for this integration?

ANS: – Exotel offers a Voice Bot Applet that streams raw PCM audio over WebSockets in real time. This is precisely what is required to integrate Amazon Nova Sonic into a conversational interface. Exotel manages call routing, number management, and telecommunications compliance so that you can focus on the AI portion of the application.

WRITTEN BY Nekkanti Bindu

Nekkanti Bindu works as a Research Associate at CloudThat, where she channels her passion for cloud computing into meaningful work every day. Fascinated by the endless possibilities of the cloud, Bindu has established herself as an AWS consultant, helping organizations harness the full potential of AWS technologies. A firm believer in continuous learning, she stays at the forefront of industry trends and evolving cloud innovations. With a strong commitment to making a lasting impact, Bindu is driven to empower businesses to thrive in a cloud-first world.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!