|
Voiced by Amazon Polly |
Overview
Creating a conversational avatar that can listen, respond, and animate in real time is becoming increasingly common in modern web applications. With advances in browser APIs and AI models, it is now possible to build a fully interactive avatar experience directly in the browser.
In this article, we will explore how to build a real-time conversational avatar using the following stack:
- React – Application UI and state management
- js – 3D rendering engine
- React Three Fiber – React renderer for Three.js
- Ready Player Me – Avatar creation platform
- Amazon Nova Sonic – Voice-to-voice conversational AI
- WebSockets + Web Audio API – Real-time audio streaming and playback
The goal is to stream microphone audio to an AI model, receive audio responses over WebSockets, and animate a 3D avatar’s mouth to appear to speak the response.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Architecture Overview
The system operates through a streaming pipeline in which both input and output audio travel over a WebSocket connection.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
React UI │ │ microphone input ▼Audio streaming (WebSocket) ▼ Amazon Nova Sonic v2 │ │ audio response ▼ Web Audio API │ ▼ Lip Sync Engine │ ▼ React Three Fiber Avatar |
This architecture enables low-latency conversational interaction, allowing the avatar to respond almost immediately after the user speaks.
Why This Stack Works Well?
Let’s quickly understand why each technology in this stack is a good fit.
React
React manages the UI and application state. It also provides an ecosystem of hooks and component abstractions that make handling streams and events easier.
Three.js
Three.js is the most widely used WebGL rendering engine for 3D graphics in browsers.
React Three Fiber
Instead of manually controlling Three.js scenes, React Three Fiber lets you use React components to build and control a 3D scene.
Ready Player Me
Ready Player Me enables developers to generate high-quality avatars with built-in facial blend shapes that support lip-sync animation.
Amazon Nova Sonic
Amazon Nova Sonic is a voice-to-voice conversational AI model designed for real-time dialogue.
Capturing Microphone Audio
The first step is capturing audio from the user’s microphone using the browser’s MediaDevices API.
|
1 2 3 4 |
const stream = await navigator.mediaDevices.getUserMedia({ audio: true }) const audioContext = new AudioContext() const source = audioContext.createMediaStreamSource(stream) |
To send audio to the backend, you convert microphone audio into PCM chunks and send them over WebSocket.
|
1 2 3 4 5 6 |
const ws = new WebSocket("wss://your-server") processor.onaudioprocess = (event) => { const input = event.inputBuffer.getChannelData(0) ws.send(input.buffer) } |
Now the user’s speech is continuously streamed to the AI backend.
Streaming Audio to Amazon Nova Sonic
Once the audio reaches the backend, it is forwarded to the conversational AI model.
The backend pipeline typically looks like this:
|
1 2 3 4 5 6 7 |
WebSocket Server │ ▼ Amazon Nova Sonic │ ▼ Audio Response Stream |
The model processes the incoming speech and produces an audio response stream.
Receiving Streaming Audio in React
The frontend receives audio chunks through the WebSocket connection.
|
1 2 3 4 |
ws.onmessage = (event) => { const audioChunk = event.data playAudio(audioChunk) } |
Since the response is streamed, the audio needs to be appended to a playback buffer.
Playing Audio Using the Web Audio API
The Web Audio API provides low-latency audio playback and analysis.
|
1 2 3 4 5 6 7 8 9 10 |
const audioContext = new AudioContext() async function playAudio(chunk) { const buffer = await audioContext.decodeAudioData(chunk) const source = audioContext.createBufferSource() source.buffer = buffer source.connect(audioContext.destination) source.start() } |
This allows the avatar’s voice response to play almost immediately after each chunk arrives.
Implementing the Lip Sync Engine
Since the WebSocket response only provides audio, we can approximate lip movements using audio energy.
The Web Audio API provides an AnalyserNode that lets us measure audio amplitude in real time.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
const analyser = audioContext.createAnalyser() function updateLipSync() { const data = new Uint8Array(analyser.frequencyBinCount) analyser.getByteFrequencyData(data) const volume = data.reduce((a,b)=>a+b) / data.length avatar.mouthOpen = volume / 255 requestAnimationFrame(updateLipSync) } |
This approach makes the avatar’s mouth open and close depending on the speech volume.
While it is not phoneme-accurate lip synchronization, it provides a convincing talking animation for conversational avatars.
Rendering the Avatar with React Three Fiber
Now we integrate the avatar into the React application.
Install dependencies:
|
1 2 3 4 5 6 7 |
npm install three @react-three/fiber @react-three/drei Then create a 3D scene. <Canvas> <ambientLight /> <Avatar /> </Canvas> |
The avatar component loads a Ready Player Me model.
|
1 2 3 4 5 6 7 |
import { useGLTF } from "@react-three/drei" function Avatar() { const { scene } = useGLTF("/avatar.glb") return <primitive object={scene} /> } |
Ready Player Me avatars include blendshapes, which allow the mouth to animate smoothly.
Connecting Lip Sync to the Avatar
Blendshapes control the avatar’s facial expressions.
For example:
viseme_aa
viseme_oo
viseme_ff
For a simple lip sync approximation, you can map audio amplitude to the jawOpen morph target.
avatar.morphTargetInfluences[jawIndex] = mouthOpenValue
As the audio gets louder, the mouth opens wider.
Handling Streaming Synchronization
Streaming introduces challenges such as:
- network latency
- audio buffering
- WebSocket jitter
A good practice is to buffer 100–300 ms of audio before playback to ensure smooth streaming.
Performance Considerations
When building conversational avatars in the browser, performance matters.
Key tips:
- Use compressed GLB avatars
- Limit scene complexity
- Avoid large texture sizes
- Keep WebSocket messages small
React Three Fiber is efficient because it only updates the scene when necessary.
Final Result
With this architecture, the user experiences a seamless conversational loop:
- The user speaks into the microphone
- Audio is streamed via WebSocket
- The AI generates a spoken response
- The response is streamed back
- The avatar speaks and animates in real time
The system feels like interacting with a live digital assistant rather than a traditional chatbot.
Conclusion
By combining React, WebSockets, Web Audio API, React Three Fiber, and Ready Player Me, developers can build powerful browser-based conversational avatars.
Although frontend-only lip sync based on audio amplitude is not perfectly accurate, it is sufficient for many real-time conversational experiences.
As AI voice models and speech engines continue to improve, integrating phoneme-based lip sync will make these avatars even more realistic.
For now, this approach offers a practical and production-ready way to bring interactive AI avatars to the web.
Drop a query if you have any questions regarding WebSockets and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. Can accurate lip sync be achieved if the backend only returns audio?
ANS: – If the backend only returns audio chunks, the frontend cannot determine exact phoneme timings. In this case, lip movement is typically approximated using audio amplitude analysis through the Web Audio API. This approach creates a convincing talking effect but does not produce perfectly accurate phoneme-based lip sync.
2. Why use React Three Fiber instead of plain Three.js?
ANS: – React Three Fiber provides a React-based abstraction layer over Three.js, making it easier to manage 3D scenes using React components, hooks, and state. This improves maintainability and allows developers to integrate complex 3D rendering directly into React applications without manually managing the entire Three.js lifecycle.
3. Is WebSocket streaming necessary for conversational avatars?
ANS: – Yes, WebSockets are typically used for conversational avatars because they allow low-latency bidirectional communication. Instead of waiting for the full audio response, the client can receive and play audio chunks as they arrive, making the avatar feel much more responsive and interactive.
WRITTEN BY Rishav Mehta
Rishav is a skilled Frontend Developer with a passion for crafting visually appealing and intuitive websites. Proficient in HTML, CSS, JavaScript, and frameworks such as ReactJS, he combines technical expertise with a strong understanding of web development principles to deliver responsive, user-friendly designs. Dedicated to continuous learning, Rishav stays updated on the latest industry trends and enjoys experimenting with emerging technologies in his free time.
Login

March 18, 2026
PREV
Comments