Developing a Real Time Talking Avatar with React and WebSockets

Overview

Creating a conversational avatar that can listen, respond, and animate in real time is becoming increasingly common in modern web applications. With advances in browser APIs and AI models, it is now possible to build a fully interactive avatar experience directly in the browser.

In this article, we will explore how to build a real-time conversational avatar using the following stack:

React – Application UI and state management
js – 3D rendering engine
React Three Fiber – React renderer for Three.js
Ready Player Me – Avatar creation platform
Amazon Nova Sonic – Voice-to-voice conversational AI
WebSockets + Web Audio API – Real-time audio streaming and playback

The goal is to stream microphone audio to an AI model, receive audio responses over WebSockets, and animate a 3D avatar’s mouth to appear to speak the response.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Architecture Overview

The system operates through a streaming pipeline in which both input and output audio travel over a WebSocket connection.

React UI
   │
   │ microphone input
   ▼Audio streaming (WebSocket)
   ▼
Amazon Nova Sonic v2
   │
   │ audio response
   ▼
Web Audio API
   │
   ▼
Lip Sync Engine
   │
   ▼
React Three Fiber Avatar

React UI

│

│ microphone input

▼Audio streaming (WebSocket)

▼

Amazon Nova Sonic v2

│

│ audio response

▼

Web Audio API

│

▼

Lip Sync Engine

│

▼

React Three Fiber Avatar

This architecture enables low-latency conversational interaction, allowing the avatar to respond almost immediately after the user speaks.

Why This Stack Works Well?

Let’s quickly understand why each technology in this stack is a good fit.

React

React manages the UI and application state. It also provides an ecosystem of hooks and component abstractions that make handling streams and events easier.

Three.js

Three.js is the most widely used WebGL rendering engine for 3D graphics in browsers.

React Three Fiber

Instead of manually controlling Three.js scenes, React Three Fiber lets you use React components to build and control a 3D scene.

Ready Player Me

Ready Player Me enables developers to generate high-quality avatars with built-in facial blend shapes that support lip-sync animation.

Amazon Nova Sonic

Amazon Nova Sonic is a voice-to-voice conversational AI model designed for real-time dialogue.

Capturing Microphone Audio

The first step is capturing audio from the user’s microphone using the browser’s MediaDevices API.

const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
const audioContext = new AudioContext()

const source = audioContext.createMediaStreamSource(stream)

const stream = await navigator.mediaDevices.getUserMedia({ audio: true })

const audioContext = new AudioContext()

const source = audioContext.createMediaStreamSource(stream)

To send audio to the backend, you convert microphone audio into PCM chunks and send them over WebSocket.

const ws = new WebSocket("wss://your-server")

processor.onaudioprocess = (event) => {
  const input = event.inputBuffer.getChannelData(0)
  ws.send(input.buffer)
}

const ws = new WebSocket("wss://your-server")

processor.onaudioprocess = (event) => {

const input = event.inputBuffer.getChannelData(0)

ws.send(input.buffer)

}

Now the user’s speech is continuously streamed to the AI backend.

Streaming Audio to Amazon Nova Sonic

Once the audio reaches the backend, it is forwarded to the conversational AI model.

The backend pipeline typically looks like this:

WebSocket Server
      │
      ▼
Amazon Nova Sonic
      │
      ▼
Audio Response Stream

WebSocket Server

│

▼

Amazon Nova Sonic

│

▼

Audio Response Stream

The model processes the incoming speech and produces an audio response stream.

Receiving Streaming Audio in React

The frontend receives audio chunks through the WebSocket connection.

ws.onmessage = (event) => {
  const audioChunk = event.data
  playAudio(audioChunk)
}

ws.onmessage = (event) => {

const audioChunk = event.data

playAudio(audioChunk)

}

Since the response is streamed, the audio needs to be appended to a playback buffer.

Playing Audio Using the Web Audio API

The Web Audio API provides low-latency audio playback and analysis.

const audioContext = new AudioContext()

async function playAudio(chunk) {
  const buffer = await audioContext.decodeAudioData(chunk)
  const source = audioContext.createBufferSource()

  source.buffer = buffer
  source.connect(audioContext.destination)
  source.start()
}

const audioContext = new AudioContext()

async function playAudio(chunk) {

const buffer = await audioContext.decodeAudioData(chunk)

const source = audioContext.createBufferSource()

source.buffer = buffer

source.connect(audioContext.destination)

source.start()

}

This allows the avatar’s voice response to play almost immediately after each chunk arrives.

Implementing the Lip Sync Engine

Since the WebSocket response only provides audio, we can approximate lip movements using audio energy.

The Web Audio API provides an AnalyserNode that lets us measure audio amplitude in real time.

const analyser = audioContext.createAnalyser()

function updateLipSync() {
  const data = new Uint8Array(analyser.frequencyBinCount)

  analyser.getByteFrequencyData(data)

  const volume =
    data.reduce((a,b)=>a+b) / data.length

  avatar.mouthOpen = volume / 255

  requestAnimationFrame(updateLipSync)
}

const analyser = audioContext.createAnalyser()

function updateLipSync() {

const data = new Uint8Array(analyser.frequencyBinCount)

analyser.getByteFrequencyData(data)

const volume =

data.reduce((a,b)=>a+b) / data.length

avatar.mouthOpen = volume / 255

requestAnimationFrame(updateLipSync)

}

This approach makes the avatar’s mouth open and close depending on the speech volume.

While it is not phoneme-accurate lip synchronization, it provides a convincing talking animation for conversational avatars.

Rendering the Avatar with React Three Fiber

Now we integrate the avatar into the React application.

Install dependencies:

npm install three @react-three/fiber @react-three/drei

Then create a 3D scene.
<Canvas>
  <ambientLight />
  <Avatar />
</Canvas>

npm install three @react-three/fiber @react-three/drei

Then create a 3D scene.

</Canvas>

The avatar component loads a Ready Player Me model.

import { useGLTF } from "@react-three/drei"

function Avatar() {
  const { scene } = useGLTF("/avatar.glb")

  return <primitive object={scene} />
}

import { useGLTF } from "@react-three/drei"

function Avatar() {

const { scene } = useGLTF("/avatar.glb")

return <primitive object={scene} />

}

Ready Player Me avatars include blendshapes, which allow the mouth to animate smoothly.

Connecting Lip Sync to the Avatar

Blendshapes control the avatar’s facial expressions.

For example:

viseme_aa
viseme_oo
viseme_ff
For a simple lip sync approximation, you can map audio amplitude to the jawOpen morph target.

avatar.morphTargetInfluences[jawIndex] = mouthOpenValue

As the audio gets louder, the mouth opens wider.

Handling Streaming Synchronization

Streaming introduces challenges such as:

network latency
audio buffering
WebSocket jitter

A good practice is to buffer 100–300 ms of audio before playback to ensure smooth streaming.

Performance Considerations

When building conversational avatars in the browser, performance matters.

Key tips:

Use compressed GLB avatars
Limit scene complexity
Avoid large texture sizes
Keep WebSocket messages small

React Three Fiber is efficient because it only updates the scene when necessary.

Final Result

With this architecture, the user experiences a seamless conversational loop:

The user speaks into the microphone
Audio is streamed via WebSocket
The AI generates a spoken response
The response is streamed back
The avatar speaks and animates in real time

The system feels like interacting with a live digital assistant rather than a traditional chatbot.

Conclusion

By combining React, WebSockets, Web Audio API, React Three Fiber, and Ready Player Me, developers can build powerful browser-based conversational avatars.

This architecture is flexible, scalable, and works entirely in modern browsers without requiring native applications.

Although frontend-only lip sync based on audio amplitude is not perfectly accurate, it is sufficient for many real-time conversational experiences.

As AI voice models and speech engines continue to improve, integrating phoneme-based lip sync will make these avatars even more realistic.

For now, this approach offers a practical and production-ready way to bring interactive AI avatars to the web.

Drop a query if you have any questions regarding WebSockets and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can accurate lip sync be achieved if the backend only returns audio?

ANS: – If the backend only returns audio chunks, the frontend cannot determine exact phoneme timings. In this case, lip movement is typically approximated using audio amplitude analysis through the Web Audio API. This approach creates a convincing talking effect but does not produce perfectly accurate phoneme-based lip sync.

2. Why use React Three Fiber instead of plain Three.js?

ANS: – React Three Fiber provides a React-based abstraction layer over Three.js, making it easier to manage 3D scenes using React components, hooks, and state. This improves maintainability and allows developers to integrate complex 3D rendering directly into React applications without manually managing the entire Three.js lifecycle.

3. Is WebSocket streaming necessary for conversational avatars?

ANS: – Yes, WebSockets are typically used for conversational avatars because they allow low-latency bidirectional communication. Instead of waiting for the full audio response, the client can receive and play audio chunks as they arrive, making the avatar feel much more responsive and interactive.

WRITTEN BY Rishav Mehta

Rishav is a skilled Frontend Developer with a passion for crafting visually appealing and intuitive websites. Proficient in HTML, CSS, JavaScript, and frameworks such as ReactJS, he combines technical expertise with a strong understanding of web development principles to deliver responsive, user-friendly designs. Dedicated to continuous learning, Rishav stays updated on the latest industry trends and enjoys experimenting with emerging technologies in his free time.