The companion has a face. It's consistent with the visual character you defined, and it speaks with synchronized lip movement and expression — in real time or generated on demand.
Visual Consistency
The avatar appearance is derived from the companion's visual definition. The same character that appears in generated images appears in the avatar — we maintain that consistency automatically. You don't manage separate assets for different modalities.
Two Modes
Real-Time
A streaming talking head over WebRTC via LiveKit. The companion appears live, reacts to speech, and maintains eye contact. Suitable for:
- Video-call style companion interfaces
- Live interactive sessions where presence matters
- Experiences where the companion needs to feel physically present
This mode shares session infrastructure with Voice. Audio and video are delivered together through the same connection.
On-Demand (Synchronous)
A POST request returns a rendered MP4 — the companion speaking with synchronized audio — with no WebRTC session required. Suitable for:
- Turn-based conversation where the user speaks or types, your app fetches a response video, and plays it back
- Ambient companion experiences triggered by events or schedules
- Environments where a persistent WebRTC connection isn't practical
The synchronous mode is also the foundation for always-on local companions — where a device handles wake word detection and local transcription, posts to the API, and plays back the response.
How to Start a Session
-
Create a session
Your backend creates an avatar session for a companion and user. For the synchronous mode, this establishes the rendering context.
-
Get a token (real-time only)
Your backend requests a short-lived LiveKit token. The client connects to the room and the avatar stream begins.
-
Generate a response (on-demand)
Your backend posts to the generate endpoint with the companion's response text. An MP4 is returned, ready to play.
Related APIs
See the API Reference for avatar session, token, and generate endpoints.

