Companion API

Avatar

The companion has a face. It's consistent with the visual character you defined, and it speaks with synchronized lip movement and expression — in real time or generated on demand.

Visual Consistency

The avatar appearance is derived from the companion's visual definition. The same character that appears in generated images appears in the avatar — we maintain that consistency automatically. You don't manage separate assets for different modalities.

Two Modes

Real-Time

A streaming talking head over WebRTC via LiveKit. The companion appears live, reacts to speech, and maintains eye contact. Suitable for:

Video-call style companion interfaces
Live interactive sessions where presence matters
Experiences where the companion needs to feel physically present

This mode shares session infrastructure with Voice. Audio and video are delivered together through the same connection.

On-Demand (Synchronous)

A POST request returns a rendered MP4 — the companion speaking with synchronized audio — with no WebRTC session required. Suitable for:

Turn-based conversation where the user speaks or types, your app fetches a response video, and plays it back
Ambient companion experiences triggered by events or schedules
Environments where a persistent WebRTC connection isn't practical

The synchronous mode is also the foundation for always-on local companions — where a device handles wake word detection and local transcription, posts to the API, and plays back the response.

How to Start a Session

Create a session

Your backend creates an avatar session for a companion and user. For the synchronous mode, this establishes the rendering context.
Get a token (real-time only)

Your backend requests a short-lived LiveKit token. The client connects to the room and the avatar stream begins.
Generate a response (on-demand)

Your backend posts to the generate endpoint with the companion's response text. An MP4 is returned, ready to play.

See the API Reference for avatar session, token, and generate endpoints.

Last modified on June 3, 2026

Voice

Two Modes

Real-Time

A streaming talking head over WebRTC via LiveKit. The companion appears live, reacts to speech, and maintains eye contact. Suitable for:

Video-call style companion interfaces

Live interactive sessions where presence matters

Experiences where the companion needs to feel physically present

This mode shares session infrastructure with Voice. Audio and video are delivered together through the same connection.

On-Demand (Synchronous)

A POST request returns a rendered MP4 — the companion speaking with synchronized audio — with no WebRTC session required. Suitable for:

Turn-based conversation where the user speaks or types, your app fetches a response video, and plays it back

Ambient companion experiences triggered by events or schedules

Environments where a persistent WebRTC connection isn't practical

The synchronous mode is also the foundation for always-on local companions — where a device handles wake word detection and local transcription, posts to the API, and plays back the response.

How to Start a Session

Create a session

Your backend creates an avatar session for a companion and user. For the synchronous mode, this establishes the rendering context.
Get a token (real-time only)

Your backend requests a short-lived LiveKit token. The client connects to the room and the avatar stream begins.
Generate a response (on-demand)

Your backend posts to the generate endpoint with the companion's response text. An MP4 is returned, ready to play.

Visual Consistency

Two Modes

Real-Time

On-Demand (Synchronous)

How to Start a Session

Related APIs

Visual Consistency

Two Modes

Real-Time

On-Demand (Synchronous)

How to Start a Session

Related APIs