Voice sessions are managed end-to-end. You start a session and your client connects — we handle everything else.
What We Manage
The session runs over WebRTC via LiveKit. Inside the session:
- Voice activity detection — we detect when the user starts and stops speaking
- Transcription — speech is converted to text using the configured STT provider
- Inference — the companion generates a response, with full memory and personality context applied
- Synthesis — the response is spoken back in the companion's voice
All of this happens within the session. Your client does not handle any of these steps — it connects to the room and the conversation begins.
Voice Character
Each companion has a configured voice. The voice is part of the companion's identity — consistent across sessions and across users.
We default to Cartesia for voice synthesis, which delivers expressive, natural-sounding speech. For latency-sensitive or high-volume deployments, Kotoro offers an ultra-low-latency alternative at significantly lower cost — frequently good enough for turn-based conversation at scale.
How to Start a Session
-
Create a session
Your backend creates a voice session for a companion and user via the API.
-
Get a token
Your backend requests a short-lived LiveKit access token for the user from the session token endpoint.
-
Connect the client
Your client connects to the LiveKit room using the standard LiveKit SDK and the token. No additional Spike SDK required on the client.
-
Conversation begins
The companion listens. The user speaks. The companion responds in its own voice.
Related APIs
See the API Reference for session creation and token endpoints.

