The Main Entities
User and API key
Every API request is authenticated withX-API-Key. In self-hosted deployments, you create users and mint tokens via the admin API:
Meeting
A meeting is the top-level record created when you request a bot:platform:google_meet|teams|zoomnative_meeting_id: what you extracted from the meeting URL
Bot session (session_uid)
Each bot run has a session UID (think: one attempt / one join). It links:
- transcript segments
- recordings and media files
Transcript Segments and Timing
Transcripts are stored as segments. Each segment has:text(the transcript string)speaker(best-effort attribution)start_time/end_time(seconds)
What start_time means
start_time and end_time are relative seconds from the start of the bot’s audio capture pipeline.
This is why post-meeting playback can be aligned without additional offset math:
- audio player time (
currentTime) maps directly to segmentstart_time/end_time
Word-level timestamps
As of v0.8, Vexa focuses on segment-level timing. Word-level timing is not exposed in the API, so UIs should highlight segments, not individual words.Recording vs Capture (important)
Vexa always needs to capture audio to transcribe. Recording persistence is separate and controlled by flags:recording_enabled: whether to persist recording artifacts (audio files) to storagetranscribe_enabled: whether to run/store transcription output
Transcription Tiers
Vexa supports atranscription_tier per meeting:
realtime(default): best effort low latencydeferred: lower priority; useful for cost/throughput control
Meeting Lifecycle States (high-level)
Common states you will see in logs/UI:joining: bot launched and navigating to meetingawaiting_admission: waiting room / lobby (host needs to admit)active: bot is in the meetingstopped: stop requestedcompleted: finished processing (post-meeting artifacts should become available)
Recordings and Media Files
Ifrecording_enabled=true and Vexa captured audio, post-meeting APIs include recordings with media_files.
Typical audio flow:
- Call
GET /transcripts/{platform}/{native_meeting_id} - Read:
recordings[0].idrecordings[0].media_files[0].id
- Stream bytes:
GET /recordings/{recording_id}/media/{media_file_id}/raw
/raw endpoint is designed for browser playback:
Content-Disposition: inlineRangerequests (206 Partial Content) for seeking
Interactive Bots (Meeting Interaction)
By default (voice_agent_enabled: true), every bot is an interactive meeting participant. An external agent can instruct it to:
- Speak via text-to-speech (OpenAI TTS → PulseAudio → meeting mic)
- Read/write chat messages in the meeting
- Share screen content (images, URLs, video)
session_uid. Chat messages captured by the bot can optionally be injected into the transcription stream.
Full details: Interactive Bots Guide