Skip to main content
The Interactive Bots feature transforms the Vexa bot from a passive transcription observer into a fully interactive meeting participant. An external agent or application controls the bot via REST API to speak, read/write chat, and share visual content during a live meeting.

Capabilities

CapabilityDescriptionStatus
SpeakText-to-speech or raw audio playback into the meetingWorking
Chat writeSend messages to the meeting chatWorking
Chat readCapture messages from the meeting chatWorking
Screen shareDisplay images, URLs, or video via screen shareWorking
Virtual cameraShow avatar/content via the bot’s camera feedExperimental

Quick Start

1

Send a bot to a meeting

Opt in to interactive audio with voice_agent_enabled: true. Recorder-only bots leave this disabled.
curl -X POST "$API_BASE/bots" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{
    "platform": "google_meet",
    "native_meeting_id": "abc-defg-hij",
    "bot_name": "AI Assistant",
    "voice_agent_enabled": true
  }'
2

Make the bot speak

curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/speak" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"text": "Hello everyone, I am the meeting assistant.", "provider": "piper", "voice": "auto"}'
3

Play a pre-rendered audio file

AUDIO_BASE64="$(base64 < greeting.wav | tr -d '\n')"
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/speak" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d "{\"audio_base64\":\"$AUDIO_BASE64\",\"format\":\"wav\"}"
4

Send a chat message

curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"text": "Meeting summary: 3 action items identified."}'
5

Share visual content

curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"type": "image", "url": "https://example.com/quarterly-chart.png"}'
For the full endpoint reference, see Interactive Bots API.

Output Formats

For interactive command endpoints (speak, chat write, screen, avatar), the API returns:
{
  "message": "Speak command sent",
  "meeting_id": 227
}
message varies by endpoint (Speak stop command sent, Chat message sent, Screen content command sent, Avatar set command sent, etc.). Chat read returns:
{
  "messages": [],
  "meeting_id": 227
}
When the meeting is not active, interactive endpoints return:
{
  "detail": "No active meeting found for google_meet/abc-defg-hij"
}

How It Works

When voice_agent_enabled is set, the bot reads from a PulseAudio virtual microphone. The bot starts muted, unmutes only while playing speech/audio, and re-mutes when playback finishes or is interrupted.

Audio Pipeline

POST /speak {"text": "...", "provider": "piper", "voice": "auto"}
  -> Meeting API publishes Redis speak command
  -> bot calls local Vexa TTS service (/v1/audio/speech)
  -> Piper returns PCM stream (24 kHz, mono)
  -> PulseAudio tts_sink -> virtual_mic
  -> Chromium microphone -> WebRTC -> meeting participants hear speech
Pre-rendered audio uses the same meeting microphone path but skips synthesis:
POST /speak {"audio_url": "..."} or {"audio_base64": "...", "format": "wav"}
  -> Meeting API publishes Redis speak_audio command
  -> bot downloads/decodes the audio with ffmpeg
  -> PulseAudio tts_sink -> virtual_mic
  -> Chromium microphone -> WebRTC -> meeting participants hear the file

Screen Content Pipeline

API request (image/url/video)
  -> Playwright renders content on Xvfb (1920x1080)
  -> Content displayed fullscreen
  -> Bot clicks "Present now" in meeting UI
  -> Participants see shared screen with content

Chat Pipeline

The bot interacts with the meeting’s native chat UI via DOM automation:
  • Write: opens the chat panel, types the message, and sends it
  • Read: captures messages from the chat panel (sender, text, timestamp)

Chat Read & Write

Chat enables two-way text communication in the meeting alongside voice. Write a message:
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"text": "Here is the meeting summary so far."}'
Read all messages:
curl "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
  -H "X-API-Key: $API_KEY"
Returns an object with messages and meeting_id. Each message has sender, text, timestamp (Unix milliseconds), and is_from_bot fields. Real-time chat events are also available via WebSocket (chat.received, chat.sent).

Screen Share (Showing Images & Content)

Display visual content to meeting participants via the bot’s screen share. Three content types are supported:
TypeDescription
imageRenders an image fullscreen on a black background
urlOpens a URL in a browser window (e.g., Google Slides, dashboards)
videoPlays video fullscreen with autoplay
# Share an image
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"type": "image", "url": "https://example.com/chart.png"}'

# Share a Google Slides presentation
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"type": "url", "url": "https://docs.google.com/presentation/d/..."}'

# Stop sharing
curl -X DELETE "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
  -H "X-API-Key: $API_KEY"

Avatar (Virtual Camera)

The virtual camera feature is experimental. It works intermittently on Google Meet due to WebRTC replaceTrack reliability. For displaying visual content to participants, screen share is recommended as the more reliable approach.
The virtual camera uses a canvas-based approach to replace the bot’s camera feed with custom content (e.g., an avatar image or animation). When working, participants see the avatar in the bot’s video tile instead of a blank camera. You can set or reset the avatar at any time via the API:
# Set a custom avatar
curl -X PUT "$API_BASE/bots/google_meet/abc-defg-hij/avatar" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{"url": "https://example.com/avatar.png"}'

# Reset to default Vexa logo
curl -X DELETE "$API_BASE/bots/google_meet/abc-defg-hij/avatar" \
  -H "X-API-Key: $API_KEY"
See the Avatar API reference for full details. Current limitations:
  • Only tested on Google Meet
  • replaceTrack into WebRTC works intermittently
  • Screen share is the recommended alternative for displaying images and content

WebSocket Events

When interactive bot mode is enabled, additional events are published on the WebSocket connection:
EventPayloadDescription
speak.started{"text": "..."}Bot started speaking
speak.completedSpeech playback finished
speak.interruptedSpeech was interrupted via API
chat.received{"sender": "John", "text": "...", "timestamp": 1234}Chat message captured from a participant
chat.sent{"text": "..."}Bot sent a chat message
screen.sharing_started{"content_type": "image"}Screen sharing started
screen.sharing_stoppedScreen sharing stopped

Platform Support

FeatureGoogle MeetTeamsZoom
Speak (TTS)SupportedBeta (requires M365 Business Basic)Requires Zoom SDK setup
Chat writeSupportedBeta (requires M365 Business Basic)Requires Zoom SDK setup
Chat readSupportedBeta (requires M365 Business Basic)Requires Zoom SDK setup
Screen shareSupportedBeta (requires M365 Business Basic)Requires Zoom SDK setup
Virtual cameraExperimental

Prerequisites

TTS_SERVICE_URL
string
Base URL for the local Vexa TTS service. Compose and Lite wire this automatically.
No OpenAI key is required for default speech. The default provider is piper, and the default voice is auto, which chooses a prepared Piper voice from the input language. PulseAudio is already configured in the bot container (entrypoint.sh). No manual setup is needed.

Direct browser control (CDP / Playwright)

Beyond the REST speak/chat/screen commands, you can attach a real Chrome DevTools Protocol (CDP) client — e.g. Playwright — directly to a running bot’s browser and drive it like any page: click, type, navigate, read the DOM. This is the lowest level of control, and the escape hatch for situations the REST API doesn’t cover — most importantly clearing a join blocker (a captcha or a “verify you’re human” challenge), either by a human over VNC or by an AI agent over CDP. Every meeting bot exposes a CDP endpoint through the gateway, keyed by meeting_id:
{GATEWAY}/b/{meeting_id}/cdp
This endpoint is access-controlled — meeting_id alone is NOT access. meeting_id is a short, guessable integer, so it can never be the only gate. Every CDP request (HTTP and WebSocket) requires a valid X-API-Key whose user owns that meeting. Anything else — no key, or a key belonging to a different user — is rejected with 403 Forbidden. There is no way to attach to someone else’s bot.
Attach with Playwright (pass your API key as a header):
import { chromium } from 'playwright-core';

const browser = await chromium.connectOverCDP(
  `${GATEWAY}/b/${meetingId}/cdp`,
  { headers: { 'X-API-Key': process.env.VEXA_API_KEY } }  // owning user's key — required
);

const page = browser.contexts()[0].pages()[0];
console.log(await page.title());          // read the live meeting tab
// drive it directly — e.g. solve a challenge, click a control:
await page.getByRole('button', { name: 'Ask to join' }).click();
When a bot hits a blocking state it can’t clear on its own, it escalates to needs_human_help and brings up its remote-browser surface (see Concepts); a human can take over via VNC, or an agent can attach via the CDP endpoint above and resolve it programmatically. Because the bot is a real, self-hostable browser, both paths stay inside your infrastructure and under your API key — there is no third party in the loop.

Known Limitations

  1. Virtual camera is experimental — the canvas-based virtual camera works intermittently on Google Meet. Screen share is more reliable for displaying visual content.
  2. Text and file playback are separate paths — text requests synthesize through local Piper; audio_url and audio_base64 play the supplied file directly through ffmpeg/PulseAudio.
  3. Zoom requires native SDK artifacts — without Zoom Meeting SDK binaries, Zoom joins fail during startup.
  4. No speech queue — rapid speak commands may overlap. Wait for the speak.completed WebSocket event before sending the next command, or use DELETE /speak to interrupt.
  5. Teams avatar not visible — Teams SFU returns a=inactive for video from anonymous guests, so the bot’s virtual camera/avatar is never visible to other Teams participants. Use screen share instead. (#124)