The Interactive Bots feature transforms the Vexa bot from a passive transcription observer into a fully interactive meeting participant. An external agent or application controls the bot via REST API to speak, read/write chat, and share visual content during a live meeting.
Capabilities
| Capability | Description | Status |
|---|
| Speak | Text-to-speech or raw audio playback into the meeting | Working |
| Chat write | Send messages to the meeting chat | Working |
| Chat read | Capture messages from the meeting chat | Working |
| Screen share | Display images, URLs, or video via screen share | Working |
| Virtual camera | Show avatar/content via the bot’s camera feed | Experimental |
Quick Start
Send a bot to a meeting
Opt in to interactive audio with voice_agent_enabled: true. Recorder-only bots leave this disabled.curl -X POST "$API_BASE/bots" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{
"platform": "google_meet",
"native_meeting_id": "abc-defg-hij",
"bot_name": "AI Assistant",
"voice_agent_enabled": true
}'
Make the bot speak
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/speak" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"text": "Hello everyone, I am the meeting assistant.", "provider": "piper", "voice": "auto"}'
Play a pre-rendered audio file
AUDIO_BASE64="$(base64 < greeting.wav | tr -d '\n')"
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/speak" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d "{\"audio_base64\":\"$AUDIO_BASE64\",\"format\":\"wav\"}"
Send a chat message
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"text": "Meeting summary: 3 action items identified."}'
Share visual content
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"type": "image", "url": "https://example.com/quarterly-chart.png"}'
For the full endpoint reference, see Interactive Bots API.
For interactive command endpoints (speak, chat write, screen, avatar), the API returns:
{
"message": "Speak command sent",
"meeting_id": 227
}
message varies by endpoint (Speak stop command sent, Chat message sent, Screen content command sent, Avatar set command sent, etc.).
Chat read returns:
{
"messages": [],
"meeting_id": 227
}
When the meeting is not active, interactive endpoints return:
{
"detail": "No active meeting found for google_meet/abc-defg-hij"
}
How It Works
When voice_agent_enabled is set, the bot reads from a PulseAudio virtual microphone. The bot starts muted, unmutes only while playing speech/audio, and re-mutes when playback finishes or is interrupted.
Audio Pipeline
POST /speak {"text": "...", "provider": "piper", "voice": "auto"}
-> Meeting API publishes Redis speak command
-> bot calls local Vexa TTS service (/v1/audio/speech)
-> Piper returns PCM stream (24 kHz, mono)
-> PulseAudio tts_sink -> virtual_mic
-> Chromium microphone -> WebRTC -> meeting participants hear speech
Pre-rendered audio uses the same meeting microphone path but skips synthesis:
POST /speak {"audio_url": "..."} or {"audio_base64": "...", "format": "wav"}
-> Meeting API publishes Redis speak_audio command
-> bot downloads/decodes the audio with ffmpeg
-> PulseAudio tts_sink -> virtual_mic
-> Chromium microphone -> WebRTC -> meeting participants hear the file
Screen Content Pipeline
API request (image/url/video)
-> Playwright renders content on Xvfb (1920x1080)
-> Content displayed fullscreen
-> Bot clicks "Present now" in meeting UI
-> Participants see shared screen with content
Chat Pipeline
The bot interacts with the meeting’s native chat UI via DOM automation:
- Write: opens the chat panel, types the message, and sends it
- Read: captures messages from the chat panel (sender, text, timestamp)
Chat Read & Write
Chat enables two-way text communication in the meeting alongside voice.
Write a message:
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"text": "Here is the meeting summary so far."}'
Read all messages:
curl "$API_BASE/bots/google_meet/abc-defg-hij/chat" \
-H "X-API-Key: $API_KEY"
Returns an object with messages and meeting_id. Each message has sender, text, timestamp (Unix milliseconds), and is_from_bot fields. Real-time chat events are also available via WebSocket (chat.received, chat.sent).
Screen Share (Showing Images & Content)
Display visual content to meeting participants via the bot’s screen share. Three content types are supported:
| Type | Description |
|---|
image | Renders an image fullscreen on a black background |
url | Opens a URL in a browser window (e.g., Google Slides, dashboards) |
video | Plays video fullscreen with autoplay |
# Share an image
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"type": "image", "url": "https://example.com/chart.png"}'
# Share a Google Slides presentation
curl -X POST "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"type": "url", "url": "https://docs.google.com/presentation/d/..."}'
# Stop sharing
curl -X DELETE "$API_BASE/bots/google_meet/abc-defg-hij/screen" \
-H "X-API-Key: $API_KEY"
Avatar (Virtual Camera)
The virtual camera feature is experimental. It works intermittently on Google Meet due to WebRTC replaceTrack reliability. For displaying visual content to participants, screen share is recommended as the more reliable approach.
The virtual camera uses a canvas-based approach to replace the bot’s camera feed with custom content (e.g., an avatar image or animation). When working, participants see the avatar in the bot’s video tile instead of a blank camera.
You can set or reset the avatar at any time via the API:
# Set a custom avatar
curl -X PUT "$API_BASE/bots/google_meet/abc-defg-hij/avatar" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{"url": "https://example.com/avatar.png"}'
# Reset to default Vexa logo
curl -X DELETE "$API_BASE/bots/google_meet/abc-defg-hij/avatar" \
-H "X-API-Key: $API_KEY"
See the Avatar API reference for full details.
Current limitations:
- Only tested on Google Meet
replaceTrack into WebRTC works intermittently
- Screen share is the recommended alternative for displaying images and content
WebSocket Events
When interactive bot mode is enabled, additional events are published on the WebSocket connection:
| Event | Payload | Description |
|---|
speak.started | {"text": "..."} | Bot started speaking |
speak.completed | — | Speech playback finished |
speak.interrupted | — | Speech was interrupted via API |
chat.received | {"sender": "John", "text": "...", "timestamp": 1234} | Chat message captured from a participant |
chat.sent | {"text": "..."} | Bot sent a chat message |
screen.sharing_started | {"content_type": "image"} | Screen sharing started |
screen.sharing_stopped | — | Screen sharing stopped |
| Feature | Google Meet | Teams | Zoom |
|---|
| Speak (TTS) | Supported | Beta (requires M365 Business Basic) | Requires Zoom SDK setup |
| Chat write | Supported | Beta (requires M365 Business Basic) | Requires Zoom SDK setup |
| Chat read | Supported | Beta (requires M365 Business Basic) | Requires Zoom SDK setup |
| Screen share | Supported | Beta (requires M365 Business Basic) | Requires Zoom SDK setup |
| Virtual camera | Experimental | — | — |
Prerequisites
Base URL for the local Vexa TTS service. Compose and Lite wire this automatically.
No OpenAI key is required for default speech. The default provider is piper, and the default voice is auto, which chooses a prepared Piper voice from the input language.
PulseAudio is already configured in the bot container (entrypoint.sh). No manual setup is needed.
Direct browser control (CDP / Playwright)
Beyond the REST speak/chat/screen commands, you can attach a real
Chrome DevTools Protocol (CDP) client — e.g. Playwright
— directly to a running bot’s browser and drive it like any page: click, type,
navigate, read the DOM. This is the lowest level of control, and the escape hatch
for situations the REST API doesn’t cover — most importantly clearing a join
blocker (a captcha or a “verify you’re human” challenge), either by a human over
VNC or by an AI agent over CDP.
Every meeting bot exposes a CDP endpoint through the gateway, keyed by meeting_id:
{GATEWAY}/b/{meeting_id}/cdp
This endpoint is access-controlled — meeting_id alone is NOT access.
meeting_id is a short, guessable integer, so it can never be the only gate.
Every CDP request (HTTP and WebSocket) requires a valid X-API-Key whose
user owns that meeting. Anything else — no key, or a key belonging to a
different user — is rejected with 403 Forbidden. There is no way to attach to
someone else’s bot.
Attach with Playwright (pass your API key as a header):
import { chromium } from 'playwright-core';
const browser = await chromium.connectOverCDP(
`${GATEWAY}/b/${meetingId}/cdp`,
{ headers: { 'X-API-Key': process.env.VEXA_API_KEY } } // owning user's key — required
);
const page = browser.contexts()[0].pages()[0];
console.log(await page.title()); // read the live meeting tab
// drive it directly — e.g. solve a challenge, click a control:
await page.getByRole('button', { name: 'Ask to join' }).click();
When a bot hits a blocking state it can’t clear on its own, it escalates to
needs_human_help and brings up its remote-browser surface (see
Concepts); a human can take over via VNC, or an agent can attach via
the CDP endpoint above and resolve it programmatically. Because the bot is a real,
self-hostable browser, both paths stay inside your infrastructure and under
your API key — there is no third party in the loop.
Known Limitations
- Virtual camera is experimental — the canvas-based virtual camera works intermittently on Google Meet. Screen share is more reliable for displaying visual content.
- Text and file playback are separate paths — text requests synthesize through local Piper;
audio_url and audio_base64 play the supplied file directly through ffmpeg/PulseAudio.
- Zoom requires native SDK artifacts — without Zoom Meeting SDK binaries, Zoom joins fail during startup.
- No speech queue — rapid speak commands may overlap. Wait for the
speak.completed WebSocket event before sending the next command, or use DELETE /speak to interrupt.
- Teams avatar not visible — Teams SFU returns
a=inactive for video from anonymous guests, so the bot’s virtual camera/avatar is never visible to other Teams participants. Use screen share instead. (#124)