How It Works
Vexa uses DOM-based speaker correlation — not audio-based diarization. The bot observes the meeting platform’s UI to detect who is currently speaking, then correlates that with the audio transcription stream.Detection by Platform
| Platform | Method | Accuracy |
|---|---|---|
| Google Meet | Speaking indicator query via injected JavaScript. Detects which participant tile shows the “speaking” animation. | High (when visible) |
| Microsoft Teams | DOM traversal from media elements to participant display name containers. | High (when visible) |
| Zoom | Dual-path: active speaker bar detection + DOM traversal with audio activity correlation within 500ms window. | Moderate |
Voting and Locking
Vexa uses a voting system to build confidence before committing a speaker identity:- Each time a participant is detected as speaking, they get a vote (1.0 for single speaker, 0.5 each for simultaneous speakers)
- Once a speaker reaches 2 votes at 70%+ confidence, the mapping is permanently locked
- Locked mappings are never re-evaluated — this prevents flip-flopping
What You Get
Transcript segments include aspeaker field:
Known Limitations
Screen sharing
During screen sharing, the speaking indicators may be hidden or reduced depending on the platform. On Google Meet, speaker detection works if there are 3+ participants (the speaking indicator appears in the speaker bar). With only 2 participants during screen share, detection degrades.Single device, multiple speakers
If multiple people speak through a single microphone (e.g., a conference room), Vexa sees them as one speaker (the display name of the device/participant). Audio-based diarization (separating speakers from a mixed audio stream) is not currently implemented.Audio-based speaker diarization (e.g., via pyannote) was explored early in Vexa’s development but did not meet real-time latency requirements. This may be revisited in future versions.
Teams DOM changes
Speaker detection on Teams relies on specific DOM selectors. Microsoft may change these across Teams versions, which can temporarily break detection until selectors are updated.Speaker labels
If a participant joins without setting a display name, their segments may be attributed as"Unknown" or a generic label.
Configuration
Speaker detection is enabled by default and requires no configuration. The bot automatically uses the appropriate platform-specific detection method. Detection timing can be tuned in the bot’s configuration:- Zoom audio activity window: 500ms (only attributes speech to the most recently active speaker)
- Vote threshold: 2 votes at 70% confidence ratio for permanent lock
Tips for Best Results
- Ask participants to set display names — speaker attribution is only as good as the names the platform provides
- Avoid single-device multi-speaker setups — use individual devices per speaker for accurate attribution
- Minimize screen sharing during critical attribution — speaking indicators are more reliable without screen share active
- 3+ participants on Google Meet — speaker bar appears with 3+ participants, improving detection during screen share