Skip to main content
Vexa uses OpenAI’s Whisper model family for speech-to-text. This page covers engine selection, language support, quality tuning, and known issues.

Transcription Engines

EngineConfig valueGPU requiredBest for
Vexa remote (default)WHISPER_BACKEND=remoteNoProduction — uses Vexa’s hosted transcription service
faster-whisperWHISPER_BACKEND=faster_whisperRecommendedSelf-hosted with GPU — runs locally via CTranslate2

Vexa remote transcription

The default and recommended option. Audio is sent to Vexa’s transcription service, which runs optimized Whisper inference. No GPU needed on your side.
TRANSCRIPTION_SERVICE_URL="https://transcription.vexa.ai/v1/audio/transcriptions"
TRANSCRIPTION_SERVICE_TOKEN="your-api-key"
Get a transcription API key at vexa.ai/dashboard/api-keys.

Self-hosted faster-whisper

For full data sovereignty, run services/transcription-service/ on your own GPU:
ModelVRAM (INT8)QualitySpeed
large-v3-turbo (default)~2.1 GBExcellentVery fast
medium~1.5 GBGoodFast
small~0.5 GBModerateVery fast
base~150 MBBasicInstant
tiny~75 MBLowInstant
Configure via environment variables:
WHISPER_MODEL_SIZE=large-v3-turbo  # Model selection
WHISPER_COMPUTE_TYPE=int8          # int8 (default), float16, float32
WHISPER_DEVICE=cuda                # cuda or cpu
A single GPU handles approximately 2 concurrent meetings with large-v3-turbo. Beyond that, requests queue and latency increases.

Language Support

Whisper models are multilingual and support 99+ languages automatically. Language detection happens per audio segment — no configuration needed for most use cases.

Hallucination filtering by language

Vexa includes phrase-based hallucination filtering for these languages:
LanguagePhrasesFile
English135hallucinations/en.txt
Spanish26hallucinations/es.txt
Portuguese13hallucinations/pt.txt
Russian13hallucinations/ru.txt
Other languages are transcribed correctly but lack dedicated hallucination filtering. Community contributions for additional language lists are welcome — see the collection script at services/WhisperLive/hallucinations/collect_hallucinations.py.

Hallucination Filtering

Whisper can produce phantom text during silence or low-level noise. Vexa filters these at three points in the pipeline:

1. Phrase database

Known hallucination phrases (e.g., “Thank you for watching”, “Abonnez-vous”) are matched and removed. Matching is case-insensitive with punctuation normalization.

2. Repetition detection

If the same 3-6 word phrase repeats 3+ times in a row, the segment is filtered as a hallucination loop.

3. Single-word heuristic

Single words under 10 characters that appear as standalone segments are filtered (commonly produced during silence).

Known Issues

Silence hallucinations

During extended silence, Whisper may generate repetitive or nonsensical text. The hallucination filter catches most of these, but some may slip through. If you notice recurring phantom phrases in a specific language, report them so they can be added to the filter list.

Timestamp shifting

When silence is removed from recordings, transcript timestamps can appear shifted. Timestamps are relative to the start of audio capture, not wall-clock time. During silence gaps, timestamps may not advance linearly.

Dashboard transcript merging

The dashboard UI may merge transcript segments from adjacent time ranges when there are silence gaps between them. This is a display issue, not a data issue — the underlying segments retain correct timestamps.

Tuning (Advanced)

The transcription pipeline has configurable voice activity detection (VAD) parameters:
ParameterDefaultEffect
minSilenceDurationMs160msMinimum silence to split segments. Increase for fewer, longer segments.
maxSpeechDurationSec15sMaximum segment length before forced boundary.
minAudioDuration2sMinimum audio before submitting to Whisper.
idleTimeoutSec15sSeconds of silence before final submission and buffer reset.
These are configurable in the bot’s transcription client. For most use cases, defaults work well.