Transcription Engines
| Engine | Config value | GPU required | Best for |
|---|---|---|---|
| Vexa remote (default) | WHISPER_BACKEND=remote | No | Production — uses Vexa’s hosted transcription service |
| faster-whisper | WHISPER_BACKEND=faster_whisper | Recommended | Self-hosted with GPU — runs locally via CTranslate2 |
Vexa remote transcription
The default and recommended option. Audio is sent to Vexa’s transcription service, which runs optimized Whisper inference. No GPU needed on your side.Self-hosted faster-whisper
For full data sovereignty, runservices/transcription-service/ on your own GPU:
| Model | VRAM (INT8) | Quality | Speed |
|---|---|---|---|
large-v3-turbo (default) | ~2.1 GB | Excellent | Very fast |
medium | ~1.5 GB | Good | Fast |
small | ~0.5 GB | Moderate | Very fast |
base | ~150 MB | Basic | Instant |
tiny | ~75 MB | Low | Instant |
large-v3-turbo. Beyond that, requests queue and latency increases.
Language Support
Whisper models are multilingual and support 99+ languages automatically. Language detection happens per audio segment — no configuration needed for most use cases.Hallucination filtering by language
Vexa includes phrase-based hallucination filtering for these languages:| Language | Phrases | File |
|---|---|---|
| English | 135 | hallucinations/en.txt |
| Spanish | 26 | hallucinations/es.txt |
| Portuguese | 13 | hallucinations/pt.txt |
| Russian | 13 | hallucinations/ru.txt |
services/WhisperLive/hallucinations/collect_hallucinations.py.
Hallucination Filtering
Whisper can produce phantom text during silence or low-level noise. Vexa filters these at three points in the pipeline:1. Phrase database
Known hallucination phrases (e.g., “Thank you for watching”, “Abonnez-vous”) are matched and removed. Matching is case-insensitive with punctuation normalization.2. Repetition detection
If the same 3-6 word phrase repeats 3+ times in a row, the segment is filtered as a hallucination loop.3. Single-word heuristic
Single words under 10 characters that appear as standalone segments are filtered (commonly produced during silence).Known Issues
Silence hallucinations
During extended silence, Whisper may generate repetitive or nonsensical text. The hallucination filter catches most of these, but some may slip through. If you notice recurring phantom phrases in a specific language, report them so they can be added to the filter list.Timestamp shifting
When silence is removed from recordings, transcript timestamps can appear shifted. Timestamps are relative to the start of audio capture, not wall-clock time. During silence gaps, timestamps may not advance linearly.Dashboard transcript merging
The dashboard UI may merge transcript segments from adjacent time ranges when there are silence gaps between them. This is a display issue, not a data issue — the underlying segments retain correct timestamps.Tuning (Advanced)
The transcription pipeline has configurable voice activity detection (VAD) parameters:| Parameter | Default | Effect |
|---|---|---|
minSilenceDurationMs | 160ms | Minimum silence to split segments. Increase for fewer, longer segments. |
maxSpeechDurationSec | 15s | Maximum segment length before forced boundary. |
minAudioDuration | 2s | Minimum audio before submitting to Whisper. |
idleTimeoutSec | 15s | Seconds of silence before final submission and buffer reset. |