Documentation Index
Fetch the complete documentation index at: https://docs.vexa.ai/llms.txt
Use this file to discover all available pages before exploring further.
This page covers Vexa’s architecture, resource requirements per bot, and how to scale for concurrent meetings.
Architecture Overview
Vexa follows a one-browser-per-bot model. Each meeting bot runs as an isolated container with its own Chromium instance:
Meeting Bot (per meeting) Shared Services
┌──────────────────────┐ ┌─────────────────────────────┐
│ Chromium (Playwright) │ │ API Gateway (port 8056) │
│ Audio capture │───>│ Meeting API (port 8080) │
│ Speaker detection │ │ Runtime API (port 8090) │
│ Transcription client │ │ Transcription Service (GPU) │
└──────────────────────┘ │ Redis, PostgreSQL │
└─────────────────────────────┘
Bot containers are ephemeral — they are created when you request a bot and destroyed after the meeting ends (or after an idle timeout).
Resource Requirements Per Bot
| Resource | Request (steady-state) | Limit (peak) |
|---|
| CPU | 250m | 1000m |
| Memory | 600 Mi | 1 Gi |
| Shared memory | 2 GB (/dev/shm) | 2 GB |
These numbers were measured on production workloads (March 2026). The 2 GB shared memory is required by Chromium for canvas and media operations.
Estimating capacity
On a node with 4 CPU cores and 8 GB RAM:
| Constraint | Concurrent bots |
|---|
| CPU (by limit, worst case) | 4 |
| CPU (by request, typical) | 16 |
| Memory (by limit) | 8 |
| Practical recommendation | 4-8 |
Scale horizontally by adding more nodes rather than increasing node size.
Orchestration Backends
Vexa’s Runtime API supports three container backends, configured via ORCHESTRATOR_BACKEND:
| Backend | Value | CPU limits | Memory limits | Best for |
|---|
| Kubernetes | kubernetes | Enforced (pod limits) | Enforced (OOMKill) | Production |
| Docker | docker | Not enforced | Enforced (cgroups) | Single-host, dev |
| Process | process | Not enforced | Best-effort | Vexa Lite, lightweight dev |
The Docker backend silently ignores CPU limits. Bot containers get unlimited CPU access. Use Kubernetes for production workloads where resource isolation matters.
Kubernetes deployment
Vexa provides Helm charts at deploy/helm/:
# Install
helm install vexa deploy/helm/charts/vexa \
-f your-values.yaml \
--namespace vexa --create-namespace
# Upgrade
helm upgrade vexa deploy/helm/charts/vexa \
-f your-values.yaml \
--namespace vexa
Service resource allocations in the Helm chart:
| Service | CPU request | Memory limit |
|---|
| api-gateway | 100m | 512 Mi |
| meeting-api | 200m | 1 Gi |
| runtime-api | 100m | 512 Mi |
| redis | 100m | 1 Gi |
| postgres | 200m | 4 Gi |
Bot containers are dynamically created as Kubernetes pods by the Runtime API — they are not part of the Helm release.
Docker Compose deployment
For single-host development and testing:
Memory limits per service are defined in deploy/compose/docker-compose.yml. See Docker Compose Deployment for the full guide.
Bot Lifecycle and Cleanup
Bot containers have automatic timeouts:
| Timeout | Default | Description |
|---|
| Waiting room | 15 min | Bot leaves if not admitted within 15 minutes |
| Everyone left | 15 min | Bot leaves 15 minutes after last participant leaves |
| No one joined | 2 min | Bot leaves if no participant joins within 2 minutes |
| Idle TTL | 5-60 min | Container removed after idle timeout (configurable per profile) |
Containers are automatically cleaned up after meetings end. The Runtime API uses Redis-backed heartbeats to track liveness.
Per-User Concurrency Limits
The Admin API supports per-user bot limits:
curl -X POST "$API_BASE/admin/users" \
-H "X-Admin-API-Key: $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com", "max_concurrent_bots": 5}'
Set max_concurrent_bots to limit how many simultaneous bots a user can run.
Transcription Service Scaling
If self-hosting transcription, a single GPU handles approximately 2 concurrent meetings with large-v3-turbo. The service returns 503 when the queue is full.
For higher concurrency:
- Run multiple transcription service replicas behind a load balancer
- Use smaller models (
small, base) for higher throughput at lower quality
- Use
INT8 compute type (default) for 50-60% VRAM reduction
See Transcription Quality for model selection details.
Deployment Options Summary
| Option | Bots | Scaling | Complexity |
|---|
| Vexa Lite | Process-based (in-container) | Vertical only | Lowest |
| Docker Compose | Docker containers | Single-host | Low |
| Helm / Kubernetes | K8s pods | Horizontal | Medium |