Test TTS & STT models in your browser. No server required.
TTSLab is a browser-based application that enables local execution and comparison of text-to-speech (TTS) and speech-to-text (STT) models without relying on remote servers, API keys, or cloud infrastructure. It leverages WebGPU and WebAssembly (WASM) to perform on-device inference, ensuring full data privacy and low-latency interaction. The tool is designed for developers evaluating model performance, researchers conducting reproducible benchmarks, and product teams comparing voice characteristics across models.
The application supports multiple open-source models—including Kokoro 82M, Whisper Base and Small, Moonshine Base, and Supertonic 2—with each model downloaded once and cached locally in the browser. Users can run side-by-side voice comparisons, execute inference benchmarks, or interact with a fully client-side Voice Agent—all without transmitting audio or text externally.
TTSLab operates through a three-stage workflow. First, users select one or more models from the integrated directory—each labeled by type (TTS or STT), architecture, parameter count, and size. Upon selection, model weights are fetched over HTTPS and stored in the browser’s cache (e.g., IndexedDB or Cache API); subsequent use loads weights directly from local storage without re-downloading.
Second, inference is executed entirely within the browser context. WebGPU provides hardware-accelerated tensor computation where supported; otherwise, WASM serves as a portable fallback runtime. Input text is processed into speech (for TTS) or audio is transcribed into text (for STT), with all intermediate data remaining in memory.
Third, results are rendered in the UI—audio playback for TTS, transcribed text for STT, or conversational responses for the Voice Agent. No network requests occur during inference, and no persistent identifiers or usage analytics are collected.
TTSLab enables privacy-sensitive evaluation of speech AI, particularly valuable for domains with strict compliance requirements—such as healthcare documentation, legal transcription, or internal enterprise communications—where sending sensitive content to third-party APIs is prohibited. Researchers benefit from standardized, reproducible inference environments across hardware configurations, facilitating fair model comparisons and benchmark reporting. Developers use it to prototype voice interfaces, validate model behavior before backend integration, or test multilingual support without provisioning cloud resources. Product teams leverage side-by-side voice previews to assess naturalness, latency, and language coverage when selecting TTS voices for end-user applications.