Real-Time Audio Transcription
Browser-to-text in near real-time, powered by Whisper AI.
Overview
Existing speech-to-text tools either cost money per API call or require sending complete audio files and waiting for a batch response. For live use cases — accessibility tools, meeting notes, live captions — that latency is unacceptable. This project streams chunked browser audio over WebSockets to a Flask-SocketIO server, runs Whisper inference on each chunk, and pushes partial transcriptions back to the UI in near real-time.
The Problem
Paid STT APIs are expensive and require data to leave the device. Offline alternatives are desktop-only. Browser-based demos that upload whole files introduce multi-second delays that make them useless for live transcription scenarios.
The Solution
Implemented a WebSocket streaming pipeline. The browser captures microphone audio via the Web Audio API, splits it into configurable chunks, and streams them to Flask-SocketIO. The server normalises each chunk to 16 kHz mono, runs Whisper base-model inference, and emits partial transcription tokens back through the socket. The frontend appends text as it arrives, giving a real-time typewriter effect.
Key Features
- Live microphone capture using the Web Audio API with configurable chunk size
- WebSocket bidirectional streaming via Flask-SocketIO
- OpenAI Whisper base model for high-accuracy multilingual transcription
- Voice Activity Detection (VAD) to skip silent audio segments
- Overlapping audio window to prevent word-boundary split errors
- Real-time text display that appends tokens as they are recognised
- Session-scoped audio buffer management to prevent memory leaks
Challenges & Learnings
The trickiest issue was word-boundary splitting — if a word spans two consecutive chunks, both halves get misidentified. Solved by implementing a 20% overlapping window: each new chunk includes the tail of the previous one, giving Whisper enough context to decode boundary words correctly. Browser audio compatibility was another hurdle — Chrome and Firefox produce different PCM sample rates and bit depths. Added a normalisation step that converts all incoming audio to 16 kHz mono float32 before passing to the model.