Real-Time Audio Transcription

Browser-to-text in near real-time, powered by Whisper AI.

Overview

Existing speech-to-text tools either cost money per API call or require sending complete audio files and waiting for a batch response. For live use cases — accessibility tools, meeting notes, live captions — that latency is unacceptable. This project streams chunked browser audio over WebSockets to a Flask-SocketIO server, runs Whisper inference on each chunk, and pushes partial transcriptions back to the UI in near real-time.

The Problem

Paid STT APIs are expensive and require data to leave the device. Offline alternatives are desktop-only. Browser-based demos that upload whole files introduce multi-second delays that make them useless for live transcription scenarios.

The Solution

Implemented a WebSocket streaming pipeline. The browser captures microphone audio via the Web Audio API, splits it into configurable chunks, and streams them to Flask-SocketIO. The server normalises each chunk to 16 kHz mono, runs Whisper base-model inference, and emits partial transcription tokens back through the socket. The frontend appends text as it arrives, giving a real-time typewriter effect.

Key Features

Live microphone capture using the Web Audio API with configurable chunk size
WebSocket bidirectional streaming via Flask-SocketIO
OpenAI Whisper base model for high-accuracy multilingual transcription
Voice Activity Detection (VAD) to skip silent audio segments
Overlapping audio window to prevent word-boundary split errors
Real-time text display that appends tokens as they are recognised
Session-scoped audio buffer management to prevent memory leaks

Challenges & Learnings

The trickiest issue was word-boundary splitting — if a word spans two consecutive chunks, both halves get misidentified. Solved by implementing a 20% overlapping window: each new chunk includes the tail of the previous one, giving Whisper enough context to decode boundary words correctly. Browser audio compatibility was another hurdle — Chrome and Firefox produce different PCM sample rates and bit depths. Added a normalisation step that converts all incoming audio to 16 kHz mono float32 before passing to the model.

Tech Stack

Backend

PythonFlaskFlask-SocketIOsoundfilenumpy

AI / ML

Hugging Face WhisperOpenAI Whisper base modelPyTorch

Frontend

HTMLCSSJavaScriptWeb Audio API

Protocol

WebSocketSocket.IO

More Projects

AI Security URL & Log Analyzer

AI Emotion Detection App