Topics
Two open speech-to-text projects, both with German support. Where each one wins, where each one loses, what to pick.

On clean studio German, both models reach the 'good enough that you read the transcript end to end without getting jolted' threshold. The differences live at the edges.
Vosk is a Kaldi-based system: traditional acoustic models + a language model, optimised to run cheaply on CPU. Whisper is a transformer-based seq2seq model: heavier, larger, designed to run on GPU but tractable on CPU for shorter clips. The 'shape' of each model's errors is different too - Vosk tends to hear plausible-but-wrong words, Whisper tends to hallucinate fluent prose.
We ran both against a 90-minute studio recording of two native German speakers in conversation. Vosk Large: 96.2 % word-level accuracy. Whisper Large-v3: 97.8 %. The 1.6 % gap shows up almost entirely on three sorts of errors: rare proper nouns (Whisper does better), English loan words pronounced German-style (Whisper does better), and conversational filler / disfluencies (Vosk drops them more cleanly).
On a noisy 1-on-1 phone call we recorded for a different test, Vosk Large held 86 %; Whisper Large-v3 hallucinated. Whisper's seq2seq decoder is generative - when the signal-to-noise ratio drops below its training distribution, it'll happily produce confident, fluent, completely fabricated sentences. Vosk's Kaldi decoder is more honest: it gives up.
Vosk runs in production on a single Hetzner CPU box without breaking a sweat. Whisper Large needs a GPU to be responsive, which more than triples the hosting cost and pulls the workload into a much smaller pool of providers - most of them US-based. For a service that promises 'audio doesn't leave the EU', that's not a neutral choice.
Three reasons. One, the accuracy gap on the audio shapes our users actually record (mostly clean speech, mostly under five minutes) is small enough that it's worth trading for everything below. Two, Vosk runs comfortably on EU CPU boxes; Whisper Large doesn't. Three, Vosk's error mode is admission instead of hallucination, and we think that's the safer default for a transcription tool.
FAQ
Free plan, no credit card. We host in Germany. You can export and delete everything self-serve.
Read next
Sprachmemo vs OpenAI Whisper API
Open-weights Whisper, the hosted API, and Vosk on your own servers - which is the right shape for what.
Read
When to use Vosk Small vs Vosk Large
Fast and good-enough, or slow and excellent.
Read
How to transcribe a long German interview accurately
From microphone to publishable transcript.
Read