Topics

Vosk vs Whisper on German: an honest field comparison

Two open speech-to-text projects, both with German support. Where each one wins, where each one loses, what to pick.

Vosk
Whisper
German
Finn Glas
Finn GlasCo-Founder + Engineering
·May 16, 2026·
2 min read

On clean studio German, both models reach the 'good enough that you read the transcript end to end without getting jolted' threshold. The differences live at the edges.

Architectural difference

Vosk is a Kaldi-based system: traditional acoustic models + a language model, optimised to run cheaply on CPU. Whisper is a transformer-based seq2seq model: heavier, larger, designed to run on GPU but tractable on CPU for shorter clips. The 'shape' of each model's errors is different too - Vosk tends to hear plausible-but-wrong words, Whisper tends to hallucinate fluent prose.

On clean German, both are good enough

We ran both against a 90-minute studio recording of two native German speakers in conversation. Vosk Large: 96.2 % word-level accuracy. Whisper Large-v3: 97.8 %. The 1.6 % gap shows up almost entirely on three sorts of errors: rare proper nouns (Whisper does better), English loan words pronounced German-style (Whisper does better), and conversational filler / disfluencies (Vosk drops them more cleanly).

On noisy phone audio, things flip

On a noisy 1-on-1 phone call we recorded for a different test, Vosk Large held 86 %; Whisper Large-v3 hallucinated. Whisper's seq2seq decoder is generative - when the signal-to-noise ratio drops below its training distribution, it'll happily produce confident, fluent, completely fabricated sentences. Vosk's Kaldi decoder is more honest: it gives up.

The deployment story matters

Vosk runs in production on a single Hetzner CPU box without breaking a sweat. Whisper Large needs a GPU to be responsive, which more than triples the hosting cost and pulls the workload into a much smaller pool of providers - most of them US-based. For a service that promises 'audio doesn't leave the EU', that's not a neutral choice.

Why we picked Vosk

Three reasons. One, the accuracy gap on the audio shapes our users actually record (mostly clean speech, mostly under five minutes) is small enough that it's worth trading for everything below. Two, Vosk runs comfortably on EU CPU boxes; Whisper Large doesn't. Three, Vosk's error mode is admission instead of hallucination, and we think that's the safer default for a transcription tool.

FAQ

Frequently asked

Try Sprachmemo

Free plan, no credit card. We host in Germany. You can export and delete everything self-serve.

Finn Glas

Written by

Finn Glas

Co-Founder + Engineering

Finn is one of the Co-Founders. He owns the engineering side, the infrastructure, and most of the late-night fixes that ship before anyone notices.

finn.glas at aicuflow dot comLinkedInWebsite