Will Sprachmemo ever offer Whisper as an option?

Possibly, in a way that keeps the EU-only-infra promise (e.g. EU-hosted Whisper via a vetted provider). The bar is: same datacentre, same audit story, same delete promise. We're not in a hurry.

Can I bring my own Whisper instance?

Not today via the UI. The architecture supports it (the engine layer is pluggable on the BE) but it's not a v1 feature. Self-hosters who want to swap in Whisper can edit be_modules/voice/engine.py directly.

Sprachmemo

Topics

Vosk vs Whisper on German: an honest field comparison

Two open speech-to-text projects, both with German support. Where each one wins, where each one loses, what to pick.

Vosk

Whisper

German

Finn GlasCo-Founder + Engineering

·May 16, 2026·

2 min read

On clean studio German, both models reach the 'good enough that you read the transcript end to end without getting jolted' threshold. The differences live at the edges.

On this page

Architectural difference On clean German, both are good enough On noisy phone audio, things flip The deployment story matters Why we picked Vosk

Architectural difference

Vosk is a Kaldi-based system: traditional acoustic models + a language model, optimised to run cheaply on CPU. Whisper is a transformer-based seq2seq model: heavier, larger, designed to run on GPU but tractable on CPU for shorter clips. The 'shape' of each model's errors is different too - Vosk tends to hear plausible-but-wrong words, Whisper tends to hallucinate fluent prose.

On clean German, both are good enough

We ran both against a 90-minute studio recording of two native German speakers in conversation. Vosk Large: 96.2 % word-level accuracy. Whisper Large-v3: 97.8 %. The 1.6 % gap shows up almost entirely on three sorts of errors: rare proper nouns (Whisper does better), English loan words pronounced German-style (Whisper does better), and conversational filler / disfluencies (Vosk drops them more cleanly).

On noisy phone audio, things flip

On a noisy 1-on-1 phone call we recorded for a different test, Vosk Large held 86 %; Whisper Large-v3 hallucinated. Whisper's seq2seq decoder is generative - when the signal-to-noise ratio drops below its training distribution, it'll happily produce confident, fluent, completely fabricated sentences. Vosk's Kaldi decoder is more honest: it gives up.

The deployment story matters

Vosk runs in production on a single Hetzner CPU box without breaking a sweat. Whisper Large needs a GPU to be responsive, which more than triples the hosting cost and pulls the workload into a much smaller pool of providers - most of them US-based. For a service that promises 'audio doesn't leave the EU', that's not a neutral choice. The deeper argument for keeping audio out of US infrastructure is laid out in why your voice shouldn't transit a US cloud.

Why we picked Vosk

Three reasons. One, the accuracy gap on the audio shapes our users actually record (mostly clean speech, mostly under five minutes) is small enough that it's worth trading for everything below. Two, Vosk runs comfortably on EU CPU boxes; Whisper Large doesn't. Three, Vosk's error mode is admission instead of hallucination, and we think that's the safer default for a transcription tool.

FAQ

Frequently asked

Share this article

Try Sprachmemo

Free plan, no credit card. We host in Germany. You can export and delete everything self-serve.

Written by

Finn Glas

Co-Founder + Engineering

Finn is one of the Co-Founders. He owns the engineering side, the infrastructure, and most of the late-night fixes that ship before anyone notices.

finn.glas at aicuflow dot comLinkedIn Website