If you've used a dictation tool in the last decade, the audio almost certainly left your device the moment you started speaking. That was the easy way to ship — push the recording to a fast model in a fast data center, get text back, look like magic.
It was also the way that quietly built a few of the largest voice datasets on Earth. Audio you didn't intend to share. Audio your customers didn't know was being transcribed. Audio that, somewhere along the way, made its way into a training set.
We didn't want to build voibly on that foundation, so we didn't.
The cloud tax on your voice
Cloud-only dictation isn't just a privacy concern. It's a tax on every dimension that matters — latency, reliability, autonomy, cost. The bill comes due in small ways:
- Latency — every keystroke is a network round trip. The "speak fast, type fast" promise dies on a flaky hotel WiFi.
- Reliability — when the provider's region goes down, your dictation goes with it.
- Autonomy — pricing, retention, and acceptable-use policies live on someone else's runway.
- Cost — every minute of audio is a minute of GPU time, charged in perpetuity to your subscription.
"On-device first isn't an aesthetic. It's the only way to make the speed–privacy–price triangle actually work for the user."
The good news: on-device transcription is now genuinely possible. The same model architectures that run on rented H100s also run on the Neural Engine in your laptop. Apple silicon, modern AMD, and most Intel chips from the last three years can do real-time speech-to-text without breaking a sweat.
What "on-device first" means in practice
For voibly, on-device first is a default — not a marketing line. Concretely:
- When you press the hotkey, audio is captured into a memory buffer that never touches disk.
- The buffer is fed into our locally-bundled Whisper-derived model, which produces a draft transcript on your device.
- A small grammar/punctuation pass runs locally to clean up the draft.
- The cleaned text is written into your active app at the cursor.
- The audio buffer is zeroed and the memory page is released.
That's it. No HTTP request, no third-party SDK, no telemetry payload with the transcript inside. You can put your laptop in airplane mode and the whole loop still works.
Verify it yourself. On macOS, run nettop while you dictate. You'll see voibly do exactly two things: occasionally check for app updates and occasionally fetch a license heartbeat. Nothing else.
The trade-offs we made
Shipping on-device first isn't free. Here are the trade-offs we made — and the ones we refuse to make.
Bigger app bundle
Our installer is around 180 MB. Most of that is the model itself. We could ship a tiny stub that downloads the model on first launch, but we'd rather pay the cost up front so you can use the app cold from a thumb drive.
Slower than the cloud — sometimes
For very long single recordings (5+ minutes), a tuned cloud model still has the edge. We're closing that gap each release, but we're honest about where we are.
Updates require a download
When we improve the model, you have to install it. We've made the auto-updater quiet and resumable, but it's still a real step we expect from you.
When the cloud still helps
Sometimes you genuinely want a beefier model — a careful summary of an hour-long recording, a translation pass, an exact transcript with speaker diarization. For those cases, voibly has an opt-in cloud polish mode you can flip on per workspace. When it's on, we send only the audio for that single request, process it on hardware we control, and discard it immediately.
Three things stay true:
- Cloud polish is opt-in, off by default, and clearly visible in the menubar.
- You can scope it to a single workspace — say, the one you use for podcasts — without affecting the rest.
- We never train on your audio, even when it does pass through our servers.
What this means for you
If you're a lawyer, doctor, journalist, or anyone whose first thought when pressing a microphone button is "who else is going to hear this?" — you can stop having that thought. The architecture answers it for you.
If you're a developer and you'd like to verify our claims, we publish the network traffic audit logs and the model SHA on every release. We'd love for you to break our assumptions.
And if you're someone who just wants their words on the page faster — welcome. The work has been waiting.