Product 8 min read · 12 May 2026

Why we ship voibly on-device first — and why you should care

The dictation tools you grew up with leak audio to the cloud by default. Here's why we built voibly the other way around — and what it means for the way you work, the apps you trust, and the data you generate every day.

If you've used a dictation tool in the last decade, the audio almost certainly left your device the moment you started speaking. That was the easy way to ship — push the recording to a fast model in a fast data center, get text back, look like magic.

It was also the way that quietly built a few of the largest voice datasets on Earth. Audio you didn't intend to share. Audio your customers didn't know was being transcribed. Audio that, somewhere along the way, made its way into a training set.

We didn't want to build voibly on that foundation, so we didn't.

The cloud tax on your voice

Cloud-only dictation isn't just a privacy concern. It's a tax on every dimension that matters — latency, reliability, autonomy, cost. The bill comes due in small ways:

Latency — every keystroke is a network round trip. The "speak fast, type fast" promise dies on a flaky hotel WiFi.
Reliability — when the provider's region goes down, your dictation goes with it.
Autonomy — pricing, retention, and acceptable-use policies live on someone else's runway.
Cost — every minute of audio is a minute of GPU time, charged in perpetuity to your subscription.

"On-device first isn't an aesthetic. It's the only way to make the speed–privacy–price triangle actually work for the user."

The good news: on-device transcription is now genuinely possible. The same model architectures that run on rented H100s also run on the Neural Engine in your laptop. Apple silicon, modern AMD, and most Intel chips from the last three years can do real-time speech-to-text without breaking a sweat.

What "on-device first" means in practice

For voibly, on-device first is a default — not a marketing line. Concretely:

When you press the hotkey, audio is captured into a memory buffer that never touches disk.
The buffer is fed into our locally-bundled Whisper-derived model, which produces a draft transcript on your device.
A small grammar/punctuation pass runs locally to clean up the draft.
The cleaned text is written into your active app at the cursor.
The audio buffer is zeroed and the memory page is released.

That's it. No HTTP request, no third-party SDK, no telemetry payload with the transcript inside. You can put your laptop in airplane mode and the whole loop still works.

Verify it yourself. On macOS, run nettop while you dictate. You'll see voibly do exactly two things: occasionally check for app updates and occasionally fetch a license heartbeat. Nothing else.

The trade-offs we made

Shipping on-device first isn't free. Here are the trade-offs we made — and the ones we refuse to make.

Bigger app bundle

Our installer is around 180 MB. Most of that is the model itself. We could ship a tiny stub that downloads the model on first launch, but we'd rather pay the cost up front so you can use the app cold from a thumb drive.

Slower than the cloud — sometimes

For very long single recordings (5+ minutes), a tuned cloud model still has the edge. We're closing that gap each release, but we're honest about where we are.

Updates require a download

When we improve the model, you have to install it. We've made the auto-updater quiet and resumable, but it's still a real step we expect from you.

When the cloud still helps

Sometimes you genuinely want a beefier model — a careful summary of an hour-long recording, a translation pass, an exact transcript with speaker diarization. For those cases, voibly has an opt-in cloud polish mode you can flip on per workspace. When it's on, we send only the audio for that single request, process it on hardware we control, and discard it immediately.

Three things stay true:

Cloud polish is opt-in, off by default, and clearly visible in the menubar.
You can scope it to a single workspace — say, the one you use for podcasts — without affecting the rest.
We never train on your audio, even when it does pass through our servers.

What this means for you

If you're a lawyer, doctor, journalist, or anyone whose first thought when pressing a microphone button is "who else is going to hear this?" — you can stop having that thought. The architecture answers it for you.

If you're a developer and you'd like to verify our claims, we publish the network traffic audit logs and the model SHA on every release. We'd love for you to break our assumptions.

And if you're someone who just wants their words on the page faster — welcome. The work has been waiting.

JM

Jordan Mehta

Co-founder & CEO, voibly

Jordan has been building voice and speech tools since 2016. Before voibly, he led the dictation team at a large notes app, where he learned exactly which lines he didn't want to cross again.

Follow on X Read more posts

Keep reading

All posts →

Workflow Inbox-zero by voice: the 25-minute morning that replaced my 90-minute morning

Priya Raman · 6 min read · 5 May 2026

Privacy What "on-device first" actually means (and what it doesn't)

Daniel Kim · 9 min read · 28 April 2026

Product Voice editing commands: the design choices we agonized over

Sara Choi · 5 min read · 1 April 2026

Get the next post in your inbox

One thoughtful piece a month on dictation, privacy, and the tools we love. No spam, unsubscribe anytime.