Building a Local Voice-to-Text App for macOS Without Accessibility Permissions
My organization disables the built-in macOS dictation feature, and I don’t have admin rights to enable it or install tools via Homebrew. I needed a way to speak instead of type — so I built one.
The Problem
- macOS voice-to-text is disabled by org policy
- No admin rights to change system settings
- No Homebrew (blocked)
- Accessibility permissions locked down (requires admin)
- Input Monitoring locked down (requires admin)
- HuggingFace blocked by network policy
The challenge: build a voice-to-text tool that works in any app without needing any of the permissions my org restricts.
The Key Insight
You don’t need accessibility permissions if you use the clipboard as the bridge:
- Record audio via microphone (standard per-app permission)
- Transcribe locally using an AI model
- Copy result to clipboard
- User pastes with Cmd+V
No keystroke injection, no accessibility hooks, no screen recording. Just microphone → clipboard.
Architecture
┌─────────────────────────────────────────────────────┐
│ VoiceToClip.app │
│ │
│ ┌──────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ AVAudio │───▶│ whisper.cpp│───▶│ NSPasteboard │ │
│ │ Engine │ │ (C library)│ │ (clipboard) │ │
│ │ 16kHz │ │ │ │ │ │
│ │ mono WAV │ │ ggml model │ │ Cmd+V paste │ │
│ └──────────┘ └────────────┘ └──────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ UI: Menu Bar Icon + Floating Button + URL Scheme│ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Components
1. OpenAI Whisper (the AI model)
OpenAI trained a speech recognition model on 680,000 hours of audio and released the weights publicly. The model comes in different sizes:
| Model | Size | Speed (30s audio) | Accuracy |
|---|---|---|---|
| tiny.en | 75MB | ~1s | Basic |
| base.en | 141MB | ~2s | Good |
| medium.en | 1.4GB | ~5s | Excellent |
| large-v3 | 3GB | ~15s | Best |
I use medium.en — the sweet spot for accuracy on Apple Silicon.
2. whisper.cpp (the inference engine)
The original Whisper runs in Python with PyTorch (~2GB runtime). whisper.cpp is a C/C++ reimplementation that:
- Needs no Python or PyTorch
- Uses Apple Metal GPU acceleration
- Compiles to a small static library
- Runs 5-10x faster than the Python version
3. GGML format (model packaging)
The PyTorch .pt model file can’t be loaded by whisper.cpp directly. GGML is a lightweight binary format optimized for fast loading. A conversion script repacks the same weights into this format.
4. Swift menu bar app (the UI)
A SwiftUI app with:
- Menu bar icon — shows recording state
- Floating button — always-on-top circle you click to toggle (blue → red → orange → blue)
- URL scheme —
voicetoclip://startandvoicetoclip://stopfor external triggering
5. Spotlight integration (the trigger)
Since keyboard shortcuts require Input Monitoring (locked down), I created tiny .app wrappers that call the URL scheme:
Cmd+Space→ type “startv” → Enter → recording startsCmd+Space→ type “stopv” → Enter → transcribes → clipboard ready
Permissions Required
| Permission | Needed? | Why |
|---|---|---|
| Microphone | ✅ | Standard per-app grant, no admin needed |
| Accessibility | ❌ | Clipboard approach avoids this entirely |
| Input Monitoring | ❌ | Spotlight trigger instead of hotkeys |
| Network | ❌ | Everything runs locally |
The Build Process
Since Homebrew and HuggingFace were both blocked, the build required some creativity:
- CMake — downloaded manually from cmake.org (portable, no install needed)
- whisper.cpp — cloned from GitHub, built with cmake into a static library
- Model — downloaded
medium.en.ptfrom OpenAI’s Azure CDN (openaipublic.azureedge.net), then converted to GGML format using a Python script with a temporary venv - Swift app — built with Swift Package Manager, linked against the whisper.cpp static library
- App bundle — wrapped in a
.appwithLSUIElement=truefor menu-bar-only behavior
Visual Feedback
The floating button changes color to indicate state:
- 🔵 Blue — idle, ready to record
- 🔴 Red — recording in progress
- 🟠 Orange — transcribing (wait for blue before pasting)
Privacy
Everything runs on-device:
- Audio never leaves the machine
- No cloud APIs called
- No network requests whatsoever
- The WAV file is stored in
/tmpand overwritten each recording
Limitations
- Transcription takes a few seconds after stopping (not real-time streaming)
- Whisper processes in 30-second segments — longer recordings take proportionally longer
- Medium model uses ~1.5GB RAM while loaded
Tech Stack
- Language: Swift + SwiftUI
- Audio: AVAudioEngine (16kHz mono PCM)
- Transcription: whisper.cpp (C library, statically linked)
- Model: OpenAI Whisper medium.en (GGML format)
- UI: MenuBarExtra + NSPanel (floating)
- Trigger: Custom URL scheme + Spotlight-indexed wrapper apps
Takeaway
When your organization locks down the easy path, there’s usually a harder path that still works within the rules. The clipboard is the universal bridge — any app can write to it, any app can read from it, and no special permissions are needed for either operation.