Building a Local Voice-to-Text App for macOS Without Accessibility Permissions

May 23, 2026 3 minute read

My organization disables the built-in macOS dictation feature, and I don’t have admin rights to enable it or install tools via Homebrew. I needed a way to speak instead of type — so I built one.

The Problem

macOS voice-to-text is disabled by org policy
No admin rights to change system settings
No Homebrew (blocked)
Accessibility permissions locked down (requires admin)
Input Monitoring locked down (requires admin)
HuggingFace blocked by network policy

The challenge: build a voice-to-text tool that works in any app without needing any of the permissions my org restricts.

The Key Insight

You don’t need accessibility permissions if you use the clipboard as the bridge:

Record audio via microphone (standard per-app permission)
Transcribe locally using an AI model
Copy result to clipboard
User pastes with Cmd+V

No keystroke injection, no accessibility hooks, no screen recording. Just microphone → clipboard.

Architecture

┌─────────────────────────────────────────────────────┐
│                   VoiceToClip.app                     │
│                                                       │
│  ┌──────────┐    ┌────────────┐    ┌──────────────┐ │
│  │ AVAudio  │───▶│ whisper.cpp│───▶│ NSPasteboard │ │
│  │ Engine   │    │ (C library)│    │ (clipboard)  │ │
│  │ 16kHz    │    │            │    │              │ │
│  │ mono WAV │    │ ggml model │    │ Cmd+V paste  │ │
│  └──────────┘    └────────────┘    └──────────────┘ │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │ UI: Menu Bar Icon + Floating Button + URL Scheme│ │
│  └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Components

1. OpenAI Whisper (the AI model)

OpenAI trained a speech recognition model on 680,000 hours of audio and released the weights publicly. The model comes in different sizes:

Model	Size	Speed (30s audio)	Accuracy
tiny.en	75MB	~1s	Basic
base.en	141MB	~2s	Good
medium.en	1.4GB	~5s	Excellent
large-v3	3GB	~15s	Best

I use medium.en — the sweet spot for accuracy on Apple Silicon.

2. whisper.cpp (the inference engine)

The original Whisper runs in Python with PyTorch (~2GB runtime). whisper.cpp is a C/C++ reimplementation that:

Needs no Python or PyTorch
Uses Apple Metal GPU acceleration
Compiles to a small static library
Runs 5-10x faster than the Python version

3. GGML format (model packaging)

The PyTorch .pt model file can’t be loaded by whisper.cpp directly. GGML is a lightweight binary format optimized for fast loading. A conversion script repacks the same weights into this format.

A SwiftUI app with:

Menu bar icon — shows recording state
Floating button — always-on-top circle you click to toggle (blue → red → orange → blue)
URL scheme — voicetoclip://start and voicetoclip://stop for external triggering

5. Spotlight integration (the trigger)

Since keyboard shortcuts require Input Monitoring (locked down), I created tiny .app wrappers that call the URL scheme:

Cmd+Space → type “startv” → Enter → recording starts
Cmd+Space → type “stopv” → Enter → transcribes → clipboard ready

Permissions Required

Permission	Needed?	Why
Microphone	✅	Standard per-app grant, no admin needed
Accessibility	❌	Clipboard approach avoids this entirely
Input Monitoring	❌	Spotlight trigger instead of hotkeys
Network	❌	Everything runs locally

The Build Process

Since Homebrew and HuggingFace were both blocked, the build required some creativity:

CMake — downloaded manually from cmake.org (portable, no install needed)
whisper.cpp — cloned from GitHub, built with cmake into a static library
Model — downloaded medium.en.pt from OpenAI’s Azure CDN (openaipublic.azureedge.net), then converted to GGML format using a Python script with a temporary venv
Swift app — built with Swift Package Manager, linked against the whisper.cpp static library
App bundle — wrapped in a .app with LSUIElement=true for menu-bar-only behavior

Visual Feedback

The floating button changes color to indicate state:

🔵 Blue — idle, ready to record
🔴 Red — recording in progress
🟠 Orange — transcribing (wait for blue before pasting)

Privacy

Everything runs on-device:

Audio never leaves the machine
No cloud APIs called
No network requests whatsoever
The WAV file is stored in /tmp and overwritten each recording

Limitations

Transcription takes a few seconds after stopping (not real-time streaming)
Whisper processes in 30-second segments — longer recordings take proportionally longer
Medium model uses ~1.5GB RAM while loaded

Tech Stack

Language: Swift + SwiftUI
Audio: AVAudioEngine (16kHz mono PCM)
Transcription: whisper.cpp (C library, statically linked)
Model: OpenAI Whisper medium.en (GGML format)
UI: MenuBarExtra + NSPanel (floating)
Trigger: Custom URL scheme + Spotlight-indexed wrapper apps

Takeaway

When your organization locks down the easy path, there’s usually a harder path that still works within the rules. The clipboard is the universal bridge — any app can write to it, any app can read from it, and no special permissions are needed for either operation.

Share on

X Facebook LinkedIn Bluesky

Naveen Gurram