3 minute read

My organization disables the built-in macOS dictation feature, and I don’t have admin rights to enable it or install tools via Homebrew. I needed a way to speak instead of type — so I built one.

The Problem

  • macOS voice-to-text is disabled by org policy
  • No admin rights to change system settings
  • No Homebrew (blocked)
  • Accessibility permissions locked down (requires admin)
  • Input Monitoring locked down (requires admin)
  • HuggingFace blocked by network policy

The challenge: build a voice-to-text tool that works in any app without needing any of the permissions my org restricts.

The Key Insight

You don’t need accessibility permissions if you use the clipboard as the bridge:

  1. Record audio via microphone (standard per-app permission)
  2. Transcribe locally using an AI model
  3. Copy result to clipboard
  4. User pastes with Cmd+V

No keystroke injection, no accessibility hooks, no screen recording. Just microphone → clipboard.

Architecture

┌─────────────────────────────────────────────────────┐
│                   VoiceToClip.app                     │
│                                                       │
│  ┌──────────┐    ┌────────────┐    ┌──────────────┐ │
│  │ AVAudio  │───▶│ whisper.cpp│───▶│ NSPasteboard │ │
│  │ Engine   │    │ (C library)│    │ (clipboard)  │ │
│  │ 16kHz    │    │            │    │              │ │
│  │ mono WAV │    │ ggml model │    │ Cmd+V paste  │ │
│  └──────────┘    └────────────┘    └──────────────┘ │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │ UI: Menu Bar Icon + Floating Button + URL Scheme│ │
│  └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Components

1. OpenAI Whisper (the AI model)

OpenAI trained a speech recognition model on 680,000 hours of audio and released the weights publicly. The model comes in different sizes:

Model Size Speed (30s audio) Accuracy
tiny.en 75MB ~1s Basic
base.en 141MB ~2s Good
medium.en 1.4GB ~5s Excellent
large-v3 3GB ~15s Best

I use medium.en — the sweet spot for accuracy on Apple Silicon.

2. whisper.cpp (the inference engine)

The original Whisper runs in Python with PyTorch (~2GB runtime). whisper.cpp is a C/C++ reimplementation that:

  • Needs no Python or PyTorch
  • Uses Apple Metal GPU acceleration
  • Compiles to a small static library
  • Runs 5-10x faster than the Python version

3. GGML format (model packaging)

The PyTorch .pt model file can’t be loaded by whisper.cpp directly. GGML is a lightweight binary format optimized for fast loading. A conversion script repacks the same weights into this format.

4. Swift menu bar app (the UI)

A SwiftUI app with:

  • Menu bar icon — shows recording state
  • Floating button — always-on-top circle you click to toggle (blue → red → orange → blue)
  • URL schemevoicetoclip://start and voicetoclip://stop for external triggering

5. Spotlight integration (the trigger)

Since keyboard shortcuts require Input Monitoring (locked down), I created tiny .app wrappers that call the URL scheme:

  • Cmd+Space → type “startv” → Enter → recording starts
  • Cmd+Space → type “stopv” → Enter → transcribes → clipboard ready

Permissions Required

Permission Needed? Why
Microphone Standard per-app grant, no admin needed
Accessibility Clipboard approach avoids this entirely
Input Monitoring Spotlight trigger instead of hotkeys
Network Everything runs locally

The Build Process

Since Homebrew and HuggingFace were both blocked, the build required some creativity:

  1. CMake — downloaded manually from cmake.org (portable, no install needed)
  2. whisper.cpp — cloned from GitHub, built with cmake into a static library
  3. Model — downloaded medium.en.pt from OpenAI’s Azure CDN (openaipublic.azureedge.net), then converted to GGML format using a Python script with a temporary venv
  4. Swift app — built with Swift Package Manager, linked against the whisper.cpp static library
  5. App bundle — wrapped in a .app with LSUIElement=true for menu-bar-only behavior

Visual Feedback

The floating button changes color to indicate state:

  • 🔵 Blue — idle, ready to record
  • 🔴 Red — recording in progress
  • 🟠 Orange — transcribing (wait for blue before pasting)

Privacy

Everything runs on-device:

  • Audio never leaves the machine
  • No cloud APIs called
  • No network requests whatsoever
  • The WAV file is stored in /tmp and overwritten each recording

Limitations

  • Transcription takes a few seconds after stopping (not real-time streaming)
  • Whisper processes in 30-second segments — longer recordings take proportionally longer
  • Medium model uses ~1.5GB RAM while loaded

Tech Stack

  • Language: Swift + SwiftUI
  • Audio: AVAudioEngine (16kHz mono PCM)
  • Transcription: whisper.cpp (C library, statically linked)
  • Model: OpenAI Whisper medium.en (GGML format)
  • UI: MenuBarExtra + NSPanel (floating)
  • Trigger: Custom URL scheme + Spotlight-indexed wrapper apps

Takeaway

When your organization locks down the easy path, there’s usually a harder path that still works within the rules. The clipboard is the universal bridge — any app can write to it, any app can read from it, and no special permissions are needed for either operation.