Meeting Transcription App

Personal Project

Overview

Designed and developed a Claude Code Skill that takes macOS Voice Memos recordings, transcribes them locally with mlx-whisper on Apple Silicon, and generates meeting minutes with generative AI. Switches between cloud (Gemini API) and a self-hosted GPU server (Gemma 4 31B on DGX-Spark), enabling confidential meetings to be processed without sending audio to external services.

Architecture

Adopted a Service Pair + Factory pattern: transcription and LLM services are each abstracted behind a `Protocol`, and `services/factory.py` swaps implementations at runtime. Transcription always runs locally on Mac (mlx-whisper); only the LLM side toggles between gemini and gemma, turning the non-functional "may-send-outside" requirement into a typed, switchable boundary.

Local inference uses `mlx-community/whisper-large-v3-mlx`. Driving the Metal GPU directly through Apple's MLX framework yields several-x effective throughput over CPU while preserving a strict privacy boundary with no external API dependency.

The remote LLM backend runs on a self-owned DGX-Spark (NVIDIA GB10) where vLLM serves Gemma 4 31B via an OpenAI-compatible HTTP API. Reaching it with plain `urllib` over HTTP instead of SSH minimizes dependencies and keeps the cloud and self-hosted backends behind the same interface.

Transcription (Mac): mlx-whisper local inference on the Metal GPU
LLM (gemini): Google Gemini 3.1 Pro Preview via cloud API
LLM (gemma): Gemma 4 31B served by vLLM on DGX-Spark (OpenAI-compatible API)
Factory: dynamic backend selection in `services/factory.py`
Protocol abstractions in `services/protocols.py` invert dependencies and simplify unit testing
Pydantic v2 Settings + `.env` externalize backend choice, model names, and hostnames

Quality Engineering

Injects past minutes as structured context and applies a two-stage consistency control that keeps participant names, affiliations, honorifics (external 'sama' / internal bare name), and meeting-series types (client / internal) aligned with history. After generation, a Read-based verification step against the most recent minutes auto-corrects the output when drift is detected.

Compresses Whisper segment output into a `MM:SS -> MM:SS: text` form before feeding it to the LLM — intentionally dropping sub-second precision roughly halves token cost, so meetings over an hour still fit within context limits.

A BCP-47 (`ja-JP`) to Whisper code (`ja`) translation layer keeps configuration readable while supporting multiple languages. Participants, Action Items, and discussion points are emitted as Markdown tables and headings so downstream systems (e.g. Notion) can re-use the output.

Key Features

Scans macOS Voice Memos (`~/Library/Group Containers/.../Recordings`) and lists recordings with timestamp, duration, and size. m4a / mp4 / mov / wav are decoded directly by Whisper through ffmpeg, so users never need to pre-convert files.

Rich / Questionary provide an interactive CLI for picking files and LLM backend with arrow keys; a non-interactive mode is also exposed so other Claude Code agents can drive it scriptably.

Outputs `./biz/minutes/yyyymmdd-<title>.md` (minutes) and `transcript/*.txt` (raw transcript) into per-series subdirectories, using a naming scheme optimized for diff review and long-term archiving.

Development & Quality

Pins dependencies with Python 3.12 + uv and runs unit and async tests with `pytest`, `pytest-cov`, `pytest-mock`, and `pytest-asyncio`. Protocol-based dependency inversion lets both the LLM and Whisper be mocked, so regression runs in CI without a GPU. Google Python Style Guide docstrings and Pydantic v2 type validation keep the codebase maintainable.

Technologies

Python 3.12mlx-whisperApple Silicon (Metal)Google GeminiGemma 4 31BvLLMDGX-Spark (NVIDIA GB10)OpenAI-compatible APIPydantic v2RichQuestionarypytestuvClaude Code Skill