--- name: add-voice-transcription description: Add voice message transcription to NanoClaw using OpenAI's Whisper API. Automatically transcribes WhatsApp voice notes so the agent can read and respond to them. --- # Add Voice Transcription This skill adds automatic voice message transcription to NanoClaw's WhatsApp channel using OpenAI's Whisper API. When a voice note arrives, it is downloaded, transcribed, and delivered to the agent as `[Voice: ]`. ## Phase 1: Pre-flight ### Check if already applied Read `.nanoclaw/state.yaml`. If `voice-transcription` is in `applied_skills`, skip to Phase 3 (Configure). The code changes are already in place. ### Ask the user Use `AskUserQuestion` to collect information: AskUserQuestion: Do you have an OpenAI API key for Whisper transcription? If yes, collect it now. If no, direct them to create one at https://platform.openai.com/api-keys. ## Phase 2: Apply Code Changes Run the skills engine to apply this skill's code package. ### Initialize skills system (if needed) If `.nanoclaw/` directory doesn't exist yet: ```bash npx tsx scripts/apply-skill.ts --init ``` ### Apply the skill ```bash npx tsx scripts/apply-skill.ts .claude/skills/add-voice-transcription ``` This deterministically: - Adds `src/transcription.ts` (voice transcription module using OpenAI Whisper) - Three-way merges voice handling into `src/channels/whatsapp.ts` (isVoiceMessage check, transcribeAudioMessage call) - Three-way merges transcription tests into `src/channels/whatsapp.test.ts` (mock + 3 test cases) - Installs the `openai` npm dependency - Updates `.env.example` with `OPENAI_API_KEY` - Records the application in `.nanoclaw/state.yaml` If the apply reports merge conflicts, read the intent files: - `modify/src/channels/whatsapp.ts.intent.md` — what changed and invariants for whatsapp.ts - `modify/src/channels/whatsapp.test.ts.intent.md` — what changed for whatsapp.test.ts ### Validate code changes ```bash npm test npm run build ``` All tests must pass (including the 3 new voice transcription tests) and build must be clean before proceeding. ## Phase 3: Configure ### Get OpenAI API key (if needed) If the user doesn't have an API key: > I need you to create an OpenAI API key: > > 1. Go to https://platform.openai.com/api-keys > 2. Click "Create new secret key" > 3. Give it a name (e.g., "NanoClaw Transcription") > 4. Copy the key (starts with `sk-`) > > Cost: ~$0.006 per minute of audio (~$0.003 per typical 30-second voice note) Wait for the user to provide the key. ### Add to environment Add to `.env`: ```bash OPENAI_API_KEY= ``` Sync to container environment: ```bash mkdir -p data/env && cp .env data/env/env ``` The container reads environment from `data/env/env`, not `.env` directly. ### Build and restart ```bash npm run build launchctl kickstart -k gui/$(id -u)/com.nanoclaw # macOS # Linux: systemctl --user restart nanoclaw ``` ## Phase 4: Verify ### Test with a voice note Tell the user: > Send a voice note in any registered WhatsApp chat. The agent should receive it as `[Voice: ]` and respond to its content. ### Check logs if needed ```bash tail -f logs/nanoclaw.log | grep -i voice ``` Look for: - `Transcribed voice message` — successful transcription with character count - `OPENAI_API_KEY not set` — key missing from `.env` - `OpenAI transcription failed` — API error (check key validity, billing) - `Failed to download audio message` — media download issue ## Troubleshooting ### Voice notes show "[Voice Message - transcription unavailable]" 1. Check `OPENAI_API_KEY` is set in `.env` AND synced to `data/env/env` 2. Verify key works: `curl -s https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" | head -c 200` 3. Check OpenAI billing — Whisper requires a funded account ### Voice notes show "[Voice Message - transcription failed]" Check logs for the specific error. Common causes: - Network timeout — transient, will work on next message - Invalid API key — regenerate at https://platform.openai.com/api-keys - Rate limiting — wait and retry ### Agent doesn't respond to voice notes Verify the chat is registered and the agent is running. Voice transcription only runs for registered groups.