Kubernetes image garbage collection silently deletes the nanoclaw-agent image when disk usage is high because ephemeral containers don't protect the image from GC. Documents symptoms, cause, fix, and diagnosis.
172 lines
6.3 KiB
Markdown
172 lines
6.3 KiB
Markdown
# NanoClaw Debug Checklist
|
|
|
|
## Known Issues (2026-02-08)
|
|
|
|
### 1. [FIXED] Resume branches from stale tree position
|
|
When agent teams spawns subagent CLI processes, they write to the same session JSONL. On subsequent `query()` resumes, the CLI reads the JSONL but may pick a stale branch tip (from before the subagent activity), causing the agent's response to land on a branch the host never receives a `result` for. **Fix**: pass `resumeSessionAt` with the last assistant message UUID to explicitly anchor each resume.
|
|
|
|
### 2. IDLE_TIMEOUT == CONTAINER_TIMEOUT (both 30 min)
|
|
Both timers fire at the same time, so containers always exit via hard SIGKILL (code 137) instead of graceful `_close` sentinel shutdown. The idle timeout should be shorter (e.g., 5 min) so containers wind down between messages, while container timeout stays at 30 min as a safety net for stuck agents.
|
|
|
|
### 3. Cursor advanced before agent succeeds
|
|
`processGroupMessages` advances `lastAgentTimestamp` before the agent runs. If the container times out, retries find no messages (cursor already past them). Messages are permanently lost on timeout.
|
|
|
|
### 4. Kubernetes image garbage collection deletes nanoclaw-agent image
|
|
|
|
**Symptoms**: `Container exited with code 125: pull access denied for nanoclaw-agent` — the container image disappears overnight or after a few hours, even though you just built it.
|
|
|
|
**Cause**: If your container runtime has Kubernetes enabled (Rancher Desktop enables it by default), the kubelet runs image garbage collection when disk usage exceeds 85%. NanoClaw containers are ephemeral (run and exit), so `nanoclaw-agent:latest` is never protected by a running container. The kubelet sees it as unused and deletes it — often overnight when no messages are being processed. Other images (docker-compose services) survive because they have long-running containers referencing them.
|
|
|
|
**Fix**: Disable Kubernetes if you don't need it:
|
|
```bash
|
|
# Rancher Desktop
|
|
rdctl set --kubernetes-enabled=false
|
|
|
|
# Then rebuild the container image
|
|
./container/build.sh
|
|
```
|
|
|
|
**Diagnosis**: Check the k3s log for image GC activity:
|
|
```bash
|
|
grep -i "nanoclaw" ~/Library/Logs/rancher-desktop/k3s.log
|
|
# Look for: "Removing image to free bytes" with the nanoclaw-agent image ID
|
|
```
|
|
|
|
Check NanoClaw logs for image status:
|
|
```bash
|
|
grep -E "image found|image NOT found|image missing" logs/nanoclaw.log
|
|
```
|
|
|
|
If you need Kubernetes enabled, set `CONTAINER_IMAGE` to an image stored in a registry that the kubelet won't GC, or raise the GC thresholds.
|
|
|
|
## Quick Status Check
|
|
|
|
```bash
|
|
# 1. Is the service running?
|
|
launchctl list | grep nanoclaw
|
|
# Expected: PID 0 com.nanoclaw (PID = running, "-" = not running, non-zero exit = crashed)
|
|
|
|
# 2. Any running containers?
|
|
container ls --format '{{.Names}} {{.Status}}' 2>/dev/null | grep nanoclaw
|
|
|
|
# 3. Any stopped/orphaned containers?
|
|
container ls -a --format '{{.Names}} {{.Status}}' 2>/dev/null | grep nanoclaw
|
|
|
|
# 4. Recent errors in service log?
|
|
grep -E 'ERROR|WARN' logs/nanoclaw.log | tail -20
|
|
|
|
# 5. Is WhatsApp connected? (look for last connection event)
|
|
grep -E 'Connected to WhatsApp|Connection closed|connection.*close' logs/nanoclaw.log | tail -5
|
|
|
|
# 6. Are groups loaded?
|
|
grep 'groupCount' logs/nanoclaw.log | tail -3
|
|
```
|
|
|
|
## Session Transcript Branching
|
|
|
|
```bash
|
|
# Check for concurrent CLI processes in session debug logs
|
|
ls -la data/sessions/<group>/.claude/debug/
|
|
|
|
# Count unique SDK processes that handled messages
|
|
# Each .txt file = one CLI subprocess. Multiple = concurrent queries.
|
|
|
|
# Check parentUuid branching in transcript
|
|
python3 -c "
|
|
import json, sys
|
|
lines = open('data/sessions/<group>/.claude/projects/-workspace-group/<session>.jsonl').read().strip().split('\n')
|
|
for i, line in enumerate(lines):
|
|
try:
|
|
d = json.loads(line)
|
|
if d.get('type') == 'user' and d.get('message'):
|
|
parent = d.get('parentUuid', 'ROOT')[:8]
|
|
content = str(d['message'].get('content', ''))[:60]
|
|
print(f'L{i+1} parent={parent} {content}')
|
|
except: pass
|
|
"
|
|
```
|
|
|
|
## Container Timeout Investigation
|
|
|
|
```bash
|
|
# Check for recent timeouts
|
|
grep -E 'Container timeout|timed out' logs/nanoclaw.log | tail -10
|
|
|
|
# Check container log files for the timed-out container
|
|
ls -lt groups/*/logs/container-*.log | head -10
|
|
|
|
# Read the most recent container log (replace path)
|
|
cat groups/<group>/logs/container-<timestamp>.log
|
|
|
|
# Check if retries were scheduled and what happened
|
|
grep -E 'Scheduling retry|retry|Max retries' logs/nanoclaw.log | tail -10
|
|
```
|
|
|
|
## Agent Not Responding
|
|
|
|
```bash
|
|
# Check if messages are being received from WhatsApp
|
|
grep 'New messages' logs/nanoclaw.log | tail -10
|
|
|
|
# Check if messages are being processed (container spawned)
|
|
grep -E 'Processing messages|Spawning container' logs/nanoclaw.log | tail -10
|
|
|
|
# Check if messages are being piped to active container
|
|
grep -E 'Piped messages|sendMessage' logs/nanoclaw.log | tail -10
|
|
|
|
# Check the queue state — any active containers?
|
|
grep -E 'Starting container|Container active|concurrency limit' logs/nanoclaw.log | tail -10
|
|
|
|
# Check lastAgentTimestamp vs latest message timestamp
|
|
sqlite3 store/messages.db "SELECT chat_jid, MAX(timestamp) as latest FROM messages GROUP BY chat_jid ORDER BY latest DESC LIMIT 5;"
|
|
```
|
|
|
|
## Container Mount Issues
|
|
|
|
```bash
|
|
# Check mount validation logs (shows on container spawn)
|
|
grep -E 'Mount validated|Mount.*REJECTED|mount' logs/nanoclaw.log | tail -10
|
|
|
|
# Verify the mount allowlist is readable
|
|
cat ~/.config/nanoclaw/mount-allowlist.json
|
|
|
|
# Check group's container_config in DB
|
|
sqlite3 store/messages.db "SELECT name, container_config FROM registered_groups;"
|
|
|
|
# Test-run a container to check mounts (dry run)
|
|
# Replace <group-folder> with the group's folder name
|
|
container run -i --rm --entrypoint ls nanoclaw-agent:latest /workspace/extra/
|
|
```
|
|
|
|
## WhatsApp Auth Issues
|
|
|
|
```bash
|
|
# Check if QR code was requested (means auth expired)
|
|
grep 'QR\|authentication required\|qr' logs/nanoclaw.log | tail -5
|
|
|
|
# Check auth files exist
|
|
ls -la store/auth/
|
|
|
|
# Re-authenticate if needed
|
|
npm run auth
|
|
```
|
|
|
|
## Service Management
|
|
|
|
```bash
|
|
# Restart the service
|
|
launchctl kickstart -k gui/$(id -u)/com.nanoclaw
|
|
|
|
# View live logs
|
|
tail -f logs/nanoclaw.log
|
|
|
|
# Stop the service (careful — running containers are detached, not killed)
|
|
launchctl bootout gui/$(id -u)/com.nanoclaw
|
|
|
|
# Start the service
|
|
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.nanoclaw.plist
|
|
|
|
# Rebuild after code changes
|
|
npm run build && launchctl kickstart -k gui/$(id -u)/com.nanoclaw
|
|
```
|