Commit Graph

10 Commits

Author SHA1 Message Date
Gabi Simons
f794185c21 fix: atomic claim prevents scheduled tasks from executing twice (#657)
* fix: atomic claim prevents scheduled tasks from executing twice (#138)

Replace the two-phase getDueTasks() + deferred updateTaskAfterRun() with
an atomic SQLite transaction (claimDueTasks) that advances next_run
BEFORE dispatching tasks to the queue. This eliminates the race window
where subsequent scheduler polls re-discover in-progress tasks.

Key changes:
- claimDueTasks(): SELECT + UPDATE in a single db.transaction(), so no
  poll can read stale next_run values. Once-tasks get next_run=NULL;
  recurring tasks get next_run advanced to the future.
- computeNextRun(): anchors interval tasks to the scheduled time (not
  Date.now()) to prevent cumulative drift. Includes a while-loop to
  skip missed intervals and a guard against invalid interval values.
- updateTaskAfterRun(): simplified to only record last_run/last_result
  since next_run is already handled by the claim.

Closes #138, #211, #300, #578

Co-authored-by: @taslim (PR #601)
Co-authored-by: @baijunjie (Issue #138)
Co-authored-by: @Michaelliv (Issue #300)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply prettier formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: track running task ID in GroupQueue to prevent duplicate execution (#138)

Previous commits implemented an "atomic claim" approach (claimDueTasks)
that advanced next_run before execution. Per Gavriel's review, this
solved the symptom at the wrong layer and introduced crash-recovery
risks for once-tasks.

This commit reverts claimDueTasks and instead fixes the actual bug:
GroupQueue.enqueueTask() only checked pendingTasks for duplicates, but
running tasks had already been shifted out. Adding runningTaskId to
GroupState closes that gap with a 3-line fix at the correct layer.

The computeNextRun() drift fix is retained, applied post-execution
where it belongs.

Closes #138, #211, #300, #578

Co-authored-by: @taslim (PR #601)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add changelog entry for scheduler duplicate fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add contributors for scheduler race condition fix

Co-Authored-By: Taslim <9999802+taslim@users.noreply.github.com>
Co-Authored-By: BaiJunjie <7956480+baijunjie@users.noreply.github.com>
Co-Authored-By: Michael <13676242+Michaelliv@users.noreply.github.com>
Co-Authored-By: Kyle Zhike Chen <3477852+kk17@users.noreply.github.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: gavrielc <gabicohen22@yahoo.com>
Co-authored-by: Taslim <9999802+taslim@users.noreply.github.com>
Co-authored-by: BaiJunjie <7956480+baijunjie@users.noreply.github.com>
Co-authored-by: Michael <13676242+Michaelliv@users.noreply.github.com>
Co-authored-by: Kyle Zhike Chen <3477852+kk17@users.noreply.github.com>
2026-03-04 16:23:29 +02:00
Gabi Simons
11c201088b refactor: CI optimization, logging improvements, and codebase formatting (#456)
* fix(db): remove unique constraint on folder to support multi-channel agents

* ci: implement automated skill drift detection and self-healing PRs

* fix: align registration logic with Gavriel's feedback and fix build/test issues from Daniel Mi

* style: conform to prettier standards for CI validation

* test: fix branch naming inconsistency in CI (master vs main)

* fix(ci): robust module resolution by removing file extensions in scripts

* refactor(ci): simplify skill validation by removing redundant combination tests

* style: conform skills-engine to prettier, unify logging in index.ts and cleanup unused imports

* refactor: extract multi-channel DB changes to separate branch

Move channel column, folder suffix logic, and related migrations
to feat/multi-channel-db-v2 for independent review. This PR now
contains only CI/CD optimizations, Prettier formatting, and
logging improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 23:13:36 +02:00
gavrielc
5f58941db2 fix: add .catch() handlers to fire-and-forget async calls (#221) (#355)
Several async calls in the message loop and group queue are
fire-and-forget without .catch() handlers. When WhatsApp disconnects
or containers fail unexpectedly, these produce unhandled rejections
that can crash the process.

Add explicit .catch() at each call site so errors are logged with
full context (groupJid, taskId) instead of crashing:

- channel.setTyping() in message loop (adapted for channel abstraction)
- startMessageLoop() in main()
- runForGroup() and runTask() in group-queue (5 call sites)

Closes #221

Co-authored-by: Naveen Jain <1779929+naveenspark@users.noreply.github.com>
Co-authored-by: Skip Potter <skip.potter.va@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 23:22:51 +02:00
Gavriel Cohen
c6b69e87a9 fix: correctly trigger idle preemption in streaming input mode
The original notifyIdle condition (!result.result) never fired in
streaming input mode because every result has non-null text content.
This caused due tasks to wait up to 30 minutes for the idle timer.

- Call notifyIdle for ALL successful results (not just null ones)
- Add isTaskContainer flag so user messages queue instead of being
  forwarded to task containers (which blocked notifyIdle from the
  message container's onOutput path)
- Reset idleWaiting in sendMessage so containers aren't preempted
  while actively working on a new incoming message
- Replace 30-min IDLE_TIMEOUT with 10s close timer for task containers
  since they are single-turn and should exit promptly after their result

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 22:19:24 +02:00
gavrielc
93bb94ff55 fix: only preempt idle containers when scheduled tasks enqueue
Containers that finish work but stay alive in waitForIpcMessage() block
queued scheduled tasks. Previous approaches killed active containers
mid-work. This fix tracks idle state via the session-update marker
(status: success, result: null) and only preempts when the container
is idle-waiting, not actively working.

Closes #293

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 22:19:24 +02:00
gavrielc
6f02ee530b Adds Agent Swarms
* feat: streaming container mode, IPC messaging, agent teams support

Major architectural shift from single-shot container runs to long-lived
streaming containers with IPC-based message injection.

- Agent runner: query loop with AsyncIterable prompt to keep stdin open
  for agent teams (fixes isSingleUserTurn premature shutdown)
- New standalone stdio MCP server (ipc-mcp-stdio.ts) inheritable by
  subagents, with send_message and schedule_task tools
- Streaming output: parse OUTPUT_START/END markers in real-time, send
  results to WhatsApp as they arrive
- IPC file-based messaging: host writes to ipc/{group}/input/, agent
  polls for follow-up messages without respawning containers
- Per-group settings.json with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
- SDK bumped to 0.2.34 for TeamCreate tool support
- Container idle timeout (30min) with _close sentinel for shutdown
- Orphaned container cleanup on startup
- alwaysRespond flag for groups that skip trigger pattern check
- Uncaught exception/rejection handlers with timestamps in logger
- Combined SDK documentation into single deep dive reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove unused ipc-mcp.ts (replaced by ipc-mcp-stdio.ts)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clarify agent communication model in docs and tool descriptions

- CLAUDE.md (main + global): split communication instructions into
  "responding to messages" vs "scheduled tasks" sections
- send_message tool: note that scheduled task output is not sent to user
- Remove structured output (outputFormat) — not needed with current flow
- Regular output is sent to WhatsApp; scheduled task output is only logged

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: ignore dynamic group data while preserving base structure

Only track groups/main/CLAUDE.md and groups/global/CLAUDE.md. All other
group directories and files are ignored to prevent tracking user-specific
session data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve critical bugs in streaming container mode

Bug 1 (scheduled task hang): Task scheduler now passes onOutput callback
with idle timer that writes _close sentinel after IDLE_TIMEOUT, so
containers exit cleanly instead of blocking queue slots for 30 minutes.
Scheduled tasks stay alive for interactive follow-up via IPC.

Bug 2 (timeout disabled): Remove resetTimeout() from stderr handler.
SDK writes debug logs continuously, resetting the timer on every line.
Timeout now only resets on actual output markers in stdout.

Bug 3 (trigger bypass): Piped messages in startMessageLoop now check
trigger pattern for non-main groups. Non-trigger messages accumulate in
DB and are pulled as context via getMessagesSince when a trigger arrives.

Bug 7 (non-atomic IPC writes): GroupQueue.sendMessage uses temp file +
rename for atomic writes, matching ipc-mcp-stdio.ts pattern.

Also: flip isVerbose back to false (debug leftover), add isScheduledTask
to host-side ContainerInput interface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: idle timer not starting + scheduled task groupFolder missing

Two bugs that prevented the scheduled task idle timeout fix from working:

1. onOutput was only called when parsed.result !== null, but session
   update markers have result: null. The idle timer never started for
   "silent" query completions, leaving containers parked at
   waitForIpcMessage until hard timeout.

2. Scheduler's onProcess callback didn't pass groupFolder to
   queue.registerProcess, so closeStdin no-oped (groupFolder was null).
   The _close sentinel was never written even when the idle timer fired.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: duplicate messages and timestamp rollback in piping path

Two bugs introduced by the trigger context accumulation change:

1. processGroupMessages didn't advance lastAgentTimestamp until after
   the container finished. The piping path's getMessagesSince(lastAgent
   Timestamp) re-fetched messages already sent as the initial prompt,
   causing duplicates.

2. processGroupMessages overwrote lastAgentTimestamp with the original
   batch timestamp on completion, rolling back any advancement made by
   the piping path while the container was running.

Fix: advance lastAgentTimestamp immediately after building the prompt,
before starting the container. This matches the piping path behavior
and eliminates both the overlap and the rollback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: container idles 30 extra minutes after _close during query

When _close was detected during pollIpcDuringQuery, it was consumed
(deleted) and stream.end() was called. But after runQuery returned,
main() still emitted a session-update marker (resetting the host's idle
timer) and called waitForIpcMessage (which polled forever since _close
was already gone). The container had to wait for a second _close.

Fix: runQuery now returns closedDuringQuery. When true, main() skips
the session-update marker and waitForIpcMessage, exiting immediately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resume branching, internal tags, and output forwarding

- Fix resume branching: pass resumeSessionAt with last assistant UUID
  to anchor each query loop resume to the correct conversation tree
  position. Prevents agent responses landing on invisible branches
  when agent teams subagents create parallel JSONL entries.

- Add <internal> tag stripping: agent can wrap internal reasoning in
  <internal> tags which are logged but not sent to WhatsApp. Prevents
  duplicate messages and internal monologue reaching users.

- Forward scheduled task output: scheduled tasks now send result text
  to WhatsApp (with <internal> stripping), matching regular message
  behavior. No more special-case instructions.

- Update Communication guidance in CLAUDE.md: simplified to "your
  output is sent to the user or group" with soft guidance on
  <internal> tags and send_message usage.

- Add messaging behavior docs to schedule_task tool: prompts the
  scheduling agent to include guidance on whether the task should
  always/conditionally/never message the user.

- Mount security: containerPath now optional, defaults to basename
  of hostPath.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cursor rollback on error, flush guard, verbose logging

- Roll back lastAgentTimestamp on container error so retries can
  re-process the messages instead of silently losing them.

- Add guard flag to flushOutgoingQueue to prevent duplicate sends
  from concurrent flushes during rapid WA reconnects.

- Revert isVerbose from hardcoded false back to env-based check
  (LOG_LEVEL=debug|trace).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: orphan container cleanup was silently failing

The startup cleanup used `container ls --format {{.Names}}` which is
Docker Go-template syntax. Apple Container only supports `--format json`
or `--format table`. The command errored with exit code 64, but the
catch block silently swallowed it — orphan containers were never cleaned
up on restart.

Fixed to use `--format json` and parse `configuration.id` from the
JSON output. Also filters by `status: running` and logs a warning on
failure instead of silently catching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add Discord badge and community section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: idle timer reset on null results and flush queue message loss

- Only reset idle timer on actual results (non-null), not session-update
  markers. Prevents containers staying alive 30 extra minutes after the
  agent finishes work.
- flushOutgoingQueue now uses shift() instead of splice(0) so unattempted
  messages stay in the queue if an unexpected error bails the loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add Agent Swarms to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update Telegram skill for current architecture

Rewrite integration instructions to match the per-group queue/SQLite
architecture: remove onMessage callback pattern (store to DB, let
message loop pick up), fix startSchedulerLoop signature, add
TELEGRAM_ONLY service startup, SQLite registration, data/env/env sync,
@mention-to-trigger translation, and BotFather group privacy docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: Telegram skill message chunking, media placeholders, chat discovery

- Split long messages at Telegram's 4096 char limit to prevent silent
  send failures
- Store placeholder text for non-text messages (photos, voice, stickers,
  etc.) so the agent knows media was sent
- Update getAvailableGroups filter to include tg: chats so the agent can
  discover and register Telegram chats via IPC
- Fix removal step numbering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update REQUIREMENTS.md and SPEC.md for SQLite architecture

- Replace all registered_groups.json / sessions.json / router_state.json
  references with SQLite equivalents
- Fix CONTAINER_TIMEOUT default (300000 → 1800000)
- Add missing config exports (IDLE_TIMEOUT, MAX_CONCURRENT_CONTAINERS)
- Update folder structure: add missing src files (logger, group-queue,
  mount-security), remove non-existent utils.ts, list all skills
- Fix agent-runner entry (ipc-mcp.ts → ipc-mcp-stdio.ts)
- Update startup sequence to reflect per-group queue architecture
- Fix env mounting description (data/env/env, not extracted vars)
- Update troubleshooting to use sqlite3 commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: fix README architecture description, revert SPEC.md env error

- README: update architecture blurb to mention per-group queue, add
  group-queue.ts to key files, update file descriptions
- SPEC.md: restore correct credential filtering description (only auth
  vars are extracted from .env, not the full file)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 02:50:43 +02:00
gavrielc
44f0b3d99c fix: improve agent output schema, tool descriptions, and shutdown robustness
- Rename status→outputType, responded/silent→message/log for clarity
- Remove scheduled task special-casing: userMessage now sent for all contexts
- Update schema, tool, and CLAUDE.md descriptions to be clear and
  non-contradictory about communication mechanisms
- Use full tool name mcp__nanoclaw__send_message in docs
- Change schedule_task target_group to accept JID instead of folder name
- Only show target_group_jid parameter to main group agents
- Add defense-in-depth sanitization and error callback to exec() in shutdown
- Use "user or group" consistently (supports both 1:1 and group chats)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 20:22:45 +02:00
gavrielc
ae177156ec feat: per-group queue, SQLite state, graceful shutdown (#111)
* fix: wire up queue processMessagesFn before recovery to prevent silent message loss

recoverPendingMessages() was called after startMessageLoop(), which meant:
1. Recovery could race with the message loop's first iteration
2. processMessagesFn was set inside startMessageLoop, so recovery
   enqueues would fire runForGroup with processMessagesFn still null,
   silently skipping message processing

Move setProcessMessagesFn and recoverPendingMessages before startMessageLoop
so the queue is fully wired before any messages are enqueued.

https://claude.ai/code/session_01PCY8zNjDa2N29jvBAV5vfL

* feat: structured agent output to fix infinite retry on silent responses (#113)

Use Agent SDK's outputFormat with json_schema to get typed responses
from the agent. The agent now returns { status: 'responded' | 'silent',
userMessage?, internalLog? } instead of a plain string. This fixes a
critical bug where a null/empty agent response caused infinite 5-second
retry loops by conflating "nothing to say" with "error".

- Agent runner: add AGENT_RESPONSE_SCHEMA and parse structured_output
- Host: advance lastAgentTimestamp on both responded AND silent status
- GroupQueue: add exponential backoff (5s-80s) with max 5 retries for
  actual errors, replacing unbounded fixed-interval retries

https://claude.ai/code/session_014SLc8MxP9BYhEhDCLox9U8

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-06 18:54:26 +02:00
gavrielc
03df69e9b5 fix: address review feedback for per-group queue reliability
- Fix startup recovery running before WhatsApp connects, which could
  permanently lose agent responses by advancing lastAgentTimestamp
  before sock is initialized
- Add 5s retry on container failure so messages aren't silently dropped
  until a new message arrives for the group
- Use `container stop` in shutdown instead of raw SIGTERM to CLI wrapper,
  ensuring proper container cleanup
- Replace unnecessary dynamic imports with static imports in processTaskIpc
- Guard JSON.parse of DB-stored last_agent_timestamp against corruption
- Validate MAX_CONCURRENT_CONTAINERS (default 5, min 1, NaN-safe)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 16:45:00 +02:00
gavrielc
eac9a6acfd feat: per-group queue, SQLite state, graceful shutdown
Add per-group container locking with global concurrency limit to prevent
concurrent containers for the same group (#89) and cap total containers.
Fix message batching bug where lastAgentTimestamp advanced to trigger
message instead of latest in batch, causing redundant re-processing.
Move router state, sessions, and registered groups from JSON files to
SQLite with automatic one-time migration. Add SIGTERM/SIGINT handlers
with graceful shutdown (SIGTERM -> grace period -> SIGKILL). Add startup
recovery for messages missed during crash. Remove dead code: utils.ts,
Session type, isScheduledTask flag, ContainerConfig.env, getTaskRunLogs,
GroupQueue.isActive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 07:38:07 +02:00