- protocol: full Zod schemas for sessions/messages with reserved fields for upcoming sub-phases - back: chat_session + chat_message tables (namespaced to avoid Better-Auth's session table) - back: SessionsRepository + SessionsService + auth-scoped Express router (POST/GET/PATCH/DELETE /sessions) - bootstrap: wire migrateSessions - PHASE-5.md: locked plan for 5a-5e (HarnessProvider abstraction, Claude binary in tenant Dockerfile, dev:tenant copies ~/.claude)
12 KiB
Phase 5 — AI Sessions Engine
Port the legacy Cial sessions system (
/app/server/src/core) intocial-core/backand wire it through@cial/protocol,@cial/sdk, and@cial/core-ui. Goal: real Claude chat with streaming, persistence, and survival across server restarts.
Reference architecture deep-dive: see internal scout report (legacy /app/server/src/core/services/{claude,gemini,kimi}-process.ts + core/ws/{stream,chat-handlers,handler-registry,session-handlers}.ts).
Scope decisions (lock these before sub-phase 5a)
| Decision | Choice | Why |
|---|---|---|
| Providers in v1 | Claude only, but behind a HarnessProvider abstraction |
Match legacy 80/20; Gemini/Kimi land in 5f/5g by implementing the same interface — no engine rewrite |
| Tool approval | Auto-approve (--dangerously-skip-permissions) |
No approval UI in v1; matches dev autologin trust model |
| Tools available | Built-in Claude CLI tools (Read/Write/Edit/Bash/Glob/Grep) | No MCP in v1 |
| Dataworlds / presets / agent teams / ghost / sandbox / voice / special commands | Deferred | Land core engine first |
| FTS search | Deferred (LIKE fallback for now) | Schema slot reserved |
| Session groups | Deferred | Schema slot reserved |
| Process state file | /core-data/.process-state.json per tenant |
Same volume as SQLite |
| CLI working dir | /platform/ per tenant (matches Phase 8 design) |
Native dev mode uses process.cwd() of dev-tenant |
claude binary |
Installed in the tenant Dockerfile as a dep (npm i -g @anthropic-ai/claude-code or equivalent); for dev:tenant, host's claude is reused |
Self-contained image, no PATH surprises |
~/.claude state |
Per-tenant volume mount (production); for dev:tenant, scripts/dev-tenant.mjs copies ~/.claude/ from host into the dev-tenant data dir on first boot so Claude is already authed |
No re-login per dev session, matches legacy auth flow |
| Schema shape | Mirror legacy column names | Future-proof for migrating real legacy data, easier port |
| WS protocol | Reuse legacy message-type strings verbatim | Lets us copy client code with minimal renaming |
Sub-phase split — each independently shippable
5a — Schema + REST CRUD
Goal: Sessions and messages persist in tenant SQLite, owned by Better-Auth user.
Deliverables
- Drizzle migration creating two tables (legacy column names):
sessions(id, user_id, name, claude_session_id, model, status, created_at, updated_at, streaming_state, archived_at, provider, parent_session_id, effort, cwd, group_id) — + reserved nullable columns for ghost/sandbox/team to avoid future migrationsmessages(id auto, session_id FK CASCADE, role, content, metadata JSON, created_at)
cial-core/back/src/modules/sessions/repository.ts— Drizzle queriesservice.ts— auth-scoped CRUDindex.ts— Express router:POST /,GET /,GET /:id,PATCH /:id(rename),DELETE /:id,GET /:id/messages?before=N&limit=N
- All routes guarded by Better-Auth session; users can only see their own sessions
@cial/protocol/src/sessions.ts— full Zod schemas (Session, Message, Tool, TurnStats, TurnUsage, SubTurn) ported from legacy
Done when
- Curl can create → list → fetch messages → delete a session via SSO cookie
- Two users see disjoint session lists
- Vitest covers the repository
5b — WS handler registry + session handlers
Goal: Real WS at /ws with auth + multi-tab sync of session list.
Deliverables
cial-core/back/src/infrastructure/ws/server.ts— upgrade handler that authenticates the WS via Better-Auth session cookie (reject anonymous)handler-registry.ts—register(type, handler)/ dispatch table, ported from legacy patternclients.ts— per-user socket pool;broadcastToUser(userId, msg),sessionFocusmap
cial-core/back/src/modules/sessions/ws-handlers.ts- Inbound:
session.list,session.create,session.rename,session.delete,session.focus,session.watch,session.unwatch,session.history - Outbound:
session.created,session.updated,session.deleted,session.history,session.list
- Inbound:
- WS message envelope:
{ type: string, payload: unknown, requestId?: string }(matches legacy) - Heartbeat / ping every 30s
Done when
- Two browser tabs open as same user → create in tab A → list updates in tab B within 100ms
- Disconnecting one tab doesn't kill events in the other
5c — ClaudeProcess (the engine)
Goal: Detached claude CLI processes that survive server restart, streaming events.
Deliverables
cial-core/back/src/modules/sessions/process/types.ts—CliProcessEventsinterface (init, text, thinking, tool_use, tool_result, usage, rate_limit, compact, question, result, cancelled, timeout, heartbeat, close, error) + provider-agnostic shapes (Tool,TurnUsage,TurnStats,ProcessSnapshot,ToolConfigs)harness-provider.ts— abstractHarnessProviderinterface:name: 'claude' | 'gemini' | 'kimi',buildSpawnArgs(opts) → { args, env, cwd },parseLine(line, ctx) → CliEvent[],hasLocalState(harnessSessionId) → boolean,homeDir(): string. The tail loop, snapshot/restore, lifecycle, kill flow all live in the generic engine and call into the provider only for these provider-specific bits.providers/claude.ts—ClaudeProvider implements HarnessProvider(the only one wired in 5c; Gemini/Kimi later just drop in next to it)cli-process.ts— generic class withstart(message, isResume, toolConfigs),cancel(),_startTailing(),_processLine(), snapshot/restore — provider-agnosticprocess-manager.ts—getOrCreate(sessionId, provider),cancel(sessionId),saveState(),restoreState()— called frominstrumentation.tson bootstate-file.ts— atomic write/read of/core-data/.process-state.json(snapshot includesproviderfield so restore picks the rightHarnessProviderimpl)
- Spawn args ported from legacy:
--dangerously-skip-permissions,--output-format stream-json,--include-partial-messages,--verbose--resume {claudeSessionId}if local state exists, else--session-id {claudeSessionId}--model,--append-system-prompt(single base prompt for v1)-p {message}last
- Spawn opts:
detached: true,stdio: ['ignore', fd, fd],unref() - Log file path:
/core-data/process-logs/{sessionId}-{ts}.jsonl - 500ms tail poll, heartbeat 5s, inactivity timeout 24h
- Cancel:
process.kill(-pid, 'SIGTERM')thenSIGKILLafter 5s - Snapshot fields: pid, sessionId, claudeSessionId, model, logFile, byteOffset, accumulatedText, accumulatedThinking, collectedTools, latestUsage, turnStats, resultEmitted
- Restore on boot:
kill -0 pidcheck → resume tailing OR drain remaining bytes for trailingresultevent
Done when
- Vitest spec spawns a real
claudeprocess with a stub message, capturesinit→text(partial+final) →resultevents - Kill the test runner mid-stream, restart, recover the same process by PID, see remaining events through
result
5d — wireProcessToSocket + chat lifecycle
Goal: WS-driven chat — send a message, watch tokens stream in, get the final result persisted.
Deliverables
cial-core/back/src/modules/sessions/ws-stream.tswireProcessToSocket(proc, ws, sessionId)— subscribes to all CliProcessEvents, broadcasts asstream.*messages to focused/watched socketsrewireProcessToSocket(proc, ws, sessionId)— replaysaccumulatedText,accumulatedThinking,collectedTools,latestUsageto a newly connecting socket- 500ms debounced
streamingStatesave tosessions.streaming_statefor UI recovery
cial-core/back/src/modules/sessions/chat-handlers.tsmessage {sessionId, content}— guard on busy, INSERT user message,processManager.getOrCreate,wireProcessToSocket,proc.start(...)cancel {sessionId}—processManager.cancelanswer {sessionId, content}— for interactivequestionevents
- Outbound
stream.*types verbatim from legacy:stream.init,stream.text,stream.thinking,stream.tool_use,stream.tool_result,stream.usage,stream.rate_limit,stream.heartbeat,stream.compact,stream.done,stream.cancelled,stream.errorquestion(interactive ask)session.status(idle/busy/error broadcast)
- On
result: INSERT assistant message withmetadata = {tools, stats, thinking, turnHistory}, clearstreaming_state,status = 'idle' - On
cancelled: INSERT assistant message with_(cancelled)_suffix - On
closewith non-zero code without priorresult: statuserror, INSERT crash placeholder
Done when
- Send a message from a curl WS test, see streaming text events, get a final assistant message in DB
- Cancel mid-stream →
_(cancelled)_message persisted - Restart server mid-stream → reconnect WS → see remaining events + final result
5e — Wire UI + SDK
Goal: The polished AppShell talks to the real backend.
Deliverables
@cial/sdk/src/modules/chat.ts— realsubscribe(sessionId, onMsg)andsend(sessionId, content)over WS@cial/sdk/src/modules/sessions.ts— REST helpers:list(),create(name?),delete(id),rename(id, name),history(id, opts?)@cial/sdk/src/ws-client.ts— auto-reconnect with exponential backoff (max 10s), outbound queue during outage, replay on reconnect, typed message dispatch@cial/core-ui/src/store.ts— replace localStorage mock with real implementation:- On mount:
sdk.sessions.list()→ seed - On
session.focus(id):sdk.sessions.history(id)→ load messages - WS dispatcher:
stream.text→ append delta tostreamingBySession[id],stream.tool_use→ add to streaming.tools,stream.done→ finalize message + clear streaming,session.status→ update session row sendMessage(text)→sdk.chat.send(activeId, text)cancel()→ new method, sendscancelWS message
- On mount:
@cial/core-ui/src/SessionView.tsx— render streaming text live (typing animation), render tool calls inline (collapsed, expandable), show "stop" button while busy- Reconnect indicator in sidebar (small dot color: green/yellow/red)
Done when
pnpm dev:tenant→ log in via autologin → type "Say hi in 3 words" → see streaming tokens → message persists → reload page → message is still there- Mid-stream, kill the dev-tenant process, restart → page reconnects → streaming continues from where it was
- Stop button cancels mid-stream cleanly
Non-goals for Phase 5 (parking lot)
- Multi-provider (Gemini → 5f, Kimi → 5g)
- Tool approval UI (5h)
- MCP tool plumbing (Phase 4 work — vault feeds MCP env)
- Dataworld/preset injection (separate phase)
- Agent teams, ghost mode, sandboxing
- Voice input/transcribe
- Special commands (/orchestrate, /crm, etc.)
- Account rotation
- FTS5 + search
- Session groups (sidebar folders)
- Branching / forking sessions
- Cost dashboards (data captured, not displayed)
Risks & mitigations
| Risk | Mitigation |
|---|---|
claude binary not on PATH inside container |
5c also lands the Dockerfile change in cial-core/docker/ to install @anthropic-ai/claude-code globally; dev:tenant fails loud with install instructions if host claude missing |
Per-tenant ~/.claude state collisions across containers |
Each tenant has own ~/.claude mounted from its volume; dev:tenant — scripts/dev-tenant.mjs copies ~/.claude/ (auth tokens, settings) from host into ./.dev-tenant/claude-home/ on first boot only (re-copy if --reset-claude flag) and exports HOME to point there for the spawned children |
| Process state file race (two boots overlap) | Atomic write (.tmp + rename) + read-then-delete-on-startup pattern from legacy |
| Tail loop CPU cost | 500ms poll matches legacy; can switch to fs.watch later if needed |
| WS auth on upgrade is awkward with Better-Auth | Read session cookie from upgrade headers, call auth.api.getSession({ headers }) before accepting |
| Streaming state debounce drops final 500ms on crash | Same as legacy — UI re-fetches session.history on reconnect, gets authoritative DB state |
--dangerously-skip-permissions running tools that modify the FS |
v1 cwd is /platform/; that IS the editable tenant code by design (matches legacy where Claude edits /app) |
Working agreement (from PLAN.md)
Each sub-phase: implement → self-test against the "Done when" criteria → commit phase(5x): <summary> → push → notify Eliot for sign-off → next sub-phase.