cial/PHASE-5.md
Eliot M 4ec28d1b60 phase(5a): sessions schema + REST CRUD
- protocol: full Zod schemas for sessions/messages with reserved fields for upcoming sub-phases
- back: chat_session + chat_message tables (namespaced to avoid Better-Auth's session table)
- back: SessionsRepository + SessionsService + auth-scoped Express router (POST/GET/PATCH/DELETE /sessions)
- bootstrap: wire migrateSessions
- PHASE-5.md: locked plan for 5a-5e (HarnessProvider abstraction, Claude binary in tenant Dockerfile, dev:tenant copies ~/.claude)
2026-04-26 17:04:37 +00:00

12 KiB

Phase 5 — AI Sessions Engine

Port the legacy Cial sessions system (/app/server/src/core) into cial-core/back and wire it through @cial/protocol, @cial/sdk, and @cial/core-ui. Goal: real Claude chat with streaming, persistence, and survival across server restarts.

Reference architecture deep-dive: see internal scout report (legacy /app/server/src/core/services/{claude,gemini,kimi}-process.ts + core/ws/{stream,chat-handlers,handler-registry,session-handlers}.ts).


Scope decisions (lock these before sub-phase 5a)

Decision Choice Why
Providers in v1 Claude only, but behind a HarnessProvider abstraction Match legacy 80/20; Gemini/Kimi land in 5f/5g by implementing the same interface — no engine rewrite
Tool approval Auto-approve (--dangerously-skip-permissions) No approval UI in v1; matches dev autologin trust model
Tools available Built-in Claude CLI tools (Read/Write/Edit/Bash/Glob/Grep) No MCP in v1
Dataworlds / presets / agent teams / ghost / sandbox / voice / special commands Deferred Land core engine first
FTS search Deferred (LIKE fallback for now) Schema slot reserved
Session groups Deferred Schema slot reserved
Process state file /core-data/.process-state.json per tenant Same volume as SQLite
CLI working dir /platform/ per tenant (matches Phase 8 design) Native dev mode uses process.cwd() of dev-tenant
claude binary Installed in the tenant Dockerfile as a dep (npm i -g @anthropic-ai/claude-code or equivalent); for dev:tenant, host's claude is reused Self-contained image, no PATH surprises
~/.claude state Per-tenant volume mount (production); for dev:tenant, scripts/dev-tenant.mjs copies ~/.claude/ from host into the dev-tenant data dir on first boot so Claude is already authed No re-login per dev session, matches legacy auth flow
Schema shape Mirror legacy column names Future-proof for migrating real legacy data, easier port
WS protocol Reuse legacy message-type strings verbatim Lets us copy client code with minimal renaming

Sub-phase split — each independently shippable

5a — Schema + REST CRUD

Goal: Sessions and messages persist in tenant SQLite, owned by Better-Auth user.

Deliverables

  • Drizzle migration creating two tables (legacy column names):
    • sessions (id, user_id, name, claude_session_id, model, status, created_at, updated_at, streaming_state, archived_at, provider, parent_session_id, effort, cwd, group_id) — + reserved nullable columns for ghost/sandbox/team to avoid future migrations
    • messages (id auto, session_id FK CASCADE, role, content, metadata JSON, created_at)
  • cial-core/back/src/modules/sessions/
    • repository.ts — Drizzle queries
    • service.ts — auth-scoped CRUD
    • index.ts — Express router: POST /, GET /, GET /:id, PATCH /:id (rename), DELETE /:id, GET /:id/messages?before=N&limit=N
  • All routes guarded by Better-Auth session; users can only see their own sessions
  • @cial/protocol/src/sessions.ts — full Zod schemas (Session, Message, Tool, TurnStats, TurnUsage, SubTurn) ported from legacy

Done when

  • Curl can create → list → fetch messages → delete a session via SSO cookie
  • Two users see disjoint session lists
  • Vitest covers the repository

5b — WS handler registry + session handlers

Goal: Real WS at /ws with auth + multi-tab sync of session list.

Deliverables

  • cial-core/back/src/infrastructure/ws/
    • server.ts — upgrade handler that authenticates the WS via Better-Auth session cookie (reject anonymous)
    • handler-registry.tsregister(type, handler) / dispatch table, ported from legacy pattern
    • clients.ts — per-user socket pool; broadcastToUser(userId, msg), sessionFocus map
  • cial-core/back/src/modules/sessions/ws-handlers.ts
    • Inbound: session.list, session.create, session.rename, session.delete, session.focus, session.watch, session.unwatch, session.history
    • Outbound: session.created, session.updated, session.deleted, session.history, session.list
  • WS message envelope: { type: string, payload: unknown, requestId?: string } (matches legacy)
  • Heartbeat / ping every 30s

Done when

  • Two browser tabs open as same user → create in tab A → list updates in tab B within 100ms
  • Disconnecting one tab doesn't kill events in the other

5c — ClaudeProcess (the engine)

Goal: Detached claude CLI processes that survive server restart, streaming events.

Deliverables

  • cial-core/back/src/modules/sessions/process/
    • types.tsCliProcessEvents interface (init, text, thinking, tool_use, tool_result, usage, rate_limit, compact, question, result, cancelled, timeout, heartbeat, close, error) + provider-agnostic shapes (Tool, TurnUsage, TurnStats, ProcessSnapshot, ToolConfigs)
    • harness-provider.tsabstract HarnessProvider interface: name: 'claude' | 'gemini' | 'kimi', buildSpawnArgs(opts) → { args, env, cwd }, parseLine(line, ctx) → CliEvent[], hasLocalState(harnessSessionId) → boolean, homeDir(): string. The tail loop, snapshot/restore, lifecycle, kill flow all live in the generic engine and call into the provider only for these provider-specific bits.
    • providers/claude.tsClaudeProvider implements HarnessProvider (the only one wired in 5c; Gemini/Kimi later just drop in next to it)
    • cli-process.ts — generic class with start(message, isResume, toolConfigs), cancel(), _startTailing(), _processLine(), snapshot/restore — provider-agnostic
    • process-manager.tsgetOrCreate(sessionId, provider), cancel(sessionId), saveState(), restoreState() — called from instrumentation.ts on boot
    • state-file.ts — atomic write/read of /core-data/.process-state.json (snapshot includes provider field so restore picks the right HarnessProvider impl)
  • Spawn args ported from legacy:
    • --dangerously-skip-permissions, --output-format stream-json, --include-partial-messages, --verbose
    • --resume {claudeSessionId} if local state exists, else --session-id {claudeSessionId}
    • --model, --append-system-prompt (single base prompt for v1)
    • -p {message} last
  • Spawn opts: detached: true, stdio: ['ignore', fd, fd], unref()
  • Log file path: /core-data/process-logs/{sessionId}-{ts}.jsonl
  • 500ms tail poll, heartbeat 5s, inactivity timeout 24h
  • Cancel: process.kill(-pid, 'SIGTERM') then SIGKILL after 5s
  • Snapshot fields: pid, sessionId, claudeSessionId, model, logFile, byteOffset, accumulatedText, accumulatedThinking, collectedTools, latestUsage, turnStats, resultEmitted
  • Restore on boot: kill -0 pid check → resume tailing OR drain remaining bytes for trailing result event

Done when

  • Vitest spec spawns a real claude process with a stub message, captures inittext (partial+final) → result events
  • Kill the test runner mid-stream, restart, recover the same process by PID, see remaining events through result

5d — wireProcessToSocket + chat lifecycle

Goal: WS-driven chat — send a message, watch tokens stream in, get the final result persisted.

Deliverables

  • cial-core/back/src/modules/sessions/ws-stream.ts
    • wireProcessToSocket(proc, ws, sessionId) — subscribes to all CliProcessEvents, broadcasts as stream.* messages to focused/watched sockets
    • rewireProcessToSocket(proc, ws, sessionId) — replays accumulatedText, accumulatedThinking, collectedTools, latestUsage to a newly connecting socket
    • 500ms debounced streamingState save to sessions.streaming_state for UI recovery
  • cial-core/back/src/modules/sessions/chat-handlers.ts
    • message {sessionId, content} — guard on busy, INSERT user message, processManager.getOrCreate, wireProcessToSocket, proc.start(...)
    • cancel {sessionId}processManager.cancel
    • answer {sessionId, content} — for interactive question events
  • Outbound stream.* types verbatim from legacy:
    • stream.init, stream.text, stream.thinking, stream.tool_use, stream.tool_result, stream.usage, stream.rate_limit, stream.heartbeat, stream.compact, stream.done, stream.cancelled, stream.error
    • question (interactive ask)
    • session.status (idle/busy/error broadcast)
  • On result: INSERT assistant message with metadata = {tools, stats, thinking, turnHistory}, clear streaming_state, status = 'idle'
  • On cancelled: INSERT assistant message with _(cancelled)_ suffix
  • On close with non-zero code without prior result: status error, INSERT crash placeholder

Done when

  • Send a message from a curl WS test, see streaming text events, get a final assistant message in DB
  • Cancel mid-stream → _(cancelled)_ message persisted
  • Restart server mid-stream → reconnect WS → see remaining events + final result

5e — Wire UI + SDK

Goal: The polished AppShell talks to the real backend.

Deliverables

  • @cial/sdk/src/modules/chat.ts — real subscribe(sessionId, onMsg) and send(sessionId, content) over WS
  • @cial/sdk/src/modules/sessions.ts — REST helpers: list(), create(name?), delete(id), rename(id, name), history(id, opts?)
  • @cial/sdk/src/ws-client.ts — auto-reconnect with exponential backoff (max 10s), outbound queue during outage, replay on reconnect, typed message dispatch
  • @cial/core-ui/src/store.ts — replace localStorage mock with real implementation:
    • On mount: sdk.sessions.list() → seed
    • On session.focus(id): sdk.sessions.history(id) → load messages
    • WS dispatcher: stream.text → append delta to streamingBySession[id], stream.tool_use → add to streaming.tools, stream.done → finalize message + clear streaming, session.status → update session row
    • sendMessage(text)sdk.chat.send(activeId, text)
    • cancel() → new method, sends cancel WS message
  • @cial/core-ui/src/SessionView.tsx — render streaming text live (typing animation), render tool calls inline (collapsed, expandable), show "stop" button while busy
  • Reconnect indicator in sidebar (small dot color: green/yellow/red)

Done when

  • pnpm dev:tenant → log in via autologin → type "Say hi in 3 words" → see streaming tokens → message persists → reload page → message is still there
  • Mid-stream, kill the dev-tenant process, restart → page reconnects → streaming continues from where it was
  • Stop button cancels mid-stream cleanly

Non-goals for Phase 5 (parking lot)

  • Multi-provider (Gemini → 5f, Kimi → 5g)
  • Tool approval UI (5h)
  • MCP tool plumbing (Phase 4 work — vault feeds MCP env)
  • Dataworld/preset injection (separate phase)
  • Agent teams, ghost mode, sandboxing
  • Voice input/transcribe
  • Special commands (/orchestrate, /crm, etc.)
  • Account rotation
  • FTS5 + search
  • Session groups (sidebar folders)
  • Branching / forking sessions
  • Cost dashboards (data captured, not displayed)

Risks & mitigations

Risk Mitigation
claude binary not on PATH inside container 5c also lands the Dockerfile change in cial-core/docker/ to install @anthropic-ai/claude-code globally; dev:tenant fails loud with install instructions if host claude missing
Per-tenant ~/.claude state collisions across containers Each tenant has own ~/.claude mounted from its volume; dev:tenantscripts/dev-tenant.mjs copies ~/.claude/ (auth tokens, settings) from host into ./.dev-tenant/claude-home/ on first boot only (re-copy if --reset-claude flag) and exports HOME to point there for the spawned children
Process state file race (two boots overlap) Atomic write (.tmp + rename) + read-then-delete-on-startup pattern from legacy
Tail loop CPU cost 500ms poll matches legacy; can switch to fs.watch later if needed
WS auth on upgrade is awkward with Better-Auth Read session cookie from upgrade headers, call auth.api.getSession({ headers }) before accepting
Streaming state debounce drops final 500ms on crash Same as legacy — UI re-fetches session.history on reconnect, gets authoritative DB state
--dangerously-skip-permissions running tools that modify the FS v1 cwd is /platform/; that IS the editable tenant code by design (matches legacy where Claude edits /app)

Working agreement (from PLAN.md)

Each sub-phase: implement → self-test against the "Done when" criteria → commit phase(5x): <summary> → push → notify Eliot for sign-off → next sub-phase.