cial/PHASE-6.md
Eliot M c50cc2b5fb refactor(layout): consolidate workspace under /cial — core/, platform/, app/
Reorganize the dev/prod tenant container so the agent runs in the monorepo
root with a clear, semantic directory tree:

  /cial/core/      — runtime (back, front, edge, ui, sdk, protocol, scripts,
                     docker). Locked down to the cial linux user (mode 0700
                     in prod; :ro bind mount in restricted dev).
  /cial/platform/  — agent-editable surface (back, front).
  /cial/app/       — App control plane sources, present in workspace but
                     never built or run inside the tenant container.
  /cial/docs/      — architecture + ops reference.
  /cial/.claude/   — project skills/agents/commands (symlinked into the
                     harness HOME by the dev entrypoint).
  /cial/data/      — persistent state (sqlite, deploy-logs, agent home).

Concrete changes:
- git mv cial-core → core, cial-platform → platform, cial-app → app,
  scripts → core/scripts.
- pnpm-workspace.yaml: packages now core/*, platform/*, app/*.
- Bulk path rewrites across 250+ source / docker / docs files.
- core/scripts/dev-tenant.mjs: ROOT path fix, rw mount of repo + ro
  overlay of /cial/core when --unrestricted is not set (FS-level
  trust boundary, defense in depth).
- core/edge/src/supervisor.{ts,dev.ts}: cwd + CLAUDE_HOME relocated to
  /cial/data/home; agent runs from /cial root so skill discovery picks
  up /cial/.claude/skills automatically.
- core/back providers/claude.ts: HOME defaults to /cial/data/home, cwd
  defaults to /cial.
- core/docker/{Dockerfile,Dockerfile.dev,dev-entrypoint.sh}: COPY +
  WORKDIR + ENTRYPOINT updated; .claude → harness symlink.
- app/docker/{Dockerfile,Dockerfile.router}: COPY core, COPY app
  (instead of cial-core / cial-app).
- New docs/file-structure.md — single canonical map of the runtime
  layout. cial:self-edit SKILL.md mandates reading it first.
- cial:build SKILL.md: scope notes updated to platform/* and core/*.
- root package.json: smoke / dev:tenant scripts now under core/scripts/.
- core/scripts/smoke.mjs: cial-core.db → cial.db.

Externals preserved as-is by intent:
- JWT issuer string 'cial-app' in core/back/src/modules/sso/index.ts +
  app/api/src/lib/sso.ts is an external contract — NOT renamed.
- @cial/back / @cial/edge / @cial/protocol / @cial/sdk / @cial/front
  package names kept stable to minimize blast radius.

Verified:
- pnpm install --prod=false → ok
- turbo run build for protocol, sdk, back, edge, front, platform-back,
  platform-front → all 7 successful (Next builds + tsc clean).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:04:45 +00:00

17 KiB
Raw Permalink Blame History

Phase 6 — Self-Editable Agent (Deploy Controller + Git Slice)

Make the in-container agent able to ship its own edits: build, restart, stream logs, and roll back. Same workflow as /app Cial uses on itself.

This phase implements PLAN.md §Phase 6 (deploy controller — fast vs stable) and pulls forward the commit / rollback slice of PLAN.md §Phase 4 (git engine), because rollback without commits per turn is useless.

Reference: legacy /app/server/src/api/deploy.ts + /app/server/src/core/lib/git/*.


What "self-editable" means today vs after this phase

Capability Today (post-5e) After Phase 6
Agent reads / writes files in /platform/ via Claude built-in tools (cwd is /platform/)
Agent runs build to surface compile errors POST /.cial/api/deploy
Platform-front / platform-back pick up changes (cold restart of whole container) POST /.cial/api/deploy/restart (only platform children bounce)
Build/restart logs visible in the chat WS deploy.* events
Per-turn auto-commit on instance/<tenant> branch
One-click rollback POST /.cial/api/git/rollback
Fast (HMR) mode for editing UX opt-in per tenant

Net: the agent gains a real edit → build → restart → verify loop inside the tenant container, with rollback as a safety net. Same shape as the /api/deploy + /api/deploy/restart pair on prod Cial.


Scope decisions (lock these before sub-phase 6a)

Decision Choice Why
Default mode stable (build + restart) Matches prod / /app behaviour. Avoids holding a Turbopack dev server in memory in every tenant.
fast mode Opt-in per-tenant flag (tenant_settings.deploy_mode = 'fast' | 'stable') Same toggle PLAN.md §6 calls for. v1 wires the flag and a no-op fast path; real HMR keep-warm lands in 6f if we want it.
Restart mechanism Supervisor IPC over Unix socket at /run/cial-supervisor.sock (mode 0660, group cial) Lets Core Back (cial) ask the supervisor (PID 1, cial) to bounce a specific child without touching the others. Agent (agent) cannot reach the socket.
Build queue Single in-flight build per tenant; new requests with same payload are coalesced, different requests queue (max depth 1) Matches /app behaviour; avoids racing pnpm.
Build target pnpm --filter @cial/platform-front build && pnpm --filter @cial/platform-back build Same as legacy. Run from /cial (where the workspace lives).
Build artifacts In-place at /cial/platform/{front,back}/ (Next standalone output already configured). No atomic-swap for v1. Bounce-in-place is fine because the supervisor restart drains old workers cleanly.
Git auto-commit granularity Per turn — one commit per assistant message that mutated /platform/ Cleanest rollback UX. Matches /app flow.
Git branch instance/<tenant_slug> on the existing /platform repo (initialised on first boot if not a repo) Same convention as instance/eliot. No remote in v1 — local commits only.
Rollback strategy git revert <sha>..HEAD (creates a new commit), then redeploy Non-destructive; preserves history. Matches /app deploy reference.
Auth All deploy/git endpoints require Better-Auth tenant session and owner role Same as Vault / DB-proxy in Phase 4.
Logging Build stdout/stderr line-buffered → WS deploy.log events; full transcript also written to /cial/data/deploy-logs/<deployId>.log (rotated, last 50 kept) Visible live in chat + auditable later.
Cancellation POST /.cial/api/deploy/:id/cancelkill -SIGTERM the build child group Same pattern as session cancel from Phase 5c.
Concurrency with sessions Build does not block chat. Restart bounces only platform-* children — Core Back keeps the WS open and emits deploy.restart.done so the UI knows when the platform endpoints are reachable again. Mirrors prod where chat stays alive across deploys.

Architecture deltas

New endpoints (Core Back, /.cial/api/)

Verb Path Purpose
POST /deploy Kick a build. Returns { deployId } immediately, streams logs over WS.
POST /deploy/restart Ask the supervisor to bounce platform-front + platform-back.
POST /deploy/:id/cancel Cancel an in-flight build.
GET /deploy/:id Status snapshot (queued / running / ok / error / cancelled, exit code, log tail).
GET /deploy/mode / PATCH /deploy/mode Read / write the fast vs stable toggle.
GET /git/log?limit=N Recent commits on instance/<slug> with { sha, subject, turnId, createdAt }.
POST /git/rollback Body { sha } → revert to that sha + auto-redeploy.

Supervisor IPC (new)

core/edge/src/supervisor.ts gains a JSONL line-protocol Unix socket:

{ type: 'restart', service: 'platform-front' | 'platform-back' | 'platform' }
{ type: 'status' }

Replies:

{ type: 'restart.ack', service, pid }
{ type: 'restart.done', service, pid, uptimeMs }
{ type: 'status', services: [{ name, pid, uptimeMs, restartCount }] }

The supervisor's child.on('exit') handler is updated: if the exit was requested via IPC the supervisor restarts that child instead of tearing the container down. Unrequested exits keep the existing crash-the-container behaviour, so genuine Platform crashes still reach Docker's restart policy.

New WS message types (additive, no break to Phase 5)

  • deploy.start { deployId, mode, targets: ['front','back'] }
  • deploy.log { deployId, stream: 'stdout' | 'stderr', line }
  • deploy.done { deployId, ok, exitCode, durationMs, errorSummary? }
  • deploy.cancelled { deployId }
  • deploy.restart.start { service }
  • deploy.restart.done { service, durationMs }
  • git.commit { sha, subject, turnId }
  • git.rollback { fromSha, toSha }

All broadcast to sockets focused on the originating session plus any socket subscribed to a new deploy.watch channel (so a global header indicator works across sessions, like /app).

Filesystem boundary

Path Owner Mode Touched by
/cial/platform/ agent:agent 0755 Agent (writes), Core Back (reads to spawn pnpm)
/cial/platform/.git/ agent:agent 0755 Agent + Core Back's git engine (both run as their respective users; we use a small gosu agent git ... shim from Core Back)
/run/cial-supervisor.sock cial:cial 0660 Core Back ↔ supervisor only (agent cannot reach it)
/cial/data/deploy-logs/ cial:cial 0700 Core Back only

The agent cannot call deploy/restart directly via filesystem tricks; it must go through the Core Back HTTP API (which is what its system prompt tells it to do). This keeps the privilege boundary intact.


Sub-phase split — each independently shippable

6a — Supervisor selective-restart IPC

Goal: Supervisor can bounce a single child on request, without exiting.

Deliverables

  • core/edge/src/supervisor.ts: add Unix socket server at /run/cial-supervisor.sock, JSONL protocol above.
  • Track restartRequested: Set<string> so the existing on('exit') handler knows whether to respawn or shut the container down.
  • Per-child restartCount, lastStartedAt in memory; include in status.
  • Refuse restart for core-back, core-front, edge (only platform-* are restartable for v1).
  • Vitest: spawn a fake "platform-front" child that just sleeps; over a UDS socket, request restart, assert the new child has a different pid.

Done when

  • nc -U /run/cial-supervisor.sock → send {type:'restart',service:'platform-front'} → see the child pid change in supervisor logs without the container dying.

6b — Deploy controller (build + restart + WS logs)

Goal: REST endpoints + WS event stream that build the platform and bounce it, gated by auth.

Deliverables

  • core/back/src/modules/deploy/
    • repository.tsdeploys table (id, tenant-implicit, mode, status, started_at, ended_at, exit_code, error_summary, log_path, requested_by_user_id)
    • runner.tsBuildRunner class:
      • Single-flight queue (max depth 1)
      • Spawns pnpm with cwd = /cial, stdio: ['ignore','pipe','pipe'], detached: true so we can SIGTERM the process group
      • Line-buffered log → emits to BuildEvents listener + appends to file
      • Reports ok only if both platform-front and platform-back exit 0
    • supervisor-client.ts — tiny UDS client for the IPC defined in 6a
    • service.ts — orchestrates: insert deploy row → run build → on success, call supervisor restart → emit deploy.done
    • index.ts — Express router: the seven endpoints listed above
    • ws-handlers.ts — inbound deploy.watch / deploy.unwatch; outbound is service calling clients.broadcastToUser
  • @cial/protocol/src/deploy.ts — Zod schemas for all WS payloads + REST DTOs
  • Drizzle migration adds deploys + tenant_settings.deploy_mode
  • Owner-role guard middleware (mirrors Vault pattern)

Done when

  • curl -XPOST .../deploy → starts a build, WS subscriber sees deploy.start + N × deploy.log + deploy.done
  • After deploy.done ok=true, calling /deploy/restart produces deploy.restart.done and curl http://localhost:3000/ returns the new platform-front output (verified with a marker file edited mid-test)
  • Cancel mid-build → next build runs cleanly (no orphaned pnpm)

6c — Git engine slice (auto-commit per turn + rollback)

Goal: Every assistant turn that touched /platform/ becomes a commit on instance/<slug>; one call rolls back to any prior sha.

Deliverables

  • core/back/src/modules/git/
    • engine.ts:
      • ensureRepo() — on boot, git init + git checkout -b instance/<slug> if /platform/.git missing; configure user.email = agent@<slug>.cial.local
      • commitTurn({ turnId, subject, sessionId })git add -A && git diff --cached --quiet || git commit -m "<subject>\n\nturn:<turnId>\nsession:<sessionId>"
      • log(limit)git log --pretty=format:%H%x00%s%x00%aI -n <limit> parsed
      • revertTo(sha)git revert --no-edit <sha>..HEAD (handles merges via -m 1); refuses if working tree dirty
      • All git invocations run as agent via gosu agent git ... so file ownership stays correct
    • service.ts — wraps engine with auth, returns Zod-validated DTOs
    • index.ts — Express router: GET /log, POST /rollback
  • Hook into Phase 5d: in chat-handlers.ts's result path, after the assistant message is persisted, if the turn used any of Write|Edit|MultiEdit|NotebookEdit|Bash, call git.commitTurn(...) and broadcast git.commit.
  • POST /git/rollback: revertTo → kick a deploy (mode = current tenant_settings.deploy_mode) → emit git.rollback.

Done when

  • Send a chat message that creates a file → GET /git/log shows one new commit with the assistant's first-line summary as subject
  • POST /git/rollback {sha: <pre-edit>} → file disappears + auto-deploy completes + WS shows git.rollback then deploy.done
  • Two consecutive turns with no FS changes → no empty commits

6d — SDK + UI bindings

Goal: Chat shows a live deploy pill and a collapsible build-log panel, mirroring /app's UX.

Deliverables

  • @cial/sdk/src/modules/deploy.ts:
    • start(opts?), restart(), cancel(deployId), mode.get() / set('fast'|'stable'), subscribe(onEvent) (returns unsubscribe)
  • @cial/sdk/src/modules/git.ts:
    • log(limit?), rollback(sha)
  • @cial/core-ui/src/store.ts:
    • New slice deployBySession[id] = { current?: DeployRow, recent: DeployRow[] }
    • WS dispatcher handles deploy.* and git.*
  • New components in @cial/core-ui/src/MessageList/:
    • DeployPill.tsx — small inline pill: "Building…" (spinner) → "Build OK · 12s" (green) → "Restarting…" → "Live" (sparkles). Shown in the assistant message's ProcessDropdown row when that turn triggered a deploy.
    • DeployLogPanel.tsx — collapsible (closed by default), tail of last 200 lines, auto-scroll while live
    • RollbackButton.tsx — appears next to the git.commit row in ProcessDropdown for any turn whose commit is not the current HEAD; one click → calls sdk.git.rollback
  • A subtle global DeployStatusDot in the app header (bottom-left of the chat) — colour reflects the most recent deploy state across sessions.

Done when

  • User edits a file via chat → in the same assistant message: tools timeline → "Build OK · 9s" pill → "Live" pill → answer
  • Click DeployLogPanel → see the tail of pnpm output streaming live
  • Rollback from a previous turn's pill → file reverts + chat shows a system-style "Rolled back to " entry

6e — Agent system prompt + behavioural loop

Goal: Claude actually calls deploy/restart after edits — without it the backend plumbing sits unused.

Deliverables

  • core/back/src/modules/sessions/process/providers/claude.ts: append a Phase-6 block to the base system prompt (kept short — every token costs):
    • "Your edits to /platform/** only become live after POST /.cial/api/deploy returns ok:true, then POST /.cial/api/deploy/restart."
    • "Always run a deploy after edits. If the build fails, fix the errors and redeploy until it passes. Never skip the build."
    • Curl examples that work via the Unix socket /run/cial-core.sock (already exists from Phase 4 plan; if not, fall back to http://localhost:4000)
  • Tools available to Claude don't change — it uses Bash to call the endpoints, same as /app Cial does.
  • Add a /.cial/api/internal/deploy* mirror bound only to the Unix socket so the agent doesn't need a session cookie to call it (the socket itself is the auth — only agent user can write to it).
  • Smoke test: ask the chat "create /platform/front/src/app/test/page.tsx saying hello" → assert that within 60s a git.commit lands and the URL http://localhost:3000/test returns 200 with "hello".

Done when

  • E2E smoke green: chat-driven file creation produces a working page with no human button-presses.
  • Inducing a TypeScript error on purpose → chat shows build errors → Claude fixes them → next deploy succeeds.

Non-goals for Phase 6 (parking lot)

  • Real Turbopack HMR keep-warm for fast mode (the flag exists; the implementation is a no-op deferring to a stable build for v1; real HMR is 6f if we decide we want it)
  • Rollback UI for individual files (only whole-turn rollback in v1)
  • Multi-tenant deploy concurrency / global queue (PLAN.md §13 territory)
  • Pushing to a remote repo (PLAN.md §13)
  • Build caching across deploys beyond what pnpm + Turbo already do
  • "Preview" deploys (build to a side dir, swap on accept)
  • Per-tenant resource caps on the build (CPU/mem) — needs orchestrator work

Risks & mitigations

Risk Mitigation
pnpm build OOMs in a small tenant container Set NODE_OPTIONS=--max-old-space-size=1024 for build child; surface OOM as a clear deploy.done error_summary
Supervisor restart leaves a port-bound zombie Wait for child exit (SIGTERM → 5s grace → SIGKILL) before respawning; verify port-free with a connect probe before declaring restart.done
Auto-commit floods git history with trivial commits git diff --cached --quiet short-circuits no-op turns. Squash UI ("squash this turn into the previous") is a v2 nicety.
Rollback after a schema migration corrupts data Mark migration tool calls in metadata.danger = 'schema'; refuse revertTo past such a commit unless force: true.
Agent calls deploy on every tiny edit, wasting CPU System prompt says "deploy after a logical change, not after every edit"; debouncer in BuildRunner queues at most 1 pending build, coalesces same-target requests within 2s
Build child outlives the request that started it (client disconnects) Build is decoupled from the HTTP request lifecycle (returns deployId and runs detached); WS subscribers get events whenever they reconnect; full log on disk
/platform/.git written by agent but Core Back runs as cial All git invocations go through gosu agent git ...; never read the working tree from Core Back directly
Stable mode bounce drops in-flight Platform requests Edge proxy (already in core/edge) gets a brief drain mode: hold incoming requests for up to 2s while platform restarts, fail fast after; UI shows "Restarting…" pill
Supervisor IPC abused if socket leaks Socket is 0660 cial:cial; agent is not in cial group. Sanity-check on every IPC line: only allow service ∈ {platform-front, platform-back, platform}

Working agreement (from PLAN.md)

Each sub-phase: implement → self-test against the "Done when" criteria → commit phase(6x): <summary> → push → notify Eliot for sign-off → next sub-phase. Order: 6a → 6b → 6c → 6d → 6e. 6a and 6c can be developed in parallel if useful, but 6b depends on 6a and 6d depends on 6b+6c.


Success metric (the one number)

Time from "user types 'add a page X'" to "X is reachable in the browser" on a warm container, measured end-to-end:

  • Target: < 25 s (build + restart + first response on the new route)
  • Stretch: < 8 s with fast mode (Phase 6f)

If 6e's smoke test consistently lands under target, Phase 6 ships.