Reorganize the dev/prod tenant container so the agent runs in the monorepo
root with a clear, semantic directory tree:
/cial/core/ — runtime (back, front, edge, ui, sdk, protocol, scripts,
docker). Locked down to the cial linux user (mode 0700
in prod; :ro bind mount in restricted dev).
/cial/platform/ — agent-editable surface (back, front).
/cial/app/ — App control plane sources, present in workspace but
never built or run inside the tenant container.
/cial/docs/ — architecture + ops reference.
/cial/.claude/ — project skills/agents/commands (symlinked into the
harness HOME by the dev entrypoint).
/cial/data/ — persistent state (sqlite, deploy-logs, agent home).
Concrete changes:
- git mv cial-core → core, cial-platform → platform, cial-app → app,
scripts → core/scripts.
- pnpm-workspace.yaml: packages now core/*, platform/*, app/*.
- Bulk path rewrites across 250+ source / docker / docs files.
- core/scripts/dev-tenant.mjs: ROOT path fix, rw mount of repo + ro
overlay of /cial/core when --unrestricted is not set (FS-level
trust boundary, defense in depth).
- core/edge/src/supervisor.{ts,dev.ts}: cwd + CLAUDE_HOME relocated to
/cial/data/home; agent runs from /cial root so skill discovery picks
up /cial/.claude/skills automatically.
- core/back providers/claude.ts: HOME defaults to /cial/data/home, cwd
defaults to /cial.
- core/docker/{Dockerfile,Dockerfile.dev,dev-entrypoint.sh}: COPY +
WORKDIR + ENTRYPOINT updated; .claude → harness symlink.
- app/docker/{Dockerfile,Dockerfile.router}: COPY core, COPY app
(instead of cial-core / cial-app).
- New docs/file-structure.md — single canonical map of the runtime
layout. cial:self-edit SKILL.md mandates reading it first.
- cial:build SKILL.md: scope notes updated to platform/* and core/*.
- root package.json: smoke / dev:tenant scripts now under core/scripts/.
- core/scripts/smoke.mjs: cial-core.db → cial.db.
Externals preserved as-is by intent:
- JWT issuer string 'cial-app' in core/back/src/modules/sso/index.ts +
app/api/src/lib/sso.ts is an external contract — NOT renamed.
- @cial/back / @cial/edge / @cial/protocol / @cial/sdk / @cial/front
package names kept stable to minimize blast radius.
Verified:
- pnpm install --prod=false → ok
- turbo run build for protocol, sdk, back, edge, front, platform-back,
platform-front → all 7 successful (Next builds + tsc clean).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17 KiB
Phase 6 — Self-Editable Agent (Deploy Controller + Git Slice)
Make the in-container agent able to ship its own edits: build, restart, stream logs, and roll back. Same workflow as
/appCial uses on itself.This phase implements PLAN.md §Phase 6 (deploy controller — fast vs stable) and pulls forward the commit / rollback slice of PLAN.md §Phase 4 (git engine), because rollback without commits per turn is useless.
Reference: legacy
/app/server/src/api/deploy.ts+/app/server/src/core/lib/git/*.
What "self-editable" means today vs after this phase
| Capability | Today (post-5e) | After Phase 6 |
|---|---|---|
Agent reads / writes files in /platform/ |
✅ via Claude built-in tools (cwd is /platform/) |
✅ |
| Agent runs build to surface compile errors | ❌ | ✅ POST /.cial/api/deploy |
| Platform-front / platform-back pick up changes | ❌ (cold restart of whole container) | ✅ POST /.cial/api/deploy/restart (only platform children bounce) |
| Build/restart logs visible in the chat | ❌ | ✅ WS deploy.* events |
Per-turn auto-commit on instance/<tenant> branch |
❌ | ✅ |
| One-click rollback | ❌ | ✅ POST /.cial/api/git/rollback |
| Fast (HMR) mode for editing UX | ❌ | ✅ opt-in per tenant |
Net: the agent gains a real edit → build → restart → verify loop inside
the tenant container, with rollback as a safety net. Same shape as the
/api/deploy + /api/deploy/restart pair on prod Cial.
Scope decisions (lock these before sub-phase 6a)
| Decision | Choice | Why |
|---|---|---|
| Default mode | stable (build + restart) |
Matches prod / /app behaviour. Avoids holding a Turbopack dev server in memory in every tenant. |
fast mode |
Opt-in per-tenant flag (tenant_settings.deploy_mode = 'fast' | 'stable') |
Same toggle PLAN.md §6 calls for. v1 wires the flag and a no-op fast path; real HMR keep-warm lands in 6f if we want it. |
| Restart mechanism | Supervisor IPC over Unix socket at /run/cial-supervisor.sock (mode 0660, group cial) |
Lets Core Back (cial) ask the supervisor (PID 1, cial) to bounce a specific child without touching the others. Agent (agent) cannot reach the socket. |
| Build queue | Single in-flight build per tenant; new requests with same payload are coalesced, different requests queue (max depth 1) | Matches /app behaviour; avoids racing pnpm. |
| Build target | pnpm --filter @cial/platform-front build && pnpm --filter @cial/platform-back build |
Same as legacy. Run from /cial (where the workspace lives). |
| Build artifacts | In-place at /cial/platform/{front,back}/ (Next standalone output already configured). No atomic-swap for v1. |
Bounce-in-place is fine because the supervisor restart drains old workers cleanly. |
| Git auto-commit granularity | Per turn — one commit per assistant message that mutated /platform/ |
Cleanest rollback UX. Matches /app flow. |
| Git branch | instance/<tenant_slug> on the existing /platform repo (initialised on first boot if not a repo) |
Same convention as instance/eliot. No remote in v1 — local commits only. |
| Rollback strategy | git revert <sha>..HEAD (creates a new commit), then redeploy |
Non-destructive; preserves history. Matches /app deploy reference. |
| Auth | All deploy/git endpoints require Better-Auth tenant session and owner role | Same as Vault / DB-proxy in Phase 4. |
| Logging | Build stdout/stderr line-buffered → WS deploy.log events; full transcript also written to /cial/data/deploy-logs/<deployId>.log (rotated, last 50 kept) |
Visible live in chat + auditable later. |
| Cancellation | POST /.cial/api/deploy/:id/cancel — kill -SIGTERM the build child group |
Same pattern as session cancel from Phase 5c. |
| Concurrency with sessions | Build does not block chat. Restart bounces only platform-* children — Core Back keeps the WS open and emits deploy.restart.done so the UI knows when the platform endpoints are reachable again. |
Mirrors prod where chat stays alive across deploys. |
Architecture deltas
New endpoints (Core Back, /.cial/api/)
| Verb | Path | Purpose |
|---|---|---|
| POST | /deploy |
Kick a build. Returns { deployId } immediately, streams logs over WS. |
| POST | /deploy/restart |
Ask the supervisor to bounce platform-front + platform-back. |
| POST | /deploy/:id/cancel |
Cancel an in-flight build. |
| GET | /deploy/:id |
Status snapshot (queued / running / ok / error / cancelled, exit code, log tail). |
| GET | /deploy/mode / PATCH /deploy/mode |
Read / write the fast vs stable toggle. |
| GET | /git/log?limit=N |
Recent commits on instance/<slug> with { sha, subject, turnId, createdAt }. |
| POST | /git/rollback |
Body { sha } → revert to that sha + auto-redeploy. |
Supervisor IPC (new)
core/edge/src/supervisor.ts gains a JSONL line-protocol Unix socket:
{ type: 'restart', service: 'platform-front' | 'platform-back' | 'platform' }
{ type: 'status' }
Replies:
{ type: 'restart.ack', service, pid }
{ type: 'restart.done', service, pid, uptimeMs }
{ type: 'status', services: [{ name, pid, uptimeMs, restartCount }] }
The supervisor's child.on('exit') handler is updated: if the exit was
requested via IPC the supervisor restarts that child instead of tearing
the container down. Unrequested exits keep the existing crash-the-container
behaviour, so genuine Platform crashes still reach Docker's restart policy.
New WS message types (additive, no break to Phase 5)
deploy.start { deployId, mode, targets: ['front','back'] }deploy.log { deployId, stream: 'stdout' | 'stderr', line }deploy.done { deployId, ok, exitCode, durationMs, errorSummary? }deploy.cancelled { deployId }deploy.restart.start { service }deploy.restart.done { service, durationMs }git.commit { sha, subject, turnId }git.rollback { fromSha, toSha }
All broadcast to sockets focused on the originating session plus any
socket subscribed to a new deploy.watch channel (so a global header
indicator works across sessions, like /app).
Filesystem boundary
| Path | Owner | Mode | Touched by |
|---|---|---|---|
/cial/platform/ |
agent:agent |
0755 | Agent (writes), Core Back (reads to spawn pnpm) |
/cial/platform/.git/ |
agent:agent |
0755 | Agent + Core Back's git engine (both run as their respective users; we use a small gosu agent git ... shim from Core Back) |
/run/cial-supervisor.sock |
cial:cial |
0660 | Core Back ↔ supervisor only (agent cannot reach it) |
/cial/data/deploy-logs/ |
cial:cial |
0700 | Core Back only |
The agent cannot call deploy/restart directly via filesystem tricks; it must go through the Core Back HTTP API (which is what its system prompt tells it to do). This keeps the privilege boundary intact.
Sub-phase split — each independently shippable
6a — Supervisor selective-restart IPC
Goal: Supervisor can bounce a single child on request, without exiting.
Deliverables
core/edge/src/supervisor.ts: add Unix socket server at/run/cial-supervisor.sock, JSONL protocol above.- Track
restartRequested: Set<string>so the existingon('exit')handler knows whether to respawn or shut the container down. - Per-child
restartCount,lastStartedAtin memory; include instatus. - Refuse restart for
core-back,core-front,edge(only platform-* are restartable for v1). - Vitest: spawn a fake "platform-front" child that just sleeps; over a UDS socket, request restart, assert the new child has a different pid.
Done when
nc -U /run/cial-supervisor.sock→ send{type:'restart',service:'platform-front'}→ see the child pid change in supervisor logs without the container dying.
6b — Deploy controller (build + restart + WS logs)
Goal: REST endpoints + WS event stream that build the platform and bounce it, gated by auth.
Deliverables
core/back/src/modules/deploy/repository.ts—deploystable (id, tenant-implicit, mode, status, started_at, ended_at, exit_code, error_summary, log_path, requested_by_user_id)runner.ts—BuildRunnerclass:- Single-flight queue (max depth 1)
- Spawns
pnpmwithcwd = /cial,stdio: ['ignore','pipe','pipe'],detached: trueso we can SIGTERM the process group - Line-buffered log → emits to
BuildEventslistener + appends to file - Reports
okonly if both platform-front and platform-back exit 0
supervisor-client.ts— tiny UDS client for the IPC defined in 6aservice.ts— orchestrates: insert deploy row → run build → on success, call supervisor restart → emitdeploy.doneindex.ts— Express router: the seven endpoints listed abovews-handlers.ts— inbounddeploy.watch/deploy.unwatch; outbound isservicecallingclients.broadcastToUser
@cial/protocol/src/deploy.ts— Zod schemas for all WS payloads + REST DTOs- Drizzle migration adds
deploys+tenant_settings.deploy_mode - Owner-role guard middleware (mirrors Vault pattern)
Done when
curl -XPOST .../deploy→ starts a build, WS subscriber seesdeploy.start+ N ×deploy.log+deploy.done- After
deploy.done ok=true, calling/deploy/restartproducesdeploy.restart.doneandcurl http://localhost:3000/returns the new platform-front output (verified with a marker file edited mid-test) - Cancel mid-build → next build runs cleanly (no orphaned pnpm)
6c — Git engine slice (auto-commit per turn + rollback)
Goal: Every assistant turn that touched /platform/ becomes a commit on
instance/<slug>; one call rolls back to any prior sha.
Deliverables
core/back/src/modules/git/engine.ts:ensureRepo()— on boot,git init+git checkout -b instance/<slug>if/platform/.gitmissing; configureuser.email = agent@<slug>.cial.localcommitTurn({ turnId, subject, sessionId })—git add -A && git diff --cached --quiet || git commit -m "<subject>\n\nturn:<turnId>\nsession:<sessionId>"log(limit)—git log --pretty=format:%H%x00%s%x00%aI -n <limit>parsedrevertTo(sha)—git revert --no-edit <sha>..HEAD(handles merges via-m 1); refuses if working tree dirty- All git invocations run as
agentviagosu agent git ...so file ownership stays correct
service.ts— wraps engine with auth, returns Zod-validated DTOsindex.ts— Express router:GET /log,POST /rollback
- Hook into Phase 5d: in
chat-handlers.ts'sresultpath, after the assistant message is persisted, if the turn used any ofWrite|Edit|MultiEdit|NotebookEdit|Bash, callgit.commitTurn(...)and broadcastgit.commit. POST /git/rollback: revertTo → kick a deploy (mode = currenttenant_settings.deploy_mode) → emitgit.rollback.
Done when
- Send a chat message that creates a file →
GET /git/logshows one new commit with the assistant's first-line summary as subject POST /git/rollback {sha: <pre-edit>}→ file disappears + auto-deploy completes + WS showsgit.rollbackthendeploy.done- Two consecutive turns with no FS changes → no empty commits
6d — SDK + UI bindings
Goal: Chat shows a live deploy pill and a collapsible build-log panel,
mirroring /app's UX.
Deliverables
@cial/sdk/src/modules/deploy.ts:start(opts?),restart(),cancel(deployId),mode.get() / set('fast'|'stable'),subscribe(onEvent)(returns unsubscribe)
@cial/sdk/src/modules/git.ts:log(limit?),rollback(sha)
@cial/core-ui/src/store.ts:- New slice
deployBySession[id] = { current?: DeployRow, recent: DeployRow[] } - WS dispatcher handles
deploy.*andgit.*
- New slice
- New components in
@cial/core-ui/src/MessageList/:DeployPill.tsx— small inline pill: "Building…" (spinner) → "Build OK · 12s" (green) → "Restarting…" → "Live" (sparkles). Shown in the assistant message'sProcessDropdownrow when that turn triggered a deploy.DeployLogPanel.tsx— collapsible (closed by default), tail of last 200 lines, auto-scroll while liveRollbackButton.tsx— appears next to thegit.commitrow inProcessDropdownfor any turn whose commit is not the current HEAD; one click → callssdk.git.rollback
- A subtle global
DeployStatusDotin the app header (bottom-left of the chat) — colour reflects the most recent deploy state across sessions.
Done when
- User edits a file via chat → in the same assistant message: tools timeline → "Build OK · 9s" pill → "Live" pill → answer
- Click DeployLogPanel → see the tail of pnpm output streaming live
- Rollback from a previous turn's pill → file reverts + chat shows a system-style "Rolled back to " entry
6e — Agent system prompt + behavioural loop
Goal: Claude actually calls deploy/restart after edits — without it the backend plumbing sits unused.
Deliverables
core/back/src/modules/sessions/process/providers/claude.ts: append a Phase-6 block to the base system prompt (kept short — every token costs):- "Your edits to
/platform/**only become live afterPOST /.cial/api/deployreturnsok:true, thenPOST /.cial/api/deploy/restart." - "Always run a deploy after edits. If the build fails, fix the errors and redeploy until it passes. Never skip the build."
- Curl examples that work via the Unix socket
/run/cial-core.sock(already exists from Phase 4 plan; if not, fall back tohttp://localhost:4000)
- "Your edits to
- Tools available to Claude don't change — it uses
Bashto call the endpoints, same as/appCial does. - Add a
/.cial/api/internal/deploy*mirror bound only to the Unix socket so the agent doesn't need a session cookie to call it (the socket itself is the auth — onlyagentuser can write to it). - Smoke test: ask the chat "create
/platform/front/src/app/test/page.tsxsaying hello" → assert that within 60s agit.commitlands and the URLhttp://localhost:3000/testreturns 200 with "hello".
Done when
- E2E smoke green: chat-driven file creation produces a working page with no human button-presses.
- Inducing a TypeScript error on purpose → chat shows build errors → Claude fixes them → next deploy succeeds.
Non-goals for Phase 6 (parking lot)
- Real Turbopack HMR keep-warm for
fastmode (the flag exists; the implementation is a no-op deferring to astablebuild for v1; real HMR is 6f if we decide we want it) - Rollback UI for individual files (only whole-turn rollback in v1)
- Multi-tenant deploy concurrency / global queue (PLAN.md §13 territory)
- Pushing to a remote repo (PLAN.md §13)
- Build caching across deploys beyond what pnpm + Turbo already do
- "Preview" deploys (build to a side dir, swap on accept)
- Per-tenant resource caps on the build (CPU/mem) — needs orchestrator work
Risks & mitigations
| Risk | Mitigation |
|---|---|
pnpm build OOMs in a small tenant container |
Set NODE_OPTIONS=--max-old-space-size=1024 for build child; surface OOM as a clear deploy.done error_summary |
| Supervisor restart leaves a port-bound zombie | Wait for child exit (SIGTERM → 5s grace → SIGKILL) before respawning; verify port-free with a connect probe before declaring restart.done |
| Auto-commit floods git history with trivial commits | git diff --cached --quiet short-circuits no-op turns. Squash UI ("squash this turn into the previous") is a v2 nicety. |
| Rollback after a schema migration corrupts data | Mark migration tool calls in metadata.danger = 'schema'; refuse revertTo past such a commit unless force: true. |
| Agent calls deploy on every tiny edit, wasting CPU | System prompt says "deploy after a logical change, not after every edit"; debouncer in BuildRunner queues at most 1 pending build, coalesces same-target requests within 2s |
| Build child outlives the request that started it (client disconnects) | Build is decoupled from the HTTP request lifecycle (returns deployId and runs detached); WS subscribers get events whenever they reconnect; full log on disk |
/platform/.git written by agent but Core Back runs as cial |
All git invocations go through gosu agent git ...; never read the working tree from Core Back directly |
| Stable mode bounce drops in-flight Platform requests | Edge proxy (already in core/edge) gets a brief drain mode: hold incoming requests for up to 2s while platform restarts, fail fast after; UI shows "Restarting…" pill |
| Supervisor IPC abused if socket leaks | Socket is 0660 cial:cial; agent is not in cial group. Sanity-check on every IPC line: only allow service ∈ {platform-front, platform-back, platform} |
Working agreement (from PLAN.md)
Each sub-phase: implement → self-test against the "Done when" criteria →
commit phase(6x): <summary> → push → notify Eliot for sign-off → next
sub-phase. Order: 6a → 6b → 6c → 6d → 6e. 6a and 6c can be developed in
parallel if useful, but 6b depends on 6a and 6d depends on 6b+6c.
Success metric (the one number)
Time from "user types 'add a page X'" to "X is reachable in the browser" on a warm container, measured end-to-end:
- Target: < 25 s (build + restart + first response on the new route)
- Stretch: < 8 s with
fastmode (Phase 6f)
If 6e's smoke test consistently lands under target, Phase 6 ships.