cial/PLAN-LOCAL.md
Eliot M c50cc2b5fb refactor(layout): consolidate workspace under /cial — core/, platform/, app/
Reorganize the dev/prod tenant container so the agent runs in the monorepo
root with a clear, semantic directory tree:

  /cial/core/      — runtime (back, front, edge, ui, sdk, protocol, scripts,
                     docker). Locked down to the cial linux user (mode 0700
                     in prod; :ro bind mount in restricted dev).
  /cial/platform/  — agent-editable surface (back, front).
  /cial/app/       — App control plane sources, present in workspace but
                     never built or run inside the tenant container.
  /cial/docs/      — architecture + ops reference.
  /cial/.claude/   — project skills/agents/commands (symlinked into the
                     harness HOME by the dev entrypoint).
  /cial/data/      — persistent state (sqlite, deploy-logs, agent home).

Concrete changes:
- git mv cial-core → core, cial-platform → platform, cial-app → app,
  scripts → core/scripts.
- pnpm-workspace.yaml: packages now core/*, platform/*, app/*.
- Bulk path rewrites across 250+ source / docker / docs files.
- core/scripts/dev-tenant.mjs: ROOT path fix, rw mount of repo + ro
  overlay of /cial/core when --unrestricted is not set (FS-level
  trust boundary, defense in depth).
- core/edge/src/supervisor.{ts,dev.ts}: cwd + CLAUDE_HOME relocated to
  /cial/data/home; agent runs from /cial root so skill discovery picks
  up /cial/.claude/skills automatically.
- core/back providers/claude.ts: HOME defaults to /cial/data/home, cwd
  defaults to /cial.
- core/docker/{Dockerfile,Dockerfile.dev,dev-entrypoint.sh}: COPY +
  WORKDIR + ENTRYPOINT updated; .claude → harness symlink.
- app/docker/{Dockerfile,Dockerfile.router}: COPY core, COPY app
  (instead of cial-core / cial-app).
- New docs/file-structure.md — single canonical map of the runtime
  layout. cial:self-edit SKILL.md mandates reading it first.
- cial:build SKILL.md: scope notes updated to platform/* and core/*.
- root package.json: smoke / dev:tenant scripts now under core/scripts/.
- core/scripts/smoke.mjs: cial-core.db → cial.db.

Externals preserved as-is by intent:
- JWT issuer string 'cial-app' in core/back/src/modules/sso/index.ts +
  app/api/src/lib/sso.ts is an external contract — NOT renamed.
- @cial/back / @cial/edge / @cial/protocol / @cial/sdk / @cial/front
  package names kept stable to minimize blast radius.

Verified:
- pnpm install --prod=false → ok
- turbo run build for protocol, sdk, back, edge, front, platform-back,
  platform-front → all 7 successful (Next builds + tsc clean).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:04:45 +00:00

13 KiB
Raw Blame History

Local mode — full multi-tenant Cial on your laptop

Goal: pnpm local:up brings up the entire Cial platform locally with Docker standing in for Fly Machines. You sign up as the owner in the App admin, create a "client", a real Docker container is spawned for that tenant, you click the tenant URL, and you land in their per-tenant Cial — SSO'd in, already authenticated.

This document scopes "plumbing v1" — same code path as production, only the orchestrator's driver swaps (Docker ↔ Fly). It pulls Phase 2 of PLAN.md (real container entrypoint) and parts of Phases 1113 (App auth, tenant CRUD, orchestrator, router) forward.


Decisions (locked in)

choice
Tenant URL {slug}.localhost:8080 (modern browsers resolve *.localhost → 127.0.0.1; no /etc/hosts edits)
Signup owner-only (you log into admin and create tenants)
Scope plumbing v1 — spawn tenant, route to it, SSO works, tenant container responds with an identifiable placeholder
Orchestration docker compose for App+Postgres+Router; the App spawns tenant containers via the host Docker socket (DooD pattern)

Tech picks

concern choice why
App DB postgres:16 in compose matches prod
App ORM drizzle-orm + drizzle-kit already in deps, lightweight
App Auth better-auth locked-in choice from infra deck
Orchestrator driver dockerode typed, supports the full Docker API; no shelling out
Router proxy tiny Node service using http-proxy (lives in app/router) dedicated process, separates concerns, matches @cial/app-router package we already scaffolded
SSO HS256 JWT (shared secret in .env.local) for v1; bumps to RS256 in prod smaller blast radius for a dev mode
Tenant data named docker volume per tenant: cial-tenant-{id}-data survives docker stop, gone with the tenant on destroy

Topology

                 host (your laptop)
 ┌──────────────────────────────────────────────────────────────────┐
 │                                                                  │
 │   browser → http://acme.localhost:8080                           │
 │                       │                                          │
 │                       ▼                                          │
 │   ┌────────────┐                                                 │
 │   │ @cial/app- │  reads tenants.{slug, container_port} from PG   │
 │   │  router    │  forwards to 172.x.x.x:port                     │
 │   │ (:8080)    │                                                 │
 │   └────────────┘                                                 │
 │                                                                  │
 │   ┌────────────┐    ┌────────────┐    ┌────────────────────┐     │
 │   │ @cial/app- │    │ postgres   │    │ tenant containers  │     │
 │   │  api       │◄──►│  :5432     │    │ (one per client)   │     │
 │   │ (:3100)    │    └────────────┘    │  cial-tenant:dev   │     │
 │   └────────────┘                      └────────────────────┘     │
 │        │                                                         │
 │        │  spawns/stops via /var/run/docker.sock (DooD)           │
 │        ▼                                                         │
 │   host Docker daemon                                             │
 │                                                                  │
 └──────────────────────────────────────────────────────────────────┘

Phasing

Each phase is independently runnable and ends with a verifiable check.

L0 — Tenant image runs the full Core + Platform stack

No scope reduction. The tenant image runs all four processes with proper user separation and a single exposed port. This derisks the hardest piece of the infra (multi-process container, internal edge, two-user model) before any of the orchestration / routing is built. Chat / agent / deploy logic stays as 501 stubs — those come back via PLAN.md phases — but the plumbing is real.

Container topology (single exposed port: 8080):

                  external :8080
                        │
                        ▼
              ┌──────────────────┐
              │ @cial/edge       │  PID 1 supervises and routes
              │ (Node, user cial)│
              └──────────────────┘
                  │     │     │     │
        /.cial/api/*    │     │     │
                  ▼     │     │     │
            @cial/back  │     │     │   user: cial   :4000
            (Express)   │     │     │   data: /cial/data (cial:0700)
                  /.cial/*    │     │
                        ▼     │     │
                @cial/front   │     │   user: cial   :4001
                (Next, basePath=/.cial)
                              │     │
                       /api/p/*    │     (internal — NOT exposed for v1)
                              ▼     │
                      @cial/platform-back :3001  user: agent
                                    │
                                    │  /* (everything else)
                                    ▼
                            @cial/platform-front :3000  user: agent
                                       (Next, basePath=/)

Key decisions (calling these now to avoid round-trips):

  • Supervisor: small Node script as PID 1 wrapped by tini. It spawns the four children, pipes their stdout/stderr with prefixes, and crashes the container if any child dies (so Docker restarts it). No s6/supervisord — keeps the runtime stack pure Node.
  • Internal edge: tiny Node http-proxy server in a new package core/edge (~100 LOC). Handles HTTP and WebSocket upgrade. The only process bound to :8080.
  • Two-user model preserved: cial user owns /cial/core (mode 0700), runs edge + back + front; agent user owns /cial/platform, runs platform-back + platform-front. Edge is cial-owned so the agent process can never bind the public port.
  • Platform Back exposure: internal-only for v1 — reachable from Platform Front server-side via http://localhost:3001, not from the browser. Avoids fighting Next.js for the /api/* namespace. We pick a public mount path later when we know the convention.
  • Next.js mode: production builds with output: 'standalone' so each Next app ships only what it needs (smaller image, faster cold start).
  • No watch / dev mode in the container — that's purely for local dev outside the container (where pnpm smoke already works).

Tasks:

  1. Add core/edge package: Node + http-proxy + ws upgrade handling.
  2. Add cial-core/supervisor (or a small bin/supervisor.mjs inside edge): spawns the four children with the right user via process.setuid (or via su-exec in the entrypoint).
  3. Add output: 'standalone' + outputFileTracingRoot to both Next apps (@cial/front, @cial/platform-front) so monorepo symlinks resolve in the standalone bundle.
  4. Add a tiny placeholder route to each surface so we can prove routing:
    • @cial/platform-front /<h1>Platform · Tenant {TENANT_ID}</h1>
    • @cial/front /.cial<h1>Cial Core · Tenant {TENANT_ID}</h1>
    • @cial/back /.cial/api/health already exists (/healthz mounted there)
    • @cial/platform-back /health already exists (internal only)
  5. Rewrite core/docker/Dockerfile as a real multi-stage build that produces the runtime image with all four services + the supervisor + correct ownership/permissions. Entry: tini -- node /opt/core/edge/bin/supervisor.mjs.
  6. Add core/docker/.dockerignore (already there).

Verify (acceptance for L0):

docker build -f core/docker/Dockerfile -t cial-tenant:dev .
docker run --rm -e TENANT_ID=demo -p 9000:8080 cial-tenant:dev
# In another shell:
curl http://localhost:9000/                     # → "Platform · Tenant demo"
curl http://localhost:9000/.cial                # → "Cial Core · Tenant demo"
curl http://localhost:9000/.cial/api/healthz    # → {"status":"ok",…}
docker exec <id> ls -la /cial/core           # → owner cial, mode 0700
docker exec <id> sudo -u agent cat /opt/core/back/dist/index.js
                                                 # → permission denied (proves boundary)

If all five checks pass, L0 is done. The container is then a real, production-shaped artifact — every later phase just plugs into it.

L1 — docker compose + Postgres + App API + owner signup

  • docker-compose.yml at repo root with services: postgres, app-api
  • app/api:
    • Drizzle schema: users, tenants (slug, name, container_id, container_port, state, owner_id)
    • drizzle-kit generate + on-boot migrate
    • Better-Auth wired at /api/auth/[...all] — email+password, no verification in dev
    • /admin/login and /admin/signup pages (signup disabled after first user; first user is owner)
  • app-api Dockerfile updated to actually run next start on :3100
  • Verify: docker compose up, browser to http://localhost:3100/admin/signup, create owner account, redirected to /admin (empty tenant list).

L2 — Admin tenant CRUD

  • /admin page: list tenants from DB
  • "New tenant" form (slug, name) → POST /api/admin/tenants
  • Validates slug (^[a-z0-9-]{2,40}$), inserts row with state='provisioning'
  • For now, no orchestrator call — just DB row + UI confirmation
  • Verify: create a tenant in UI, see it in the list with state=provisioning.

L3 — Orchestrator Docker driver

  • app/orchestrator/src/drivers/docker.ts: implements Orchestrator via dockerode
    • create(tenant): docker create from cial-tenant:dev with env TENANT_ID, label cial.tenant=<id>, named volume, exposed port mapped to a free host port. Sets state=starting.
    • start/stop/destroy/status: trivial dockerode calls
  • app-api calls orchestrator.create() after the DB insert in L2; records container_id, container_port; flips state to running once the container's /healthz responds.
  • App container needs /var/run/docker.sock mounted in compose.
  • Verify: create a tenant in UI → admin list shows state=running with a port; docker ps shows the container; curl localhost:<port> returns the L0 placeholder page.

L4 — Router reverse proxy on :8080

  • app/router/src/index.ts: small Node HTTP+WS server on :8080 using http-proxy
  • Reads tenant routes from Postgres on each request (cache with 5s TTL)
  • Host header acme.localhosttenants WHERE slug='acme' AND state='running' → forward to localhost:<container_port>
  • 404 if unknown slug; 503 if state ≠ running
  • Add router service to compose
  • Verify: http://acme.localhost:8080 shows the tenant's page (Tenant acme).

L5 — SSO handoff App → tenant Core

  • App admin: tenant detail page has an Open button → /api/admin/tenants/:slug/open
  • That endpoint mints a short-lived (60s) HS256 JWT { sub: ownerId, tenant: slug, role: 'owner' } signed with CIAL_SSO_SECRET, redirects browser to http://{slug}.localhost:8080/.cial/sso?token=...
  • Core Back: /.cial/sso validates token, sets a session cookie scoped to the tenant subdomain, redirects to /. (For v1 the cookie is the marker; Phase 3 of PLAN.md replaces this with real Better-Auth in Core.)
  • The tenant / page now shows Tenant {id} · signed in as {sub}.
  • Verify: from admin, click "Open acme" → land on acme.localhost:8080 with the tenant page showing your owner email.

L6 — pnpm local:up wraps everything

  • One-shot script that:
    1. Builds cial-tenant:dev image
    2. Runs docker compose up --build -d
    3. Tails logs, prints → http://localhost:3100/admin when ready
  • pnpm local:down tears compose + prunes any orphan tenant containers (label-selected: label=cial.tenant)
  • A new scripts/smoke-local.mjs end-to-end:
    1. local:up
    2. sign up owner via API
    3. create tenant via API
    4. probe acme.localhost:8080 until 200
    5. local:down
  • Verify: green E2E smoke from a clean state.

Working agreement

  • Same as PLAN.md: implement → self-test → commit phase(L<n>): … → push.
  • Each phase ends with a green check. If a phase's verify step fails, we fix before moving on.
  • Anything bigger than L0L6 (e.g., bring Core Front + Platform processes inside the tenant image) is a follow-up scoped after L6 is green.

Open questions to revisit later (NOT for v1)

  • Multi-process tenant (Core Front + Platform Back + Platform Front inside the container) — needed for the real Platform editing experience
  • Trigger fabric / scheduler — needed once tenants have cron triggers
  • Custom domains for tenants
  • Per-tenant resource limits (memory/CPU caps via Docker)
  • Tenant suspend (Docker stop) ↔ cold start time on first request