cial/PLAN-LOCAL.md
Eliot M d4a39d425a phase(L0): tenant container runs full Core + Platform stack
Per-tenant container now boots all five processes behind a single
exposed port (:8080), with the Core/Platform boundary enforced at
the filesystem level (two Linux users, mode 0700 on cial-core).

- @cial/edge: http-proxy edge (HTTP+WS) + node supervisor (PID 1
  under tini, spawns each service via gosu as the right user)
- Routes: /.cial/api/* -> back (prefix stripped), /.cial/* -> core
  front (basePath kept), /* -> platform front. Platform Back is
  internal-only for v1.
- Dockerfile: multi-stage (builder + runtime). Builds protocol/sdk/
  back/edge/front/platform-back/platform-front. Runtime installs
  tini+gosu, creates cial:1000 / agent:1001, locks down cial-core
  to 0700.
- Placeholder pages now render TENANT_ID at request time so the
  smoke can verify per-tenant env propagation end-to-end.
- scripts/smoke-tenant.mjs: docker-driven L0 acceptance — boots the
  image, polls healthz, probes the four route classes, and asserts
  the agent user cannot read /opt/cial-monorepo/cial-core.
- PLAN-LOCAL.md: phased local-mode roadmap (L0..L6).

Verify on a host with docker:
  docker build -f cial-core/docker/Dockerfile -t cial-tenant:dev .
  pnpm smoke:tenant

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-26 10:46:16 +00:00

14 KiB
Raw Blame History

Local mode — full multi-tenant Cial on your laptop

Goal: pnpm local:up brings up the entire Cial platform locally with Docker standing in for Fly Machines. You sign up as the owner in the App admin, create a "client", a real Docker container is spawned for that tenant, you click the tenant URL, and you land in their per-tenant Cial — SSO'd in, already authenticated.

This document scopes "plumbing v1" — same code path as production, only the orchestrator's driver swaps (Docker ↔ Fly). It pulls Phase 2 of PLAN.md (real container entrypoint) and parts of Phases 1113 (App auth, tenant CRUD, orchestrator, router) forward.


Decisions (locked in)

choice
Tenant URL {slug}.localhost:8080 (modern browsers resolve *.localhost → 127.0.0.1; no /etc/hosts edits)
Signup owner-only (you log into admin and create tenants)
Scope plumbing v1 — spawn tenant, route to it, SSO works, tenant container responds with an identifiable placeholder
Orchestration docker compose for App+Postgres+Router; the App spawns tenant containers via the host Docker socket (DooD pattern)

Tech picks

concern choice why
App DB postgres:16 in compose matches prod
App ORM drizzle-orm + drizzle-kit already in deps, lightweight
App Auth better-auth locked-in choice from infra deck
Orchestrator driver dockerode typed, supports the full Docker API; no shelling out
Router proxy tiny Node service using http-proxy (lives in cial-app/router) dedicated process, separates concerns, matches @cial/app-router package we already scaffolded
SSO HS256 JWT (shared secret in .env.local) for v1; bumps to RS256 in prod smaller blast radius for a dev mode
Tenant data named docker volume per tenant: cial-tenant-{id}-data survives docker stop, gone with the tenant on destroy

Topology

                 host (your laptop)
 ┌──────────────────────────────────────────────────────────────────┐
 │                                                                  │
 │   browser → http://acme.localhost:8080                           │
 │                       │                                          │
 │                       ▼                                          │
 │   ┌────────────┐                                                 │
 │   │ @cial/app- │  reads tenants.{slug, container_port} from PG   │
 │   │  router    │  forwards to 172.x.x.x:port                     │
 │   │ (:8080)    │                                                 │
 │   └────────────┘                                                 │
 │                                                                  │
 │   ┌────────────┐    ┌────────────┐    ┌────────────────────┐     │
 │   │ @cial/app- │    │ postgres   │    │ tenant containers  │     │
 │   │  api       │◄──►│  :5432     │    │ (one per client)   │     │
 │   │ (:3100)    │    └────────────┘    │  cial-tenant:dev   │     │
 │   └────────────┘                      └────────────────────┘     │
 │        │                                                         │
 │        │  spawns/stops via /var/run/docker.sock (DooD)           │
 │        ▼                                                         │
 │   host Docker daemon                                             │
 │                                                                  │
 └──────────────────────────────────────────────────────────────────┘

Phasing

Each phase is independently runnable and ends with a verifiable check.

L0 — Tenant image runs the full Core + Platform stack

No scope reduction. The tenant image runs all four processes with proper user separation and a single exposed port. This derisks the hardest piece of the infra (multi-process container, internal edge, two-user model) before any of the orchestration / routing is built. Chat / agent / deploy logic stays as 501 stubs — those come back via PLAN.md phases — but the plumbing is real.

Container topology (single exposed port: 8080):

                  external :8080
                        │
                        ▼
              ┌──────────────────┐
              │ @cial/edge       │  PID 1 supervises and routes
              │ (Node, user cial)│
              └──────────────────┘
                  │     │     │     │
        /.cial/api/*    │     │     │
                  ▼     │     │     │
            @cial/back  │     │     │   user: cial   :4000
            (Express)   │     │     │   data: /var/lib/cial (cial:0700)
                  /.cial/*    │     │
                        ▼     │     │
                @cial/front   │     │   user: cial   :4001
                (Next, basePath=/.cial)
                              │     │
                       /api/p/*    │     (internal — NOT exposed for v1)
                              ▼     │
                      @cial/platform-back :3001  user: agent
                                    │
                                    │  /* (everything else)
                                    ▼
                            @cial/platform-front :3000  user: agent
                                       (Next, basePath=/)

Key decisions (calling these now to avoid round-trips):

  • Supervisor: small Node script as PID 1 wrapped by tini. It spawns the four children, pipes their stdout/stderr with prefixes, and crashes the container if any child dies (so Docker restarts it). No s6/supervisord — keeps the runtime stack pure Node.
  • Internal edge: tiny Node http-proxy server in a new package cial-core/edge (~100 LOC). Handles HTTP and WebSocket upgrade. The only process bound to :8080.
  • Two-user model preserved: cial user owns /opt/cial-core (mode 0700), runs edge + back + front; agent user owns /opt/cial-platform, runs platform-back + platform-front. Edge is cial-owned so the agent process can never bind the public port.
  • Platform Back exposure: internal-only for v1 — reachable from Platform Front server-side via http://localhost:3001, not from the browser. Avoids fighting Next.js for the /api/* namespace. We pick a public mount path later when we know the convention.
  • Next.js mode: production builds with output: 'standalone' so each Next app ships only what it needs (smaller image, faster cold start).
  • No watch / dev mode in the container — that's purely for local dev outside the container (where pnpm smoke already works).

Tasks:

  1. Add cial-core/edge package: Node + http-proxy + ws upgrade handling.
  2. Add cial-core/supervisor (or a small bin/supervisor.mjs inside edge): spawns the four children with the right user via process.setuid (or via su-exec in the entrypoint).
  3. Add output: 'standalone' + outputFileTracingRoot to both Next apps (@cial/front, @cial/platform-front) so monorepo symlinks resolve in the standalone bundle.
  4. Add a tiny placeholder route to each surface so we can prove routing:
    • @cial/platform-front /<h1>Platform · Tenant {TENANT_ID}</h1>
    • @cial/front /.cial<h1>Cial Core · Tenant {TENANT_ID}</h1>
    • @cial/back /.cial/api/health already exists (/healthz mounted there)
    • @cial/platform-back /health already exists (internal only)
  5. Rewrite cial-core/docker/Dockerfile as a real multi-stage build that produces the runtime image with all four services + the supervisor + correct ownership/permissions. Entry: tini -- node /opt/cial-core/edge/bin/supervisor.mjs.
  6. Add cial-core/docker/.dockerignore (already there).

Verify (acceptance for L0):

docker build -f cial-core/docker/Dockerfile -t cial-tenant:dev .
docker run --rm -e TENANT_ID=demo -p 9000:8080 cial-tenant:dev
# In another shell:
curl http://localhost:9000/                     # → "Platform · Tenant demo"
curl http://localhost:9000/.cial                # → "Cial Core · Tenant demo"
curl http://localhost:9000/.cial/api/healthz    # → {"status":"ok",…}
docker exec <id> ls -la /opt/cial-core           # → owner cial, mode 0700
docker exec <id> sudo -u agent cat /opt/cial-core/back/dist/index.js
                                                 # → permission denied (proves boundary)

If all five checks pass, L0 is done. The container is then a real, production-shaped artifact — every later phase just plugs into it.

L1 — docker compose + Postgres + App API + owner signup

  • docker-compose.yml at repo root with services: postgres, app-api
  • cial-app/api:
    • Drizzle schema: users, tenants (slug, name, container_id, container_port, state, owner_id)
    • drizzle-kit generate + on-boot migrate
    • Better-Auth wired at /api/auth/[...all] — email+password, no verification in dev
    • /admin/login and /admin/signup pages (signup disabled after first user; first user is owner)
  • app-api Dockerfile updated to actually run next start on :3100
  • Verify: docker compose up, browser to http://localhost:3100/admin/signup, create owner account, redirected to /admin (empty tenant list).

L2 — Admin tenant CRUD

  • /admin page: list tenants from DB
  • "New tenant" form (slug, name) → POST /api/admin/tenants
  • Validates slug (^[a-z0-9-]{2,40}$), inserts row with state='provisioning'
  • For now, no orchestrator call — just DB row + UI confirmation
  • Verify: create a tenant in UI, see it in the list with state=provisioning.

L3 — Orchestrator Docker driver

  • cial-app/orchestrator/src/drivers/docker.ts: implements Orchestrator via dockerode
    • create(tenant): docker create from cial-tenant:dev with env TENANT_ID, label cial.tenant=<id>, named volume, exposed port mapped to a free host port. Sets state=starting.
    • start/stop/destroy/status: trivial dockerode calls
  • app-api calls orchestrator.create() after the DB insert in L2; records container_id, container_port; flips state to running once the container's /healthz responds.
  • App container needs /var/run/docker.sock mounted in compose.
  • Verify: create a tenant in UI → admin list shows state=running with a port; docker ps shows the container; curl localhost:<port> returns the L0 placeholder page.

L4 — Router reverse proxy on :8080

  • cial-app/router/src/index.ts: small Node HTTP+WS server on :8080 using http-proxy
  • Reads tenant routes from Postgres on each request (cache with 5s TTL)
  • Host header acme.localhosttenants WHERE slug='acme' AND state='running' → forward to localhost:<container_port>
  • 404 if unknown slug; 503 if state ≠ running
  • Add router service to compose
  • Verify: http://acme.localhost:8080 shows the tenant's page (Tenant acme).

L5 — SSO handoff App → tenant Core

  • App admin: tenant detail page has an Open button → /api/admin/tenants/:slug/open
  • That endpoint mints a short-lived (60s) HS256 JWT { sub: ownerId, tenant: slug, role: 'owner' } signed with CIAL_SSO_SECRET, redirects browser to http://{slug}.localhost:8080/.cial/sso?token=...
  • Core Back: /.cial/sso validates token, sets a session cookie scoped to the tenant subdomain, redirects to /. (For v1 the cookie is the marker; Phase 3 of PLAN.md replaces this with real Better-Auth in Core.)
  • The tenant / page now shows Tenant {id} · signed in as {sub}.
  • Verify: from admin, click "Open acme" → land on acme.localhost:8080 with the tenant page showing your owner email.

L6 — pnpm local:up wraps everything

  • One-shot script that:
    1. Builds cial-tenant:dev image
    2. Runs docker compose up --build -d
    3. Tails logs, prints → http://localhost:3100/admin when ready
  • pnpm local:down tears compose + prunes any orphan tenant containers (label-selected: label=cial.tenant)
  • A new scripts/smoke-local.mjs end-to-end:
    1. local:up
    2. sign up owner via API
    3. create tenant via API
    4. probe acme.localhost:8080 until 200
    5. local:down
  • Verify: green E2E smoke from a clean state.

Working agreement

  • Same as PLAN.md: implement → self-test → commit phase(L<n>): … → push.
  • Each phase ends with a green check. If a phase's verify step fails, we fix before moving on.
  • Anything bigger than L0L6 (e.g., bring Core Front + Platform processes inside the tenant image) is a follow-up scoped after L6 is green.

Open questions to revisit later (NOT for v1)

  • Multi-process tenant (Core Front + Platform Back + Platform Front inside the container) — needed for the real Platform editing experience
  • Trigger fabric / scheduler — needed once tenants have cron triggers
  • Custom domains for tenants
  • Per-tenant resource limits (memory/CPU caps via Docker)
  • Tenant suspend (Docker stop) ↔ cold start time on first request