Auraison — Management Plane Design

Date: 2026-02-23 Status: Draft (v2 — not yet implemented)

Problem

As the user plane grows to run multiple concurrent agentic workloads across tenants, and the control plane orchestrates an increasing number of Claude Code agent subprocesses, three cross-cutting concerns emerge that neither plane can own cleanly:

Governance: who is allowed to run what, at what cost, subject to which quotas
Observability: a unified view of agent decisions, job outcomes, and resource consumption across both planes — not just logs, but a causal trace linking control plane intent to user plane execution
Learning loops: aggregated performance signals that feed back into model selection, system prompt tuning, and tool scoping policy

The management plane is the governance and observability substrate for the entire system. It does not route jobs or execute workloads. It watches, enforces, and learns.

Goals

Tenant isolation: namespaces, RBAC, per-tenant quotas
Cost and token tracking: per-job, per-user, per-tenant
Unified observability: traces and metrics spanning both planes
Agent decision history: immutable audit log of all agent reasoning and tool calls
Policy enforcement: tool scoping, rate limits, actuation gates
Evaluation loops: experiment performance data → model/config improvement signals
Model lifecycle management: which model version each agent uses, A/B testing

Non-goals (v2 scope — not yet implemented)

Payment processing (third-party: Stripe or equivalent)
Real-time control plane routing decisions — management plane is advisory, not blocking (except for hard quota enforcements)
User plane execution — the management plane never submits jobs

Architecture

Position in the four-plane stack

The management plane consumes events from both planes. It does not sit in the hot path of request/response between the control and user planes.

Key components

Tenant Registry

Stores organizations, users, and K8s namespace mappings. Each tenant has:

An isolated K8s namespace in the user plane
A set of allowed environments (torch.dev.gpu, ros.dev.gpu, or both)
A quota configuration (see Policy Engine)

Policy Engine

Enforces governance at two points:

Pre-dispatch (soft enforcement, control plane side): the control plane queries the policy engine before submitting a job. If the tenant is over quota, the job is rejected before a subprocess is spawned.
Tool scoping (hard enforcement, agent side): --allowedTools strings for each agent role are managed by the policy engine, not hardcoded in agent Python wrappers. In v2, the control plane fetches the current policy before spawning each claude -p subprocess.

Policy type	v1 location	v2 location
Tool scoping (`Bash(kubectl *)`)	Hardcoded in agent wrappers	Policy Engine → fetched at spawn time
Max job duration	JobSpec field	Policy Engine → default per tenant role
GPU quota	Not enforced	Policy Engine → KubeRay ResourceQuota
Token budget	Not tracked	Billing Service → enforced by Policy Engine

Observability Store

A unified event store for agent reasoning and execution traces. Every claude -p subprocess invocation emits an AgentEvent on completion. Every Ray job emits an ExecutionEvent.

AgentEvent {
  event_id:        UUID
  timestamp:       ISO 8601
  plane:           "control"
  agent_role:      "notebook" | "cluster" | "wandb" | "lakehouse"
  job_id:          UUID (optional — links to JobSpec)
  session_id:      str (claude -p session ID, enables conversation replay)
  tool_calls:      [{tool: str, input: dict, output: str, duration_ms: int}]
  tokens_in:       int
  tokens_out:      int
  exit_code:       int
  tenant_id:       UUID
}

ExecutionEvent {
  event_id:     UUID
  timestamp:    ISO 8601
  plane:        "user"
  job_id:       UUID
  environment:  "torch.dev.gpu" | "ros.dev.gpu"
  ray_job_id:   str
  status:       PENDING | RUNNING | SUCCEEDED | FAILED
  gpu_seconds:  float
  tenant_id:    UUID
}

These events are the primary input for billing, evaluation, and audit.

Billing Service

Aggregates AgentEvent.tokens_in/out and ExecutionEvent.gpu_seconds per tenant per billing period. In v2:

Token cost = (tokens_in + tokens_out) × model_rate per agent invocation
GPU cost = gpu_seconds × gpu_rate per Ray job
Both are attributed to the tenant who submitted the job

In v1, cost tracking is absent. The Billing Service is the first management plane component to implement in v2.

Evaluation Loop

The evaluation loop closes the feedback cycle between execution outcomes and model/config configuration:

ExecutionEvent (job succeeded/failed)
  → W&B run linked to job
      → WandBAgent surfaces regression signal
          → Evaluation Loop aggregates across runs
              → signal: "NotebookAgent on torch.dev.gpu has 23% failure rate for lakehouse jobs"
                  → operator action: update NotebookAgent system prompt, adjust tool scope

In v2, the evaluation loop is a scheduled job (daily or per N completions) that:

Queries W&B for recent runs linked to control plane jobs
Computes per-agent success rates, latency distributions, and cost efficiency
Writes a summary to the Observability Store
(Optional) triggers a management plane alert if a metric crosses a threshold

The evaluation loop does not automatically update prompts or models. It surfaces signals; human operators act on them.

Agent decision history

The agent decision history is the immutable append-only log of all AgentEvent records. It serves three purposes:

Audit trail: for any job, you can replay the exact tool calls the agent made, in order, with inputs and outputs. This is essential for debugging unexpected agent behavior.
Memory substrate (v2): the Observability Store becomes the episodic memory layer for the control plane. Future agent invocations can query prior agent decisions for context ("last time this notebook failed, the agent ran X to recover").
Safety review: for user-plane workloads involving robot actuation, every Nav2 goal submission and every publish_cmd_vel MCP call is logged with a timestamp and session ID. Post-incident review can reconstruct the full causal chain.

Tool scoping as management plane policy

In the current v1 design, --allowedTools strings are hardcoded in each agent Python wrapper:

# v1: hardcoded in backend/agents/notebook_agent.py
ALLOWED_TOOLS = "Bash(kubectl *),Bash(ray *),Read"

This is correct for v1 but wrong architecturally: tool scoping is a governance decision, not an implementation detail. In v2, tool scoping policies are managed by the Policy Engine:

# v2: fetched from Policy Engine at spawn time
scope = await policy_engine.get_tool_scope(agent_role="notebook", tenant_id=tenant_id)
# Returns: "Bash(kubectl *),Bash(ray *),Read"  — or a more restrictive scope for untrusted tenants

This allows:

Per-tenant tool restrictions (a trial tenant cannot run kubectl delete)
Audit of scope changes (who changed what, when)
Emergency scope lockdown without a code deploy

Interfaces

Event ingestion (from control and user planes)

Both planes emit events to the management plane. In v1.5, via Redis Streams. In v2, via NATS:

Stream / Subject	Emitter	Consumer
`mp.agent.events`	Control plane (after each agent invocation)	Observability Store, Billing
`mp.execution.events`	User plane (KubeRay job lifecycle)	Observability Store, Billing
`mp.policy.requests`	Control plane (pre-dispatch quota check)	Policy Engine

Policy query (from control plane)

GET /api/v1/policy/tool-scope?agent_role=notebook&tenant_id=<uuid>
→ {"allowed_tools": "Bash(kubectl *),Bash(ray *),Read", "max_duration_s": 3600}

GET /api/v1/policy/quota?tenant_id=<uuid>
→ {"gpu_hours_remaining": 42.5, "token_budget_remaining": 500000, "jobs_in_flight": 2}

Implementation sequence (v2)

The management plane does not exist in v1. The recommended build sequence:

Billing Service — instrument run_agent() to emit token counts; store in Postgres
Tenant Registry — users + namespaces; gate job submission by tenant
Policy Engine (quotas + tool scoping) — replace hardcoded ALLOWED_TOOLS
Observability Store — structured AgentEvent log; expose in dashboard
Evaluation Loop — scheduled aggregation job, W&B integration

Evolution Path

v1   — No management plane; tool scoping hardcoded; no cost tracking; no tenancy
v2.0 — Billing Service + Tenant Registry; basic quota enforcement
v2.1 — Policy Engine: dynamic tool scoping; tool scope audit log
v2.2 — Observability Store: AgentEvent log; session replay
v2.3 — Evaluation Loop: W&B → performance signals; operator dashboard

Requirements (MP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

ID	Requirement	Traces to	Version
MP-001	The management plane shall govern tenancy, billing, quotas, and user management	SYS-001	v2
MP-002	The management plane shall have latency of minutes, strongly consistent	SYS-001	v2
MP-003	Management plane failure shall prevent new deployments; running agents remain unaffected	SYS-002	v2
MP-004	The management plane shall provide tenant isolation: namespaces, RBAC, per-tenant quotas	—	v2
MP-005	The management plane shall track costs and tokens per-job, per-user, per-tenant	—	v2
MP-006	The management plane shall provide unified observability spanning control and user planes	—	v2.2
MP-007	The management plane shall maintain immutable audit log of all agent reasoning and tool calls	—	v2
MP-008	The management plane shall enforce policy: tool scoping, rate limits, actuation gates	CP-018	v2.1
MP-009	The management plane shall provide evaluation loops: experiment data → improvement signals	—	v2.3
MP-010	The management plane shall manage model lifecycle: version tracking, A/B testing	—	v2.3
MP-011	Payment processing shall be delegated to third-party (Stripe or equivalent)	—	v2
MP-012	The management plane shall NOT make real-time routing decisions; advisory only (except hard quotas)	SYS-002	v2
MP-013	The management plane shall NOT execute user plane jobs	SYS-001	v2
MP-014	The management plane shall NOT be in the hot path of control ↔ user plane communication	SYS-002	v2
MP-015	Tenant Registry shall store organizations, users, and K8s namespace mappings	MP-004	v2
MP-016	Per-tenant config: isolated namespace, allowed environments, quota configuration	MP-004	v2
MP-017	Policy Engine shall enforce governance at pre-dispatch (soft) and tool scoping (hard)	MP-008	v2.1
MP-018	Policy Engine query: `GET /api/v1/policy/tool-scope?agent_role=<role>&tenant_id=<uuid>`	MP-017	v2.1
MP-019	Quota query: `GET /api/v1/policy/quota?tenant_id=<uuid>` → gpu_hours, token_budget, jobs_in_flight	MP-005	v2
MP-020	Tool scope shall be fetched dynamically before spawning each subprocess	MP-017, CP-018	v2.1
MP-021	`--allowedTools` strings shall be managed dynamically by Policy Engine (not hardcoded)	MP-017	v2.1
MP-022	Trial tenants shall be restricted from destructive commands (e.g., `kubectl delete`)	MP-004, MP-017	v2.1
MP-023	Tool scope changes shall be audited: who changed what, when	MP-007	v2.1
MP-024	Emergency scope lockdown shall be possible without code deploy	MP-017	v2.1
MP-025	v1 constraints: tool scoping hardcoded, no cost tracking, no tenancy	—	v1
MP-026	v2 constraints: tool scoping from Policy Engine, quotas via KubeRay ResourceQuota, tokens via Billing	MP-017, MP-005	v2
MP-027	Observability Store shall unify agent reasoning and execution traces	MP-006	v2.2
MP-028	AgentEvent schema: event_id, timestamp, agent_role, job_id, session_id, tool_calls, tokens, exit_code, tenant_id	CP-019	v2.2
MP-029	ExecutionEvent schema: event_id, timestamp, job_id, environment, ray_job_id, status, gpu_seconds, tenant_id	—	v2.2
MP-030	Billing shall aggregate tokens and gpu_seconds per tenant per billing period	MP-005	v2
MP-031	Cost formula: token_cost = tokens × model_rate; gpu_cost = gpu_seconds × gpu_rate	MP-030	v2
MP-032	Costs shall be attributed to the tenant who submitted the job	MP-030	v2
MP-033	Billing Service shall NOT be implemented in v1	MP-025	v1
MP-034	Evaluation Loop: ExecutionEvent → W&B → regression signal → metrics → operator action	MP-009	v2.3
MP-035	Evaluation Loop shall NOT automatically update prompts/models; surfaces signals for human action	MP-009	v2.3
MP-036	Agent decision history shall be immutable append-only AgentEvent log	MP-007	v2
MP-037	Decision history serves: audit trail, episodic memory for agents, safety review for actuation	MP-036	v2
MP-038	v2 implementation sequence: Billing → Tenant Registry → Policy Engine → Observability → Evaluation Loop	—	v2
MP-039	In v1.5, AgentEvent shall be emitted to Observability Store via Redis Streams	CP-019	v1.5
MP-040	In v2, events shall flow via NATS subjects: `mp.agent.events`, `mp.execution.events`	—	v2

Problem​

Goals​

Non-goals (v2 scope — not yet implemented)​

Architecture​

Position in the four-plane stack​

Key components​

Tenant Registry​

Policy Engine​

Observability Store​

Billing Service​

Evaluation Loop​

Agent decision history​

Tool scoping as management plane policy​

Interfaces​

Event ingestion (from control and user planes)​

Policy query (from control plane)​

Implementation sequence (v2)​

Evolution Path​

Requirements (MP-xxx)​