Auraison — Management Plane Design
Date: 2026-02-23 Status: Draft (v2 — not yet implemented)
Problem
As the user plane grows to run multiple concurrent agentic workloads across tenants, and the control plane orchestrates an increasing number of Claude Code agent subprocesses, three cross-cutting concerns emerge that neither plane can own cleanly:
- Governance: who is allowed to run what, at what cost, subject to which quotas
- Observability: a unified view of agent decisions, job outcomes, and resource consumption across both planes — not just logs, but a causal trace linking control plane intent to user plane execution
- Learning loops: aggregated performance signals that feed back into model selection, system prompt tuning, and tool scoping policy
The management plane is the governance and observability substrate for the entire system. It does not route jobs or execute workloads. It watches, enforces, and learns.
Goals
- Tenant isolation: namespaces, RBAC, per-tenant quotas
- Cost and token tracking: per-job, per-user, per-tenant
- Unified observability: traces and metrics spanning both planes
- Agent decision history: immutable audit log of all agent reasoning and tool calls
- Policy enforcement: tool scoping, rate limits, actuation gates
- Evaluation loops: experiment performance data → model/config improvement signals
- Model lifecycle management: which model version each agent uses, A/B testing
Non-goals (v2 scope — not yet implemented)
- Payment processing (third-party: Stripe or equivalent)
- Real-time control plane routing decisions — management plane is advisory, not blocking (except for hard quota enforcements)
- User plane execution — the management plane never submits jobs
Architecture
Position in the four-plane stack
The management plane consumes events from both planes. It does not sit in the hot path of request/response between the control and user planes.
Key components
Tenant Registry
Stores organizations, users, and K8s namespace mappings. Each tenant has:
- An isolated K8s namespace in the user plane
- A set of allowed environments (
torch.dev.gpu,ros.dev.gpu, or both) - A quota configuration (see Policy Engine)
Policy Engine
Enforces governance at two points:
-
Pre-dispatch (soft enforcement, control plane side): the control plane queries the policy engine before submitting a job. If the tenant is over quota, the job is rejected before a subprocess is spawned.
-
Tool scoping (hard enforcement, agent side):
--allowedToolsstrings for each agent role are managed by the policy engine, not hardcoded in agent Python wrappers. In v2, the control plane fetches the current policy before spawning eachclaude -psubprocess.
| Policy type | v1 location | v2 location |
|---|---|---|
Tool scoping (Bash(kubectl *)) | Hardcoded in agent wrappers | Policy Engine → fetched at spawn time |
| Max job duration | JobSpec field | Policy Engine → default per tenant role |
| GPU quota | Not enforced | Policy Engine → KubeRay ResourceQuota |
| Token budget | Not tracked | Billing Service → enforced by Policy Engine |
Observability Store
A unified event store for agent reasoning and execution traces. Every claude -p subprocess
invocation emits an AgentEvent on completion. Every Ray job emits an ExecutionEvent.
AgentEvent {
event_id: UUID
timestamp: ISO 8601
plane: "control"
agent_role: "notebook" | "cluster" | "wandb" | "lakehouse"
job_id: UUID (optional — links to JobSpec)
session_id: str (claude -p session ID, enables conversation replay)
tool_calls: [{tool: str, input: dict, output: str, duration_ms: int}]
tokens_in: int
tokens_out: int
exit_code: int
tenant_id: UUID
}
ExecutionEvent {
event_id: UUID
timestamp: ISO 8601
plane: "user"
job_id: UUID
environment: "torch.dev.gpu" | "ros.dev.gpu"
ray_job_id: str
status: PENDING | RUNNING | SUCCEEDED | FAILED
gpu_seconds: float
tenant_id: UUID
}
These events are the primary input for billing, evaluation, and audit.
Billing Service
Aggregates AgentEvent.tokens_in/out and ExecutionEvent.gpu_seconds per tenant per billing
period. In v2:
- Token cost =
(tokens_in + tokens_out) × model_rateper agent invocation - GPU cost =
gpu_seconds × gpu_rateper Ray job - Both are attributed to the tenant who submitted the job
In v1, cost tracking is absent. The Billing Service is the first management plane component to implement in v2.
Evaluation Loop
The evaluation loop closes the feedback cycle between execution outcomes and model/config configuration:
ExecutionEvent (job succeeded/failed)
→ W&B run linked to job
→ WandBAgent surfaces regression signal
→ Evaluation Loop aggregates across runs
→ signal: "NotebookAgent on torch.dev.gpu has 23% failure rate for lakehouse jobs"
→ operator action: update NotebookAgent system prompt, adjust tool scope
In v2, the evaluation loop is a scheduled job (daily or per N completions) that:
- Queries W&B for recent runs linked to control plane jobs
- Computes per-agent success rates, latency distributions, and cost efficiency
- Writes a summary to the Observability Store
- (Optional) triggers a management plane alert if a metric crosses a threshold
The evaluation loop does not automatically update prompts or models. It surfaces signals; human operators act on them.
Agent decision history
The agent decision history is the immutable append-only log of all AgentEvent records.
It serves three purposes:
-
Audit trail: for any job, you can replay the exact tool calls the agent made, in order, with inputs and outputs. This is essential for debugging unexpected agent behavior.
-
Memory substrate (v2): the Observability Store becomes the episodic memory layer for the control plane. Future agent invocations can query prior agent decisions for context ("last time this notebook failed, the agent ran X to recover").
-
Safety review: for user-plane workloads involving robot actuation, every Nav2 goal submission and every
publish_cmd_velMCP call is logged with a timestamp and session ID. Post-incident review can reconstruct the full causal chain.
Tool scoping as management plane policy
In the current v1 design, --allowedTools strings are hardcoded in each agent Python wrapper:
# v1: hardcoded in backend/agents/notebook_agent.py
ALLOWED_TOOLS = "Bash(kubectl *),Bash(ray *),Read"
This is correct for v1 but wrong architecturally: tool scoping is a governance decision, not an implementation detail. In v2, tool scoping policies are managed by the Policy Engine:
# v2: fetched from Policy Engine at spawn time
scope = await policy_engine.get_tool_scope(agent_role="notebook", tenant_id=tenant_id)
# Returns: "Bash(kubectl *),Bash(ray *),Read" — or a more restrictive scope for untrusted tenants
This allows:
- Per-tenant tool restrictions (a trial tenant cannot run
kubectl delete) - Audit of scope changes (who changed what, when)
- Emergency scope lockdown without a code deploy
Interfaces
Event ingestion (from control and user planes)
Both planes emit events to the management plane. In v1.5, via Redis Streams. In v2, via NATS:
| Stream / Subject | Emitter | Consumer |
|---|---|---|
mp.agent.events | Control plane (after each agent invocation) | Observability Store, Billing |
mp.execution.events | User plane (KubeRay job lifecycle) | Observability Store, Billing |
mp.policy.requests | Control plane (pre-dispatch quota check) | Policy Engine |
Policy query (from control plane)
GET /api/v1/policy/tool-scope?agent_role=notebook&tenant_id=<uuid>
→ {"allowed_tools": "Bash(kubectl *),Bash(ray *),Read", "max_duration_s": 3600}
GET /api/v1/policy/quota?tenant_id=<uuid>
→ {"gpu_hours_remaining": 42.5, "token_budget_remaining": 500000, "jobs_in_flight": 2}
Implementation sequence (v2)
The management plane does not exist in v1. The recommended build sequence:
- Billing Service — instrument
run_agent()to emit token counts; store in Postgres - Tenant Registry — users + namespaces; gate job submission by tenant
- Policy Engine (quotas + tool scoping) — replace hardcoded
ALLOWED_TOOLS - Observability Store — structured
AgentEventlog; expose in dashboard - Evaluation Loop — scheduled aggregation job, W&B integration
Evolution Path
v1 — No management plane; tool scoping hardcoded; no cost tracking; no tenancy
v2.0 — Billing Service + Tenant Registry; basic quota enforcement
v2.1 — Policy Engine: dynamic tool scoping; tool scope audit log
v2.2 — Observability Store: AgentEvent log; session replay
v2.3 — Evaluation Loop: W&B → performance signals; operator dashboard