Skip to main content

AIOps Control Plane — Design Document

Date: 2026-02-23 Updated: 2026-03-02 Status: Approved (v2)


Problem

The manage_notebooks.py workflow in eaia executes notebooks locally via Docker Compose. As the notebook fleet grows across environments (torch.dev.gpu, ros.dev.gpu), execution must be distributed across a multi-host fleet. This capability is general enough to warrant a dedicated platform rather than an in-place extension.

More broadly: the aegean-ai infrastructure needs a control plane that oversees a user plane where agentic workloads — VLA agents, multi-agent systems, real-time robot control — execute on KubeRay. The control plane is not just job scheduling; it is the observability, memory, and safety substrate for that agentic runtime.


Goals

  • Remote notebook execution dispatched to KubeRay on Proxmox K8s
  • Experiment tracking via W&B, linked to jobs
  • Executed notebooks copied back to eaia for MDX regeneration
  • Agentic control plane powered by Claude Code (subscription, not API)
  • Web UI for job monitoring, cluster health, W&B run browsing
  • Platform scope: also controls aegean-ai/lakehouse

Non-goals (v1)

  • Pydantic AI integration (deferred to v2)
  • Public access / multi-tenant
  • Cost accounting per job (management plane — deferred to v2)

Four-Plane Architecture

The system follows a user / control / data / management plane separation — a pattern from SDN and telecom applied to agentic infrastructure. The planes have fundamentally different latency, consistency, and availability requirements.

PlaneWhat runs hereLatency / consistencyFailure consequence
User planeCustomer agents: VLA, Nav2, behavior trees, YOLOv8, SLAMReal-time (ms), stateful per-sessionAgent stops; robot halts
Control planeThis repo: job dispatch, cluster management, experiment tracking, agent lifecycle governanceSeconds, eventually consistentDegraded visibility; user plane continues
Data planeLakehouse (DuckDB + DuckLake + MinIO), embeddings, event logSeconds, eventually consistentQueries fail; agents lose context
Management planeBilling, tenancy, quotas, user managementMinutes, strongly consistentNo new deployments; running agents unaffected

The plane separation is a first principle: user plane failures must not cascade to the control plane, and control plane outages must not halt running agents.

Reference application: turtlebot-maze

The turtlebot-maze project (ros.dev.gpu) is the reference user-plane application. It demonstrates what the user plane looks like in operation:

  • Gazebo simulation + ROS 2 Jazzy + Nav2 running on KubeRay ros.dev.gpu workers
  • Behavior trees (py_trees / BehaviorTree.CPP) for autonomous navigation and object search
  • YOLOv8 object detection via PyTorch, decoupled from ROS via Zenoh transport
  • Visual SLAM (stella_vslam) for mapping and localization
  • Claude Code + /navigate slash command + ros-mcp-server (MCP over rosbridge WebSocket) — a Claude Code agent doing real-time robot control: the user plane in operation

The control plane does not control the robot in real-time. That is ros-mcp-server's job. The control plane manages the ros.dev.gpu RayCluster the robot runs on and the experiment lifecycle (job submission, W&B tracking, notebook copyback) around it.

End-to-end demo across all layers:

User → /navigate (Claude Code, user plane, real-time)
↕ ros-mcp-server / rosbridge :9090
ROS 2 Nav2 on ros.dev.gpu RayCluster (user plane compute)

eaia → POST /api/v1/jobs (control plane)
NotebookAgent → KubeRay ros.dev.gpu job (papermill run)
W&B run linked to job (experiment tracking)
Copyback webhook → eaia MDX regeneration

Management plane (v2): GPU hours logged, per-user quota decremented, billing event emitted

Control Plane Architecture

System context (C4 Level 1)

Agent layer component diagram

Agent roles

Agent--allowedToolsResponsibility
NotebookAgentBash(kubectl *),Bash(ray *),ReadSubmit/poll Ray jobs (notebook + training), trigger copyback
ClusterAgentBash(kubectl *),Bash(helm *)KubeRay health, node scaling
WandBAgentBash(wandb *),WebFetchQuery runs, surface regressions
LakehouseAgentBash(duckdb *),Bash(python *),Read,EditCatalog ops on lakehouse
TwinAgentBash(duckdb *),Bash(python *),ReadTwin lifecycle: create, sync, query, predict, retire
PolicyAgent (v1.5)Bash(ray *),Bash(kubectl *),ReadDeploy/update Policy Server (Ray Serve), model hot-swap

Notebook execution flow

Training job execution flow

The control plane dispatches model training jobs (SFT, DPO, GRPO) to torch.dev.gpu via Ray Jobs — the local equivalent of HF Jobs. This replaces HF cloud compute with the Auraison KubeRay cluster while using HF Hub as the artifact registry for model checkpoints.

Conceptual mapping:

HF Jobs cloudAuraison local (Ray on KubeRay)
hf jobs uv run CLIPOST /api/v1/jobs → NotebookAgent → ray job submit
Managed cloud containertorch.dev.gpu Ray worker image (PyTorch + Unsloth + TRL)
Cloud GPU flavor (A10G, A100)torch.dev.gpu RayCluster GPU resources
HF secretsKubernetes secrets / env vars (HF_TOKEN, WANDB_API_KEY)
Auto-loggingW&B integration via WandBAgent
trainer.push_to_hub()Same — Ray worker pushes checkpoints to HF Hub directly

Architecture principle: HF Hub remains the model artifact registry (no vendor lock-in). The compute moves from HF cloud to our KubeRay cluster. Training scripts (Unsloth, TRL) run unchanged — only the submission method changes (ray job submit instead of hf jobs uv run).

Training job types supported:

MethodLibraryUse case
SFTTRL SFTTrainerSupervised fine-tuning (instruction tuning, domain adaptation)
DPOTRL DPOTrainerDirect preference optimisation (alignment)
GRPOTRL GRPOTrainerGroup relative policy optimisation (reasoning)
UnslothFastLanguageModel2x faster, 60% less VRAM; wraps SFT/DPO for efficient QLoRA

Cosmos model post-training (v1.5): Cosmos-Predict2 and Cosmos-Reason2 are post-trained on domain-specific data (turtlebot-maze ROS bags, AR4 manipulation datasets) using the same Ray job dispatch pattern. The training script calls push_to_hub() to publish the fine-tuned checkpoint to HF Hub; the Policy Server on torch.dev.gpu pulls the latest checkpoint on next deployment.

Policy Server deployment (v1.5)

The Policy Server is a Ray Serve endpoint on torch.dev.gpu that serves VLA inference. It is deployed and managed by a new PolicyAgent in the control plane.

PolicyAgent responsibilities:
- Deploy/update Ray Serve endpoint with specified model (OpenVLA → GR00T)
- Pull model checkpoint from HF Hub
- Monitor inference latency and GPU utilisation
- Scale replicas based on request rate
- Swap model backend without restarting ROS 2 stack (hot-swap)

Tool scope: Bash(ray *),Bash(kubectl *),Read

The Policy Server abstraction is the key anti-lock-in decision: NVIDIA Cosmos models are plugins, not infrastructure. The user-plane ROS 2 stack communicates with the Policy Server via Zenoh bridge — it never knows which VLA model is running behind the endpoint.

eaia integration

# manage_notebooks.py cmd_execute — one new flag
if args.remote:
resp = requests.post(f"{CONTROL_PLANE_URL}/api/v1/jobs",
json={"source": source_rel, "environment": environment})
print(f"Job submitted: {resp.json()['job_id']}")
return 0

Multi-Agent Rationale

Why one agent per domain, not one general agent

A single general-purpose agent for all four domains (notebook, cluster, W&B, lakehouse) would require a superset of all tools — kubectl, ray, helm, wandb, duckdb, plus Read/Edit. This violates least-privilege and makes prompt engineering harder: a single system prompt cannot be well-optimised for four distinct operational contexts simultaneously.

One agent per domain gives:

  1. Optimised system prompts — each agent's prompt is tuned for its domain vocabulary and failure modes (Ray job errors vs. DuckDB query failures are fundamentally different)
  2. Tool scoping as a security boundary--allowedTools Bash(kubectl *) means a NotebookAgent subprocess cannot run duckdb queries even if prompted to; the shell restriction is enforced by Claude Code, not by convention
  3. Blast-radius containment — a runaway NotebookAgent cannot mutate lakehouse models; a LakehouseAgent cannot scale down a running KubeRay cluster
  4. Independent failure modes — a WandBAgent timeout does not block cluster health checks

Session resume and stateful polling

claude -p --resume <session_id> allows a subsequent subprocess invocation to continue within the same conversation context. This is used for job status polling: the first call to submit_notebook_job returns a session_id; subsequent calls to poll_job_status pass that ID, giving the agent continuity (logs, prior tool outputs) without restarting cold.

This partially compensates for the synchronous subprocess model for long-running jobs.


v1 Hybrid Compromise: Conflated Plane Boundary

In v1, the claude -p subprocess agents conflate two architectural concerns:

  • Control plane cognition: reasoning about what to do ("is this cluster healthy?", "which Ray cluster should this job go to?", "has the job succeeded?")
  • User plane execution: issuing the commands that actually change state (kubectl, ray job submit, helm upgrade)

A NotebookAgent subprocess both decides how to submit a job (control plane) and runs ray job submit (user plane execution). This is deliberate for v1: it eliminates the need for a separate user-plane executor service and leverages the claude -p subprocess boundary as a physical isolation point — the FastAPI process never directly touches kubectl or ray.

The control plane does not execute directly. The subprocess does execute on its behalf. In v1, this is the best available approximation of plane separation given the claude -p synchronous subprocess model.

In v2, the separation becomes explicit:

v1:  FastAPI → claude -p subprocess → kubectl / ray job submit (hybrid: reason + execute)
v2: FastAPI → claude -p subprocess → emit JobSpec to NATS subject

user-plane executor (no LLM) → kubectl / ray job submit

The v2 control plane agents become pure intent emitters: they reason, plan, and emit structured JobSpec messages. A stateless user-plane executor handles all CLI invocations. This is the canonical SDN separation: the control plane produces forwarding rules; the data (user) plane applies them.


First Principles

Six first principles drove the v1 design choices:

0. Control plane does not execute The control plane reasons about what should happen and emits intent. It does not directly mutate infrastructure. In v1, this principle is approximated: the claude -p subprocess boundary is the execution boundary. In v2, a dedicated user-plane executor enforces this cleanly.

1. Subscription compute arbitrage claude -p reuses the Claude Code subscription — zero per-token marginal cost for internal tooling. At notebook fleet scale, per-token API billing would be prohibitive. The agents are cost-free until the subscription limit is reached.

2. Infrastructure-as-conversation Each agent uses the same CLIs a human operator would: kubectl, ray, helm, wandb, duckdb. No new SDK surface to maintain, no client library versioning, no custom auth layer. The agent reads CLI output exactly as a human would read a terminal. When the CLI changes, the agent adapts without code changes.

3. Blast-radius containment via tool scoping Bash(kubectl *) is equivalent to a least-privilege service account. The tool restriction is enforced at the Claude Code subprocess boundary — it is not a convention or a prompt instruction, but a hard constraint. Each agent can only reach the infrastructure it needs.

4. Subprocess isolation boundary Each agent invocation is a fresh process with captured stdout/stderr. Failures are isolated: a hanging agent does not block the FastAPI event loop; a crashed agent returns a non-zero exit code caught by run_agent(); retries are clean. The API layer remains stateless with respect to agent execution.

5. Framework-agnostic escape hatch The claude -p subprocess interface is framework-agnostic. Pydantic AI can be layered in v2 to formalise agent input/output schemas without rewriting agent logic. The subprocess boundary is the seam — what is above it (agent prompts, tool definitions) and below it (API routing, job store) can evolve independently.


Event-Driven vs Request/Response

Most agent systems in production are event-driven. This design uses synchronous request/response (API call → subprocess → return JSON). The tension is deliberate and bounded.

What event sourcing would give

  • Audit trail and replayability — an immutable event log of every job state transition. The current in-memory (and future Postgres) job store loses causal history.
  • Decoupled producers and consumers — currently, the API router directly invokes the agent subprocess (tight coupling). An event bus would let multiple consumers react to job.created without the router knowing about them.
  • Natural fit for long-running jobs — notebook execution takes minutes to hours. The current polling model (poll_job_status + --resume) is a workaround for the synchronous subprocess limitation.

Why request/response is justified for v1

  • No message broker to operate — reduces operational surface for a v1 platform
  • claude -p is inherently synchronous; async event consumption would require persistent agent processes or a worker pool, both of which require operational investment
  • Session resume (--resume) partially compensates: long-running jobs split across multiple subprocess invocations with shared session context

The impedance mismatch with the user plane

The user plane is already event-driven: DDS topics, Zenoh pub/sub, ROS action servers. The control plane wraps this in a request/response job API. This mismatch must be acknowledged: the control plane is an event consumer (KubeRay job lifecycle events, W&B run events) that presents a request/response interface upward. Redis, already in the architecture, is the natural bridge — Redis Streams can serve as the event log before a full Kafka adoption.

Migration path to event sourcing (v2)

Current:   API router → claude -p subprocess (synchronous)
v1.5: API router → Redis Streams (job events) → worker pool polling streams → claude -p
v2: NATS (control messages, ms latency) + Kafka (audit log, replay)
Pydantic AI agents subscribing to domain topics

Alignment with Agentic Systems Architecture

The design patterns in DRAFT-agentic-systems-architecture.md map directly to this system:

Pattern alignment

Pattern (from draft)Manifestation in this system
Routing (§2.3)FastAPI routers dispatch to specialist agents by domain. Currently synchronous; v2 moves to pub/sub topic-per-domain.
Orchestrator-Worker (§2.5)Control plane is the orchestrator; KubeRay Ray workers are the workers. NATS request-reply is the target dispatch pattern.
Parallelisation (§2.4)Multiple agents (NotebookAgent, ClusterAgent) can be invoked concurrently for independent operations. Currently serial; worker pool unlocks this.
Evaluator-Optimizer (§2.6)WandBAgent surfaces regressions against prior runs — a lightweight evaluator loop for experiment quality.

Memory management implications

Current agents have only short-term memory (session resume context). As the user plane matures, the control plane must become a persistent memory store for the agentic runtime:

  • Episodic: job history, cluster failure events, navigation trial outcomes
  • Semantic: learned cluster failure patterns, experiment regression signatures
  • Procedural: successful job submission sequences, DuckLake catalog dependency chains

The Job model schema must grow to capture agent reasoning traces and tool call logs — enabling replay, post-hoc analysis, and eventually value-guided memory attribution (§4.4 of the draft).

Streaming technology selection

Use caseTechnologyRationale
Control plane → KubeRay agent dispatchNATSSub-millisecond, request-reply, matches §2.5 pattern
Experiment data, notebook outputsKafkaDurable, replayable, high-throughput sensor-class data
User plane ↔ user plane (ROS)DDS / ZenohAlready in use; Zenoh decouples non-ROS containers
v1 bridgeRedis StreamsAlready in architecture; approximates Kafka semantics without new infra

Human oversight for user-plane safety

The user plane runs VLA agents in robotics contexts (ros.dev.gpu). The control plane must implement safety checkpoints aligned with §6.3 of the draft:

  • Maximum job duration before forced termination
  • Confidence threshold gates before irreversible actions (robot actuation)
  • Human-in-the-loop escalation path from job monitoring UI
  • Circuit breakers: cluster-wide pause when anomaly rate exceeds threshold

MCP trajectory

v1:  claude -p subprocess (MCP implicit via Claude Code tools)
v2: Pydantic AI agents with explicit MCP tool schemas
Control plane exposes MCP server: user plane agents query job status,
cluster state, and experiment results via structured tool calls

AgentOps Subsystem

The control plane includes an agent operations subsystem (control-plane/backend/agentops/) that governs agent behaviour at runtime. This was previously considered as a separate architectural layer but was consolidated into the control plane — the industry standard (Forrester "Agent Control Plane" market category, Microsoft Agent 365, GitHub Enterprise AI Controls) places these functions within the control plane.

Why not a separate layer

The AgentOps functions have no independent deployment boundary — they are Python modules within the control plane backend. In v1, they are already implicit in run_agent() (subprocess timeout, --resume for state, --allowedTools for guardrails). Promoting implementation details to an architectural layer adds conceptual overhead without structural benefit.

Components

ComponentResponsibilityv1 mechanismv1.5 implementation
Execution SchedulerConcurrency limits, priority queueSynchronous subprocess (queue depth 1)Redis work queue + configurable worker pool
Agent Lifecycle ManagerRetry, timeout, dead-letterManual in run_agent() callerExponential backoff with jitter; max 3 retries
Backpressure ControllerPrevent user-plane overloadNone (unbounded)Per-environment Ray job concurrency cap
Runtime Policy EngineToken budgets, tool call limitsmax_turns paramConfigurable per agent role; fetched from management plane in v2
Guardrails EngineSafety constraints per agent role--allowedTools at spawnPer-role constraint checks; v2 adds per-tool-call evaluation
Trace CollectorStructured agent event loggingstdout/stderr captureAgentEvent emitted to Redis Stream / Postgres
State StabiliserCheckpoint and resume across restarts--resume session_id (in-memory)Checkpoint to Postgres; resume from last tool call

AgentEvent schema

Every claude -p subprocess invocation produces an AgentEvent on completion:

AgentEvent {
event_id: UUID
intent_id: UUID
session_id: str
agent_role: str
job_id: UUID (opt)
tool_calls: [{tool, input, output, duration_ms}]
tokens_in: int
tokens_out: int
exit_code: int
limit_hit: str | null
retry_count: int
duration_ms: int
tenant_id: UUID
}

Runtime policies

PolicyDefaultConfigurable by
max_tool_calls50Management plane Policy Engine
max_recursion_depth5Agent role config
token_budget_in100 000Management plane (per tenant)
token_budget_out10 000Management plane (per tenant)
max_duration_s3 600JobSpec / AgentIntent

Failure modes prevented

  1. Recursive planning loops — token budget and max tool call limits terminate cleanly
  2. Tool call storms — backpressure controller caps concurrent Ray job submissions
  3. Distributed agent chaos — execution scheduler enforces concurrency limits per agent role
  4. Retry explosions — lifecycle manager enforces bounded retries with exponential backoff

World model (v2)

In v2, the AgentOps subsystem maintains a structured world model of the agent ecosystem:

WorldModel {
agents: {<intent_id>: {role, status, tool_call_count, session_id}}
user_plane: {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
execution_topology: [{parent_intent_id, child_intent_id, relationship}]
causal_graph: [{intent_id, job_id, outcome}]
}

Enables deadlock detection, anomaly detection, causal replay, and predictive throttling.


Key Decisions

DecisionChoiceRationale
Plane boundary (v1)claude -p subprocess as execution boundaryApproximates "control plane does not execute"; FastAPI never touches kubectl directly
Plane boundary (v2)Intent emitter + user-plane executorClean SDN separation: control plane emits JobSpec; executor applies it
Agent computeclaude -p subprocessReuses Claude Code subscription; zero marginal cost
Agent isolationOne agent per domainTool scoping, blast-radius containment, optimised prompts
AgentOps placementControl plane subsystem, not separate layerNo deployment boundary; industry standard places these functions in the control plane
InfraProxmox K8s + KubeRayRay on Docker Swarm is non-standard; K8s gives GPU scheduling
Experiment trackingW&BEstablished tooling; CLI-accessible for agent use
Model artifact registryHF Hubtrainer.push_to_hub(); no vendor lock-in; same as HF Jobs workflow
Training computeRay Jobs on torch.dev.gpuLocal equivalent of HF Jobs; Unsloth/TRL scripts run unchanged
VLA inferencePolicy Server (Ray Serve) on torch.dev.gpuModel-agnostic endpoint; backends swappable (OpenVLA → GR00T → Cosmos)
v1 job storeIn-memory → PostgresPostgres planned; in-memory for scaffolding only
v1 dispatchSynchronous subprocessSimplest path; Redis Streams migration in v1.5
Streaming (v2)NATS + KafkaNATS for control, Kafka for audit — per §3.1 of architecture draft
Agent formalisationPydantic AI deferredAdded in v2 to formalise schemas without rewriting agent logic

Evolution Path

v1   — Synchronous subprocesses, Postgres job store; hybrid plane boundary (reason + execute)
Training jobs via Ray: Unsloth/TRL on torch.dev.gpu; HF Hub as model artifact registry
TwinAgent for digital twin lifecycle
v1.5 — AgentOps subsystem: execution scheduler, backpressure, trace collector; Redis Streams
Control plane emits AgentEvents to management plane Observability Store
PolicyAgent: deploy/manage Policy Server (Ray Serve) for VLA inference on torch.dev.gpu
Cosmos model post-training via Ray Jobs → push to HF Hub → hot-swap on Policy Server
v2 — NATS (control), Kafka (audit), Pydantic AI + MCP; world model
Explicit plane separation: control agents emit JobSpec; user-plane executor runs CLI
Management plane: billing, tenancy, dynamic tool scoping, evaluation loops

See also:

  • docs/plans/2026-02-23-auraison-user-plane-design.md — execution mesh design
  • docs/plans/2026-02-23-auraison-management-plane-design.md — governance and observability
  • docs/plans/2026-02-23-auraison-agentops-design.md — original AgentOps design (consolidated here)
  • docs/plans/2026-03-02-digital-twins-design.md — TwinAgent and twin schema
  • docs/plans/2026-03-02-ar4-digital-twin-design.md — AR4 twin, layered plane architecture

References