AIOps Control Plane — Design Document
Date: 2026-02-23 Updated: 2026-03-02 Status: Approved (v2)
Problem
The manage_notebooks.py workflow in eaia executes notebooks locally via Docker Compose.
As the notebook fleet grows across environments (torch.dev.gpu, ros.dev.gpu), execution must
be distributed across a multi-host fleet. This capability is general enough to warrant a
dedicated platform rather than an in-place extension.
More broadly: the aegean-ai infrastructure needs a control plane that oversees a user plane where agentic workloads — VLA agents, multi-agent systems, real-time robot control — execute on KubeRay. The control plane is not just job scheduling; it is the observability, memory, and safety substrate for that agentic runtime.
Goals
- Remote notebook execution dispatched to KubeRay on Proxmox K8s
- Experiment tracking via W&B, linked to jobs
- Executed notebooks copied back to
eaiafor MDX regeneration - Agentic control plane powered by Claude Code (subscription, not API)
- Web UI for job monitoring, cluster health, W&B run browsing
- Platform scope: also controls
aegean-ai/lakehouse
Non-goals (v1)
- Pydantic AI integration (deferred to v2)
- Public access / multi-tenant
- Cost accounting per job (management plane — deferred to v2)
Four-Plane Architecture
The system follows a user / control / data / management plane separation — a pattern from SDN and telecom applied to agentic infrastructure. The planes have fundamentally different latency, consistency, and availability requirements.
| Plane | What runs here | Latency / consistency | Failure consequence |
|---|---|---|---|
| User plane | Customer agents: VLA, Nav2, behavior trees, YOLOv8, SLAM | Real-time (ms), stateful per-session | Agent stops; robot halts |
| Control plane | This repo: job dispatch, cluster management, experiment tracking, agent lifecycle governance | Seconds, eventually consistent | Degraded visibility; user plane continues |
| Data plane | Lakehouse (DuckDB + DuckLake + MinIO), embeddings, event log | Seconds, eventually consistent | Queries fail; agents lose context |
| Management plane | Billing, tenancy, quotas, user management | Minutes, strongly consistent | No new deployments; running agents unaffected |
The plane separation is a first principle: user plane failures must not cascade to the control plane, and control plane outages must not halt running agents.
Reference application: turtlebot-maze
The turtlebot-maze project (ros.dev.gpu) is the reference user-plane application. It
demonstrates what the user plane looks like in operation:
- Gazebo simulation + ROS 2 Jazzy + Nav2 running on KubeRay
ros.dev.gpuworkers - Behavior trees (py_trees / BehaviorTree.CPP) for autonomous navigation and object search
- YOLOv8 object detection via PyTorch, decoupled from ROS via Zenoh transport
- Visual SLAM (stella_vslam) for mapping and localization
- Claude Code +
/navigateslash command + ros-mcp-server (MCP over rosbridge WebSocket) — a Claude Code agent doing real-time robot control: the user plane in operation
The control plane does not control the robot in real-time. That is ros-mcp-server's job.
The control plane manages the ros.dev.gpu RayCluster the robot runs on and the experiment
lifecycle (job submission, W&B tracking, notebook copyback) around it.
End-to-end demo across all layers:
User → /navigate (Claude Code, user plane, real-time)
↕ ros-mcp-server / rosbridge :9090
ROS 2 Nav2 on ros.dev.gpu RayCluster (user plane compute)
eaia → POST /api/v1/jobs (control plane)
NotebookAgent → KubeRay ros.dev.gpu job (papermill run)
W&B run linked to job (experiment tracking)
Copyback webhook → eaia MDX regeneration
Management plane (v2): GPU hours logged, per-user quota decremented, billing event emitted
Control Plane Architecture
System context (C4 Level 1)
Agent layer component diagram
Agent roles
| Agent | --allowedTools | Responsibility |
|---|---|---|
| NotebookAgent | Bash(kubectl *),Bash(ray *),Read | Submit/poll Ray jobs (notebook + training), trigger copyback |
| ClusterAgent | Bash(kubectl *),Bash(helm *) | KubeRay health, node scaling |
| WandBAgent | Bash(wandb *),WebFetch | Query runs, surface regressions |
| LakehouseAgent | Bash(duckdb *),Bash(python *),Read,Edit | Catalog ops on lakehouse |
| TwinAgent | Bash(duckdb *),Bash(python *),Read | Twin lifecycle: create, sync, query, predict, retire |
| PolicyAgent (v1.5) | Bash(ray *),Bash(kubectl *),Read | Deploy/update Policy Server (Ray Serve), model hot-swap |
Notebook execution flow
Training job execution flow
The control plane dispatches model training jobs (SFT, DPO, GRPO) to torch.dev.gpu via Ray
Jobs — the local equivalent of HF Jobs. This replaces HF cloud compute with the Auraison
KubeRay cluster while using HF Hub as the artifact registry for model checkpoints.
Conceptual mapping:
| HF Jobs cloud | Auraison local (Ray on KubeRay) |
|---|---|
hf jobs uv run CLI | POST /api/v1/jobs → NotebookAgent → ray job submit |
| Managed cloud container | torch.dev.gpu Ray worker image (PyTorch + Unsloth + TRL) |
| Cloud GPU flavor (A10G, A100) | torch.dev.gpu RayCluster GPU resources |
| HF secrets | Kubernetes secrets / env vars (HF_TOKEN, WANDB_API_KEY) |
| Auto-logging | W&B integration via WandBAgent |
trainer.push_to_hub() | Same — Ray worker pushes checkpoints to HF Hub directly |
Architecture principle: HF Hub remains the model artifact registry (no vendor lock-in).
The compute moves from HF cloud to our KubeRay cluster. Training scripts (Unsloth, TRL) run
unchanged — only the submission method changes (ray job submit instead of hf jobs uv run).
Training job types supported:
| Method | Library | Use case |
|---|---|---|
| SFT | TRL SFTTrainer | Supervised fine-tuning (instruction tuning, domain adaptation) |
| DPO | TRL DPOTrainer | Direct preference optimisation (alignment) |
| GRPO | TRL GRPOTrainer | Group relative policy optimisation (reasoning) |
| Unsloth | FastLanguageModel | 2x faster, 60% less VRAM; wraps SFT/DPO for efficient QLoRA |
Cosmos model post-training (v1.5): Cosmos-Predict2 and Cosmos-Reason2 are post-trained
on domain-specific data (turtlebot-maze ROS bags, AR4 manipulation datasets) using the same
Ray job dispatch pattern. The training script calls push_to_hub() to publish the fine-tuned
checkpoint to HF Hub; the Policy Server on torch.dev.gpu pulls the latest checkpoint on
next deployment.
Policy Server deployment (v1.5)
The Policy Server is a Ray Serve endpoint on torch.dev.gpu that serves VLA inference. It
is deployed and managed by a new PolicyAgent in the control plane.
PolicyAgent responsibilities:
- Deploy/update Ray Serve endpoint with specified model (OpenVLA → GR00T)
- Pull model checkpoint from HF Hub
- Monitor inference latency and GPU utilisation
- Scale replicas based on request rate
- Swap model backend without restarting ROS 2 stack (hot-swap)
Tool scope: Bash(ray *),Bash(kubectl *),Read
The Policy Server abstraction is the key anti-lock-in decision: NVIDIA Cosmos models are plugins, not infrastructure. The user-plane ROS 2 stack communicates with the Policy Server via Zenoh bridge — it never knows which VLA model is running behind the endpoint.
eaia integration
# manage_notebooks.py cmd_execute — one new flag
if args.remote:
resp = requests.post(f"{CONTROL_PLANE_URL}/api/v1/jobs",
json={"source": source_rel, "environment": environment})
print(f"Job submitted: {resp.json()['job_id']}")
return 0
Multi-Agent Rationale
Why one agent per domain, not one general agent
A single general-purpose agent for all four domains (notebook, cluster, W&B, lakehouse) would
require a superset of all tools — kubectl, ray, helm, wandb, duckdb, plus Read/Edit.
This violates least-privilege and makes prompt engineering harder: a single system prompt
cannot be well-optimised for four distinct operational contexts simultaneously.
One agent per domain gives:
- Optimised system prompts — each agent's prompt is tuned for its domain vocabulary and failure modes (Ray job errors vs. DuckDB query failures are fundamentally different)
- Tool scoping as a security boundary —
--allowedTools Bash(kubectl *)means a NotebookAgent subprocess cannot runduckdbqueries even if prompted to; the shell restriction is enforced by Claude Code, not by convention - Blast-radius containment — a runaway NotebookAgent cannot mutate lakehouse models; a LakehouseAgent cannot scale down a running KubeRay cluster
- Independent failure modes — a WandBAgent timeout does not block cluster health checks
Session resume and stateful polling
claude -p --resume <session_id> allows a subsequent subprocess invocation to continue within
the same conversation context. This is used for job status polling: the first call to
submit_notebook_job returns a session_id; subsequent calls to poll_job_status pass that
ID, giving the agent continuity (logs, prior tool outputs) without restarting cold.
This partially compensates for the synchronous subprocess model for long-running jobs.
v1 Hybrid Compromise: Conflated Plane Boundary
In v1, the claude -p subprocess agents conflate two architectural concerns:
- Control plane cognition: reasoning about what to do ("is this cluster healthy?", "which Ray cluster should this job go to?", "has the job succeeded?")
- User plane execution: issuing the commands that actually change state (
kubectl,ray job submit,helm upgrade)
A NotebookAgent subprocess both decides how to submit a job (control plane) and runs
ray job submit (user plane execution). This is deliberate for v1: it eliminates the need
for a separate user-plane executor service and leverages the claude -p subprocess boundary
as a physical isolation point — the FastAPI process never directly touches kubectl or ray.
The control plane does not execute directly. The subprocess does execute on its behalf.
In v1, this is the best available approximation of plane separation given the claude -p
synchronous subprocess model.
In v2, the separation becomes explicit:
v1: FastAPI → claude -p subprocess → kubectl / ray job submit (hybrid: reason + execute)
v2: FastAPI → claude -p subprocess → emit JobSpec to NATS subject
↓
user-plane executor (no LLM) → kubectl / ray job submit
The v2 control plane agents become pure intent emitters: they reason, plan, and emit
structured JobSpec messages. A stateless user-plane executor handles all CLI invocations.
This is the canonical SDN separation: the control plane produces forwarding rules; the data
(user) plane applies them.
First Principles
Six first principles drove the v1 design choices:
0. Control plane does not execute
The control plane reasons about what should happen and emits intent. It does not directly
mutate infrastructure. In v1, this principle is approximated: the claude -p subprocess
boundary is the execution boundary. In v2, a dedicated user-plane executor enforces this
cleanly.
1. Subscription compute arbitrage
claude -p reuses the Claude Code subscription — zero per-token marginal cost for internal
tooling. At notebook fleet scale, per-token API billing would be prohibitive. The agents are
cost-free until the subscription limit is reached.
2. Infrastructure-as-conversation
Each agent uses the same CLIs a human operator would: kubectl, ray, helm, wandb, duckdb.
No new SDK surface to maintain, no client library versioning, no custom auth layer. The agent
reads CLI output exactly as a human would read a terminal. When the CLI changes, the agent
adapts without code changes.
3. Blast-radius containment via tool scoping
Bash(kubectl *) is equivalent to a least-privilege service account. The tool restriction is
enforced at the Claude Code subprocess boundary — it is not a convention or a prompt
instruction, but a hard constraint. Each agent can only reach the infrastructure it needs.
4. Subprocess isolation boundary
Each agent invocation is a fresh process with captured stdout/stderr. Failures are isolated:
a hanging agent does not block the FastAPI event loop; a crashed agent returns a non-zero exit
code caught by run_agent(); retries are clean. The API layer remains stateless with respect
to agent execution.
5. Framework-agnostic escape hatch
The claude -p subprocess interface is framework-agnostic. Pydantic AI can be layered in v2
to formalise agent input/output schemas without rewriting agent logic. The subprocess boundary
is the seam — what is above it (agent prompts, tool definitions) and below it (API routing,
job store) can evolve independently.
Event-Driven vs Request/Response
Most agent systems in production are event-driven. This design uses synchronous
request/response (API call → subprocess → return JSON). The tension is deliberate and bounded.
What event sourcing would give
- Audit trail and replayability — an immutable event log of every job state transition. The current in-memory (and future Postgres) job store loses causal history.
- Decoupled producers and consumers — currently, the API router directly invokes the agent
subprocess (tight coupling). An event bus would let multiple consumers react to
job.createdwithout the router knowing about them. - Natural fit for long-running jobs — notebook execution takes minutes to hours. The current
polling model (
poll_job_status+--resume) is a workaround for the synchronous subprocess limitation.
Why request/response is justified for v1
- No message broker to operate — reduces operational surface for a v1 platform
claude -pis inherently synchronous; async event consumption would require persistent agent processes or a worker pool, both of which require operational investment- Session resume (
--resume) partially compensates: long-running jobs split across multiple subprocess invocations with shared session context
The impedance mismatch with the user plane
The user plane is already event-driven: DDS topics, Zenoh pub/sub, ROS action servers. The control plane wraps this in a request/response job API. This mismatch must be acknowledged: the control plane is an event consumer (KubeRay job lifecycle events, W&B run events) that presents a request/response interface upward. Redis, already in the architecture, is the natural bridge — Redis Streams can serve as the event log before a full Kafka adoption.
Migration path to event sourcing (v2)
Current: API router → claude -p subprocess (synchronous)
v1.5: API router → Redis Streams (job events) → worker pool polling streams → claude -p
v2: NATS (control messages, ms latency) + Kafka (audit log, replay)
Pydantic AI agents subscribing to domain topics
Alignment with Agentic Systems Architecture
The design patterns in DRAFT-agentic-systems-architecture.md map directly to this system:
Pattern alignment
| Pattern (from draft) | Manifestation in this system |
|---|---|
| Routing (§2.3) | FastAPI routers dispatch to specialist agents by domain. Currently synchronous; v2 moves to pub/sub topic-per-domain. |
| Orchestrator-Worker (§2.5) | Control plane is the orchestrator; KubeRay Ray workers are the workers. NATS request-reply is the target dispatch pattern. |
| Parallelisation (§2.4) | Multiple agents (NotebookAgent, ClusterAgent) can be invoked concurrently for independent operations. Currently serial; worker pool unlocks this. |
| Evaluator-Optimizer (§2.6) | WandBAgent surfaces regressions against prior runs — a lightweight evaluator loop for experiment quality. |
Memory management implications
Current agents have only short-term memory (session resume context). As the user plane matures, the control plane must become a persistent memory store for the agentic runtime:
- Episodic: job history, cluster failure events, navigation trial outcomes
- Semantic: learned cluster failure patterns, experiment regression signatures
- Procedural: successful job submission sequences, DuckLake catalog dependency chains
The Job model schema must grow to capture agent reasoning traces and tool call logs — enabling
replay, post-hoc analysis, and eventually value-guided memory attribution (§4.4 of the draft).
Streaming technology selection
| Use case | Technology | Rationale |
|---|---|---|
| Control plane → KubeRay agent dispatch | NATS | Sub-millisecond, request-reply, matches §2.5 pattern |
| Experiment data, notebook outputs | Kafka | Durable, replayable, high-throughput sensor-class data |
| User plane ↔ user plane (ROS) | DDS / Zenoh | Already in use; Zenoh decouples non-ROS containers |
| v1 bridge | Redis Streams | Already in architecture; approximates Kafka semantics without new infra |
Human oversight for user-plane safety
The user plane runs VLA agents in robotics contexts (ros.dev.gpu). The control plane must
implement safety checkpoints aligned with §6.3 of the draft:
- Maximum job duration before forced termination
- Confidence threshold gates before irreversible actions (robot actuation)
- Human-in-the-loop escalation path from job monitoring UI
- Circuit breakers: cluster-wide pause when anomaly rate exceeds threshold
MCP trajectory
v1: claude -p subprocess (MCP implicit via Claude Code tools)
v2: Pydantic AI agents with explicit MCP tool schemas
Control plane exposes MCP server: user plane agents query job status,
cluster state, and experiment results via structured tool calls
AgentOps Subsystem
The control plane includes an agent operations subsystem (control-plane/backend/agentops/)
that governs agent behaviour at runtime. This was previously considered as a separate
architectural layer but was consolidated into the control plane — the industry standard
(Forrester "Agent Control Plane" market category, Microsoft Agent 365, GitHub Enterprise AI
Controls) places these functions within the control plane.
Why not a separate layer
The AgentOps functions have no independent deployment boundary — they are Python modules
within the control plane backend. In v1, they are already implicit in run_agent() (subprocess
timeout, --resume for state, --allowedTools for guardrails). Promoting implementation
details to an architectural layer adds conceptual overhead without structural benefit.
Components
| Component | Responsibility | v1 mechanism | v1.5 implementation |
|---|---|---|---|
| Execution Scheduler | Concurrency limits, priority queue | Synchronous subprocess (queue depth 1) | Redis work queue + configurable worker pool |
| Agent Lifecycle Manager | Retry, timeout, dead-letter | Manual in run_agent() caller | Exponential backoff with jitter; max 3 retries |
| Backpressure Controller | Prevent user-plane overload | None (unbounded) | Per-environment Ray job concurrency cap |
| Runtime Policy Engine | Token budgets, tool call limits | max_turns param | Configurable per agent role; fetched from management plane in v2 |
| Guardrails Engine | Safety constraints per agent role | --allowedTools at spawn | Per-role constraint checks; v2 adds per-tool-call evaluation |
| Trace Collector | Structured agent event logging | stdout/stderr capture | AgentEvent emitted to Redis Stream / Postgres |
| State Stabiliser | Checkpoint and resume across restarts | --resume session_id (in-memory) | Checkpoint to Postgres; resume from last tool call |
AgentEvent schema
Every claude -p subprocess invocation produces an AgentEvent on completion:
AgentEvent {
event_id: UUID
intent_id: UUID
session_id: str
agent_role: str
job_id: UUID (opt)
tool_calls: [{tool, input, output, duration_ms}]
tokens_in: int
tokens_out: int
exit_code: int
limit_hit: str | null
retry_count: int
duration_ms: int
tenant_id: UUID
}
Runtime policies
| Policy | Default | Configurable by |
|---|---|---|
max_tool_calls | 50 | Management plane Policy Engine |
max_recursion_depth | 5 | Agent role config |
token_budget_in | 100 000 | Management plane (per tenant) |
token_budget_out | 10 000 | Management plane (per tenant) |
max_duration_s | 3 600 | JobSpec / AgentIntent |
Failure modes prevented
- Recursive planning loops — token budget and max tool call limits terminate cleanly
- Tool call storms — backpressure controller caps concurrent Ray job submissions
- Distributed agent chaos — execution scheduler enforces concurrency limits per agent role
- Retry explosions — lifecycle manager enforces bounded retries with exponential backoff
World model (v2)
In v2, the AgentOps subsystem maintains a structured world model of the agent ecosystem:
WorldModel {
agents: {<intent_id>: {role, status, tool_call_count, session_id}}
user_plane: {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
execution_topology: [{parent_intent_id, child_intent_id, relationship}]
causal_graph: [{intent_id, job_id, outcome}]
}
Enables deadlock detection, anomaly detection, causal replay, and predictive throttling.
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Plane boundary (v1) | claude -p subprocess as execution boundary | Approximates "control plane does not execute"; FastAPI never touches kubectl directly |
| Plane boundary (v2) | Intent emitter + user-plane executor | Clean SDN separation: control plane emits JobSpec; executor applies it |
| Agent compute | claude -p subprocess | Reuses Claude Code subscription; zero marginal cost |
| Agent isolation | One agent per domain | Tool scoping, blast-radius containment, optimised prompts |
| AgentOps placement | Control plane subsystem, not separate layer | No deployment boundary; industry standard places these functions in the control plane |
| Infra | Proxmox K8s + KubeRay | Ray on Docker Swarm is non-standard; K8s gives GPU scheduling |
| Experiment tracking | W&B | Established tooling; CLI-accessible for agent use |
| Model artifact registry | HF Hub | trainer.push_to_hub(); no vendor lock-in; same as HF Jobs workflow |
| Training compute | Ray Jobs on torch.dev.gpu | Local equivalent of HF Jobs; Unsloth/TRL scripts run unchanged |
| VLA inference | Policy Server (Ray Serve) on torch.dev.gpu | Model-agnostic endpoint; backends swappable (OpenVLA → GR00T → Cosmos) |
| v1 job store | In-memory → Postgres | Postgres planned; in-memory for scaffolding only |
| v1 dispatch | Synchronous subprocess | Simplest path; Redis Streams migration in v1.5 |
| Streaming (v2) | NATS + Kafka | NATS for control, Kafka for audit — per §3.1 of architecture draft |
| Agent formalisation | Pydantic AI deferred | Added in v2 to formalise schemas without rewriting agent logic |
Evolution Path
v1 — Synchronous subprocesses, Postgres job store; hybrid plane boundary (reason + execute)
Training jobs via Ray: Unsloth/TRL on torch.dev.gpu; HF Hub as model artifact registry
TwinAgent for digital twin lifecycle
v1.5 — AgentOps subsystem: execution scheduler, backpressure, trace collector; Redis Streams
Control plane emits AgentEvents to management plane Observability Store
PolicyAgent: deploy/manage Policy Server (Ray Serve) for VLA inference on torch.dev.gpu
Cosmos model post-training via Ray Jobs → push to HF Hub → hot-swap on Policy Server
v2 — NATS (control), Kafka (audit), Pydantic AI + MCP; world model
Explicit plane separation: control agents emit JobSpec; user-plane executor runs CLI
Management plane: billing, tenancy, dynamic tool scoping, evaluation loops
See also:
docs/plans/2026-02-23-auraison-user-plane-design.md— execution mesh designdocs/plans/2026-02-23-auraison-management-plane-design.md— governance and observabilitydocs/plans/2026-02-23-auraison-agentops-design.md— original AgentOps design (consolidated here)docs/plans/2026-03-02-digital-twins-design.md— TwinAgent and twin schemadocs/plans/2026-03-02-ar4-digital-twin-design.md— AR4 twin, layered plane architecture
References
- HF Jobs: Run and manage Jobs — managed container training; the cloud pattern our Ray Jobs replicate locally
- Train AI models with Unsloth and HF Jobs — Unsloth + TRL on HF Jobs; 2x faster, 60% less VRAM; scripts run unchanged on our Ray cluster
- TRL Jobs Training — SFT, DPO, GRPO via TRL on HF Jobs infrastructure
- Cosmos-Reason2 on Jetson — edge deployment of Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM