AIOps Control Plane — Design Document

Date: 2026-02-23 Updated: 2026-03-02 Status: Approved (v2)

Problem

The manage_notebooks.py workflow in eaia executes notebooks locally via Docker Compose. As the notebook fleet grows across environments (torch.dev.gpu, ros.dev.gpu), execution must be distributed across a multi-host fleet. This capability is general enough to warrant a dedicated platform rather than an in-place extension.

More broadly: the aegean-ai infrastructure needs a control plane that oversees a user plane where agentic workloads — VLA agents, multi-agent systems, real-time robot control — execute on KubeRay. The control plane is not just job scheduling; it is the observability, memory, and safety substrate for that agentic runtime.

Goals

Remote notebook execution dispatched to KubeRay on Proxmox K8s
Experiment tracking via W&B, linked to jobs
Executed notebooks copied back to eaia for MDX regeneration
Agentic control plane powered by Claude Code (subscription, not API)
Web UI for job monitoring, cluster health, W&B run browsing
Platform scope: also controls aegean-ai/lakehouse

Non-goals (v1)

Pydantic AI integration (deferred to v2)
Public access / multi-tenant
Cost accounting per job (management plane — deferred to v2)

Four-Plane Architecture

The system follows a user / control / data / management plane separation — a pattern from SDN and telecom applied to agentic infrastructure. The planes have fundamentally different latency, consistency, and availability requirements.

Plane	What runs here	Latency / consistency	Failure consequence
User plane	Customer agents: VLA, Nav2, behavior trees, YOLOv8, SLAM	Real-time (ms), stateful per-session	Agent stops; robot halts
Control plane	This repo: job dispatch, cluster management, experiment tracking, agent lifecycle governance	Seconds, eventually consistent	Degraded visibility; user plane continues
Data plane	Lakehouse (DuckDB + DuckLake + MinIO), embeddings, event log	Seconds, eventually consistent	Queries fail; agents lose context
Management plane	Billing, tenancy, quotas, user management	Minutes, strongly consistent	No new deployments; running agents unaffected

The plane separation is a first principle: user plane failures must not cascade to the control plane, and control plane outages must not halt running agents.

Reference application: turtlebot-maze

The turtlebot-maze project (ros.dev.gpu) is the reference user-plane application. It demonstrates what the user plane looks like in operation:

Gazebo simulation + ROS 2 Jazzy + Nav2 running on KubeRay ros.dev.gpu workers
Behavior trees (py_trees / BehaviorTree.CPP) for autonomous navigation and object search
YOLOv8 object detection via PyTorch, decoupled from ROS via Zenoh transport
Visual SLAM (stella_vslam) for mapping and localization
Claude Code + /navigate slash command + ros-mcp-server (MCP over rosbridge WebSocket) — a Claude Code agent doing real-time robot control: the user plane in operation

The control plane does not control the robot in real-time. That is ros-mcp-server's job. The control plane manages the ros.dev.gpu RayCluster the robot runs on and the experiment lifecycle (job submission, W&B tracking, notebook copyback) around it.

End-to-end demo across all layers:

User → /navigate (Claude Code, user plane, real-time)
              ↕ ros-mcp-server / rosbridge :9090
       ROS 2 Nav2 on ros.dev.gpu RayCluster (user plane compute)

eaia → POST /api/v1/jobs (control plane)
       NotebookAgent → KubeRay ros.dev.gpu job (papermill run)
       W&B run linked to job (experiment tracking)
       Copyback webhook → eaia MDX regeneration

Management plane (v2): GPU hours logged, per-user quota decremented, billing event emitted

Control Plane Architecture

System context (C4 Level 1)

Agent layer component diagram

Agent roles

Agent	`--allowedTools`	Responsibility
NotebookAgent	`Bash(kubectl ),Bash(ray ),Read`	Submit/poll Ray jobs (notebook + training), trigger copyback
ClusterAgent	`Bash(kubectl ),Bash(helm )`	KubeRay health, node scaling
WandBAgent	`Bash(wandb *),WebFetch`	Query runs, surface regressions
LakehouseAgent	`Bash(duckdb ),Bash(python ),Read,Edit`	Catalog ops on lakehouse
TwinAgent	`Bash(duckdb ),Bash(python ),Read`	Twin lifecycle: create, sync, query, predict, retire
PolicyAgent (v1.5)	`Bash(ray ),Bash(kubectl ),Read`	Deploy/update Policy Server (Ray Serve), model hot-swap

Notebook execution flow

Training job execution flow

The control plane dispatches model training jobs (SFT, DPO, GRPO) to torch.dev.gpu via Ray Jobs — the local equivalent of HF Jobs. This replaces HF cloud compute with the Auraison KubeRay cluster while using HF Hub as the artifact registry for model checkpoints.

Conceptual mapping:

HF Jobs cloud	Auraison local (Ray on KubeRay)
`hf jobs uv run` CLI	`POST /api/v1/jobs` → NotebookAgent → `ray job submit`
Managed cloud container	`torch.dev.gpu` Ray worker image (PyTorch + Unsloth + TRL)
Cloud GPU flavor (A10G, A100)	`torch.dev.gpu` RayCluster GPU resources
HF secrets	Kubernetes secrets / env vars (`HF_TOKEN`, `WANDB_API_KEY`)
Auto-logging	W&B integration via WandBAgent
`trainer.push_to_hub()`	Same — Ray worker pushes checkpoints to HF Hub directly

Architecture principle: HF Hub remains the model artifact registry (no vendor lock-in). The compute moves from HF cloud to our KubeRay cluster. Training scripts (Unsloth, TRL) run unchanged — only the submission method changes (ray job submit instead of hf jobs uv run).

Training job types supported:

Method	Library	Use case
SFT	TRL `SFTTrainer`	Supervised fine-tuning (instruction tuning, domain adaptation)
DPO	TRL `DPOTrainer`	Direct preference optimisation (alignment)
GRPO	TRL `GRPOTrainer`	Group relative policy optimisation (reasoning)
Unsloth	`FastLanguageModel`	2x faster, 60% less VRAM; wraps SFT/DPO for efficient QLoRA

Cosmos model post-training (v1.5): Cosmos-Predict2 and Cosmos-Reason2 are post-trained on domain-specific data (turtlebot-maze ROS bags, AR4 manipulation datasets) using the same Ray job dispatch pattern. The training script calls push_to_hub() to publish the fine-tuned checkpoint to HF Hub; the Policy Server on torch.dev.gpu pulls the latest checkpoint on next deployment.

Policy Server deployment (v1.5)

The Policy Server is a Ray Serve endpoint on torch.dev.gpu that serves VLA inference. It is deployed and managed by a new PolicyAgent in the control plane.

PolicyAgent responsibilities:
  - Deploy/update Ray Serve endpoint with specified model (OpenVLA → GR00T)
  - Pull model checkpoint from HF Hub
  - Monitor inference latency and GPU utilisation
  - Scale replicas based on request rate
  - Swap model backend without restarting ROS 2 stack (hot-swap)

Tool scope: Bash(ray *),Bash(kubectl *),Read

The Policy Server abstraction is the key anti-lock-in decision: NVIDIA Cosmos models are plugins, not infrastructure. The user-plane ROS 2 stack communicates with the Policy Server via Zenoh bridge — it never knows which VLA model is running behind the endpoint.

eaia integration

# manage_notebooks.py cmd_execute — one new flag
if args.remote:
    resp = requests.post(f"{CONTROL_PLANE_URL}/api/v1/jobs",
                         json={"source": source_rel, "environment": environment})
    print(f"Job submitted: {resp.json()['job_id']}")
    return 0

Multi-Agent Rationale

Why one agent per domain, not one general agent

A single general-purpose agent for all four domains (notebook, cluster, W&B, lakehouse) would require a superset of all tools — kubectl, ray, helm, wandb, duckdb, plus Read/Edit. This violates least-privilege and makes prompt engineering harder: a single system prompt cannot be well-optimised for four distinct operational contexts simultaneously.

One agent per domain gives:

Optimised system prompts — each agent's prompt is tuned for its domain vocabulary and failure modes (Ray job errors vs. DuckDB query failures are fundamentally different)
Tool scoping as a security boundary — --allowedTools Bash(kubectl *) means a NotebookAgent subprocess cannot run duckdb queries even if prompted to; the shell restriction is enforced by Claude Code, not by convention
Blast-radius containment — a runaway NotebookAgent cannot mutate lakehouse models; a LakehouseAgent cannot scale down a running KubeRay cluster
Independent failure modes — a WandBAgent timeout does not block cluster health checks

Session resume and stateful polling

claude -p --resume <session_id> allows a subsequent subprocess invocation to continue within the same conversation context. This is used for job status polling: the first call to submit_notebook_job returns a session_id; subsequent calls to poll_job_status pass that ID, giving the agent continuity (logs, prior tool outputs) without restarting cold.

This partially compensates for the synchronous subprocess model for long-running jobs.

v1 Hybrid Compromise: Conflated Plane Boundary

In v1, the claude -p subprocess agents conflate two architectural concerns:

Control plane cognition: reasoning about what to do ("is this cluster healthy?", "which Ray cluster should this job go to?", "has the job succeeded?")
User plane execution: issuing the commands that actually change state (kubectl, ray job submit, helm upgrade)

A NotebookAgent subprocess both decides how to submit a job (control plane) and runs ray job submit (user plane execution). This is deliberate for v1: it eliminates the need for a separate user-plane executor service and leverages the claude -p subprocess boundary as a physical isolation point — the FastAPI process never directly touches kubectl or ray.

The control plane does not execute directly. The subprocess does execute on its behalf. In v1, this is the best available approximation of plane separation given the claude -p synchronous subprocess model.

In v2, the separation becomes explicit:

v1:  FastAPI → claude -p subprocess → kubectl / ray job submit (hybrid: reason + execute)
v2:  FastAPI → claude -p subprocess → emit JobSpec to NATS subject
                                         ↓
                              user-plane executor (no LLM) → kubectl / ray job submit

The v2 control plane agents become pure intent emitters: they reason, plan, and emit structured JobSpec messages. A stateless user-plane executor handles all CLI invocations. This is the canonical SDN separation: the control plane produces forwarding rules; the data (user) plane applies them.

First Principles

Six first principles drove the v1 design choices:

0. Control plane does not execute The control plane reasons about what should happen and emits intent. It does not directly mutate infrastructure. In v1, this principle is approximated: the claude -p subprocess boundary is the execution boundary. In v2, a dedicated user-plane executor enforces this cleanly.

1. Subscription compute arbitrage claude -p reuses the Claude Code subscription — zero per-token marginal cost for internal tooling. At notebook fleet scale, per-token API billing would be prohibitive. The agents are cost-free until the subscription limit is reached.

2. Infrastructure-as-conversation Each agent uses the same CLIs a human operator would: kubectl, ray, helm, wandb, duckdb. No new SDK surface to maintain, no client library versioning, no custom auth layer. The agent reads CLI output exactly as a human would read a terminal. When the CLI changes, the agent adapts without code changes.

3. Blast-radius containment via tool scoping Bash(kubectl *) is equivalent to a least-privilege service account. The tool restriction is enforced at the Claude Code subprocess boundary — it is not a convention or a prompt instruction, but a hard constraint. Each agent can only reach the infrastructure it needs.

4. Subprocess isolation boundary Each agent invocation is a fresh process with captured stdout/stderr. Failures are isolated: a hanging agent does not block the FastAPI event loop; a crashed agent returns a non-zero exit code caught by run_agent(); retries are clean. The API layer remains stateless with respect to agent execution.

5. Framework-agnostic escape hatch The claude -p subprocess interface is framework-agnostic. Pydantic AI can be layered in v2 to formalise agent input/output schemas without rewriting agent logic. The subprocess boundary is the seam — what is above it (agent prompts, tool definitions) and below it (API routing, job store) can evolve independently.

Event-Driven vs Request/Response

Most agent systems in production are event-driven. This design uses synchronous request/response (API call → subprocess → return JSON). The tension is deliberate and bounded.

What event sourcing would give

Audit trail and replayability — an immutable event log of every job state transition. The current in-memory (and future Postgres) job store loses causal history.
Decoupled producers and consumers — currently, the API router directly invokes the agent subprocess (tight coupling). An event bus would let multiple consumers react to job.created without the router knowing about them.
Natural fit for long-running jobs — notebook execution takes minutes to hours. The current polling model (poll_job_status + --resume) is a workaround for the synchronous subprocess limitation.

Why request/response is justified for v1

No message broker to operate — reduces operational surface for a v1 platform
claude -p is inherently synchronous; async event consumption would require persistent agent processes or a worker pool, both of which require operational investment
Session resume (--resume) partially compensates: long-running jobs split across multiple subprocess invocations with shared session context

The impedance mismatch with the user plane

The user plane is already event-driven: DDS topics, Zenoh pub/sub, ROS action servers. The control plane wraps this in a request/response job API. This mismatch must be acknowledged: the control plane is an event consumer (KubeRay job lifecycle events, W&B run events) that presents a request/response interface upward. Redis, already in the architecture, is the natural bridge — Redis Streams can serve as the event log before a full Kafka adoption.

Migration path to event sourcing (v2)

Current:   API router → claude -p subprocess (synchronous)
v1.5:      API router → Redis Streams (job events) → worker pool polling streams → claude -p
v2:        NATS (control messages, ms latency) + Kafka (audit log, replay)
           Pydantic AI agents subscribing to domain topics

Alignment with Agentic Systems Architecture

The design patterns in DRAFT-agentic-systems-architecture.md map directly to this system:

Pattern alignment

Pattern (from draft)	Manifestation in this system
Routing (§2.3)	FastAPI routers dispatch to specialist agents by domain. Currently synchronous; v2 moves to pub/sub topic-per-domain.
Orchestrator-Worker (§2.5)	Control plane is the orchestrator; KubeRay Ray workers are the workers. NATS request-reply is the target dispatch pattern.
Parallelisation (§2.4)	Multiple agents (NotebookAgent, ClusterAgent) can be invoked concurrently for independent operations. Currently serial; worker pool unlocks this.
Evaluator-Optimizer (§2.6)	WandBAgent surfaces regressions against prior runs — a lightweight evaluator loop for experiment quality.

Memory management implications

Current agents have only short-term memory (session resume context). As the user plane matures, the control plane must become a persistent memory store for the agentic runtime:

Episodic: job history, cluster failure events, navigation trial outcomes
Semantic: learned cluster failure patterns, experiment regression signatures
Procedural: successful job submission sequences, DuckLake catalog dependency chains

The Job model schema must grow to capture agent reasoning traces and tool call logs — enabling replay, post-hoc analysis, and eventually value-guided memory attribution (§4.4 of the draft).

Streaming technology selection

Use case	Technology	Rationale
Control plane → KubeRay agent dispatch	NATS	Sub-millisecond, request-reply, matches §2.5 pattern
Experiment data, notebook outputs	Kafka	Durable, replayable, high-throughput sensor-class data
User plane ↔ user plane (ROS)	DDS / Zenoh	Already in use; Zenoh decouples non-ROS containers
v1 bridge	Redis Streams	Already in architecture; approximates Kafka semantics without new infra

Human oversight for user-plane safety

The user plane runs VLA agents in robotics contexts (ros.dev.gpu). The control plane must implement safety checkpoints aligned with §6.3 of the draft:

Maximum job duration before forced termination
Confidence threshold gates before irreversible actions (robot actuation)
Human-in-the-loop escalation path from job monitoring UI
Circuit breakers: cluster-wide pause when anomaly rate exceeds threshold

MCP trajectory

v1:  claude -p subprocess (MCP implicit via Claude Code tools)
v2:  Pydantic AI agents with explicit MCP tool schemas
     Control plane exposes MCP server: user plane agents query job status,
     cluster state, and experiment results via structured tool calls

AgentOps Subsystem

The control plane includes an agent operations subsystem (control-plane/backend/agentops/) that governs agent behaviour at runtime. This was previously considered as a separate architectural layer but was consolidated into the control plane — the industry standard (Forrester "Agent Control Plane" market category, Microsoft Agent 365, GitHub Enterprise AI Controls) places these functions within the control plane.

Why not a separate layer

The AgentOps functions have no independent deployment boundary — they are Python modules within the control plane backend. In v1, they are already implicit in run_agent() (subprocess timeout, --resume for state, --allowedTools for guardrails). Promoting implementation details to an architectural layer adds conceptual overhead without structural benefit.

Components

Component	Responsibility	v1 mechanism	v1.5 implementation
Execution Scheduler	Concurrency limits, priority queue	Synchronous subprocess (queue depth 1)	Redis work queue + configurable worker pool
Agent Lifecycle Manager	Retry, timeout, dead-letter	Manual in `run_agent()` caller	Exponential backoff with jitter; max 3 retries
Backpressure Controller	Prevent user-plane overload	None (unbounded)	Per-environment Ray job concurrency cap
Runtime Policy Engine	Token budgets, tool call limits	`max_turns` param	Configurable per agent role; fetched from management plane in v2
Guardrails Engine	Safety constraints per agent role	`--allowedTools` at spawn	Per-role constraint checks; v2 adds per-tool-call evaluation
Trace Collector	Structured agent event logging	stdout/stderr capture	`AgentEvent` emitted to Redis Stream / Postgres
State Stabiliser	Checkpoint and resume across restarts	`--resume session_id` (in-memory)	Checkpoint to Postgres; resume from last tool call

AgentEvent schema

Every claude -p subprocess invocation produces an AgentEvent on completion:

AgentEvent {
  event_id:        UUID
  intent_id:       UUID
  session_id:      str
  agent_role:      str
  job_id:          UUID (opt)
  tool_calls:      [{tool, input, output, duration_ms}]
  tokens_in:       int
  tokens_out:      int
  exit_code:       int
  limit_hit:       str | null
  retry_count:     int
  duration_ms:     int
  tenant_id:       UUID
}

Runtime policies

Policy	Default	Configurable by
`max_tool_calls`	50	Management plane Policy Engine
`max_recursion_depth`	5	Agent role config
`token_budget_in`	100 000	Management plane (per tenant)
`token_budget_out`	10 000	Management plane (per tenant)
`max_duration_s`	3 600	JobSpec / AgentIntent

Failure modes prevented

Recursive planning loops — token budget and max tool call limits terminate cleanly
Tool call storms — backpressure controller caps concurrent Ray job submissions
Distributed agent chaos — execution scheduler enforces concurrency limits per agent role
Retry explosions — lifecycle manager enforces bounded retries with exponential backoff

World model (v2)

In v2, the AgentOps subsystem maintains a structured world model of the agent ecosystem:

WorldModel {
  agents:              {<intent_id>: {role, status, tool_call_count, session_id}}
  user_plane:          {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
  execution_topology:  [{parent_intent_id, child_intent_id, relationship}]
  causal_graph:        [{intent_id, job_id, outcome}]
}

Enables deadlock detection, anomaly detection, causal replay, and predictive throttling.

Key Decisions

Decision	Choice	Rationale
Plane boundary (v1)	`claude -p` subprocess as execution boundary	Approximates "control plane does not execute"; FastAPI never touches kubectl directly
Plane boundary (v2)	Intent emitter + user-plane executor	Clean SDN separation: control plane emits JobSpec; executor applies it
Agent compute	`claude -p` subprocess	Reuses Claude Code subscription; zero marginal cost
Agent isolation	One agent per domain	Tool scoping, blast-radius containment, optimised prompts
AgentOps placement	Control plane subsystem, not separate layer	No deployment boundary; industry standard places these functions in the control plane
Infra	Proxmox K8s + KubeRay	Ray on Docker Swarm is non-standard; K8s gives GPU scheduling
Experiment tracking	W&B	Established tooling; CLI-accessible for agent use
Model artifact registry	HF Hub	`trainer.push_to_hub()`; no vendor lock-in; same as HF Jobs workflow
Training compute	Ray Jobs on `torch.dev.gpu`	Local equivalent of HF Jobs; Unsloth/TRL scripts run unchanged
VLA inference	Policy Server (Ray Serve) on `torch.dev.gpu`	Model-agnostic endpoint; backends swappable (OpenVLA → GR00T → Cosmos)
v1 job store	In-memory → Postgres	Postgres planned; in-memory for scaffolding only
v1 dispatch	Synchronous subprocess	Simplest path; Redis Streams migration in v1.5
Streaming (v2)	NATS + Kafka	NATS for control, Kafka for audit — per §3.1 of architecture draft
Agent formalisation	Pydantic AI deferred	Added in v2 to formalise schemas without rewriting agent logic

Evolution Path

v1   — Synchronous subprocesses, Postgres job store; hybrid plane boundary (reason + execute)
       Training jobs via Ray: Unsloth/TRL on torch.dev.gpu; HF Hub as model artifact registry
       TwinAgent for digital twin lifecycle
v1.5 — AgentOps subsystem: execution scheduler, backpressure, trace collector; Redis Streams
       Control plane emits AgentEvents to management plane Observability Store
       PolicyAgent: deploy/manage Policy Server (Ray Serve) for VLA inference on torch.dev.gpu
       Cosmos model post-training via Ray Jobs → push to HF Hub → hot-swap on Policy Server
v2   — NATS (control), Kafka (audit), Pydantic AI + MCP; world model
       Explicit plane separation: control agents emit JobSpec; user-plane executor runs CLI
       Management plane: billing, tenancy, dynamic tool scoping, evaluation loops

References

HF Jobs: Run and manage Jobs — managed container training; the cloud pattern our Ray Jobs replicate locally
Train AI models with Unsloth and HF Jobs — Unsloth + TRL on HF Jobs; 2x faster, 60% less VRAM; scripts run unchanged on our Ray cluster
TRL Jobs Training — SFT, DPO, GRPO via TRL on HF Jobs infrastructure
Cosmos-Reason2 on Jetson — edge deployment of Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM

Problem​

Goals​

Non-goals (v1)​

Four-Plane Architecture​

Reference application: turtlebot-maze​

Control Plane Architecture​

System context (C4 Level 1)​

Agent layer component diagram​

Agent roles​

Notebook execution flow​

Training job execution flow​

Policy Server deployment (v1.5)​

eaia integration​

Multi-Agent Rationale​

Why one agent per domain, not one general agent​

Session resume and stateful polling​

v1 Hybrid Compromise: Conflated Plane Boundary​

First Principles​

Event-Driven vs Request/Response​

What event sourcing would give​

Why request/response is justified for v1​

The impedance mismatch with the user plane​

Migration path to event sourcing (v2)​

Alignment with Agentic Systems Architecture​

Pattern alignment​

Memory management implications​

Streaming technology selection​

Human oversight for user-plane safety​

MCP trajectory​

AgentOps Subsystem​

Why not a separate layer​

Components​

AgentEvent schema​

Runtime policies​

Failure modes prevented​

World model (v2)​

Key Decisions​

Evolution Path​

References​