Auraison — AgentOps Layer Design
Date: 2026-02-23
Updated: 2026-02-28
Status: Superseded — consolidated into control plane design
(docs/plans/2026-02-23-aiops-control-plane-design.md §"AgentOps Subsystem").
AgentOps is now a control plane subsystem, not a separate architectural layer.
This document is retained for historical context.
Problem
The four-plane architecture (user / control / data / management) governs control flow well for single agents and small workflows. It breaks down as soon as the system scales to multiple concurrent agents, long-running autonomous workflows, and dynamic tool usage.
The failure modes are specific and predictable:
- Recursive planning loops — a control-plane agent keeps refining its plan endlessly; token budget explodes; no execution ever starts
- Tool call storms — agents aggressively issue
kubectlorraycalls in parallel; API rate limits exceeded; cascading failures - Distributed agent chaos — multiple concurrent NotebookAgent subprocesses submit
duplicate Ray jobs, race on cluster state, or issue conflicting
helmmutations - Retry explosions — a naive retry on job failure re-submits without backoff or bound, compounding the original failure
In v1, these are largely prevented by the synchronous subprocess model: only one claude -p
subprocess runs at a time, providing natural serialisation. In v1.5 (async worker pool) and
v2 (NATS-driven multi-agent), this accidental serialisation disappears and the failure modes
become critical.
Neither the control plane nor the management plane can own the solution. The control plane reasons about what to do; it cannot govern its own execution dynamics. The management plane governs infrastructure and governance policy; it operates at too coarse a granularity to stabilise per-agent behaviour.
AgentOps is the missing layer: the operational runtime that governs the behavioural dynamics of agents themselves — between cognition (control plane) and governance (management plane).
Definition
AgentOps is the operational runtime layer that ensures agent behaviour remains stable, observable, and bounded during execution.
The analogy: if the control plane is the brain and the user plane is the muscles, AgentOps is the nervous system and reflexes — stabilising behaviour before governance needs to intervene.
Goals
- Mediate all control-plane agent executions: concurrency limits, retries, timeouts
- Provide agent lifecycle control: start, stop, pause, restart, isolate
- Stabilise agent state across subprocess invocations (extends
--resume) - Enforce runtime policies: max tool calls, recursion depth, token budgets
- Implement backpressure: prevent unbounded Ray worker fan-out
- Surface agent-granularity observability: reasoning traces, tool call graphs
- Run a guardrails engine: safety constraints and behaviour policies per agent role
Non-goals
- Infrastructure provisioning — that is the management plane
- Job execution — that is the user plane
- Billing and tenancy — that is the management plane
- Reasoning and planning — that is the control plane
Position in the architecture
AgentOps sits between the control plane and the management plane. All control-plane agent invocations pass through AgentOps before execution. AgentOps never reasons; it only governs.
Core components
Execution Scheduler
The scheduler decides when and in what order control-plane agent subprocesses run.
Responsibilities:
- Maintain a bounded queue of pending
AgentIntentmessages - Enforce a maximum concurrency limit per agent role (e.g. max 2 concurrent NotebookAgents)
- Priority ordering: ClusterAgent health checks preempt notebook submissions
- Timeout enforcement: kill subprocesses that exceed
max_duration_s
In v1, the scheduler is implicit (synchronous subprocess model = queue depth 1). In v1.5, it becomes an explicit Redis-backed work queue with a configurable worker pool.
AgentIntent {
intent_id: UUID
agent_role: "notebook" | "cluster" | "wandb" | "lakehouse"
job_id: UUID (optional)
prompt: str
priority: int (0=critical, 4=backlog)
max_duration: seconds
tenant_id: UUID
}
Agent Lifecycle Manager
Controls the state machine of each agent subprocess invocation:
PENDING → STARTING → RUNNING → SUCCEEDED
↘ FAILED → RETRYING → RUNNING (bounded)
↘ TIMED_OUT
↘ CANCELLED
Key behaviours:
- Retry policy: exponential backoff with jitter; maximum 3 retries per intent; dead-letter queue after exhaustion (prevents retry explosions)
- Isolation: a FAILED agent does not affect sibling agents of the same role
- Cancellation:
POST /agentops/intents/{id}/cancelpropagates SIGTERM to subprocess
State Stabiliser
Extends the existing --resume <session_id> mechanism to survive:
- FastAPI process restarts
- Host reboots
- Kubernetes pod evictions
State is checkpointed to Postgres (v1.5) or a dedicated state store (v2) at each tool call boundary. On restart, the Execution Scheduler resumes inflight intents from their last checkpoint rather than starting cold.
Checkpoint {
intent_id: UUID
session_id: str (claude -p session ID)
tool_call_count: int
last_tool: str
last_output: str
timestamp: ISO 8601
}
Runtime Policy Engine
Enforces per-invocation limits on agent behaviour. These are runtime policies, distinct from the management plane's infrastructure policies (quotas, RBAC):
| Policy | Default | Configurable by |
|---|---|---|
max_tool_calls | 50 | Management plane Policy Engine |
max_recursion_depth | 5 | Agent role config |
token_budget_in | 100 000 | Management plane (per tenant) |
token_budget_out | 10 000 | Management plane (per tenant) |
max_duration_s | 3 600 | JobSpec / AgentIntent |
When a limit is reached, the agent is terminated cleanly (SIGTERM → summary emitted) rather than killed (SIGKILL → no summary). The AgentEvent captures which limit was hit.
Backpressure Controller
Prevents the control plane from overwhelming the user plane with concurrent job submissions.
Control flow:
NotebookAgent wants to submit Ray job
→ Backpressure Controller checks: ray_jobs_in_flight < max_concurrent_jobs
→ If over limit: enqueue AgentIntent; return 429 to caller with retry-after header
→ If under limit: permit submission; increment counter
→ On job completion event: decrement counter
Limits are per-environment (torch.dev.gpu, ros.dev.gpu) and configurable. Default: 4
concurrent Ray jobs per environment.
This directly prevents the GPU exhaustion / cascading failure scenario described in the problem statement.
Guardrails Engine
Applies safety constraints specific to the agent's domain and the workload context. Distinct from management-plane policy (which governs who can do what) — guardrails govern how agents behave during execution.
Current guardrails (v1.5):
- NotebookAgent: refuse
kubectl deleteon running jobs; require confirmation for cluster scaling operations - ClusterAgent: refuse
helm uninstallwithout explicitforce=trueflag in intent - WandBAgent: read-only by default; refuse run deletion without explicit flag
- LakehouseAgent: refuse
DROP TABLEorDELETE FROMon production catalog tables
In v2, guardrails are expressed as a policy DSL and evaluated at each tool call, not just at spawn time.
Trace Collector
Every claude -p subprocess invocation produces an AgentEvent on completion. The Trace
Collector enriches the event with AgentOps metadata and forwards it to the management-plane
Observability Store.
AgentEvent {
event_id: UUID
intent_id: UUID (links to AgentIntent)
session_id: str
agent_role: str
job_id: UUID (opt)
tool_calls: [{tool, input, output, duration_ms}]
tokens_in: int
tokens_out: int
exit_code: int
limit_hit: str | null (which runtime policy triggered termination, if any)
retry_count: int
checkpoint_count: int
duration_ms: int
tenant_id: UUID
}
World Model–Driven AgentOps
The frontier pattern for 2026 agent systems: AgentOps maintains a structured world model of the agent ecosystem, not just per-invocation state.
The world model captures:
WorldModel {
agents: {
<intent_id>: {role, status, tool_call_count, current_tool, session_id}
}
user_plane: {
torch_dev_gpu: {ray_jobs_in_flight, cluster_status, gpu_utilisation}
ros_dev_gpu: {ray_jobs_in_flight, cluster_status, gpu_utilisation}
}
execution_topology: [
{parent_intent_id, child_intent_id, relationship: "spawned_by" | "blocked_by"}
]
causal_graph: [
{control_plane_intent_id, user_plane_job_id, outcome}
]
}
The world model enables:
- Deadlock detection: circular
blocked_byrelationships surfaced before they stall - Anomaly detection: agent taking 10× the usual tool calls for a known job type
- Causal replay: for any user-plane outcome, reconstruct the full control-plane intent chain
- Predictive throttling: pre-emptively apply backpressure based on queued intent depth
Relevance to robotics (turtlebot-maze)
For the ros.dev.gpu environment, the world model's execution_topology maps directly to
the robot's operational state:
- A Nav2 action in progress is a
user_planeentry in the world model - The Claude Code
/navigateskill invocation is acontrol_planeentry - The causal graph links the navigation intent to the robot's position change
This is the convergence point between AgentOps and world-model-based VLA architectures: the AgentOps world model is a structured representation of the robot's operational context, enabling the control plane to reason over live execution state rather than polling CLI output.
Interfaces
Control plane → AgentOps
POST /agentops/intents
Body: AgentIntent
→ 202 {intent_id, queue_position, estimated_start_s}
GET /agentops/intents/{id}
→ {status, checkpoint, tool_call_count, duration_ms}
POST /agentops/intents/{id}/cancel
→ 200 | 409 (already terminal)
GET /agentops/world-model
→ WorldModel snapshot
AgentOps → management plane
Emits AgentEvent to mp.agent.events stream (Redis Streams v1.5, NATS v2) on each
agent completion or timeout.
AgentOps → user plane (backpressure)
Does not directly call user plane. Mediates control-plane agent access to user-plane CLIs via the Backpressure Controller and Guardrails Engine at subprocess spawn time.
v1 mapping (implicit AgentOps)
In v1, AgentOps concerns are handled implicitly:
| AgentOps concern | v1 mechanism | Limitation |
|---|---|---|
| Concurrency | Synchronous subprocess (1 at a time) | Blocks FastAPI event loop |
| Retry | Manual in run_agent() caller | No backoff, no bound |
| State | --resume session_id (in-memory) | Lost on process restart |
| Timeout | subprocess.run(timeout=...) | Kill, no clean summary |
| Backpressure | None | Unbounded Ray job submission |
| Guardrails | --allowedTools at spawn | Tool scope only; no call-count limit |
| Tracing | stdout/stderr captured | No structured event; no downstream consumer |
The v1 implicit model is sufficient for a single-operator platform with serialised job submission. It fails at multi-agent scale.
Implementation sequence
AgentOps is introduced incrementally, not as a big-bang v2 rewrite:
v1.5 (async worker pool)
- Execution Scheduler — Redis work queue; configurable worker pool size
- Agent Lifecycle Manager — retry with exponential backoff; dead-letter queue
- Backpressure Controller — per-environment Ray job concurrency cap
- Trace Collector — structured
AgentEventemitted to Redis Stream
v2 (full runtime)
- State Stabiliser — checkpoint to Postgres; resume across restarts
- Runtime Policy Engine — token budgets, tool call limits, fetched from management plane
- Guardrails Engine — per-role constraint DSL; evaluated per tool call
- World Model — live execution topology; causal graph; deadlock detection
Evolution path
v1 — AgentOps implicit: synchronous subprocess = accidental serialisation
v1.5 — Execution Scheduler + Lifecycle Manager + Backpressure + Trace Collector
v2 — State Stabiliser + Runtime Policy Engine + Guardrails Engine + World Model
v3 — World Model–Driven AgentOps: predictive throttling; causal replay; VLA integration
See also:
docs/plans/2026-02-23-auraison-control-plane-design.md— control plane intent emittersdocs/plans/2026-02-23-auraison-management-plane-design.md— Policy Engine, Observability Storedocs/plans/2026-02-23-auraison-user-plane-design.md— execution mesh, backpressure target