Skip to main content

Auraison — AgentOps Layer Design

Date: 2026-02-23 Updated: 2026-02-28 Status: Superseded — consolidated into control plane design (docs/plans/2026-02-23-aiops-control-plane-design.md §"AgentOps Subsystem"). AgentOps is now a control plane subsystem, not a separate architectural layer. This document is retained for historical context.


Problem

The four-plane architecture (user / control / data / management) governs control flow well for single agents and small workflows. It breaks down as soon as the system scales to multiple concurrent agents, long-running autonomous workflows, and dynamic tool usage.

The failure modes are specific and predictable:

  1. Recursive planning loops — a control-plane agent keeps refining its plan endlessly; token budget explodes; no execution ever starts
  2. Tool call storms — agents aggressively issue kubectl or ray calls in parallel; API rate limits exceeded; cascading failures
  3. Distributed agent chaos — multiple concurrent NotebookAgent subprocesses submit duplicate Ray jobs, race on cluster state, or issue conflicting helm mutations
  4. Retry explosions — a naive retry on job failure re-submits without backoff or bound, compounding the original failure

In v1, these are largely prevented by the synchronous subprocess model: only one claude -p subprocess runs at a time, providing natural serialisation. In v1.5 (async worker pool) and v2 (NATS-driven multi-agent), this accidental serialisation disappears and the failure modes become critical.

Neither the control plane nor the management plane can own the solution. The control plane reasons about what to do; it cannot govern its own execution dynamics. The management plane governs infrastructure and governance policy; it operates at too coarse a granularity to stabilise per-agent behaviour.

AgentOps is the missing layer: the operational runtime that governs the behavioural dynamics of agents themselves — between cognition (control plane) and governance (management plane).


Definition

AgentOps is the operational runtime layer that ensures agent behaviour remains stable, observable, and bounded during execution.

The analogy: if the control plane is the brain and the user plane is the muscles, AgentOps is the nervous system and reflexes — stabilising behaviour before governance needs to intervene.


Goals

  • Mediate all control-plane agent executions: concurrency limits, retries, timeouts
  • Provide agent lifecycle control: start, stop, pause, restart, isolate
  • Stabilise agent state across subprocess invocations (extends --resume)
  • Enforce runtime policies: max tool calls, recursion depth, token budgets
  • Implement backpressure: prevent unbounded Ray worker fan-out
  • Surface agent-granularity observability: reasoning traces, tool call graphs
  • Run a guardrails engine: safety constraints and behaviour policies per agent role

Non-goals

  • Infrastructure provisioning — that is the management plane
  • Job execution — that is the user plane
  • Billing and tenancy — that is the management plane
  • Reasoning and planning — that is the control plane

Position in the architecture

AgentOps sits between the control plane and the management plane. All control-plane agent invocations pass through AgentOps before execution. AgentOps never reasons; it only governs.


Core components

Execution Scheduler

The scheduler decides when and in what order control-plane agent subprocesses run.

Responsibilities:

  • Maintain a bounded queue of pending AgentIntent messages
  • Enforce a maximum concurrency limit per agent role (e.g. max 2 concurrent NotebookAgents)
  • Priority ordering: ClusterAgent health checks preempt notebook submissions
  • Timeout enforcement: kill subprocesses that exceed max_duration_s

In v1, the scheduler is implicit (synchronous subprocess model = queue depth 1). In v1.5, it becomes an explicit Redis-backed work queue with a configurable worker pool.

AgentIntent {
intent_id: UUID
agent_role: "notebook" | "cluster" | "wandb" | "lakehouse"
job_id: UUID (optional)
prompt: str
priority: int (0=critical, 4=backlog)
max_duration: seconds
tenant_id: UUID
}

Agent Lifecycle Manager

Controls the state machine of each agent subprocess invocation:

PENDING → STARTING → RUNNING → SUCCEEDED
↘ FAILED → RETRYING → RUNNING (bounded)
↘ TIMED_OUT
↘ CANCELLED

Key behaviours:

  • Retry policy: exponential backoff with jitter; maximum 3 retries per intent; dead-letter queue after exhaustion (prevents retry explosions)
  • Isolation: a FAILED agent does not affect sibling agents of the same role
  • Cancellation: POST /agentops/intents/{id}/cancel propagates SIGTERM to subprocess

State Stabiliser

Extends the existing --resume <session_id> mechanism to survive:

  • FastAPI process restarts
  • Host reboots
  • Kubernetes pod evictions

State is checkpointed to Postgres (v1.5) or a dedicated state store (v2) at each tool call boundary. On restart, the Execution Scheduler resumes inflight intents from their last checkpoint rather than starting cold.

Checkpoint {
intent_id: UUID
session_id: str (claude -p session ID)
tool_call_count: int
last_tool: str
last_output: str
timestamp: ISO 8601
}

Runtime Policy Engine

Enforces per-invocation limits on agent behaviour. These are runtime policies, distinct from the management plane's infrastructure policies (quotas, RBAC):

PolicyDefaultConfigurable by
max_tool_calls50Management plane Policy Engine
max_recursion_depth5Agent role config
token_budget_in100 000Management plane (per tenant)
token_budget_out10 000Management plane (per tenant)
max_duration_s3 600JobSpec / AgentIntent

When a limit is reached, the agent is terminated cleanly (SIGTERM → summary emitted) rather than killed (SIGKILL → no summary). The AgentEvent captures which limit was hit.

Backpressure Controller

Prevents the control plane from overwhelming the user plane with concurrent job submissions.

Control flow:
NotebookAgent wants to submit Ray job
→ Backpressure Controller checks: ray_jobs_in_flight < max_concurrent_jobs
→ If over limit: enqueue AgentIntent; return 429 to caller with retry-after header
→ If under limit: permit submission; increment counter
→ On job completion event: decrement counter

Limits are per-environment (torch.dev.gpu, ros.dev.gpu) and configurable. Default: 4 concurrent Ray jobs per environment.

This directly prevents the GPU exhaustion / cascading failure scenario described in the problem statement.

Guardrails Engine

Applies safety constraints specific to the agent's domain and the workload context. Distinct from management-plane policy (which governs who can do what) — guardrails govern how agents behave during execution.

Current guardrails (v1.5):

  • NotebookAgent: refuse kubectl delete on running jobs; require confirmation for cluster scaling operations
  • ClusterAgent: refuse helm uninstall without explicit force=true flag in intent
  • WandBAgent: read-only by default; refuse run deletion without explicit flag
  • LakehouseAgent: refuse DROP TABLE or DELETE FROM on production catalog tables

In v2, guardrails are expressed as a policy DSL and evaluated at each tool call, not just at spawn time.

Trace Collector

Every claude -p subprocess invocation produces an AgentEvent on completion. The Trace Collector enriches the event with AgentOps metadata and forwards it to the management-plane Observability Store.

AgentEvent {
event_id: UUID
intent_id: UUID (links to AgentIntent)
session_id: str
agent_role: str
job_id: UUID (opt)
tool_calls: [{tool, input, output, duration_ms}]
tokens_in: int
tokens_out: int
exit_code: int
limit_hit: str | null (which runtime policy triggered termination, if any)
retry_count: int
checkpoint_count: int
duration_ms: int
tenant_id: UUID
}

World Model–Driven AgentOps

The frontier pattern for 2026 agent systems: AgentOps maintains a structured world model of the agent ecosystem, not just per-invocation state.

The world model captures:

WorldModel {
agents: {
<intent_id>: {role, status, tool_call_count, current_tool, session_id}
}
user_plane: {
torch_dev_gpu: {ray_jobs_in_flight, cluster_status, gpu_utilisation}
ros_dev_gpu: {ray_jobs_in_flight, cluster_status, gpu_utilisation}
}
execution_topology: [
{parent_intent_id, child_intent_id, relationship: "spawned_by" | "blocked_by"}
]
causal_graph: [
{control_plane_intent_id, user_plane_job_id, outcome}
]
}

The world model enables:

  • Deadlock detection: circular blocked_by relationships surfaced before they stall
  • Anomaly detection: agent taking 10× the usual tool calls for a known job type
  • Causal replay: for any user-plane outcome, reconstruct the full control-plane intent chain
  • Predictive throttling: pre-emptively apply backpressure based on queued intent depth

Relevance to robotics (turtlebot-maze)

For the ros.dev.gpu environment, the world model's execution_topology maps directly to the robot's operational state:

  • A Nav2 action in progress is a user_plane entry in the world model
  • The Claude Code /navigate skill invocation is a control_plane entry
  • The causal graph links the navigation intent to the robot's position change

This is the convergence point between AgentOps and world-model-based VLA architectures: the AgentOps world model is a structured representation of the robot's operational context, enabling the control plane to reason over live execution state rather than polling CLI output.


Interfaces

Control plane → AgentOps

POST /agentops/intents
Body: AgentIntent
→ 202 {intent_id, queue_position, estimated_start_s}

GET /agentops/intents/{id}
→ {status, checkpoint, tool_call_count, duration_ms}

POST /agentops/intents/{id}/cancel
→ 200 | 409 (already terminal)

GET /agentops/world-model
→ WorldModel snapshot

AgentOps → management plane

Emits AgentEvent to mp.agent.events stream (Redis Streams v1.5, NATS v2) on each agent completion or timeout.

AgentOps → user plane (backpressure)

Does not directly call user plane. Mediates control-plane agent access to user-plane CLIs via the Backpressure Controller and Guardrails Engine at subprocess spawn time.


v1 mapping (implicit AgentOps)

In v1, AgentOps concerns are handled implicitly:

AgentOps concernv1 mechanismLimitation
ConcurrencySynchronous subprocess (1 at a time)Blocks FastAPI event loop
RetryManual in run_agent() callerNo backoff, no bound
State--resume session_id (in-memory)Lost on process restart
Timeoutsubprocess.run(timeout=...)Kill, no clean summary
BackpressureNoneUnbounded Ray job submission
Guardrails--allowedTools at spawnTool scope only; no call-count limit
Tracingstdout/stderr capturedNo structured event; no downstream consumer

The v1 implicit model is sufficient for a single-operator platform with serialised job submission. It fails at multi-agent scale.


Implementation sequence

AgentOps is introduced incrementally, not as a big-bang v2 rewrite:

v1.5 (async worker pool)

  1. Execution Scheduler — Redis work queue; configurable worker pool size
  2. Agent Lifecycle Manager — retry with exponential backoff; dead-letter queue
  3. Backpressure Controller — per-environment Ray job concurrency cap
  4. Trace Collector — structured AgentEvent emitted to Redis Stream

v2 (full runtime)

  1. State Stabiliser — checkpoint to Postgres; resume across restarts
  2. Runtime Policy Engine — token budgets, tool call limits, fetched from management plane
  3. Guardrails Engine — per-role constraint DSL; evaluated per tool call
  4. World Model — live execution topology; causal graph; deadlock detection

Evolution path

v1   — AgentOps implicit: synchronous subprocess = accidental serialisation
v1.5 — Execution Scheduler + Lifecycle Manager + Backpressure + Trace Collector
v2 — State Stabiliser + Runtime Policy Engine + Guardrails Engine + World Model
v3 — World Model–Driven AgentOps: predictive throttling; causal replay; VLA integration

See also:

  • docs/plans/2026-02-23-auraison-control-plane-design.md — control plane intent emitters
  • docs/plans/2026-02-23-auraison-management-plane-design.md — Policy Engine, Observability Store
  • docs/plans/2026-02-23-auraison-user-plane-design.md — execution mesh, backpressure target