Auraison — AgentOps Layer Design

Date: 2026-02-23 Updated: 2026-02-28 Status: Superseded — consolidated into control plane design (docs/plans/2026-02-23-aiops-control-plane-design.md §"AgentOps Subsystem"). AgentOps is now a control plane subsystem, not a separate architectural layer. This document is retained for historical context.

Problem

The four-plane architecture (user / control / data / management) governs control flow well for single agents and small workflows. It breaks down as soon as the system scales to multiple concurrent agents, long-running autonomous workflows, and dynamic tool usage.

The failure modes are specific and predictable:

Recursive planning loops — a control-plane agent keeps refining its plan endlessly; token budget explodes; no execution ever starts
Tool call storms — agents aggressively issue kubectl or ray calls in parallel; API rate limits exceeded; cascading failures
Distributed agent chaos — multiple concurrent NotebookAgent subprocesses submit duplicate Ray jobs, race on cluster state, or issue conflicting helm mutations
Retry explosions — a naive retry on job failure re-submits without backoff or bound, compounding the original failure

In v1, these are largely prevented by the synchronous subprocess model: only one claude -p subprocess runs at a time, providing natural serialisation. In v1.5 (async worker pool) and v2 (NATS-driven multi-agent), this accidental serialisation disappears and the failure modes become critical.

Neither the control plane nor the management plane can own the solution. The control plane reasons about what to do; it cannot govern its own execution dynamics. The management plane governs infrastructure and governance policy; it operates at too coarse a granularity to stabilise per-agent behaviour.

AgentOps is the missing layer: the operational runtime that governs the behavioural dynamics of agents themselves — between cognition (control plane) and governance (management plane).

Definition

AgentOps is the operational runtime layer that ensures agent behaviour remains stable, observable, and bounded during execution.

The analogy: if the control plane is the brain and the user plane is the muscles, AgentOps is the nervous system and reflexes — stabilising behaviour before governance needs to intervene.

Goals

Mediate all control-plane agent executions: concurrency limits, retries, timeouts
Provide agent lifecycle control: start, stop, pause, restart, isolate
Stabilise agent state across subprocess invocations (extends --resume)
Enforce runtime policies: max tool calls, recursion depth, token budgets
Implement backpressure: prevent unbounded Ray worker fan-out
Surface agent-granularity observability: reasoning traces, tool call graphs
Run a guardrails engine: safety constraints and behaviour policies per agent role

Non-goals

Infrastructure provisioning — that is the management plane
Job execution — that is the user plane
Billing and tenancy — that is the management plane
Reasoning and planning — that is the control plane

Position in the architecture

AgentOps sits between the control plane and the management plane. All control-plane agent invocations pass through AgentOps before execution. AgentOps never reasons; it only governs.

Core components

Execution Scheduler

The scheduler decides when and in what order control-plane agent subprocesses run.

Responsibilities:

Maintain a bounded queue of pending AgentIntent messages
Enforce a maximum concurrency limit per agent role (e.g. max 2 concurrent NotebookAgents)
Priority ordering: ClusterAgent health checks preempt notebook submissions
Timeout enforcement: kill subprocesses that exceed max_duration_s

In v1, the scheduler is implicit (synchronous subprocess model = queue depth 1). In v1.5, it becomes an explicit Redis-backed work queue with a configurable worker pool.

AgentIntent {
  intent_id:    UUID
  agent_role:   "notebook" | "cluster" | "wandb" | "lakehouse"
  job_id:       UUID (optional)
  prompt:       str
  priority:     int (0=critical, 4=backlog)
  max_duration: seconds
  tenant_id:    UUID
}

Agent Lifecycle Manager

Controls the state machine of each agent subprocess invocation:

PENDING → STARTING → RUNNING → SUCCEEDED
                             ↘ FAILED → RETRYING → RUNNING (bounded)
                             ↘ TIMED_OUT
                             ↘ CANCELLED

Key behaviours:

Retry policy: exponential backoff with jitter; maximum 3 retries per intent; dead-letter queue after exhaustion (prevents retry explosions)
Isolation: a FAILED agent does not affect sibling agents of the same role
Cancellation: POST /agentops/intents/{id}/cancel propagates SIGTERM to subprocess

State Stabiliser

Extends the existing --resume <session_id> mechanism to survive:

FastAPI process restarts
Host reboots
Kubernetes pod evictions

State is checkpointed to Postgres (v1.5) or a dedicated state store (v2) at each tool call boundary. On restart, the Execution Scheduler resumes inflight intents from their last checkpoint rather than starting cold.

Checkpoint {
  intent_id:       UUID
  session_id:      str        (claude -p session ID)
  tool_call_count: int
  last_tool:       str
  last_output:     str
  timestamp:       ISO 8601
}

Runtime Policy Engine

Enforces per-invocation limits on agent behaviour. These are runtime policies, distinct from the management plane's infrastructure policies (quotas, RBAC):

Policy	Default	Configurable by
`max_tool_calls`	50	Management plane Policy Engine
`max_recursion_depth`	5	Agent role config
`token_budget_in`	100 000	Management plane (per tenant)
`token_budget_out`	10 000	Management plane (per tenant)
`max_duration_s`	3 600	JobSpec / AgentIntent

When a limit is reached, the agent is terminated cleanly (SIGTERM → summary emitted) rather than killed (SIGKILL → no summary). The AgentEvent captures which limit was hit.

Backpressure Controller

Prevents the control plane from overwhelming the user plane with concurrent job submissions.

Control flow:
  NotebookAgent wants to submit Ray job
    → Backpressure Controller checks: ray_jobs_in_flight < max_concurrent_jobs
    → If over limit: enqueue AgentIntent; return 429 to caller with retry-after header
    → If under limit: permit submission; increment counter
    → On job completion event: decrement counter

Limits are per-environment (torch.dev.gpu, ros.dev.gpu) and configurable. Default: 4 concurrent Ray jobs per environment.

This directly prevents the GPU exhaustion / cascading failure scenario described in the problem statement.

Guardrails Engine

Applies safety constraints specific to the agent's domain and the workload context. Distinct from management-plane policy (which governs who can do what) — guardrails govern how agents behave during execution.

Current guardrails (v1.5):

NotebookAgent: refuse kubectl delete on running jobs; require confirmation for cluster scaling operations
ClusterAgent: refuse helm uninstall without explicit force=true flag in intent
WandBAgent: read-only by default; refuse run deletion without explicit flag
LakehouseAgent: refuse DROP TABLE or DELETE FROM on production catalog tables

In v2, guardrails are expressed as a policy DSL and evaluated at each tool call, not just at spawn time.

Trace Collector

Every claude -p subprocess invocation produces an AgentEvent on completion. The Trace Collector enriches the event with AgentOps metadata and forwards it to the management-plane Observability Store.

AgentEvent {
  event_id:        UUID
  intent_id:       UUID          (links to AgentIntent)
  session_id:      str
  agent_role:      str
  job_id:          UUID (opt)
  tool_calls:      [{tool, input, output, duration_ms}]
  tokens_in:       int
  tokens_out:      int
  exit_code:       int
  limit_hit:       str | null    (which runtime policy triggered termination, if any)
  retry_count:     int
  checkpoint_count: int
  duration_ms:     int
  tenant_id:       UUID
}

World Model–Driven AgentOps

The frontier pattern for 2026 agent systems: AgentOps maintains a structured world model of the agent ecosystem, not just per-invocation state.

The world model captures:

WorldModel {
  agents: {
    <intent_id>: {role, status, tool_call_count, current_tool, session_id}
  }
  user_plane: {
    torch_dev_gpu: {ray_jobs_in_flight, cluster_status, gpu_utilisation}
    ros_dev_gpu:   {ray_jobs_in_flight, cluster_status, gpu_utilisation}
  }
  execution_topology: [
    {parent_intent_id, child_intent_id, relationship: "spawned_by" | "blocked_by"}
  ]
  causal_graph: [
    {control_plane_intent_id, user_plane_job_id, outcome}
  ]
}

The world model enables:

Deadlock detection: circular blocked_by relationships surfaced before they stall
Anomaly detection: agent taking 10× the usual tool calls for a known job type
Causal replay: for any user-plane outcome, reconstruct the full control-plane intent chain
Predictive throttling: pre-emptively apply backpressure based on queued intent depth

Relevance to robotics (turtlebot-maze)

For the ros.dev.gpu environment, the world model's execution_topology maps directly to the robot's operational state:

A Nav2 action in progress is a user_plane entry in the world model
The Claude Code /navigate skill invocation is a control_plane entry
The causal graph links the navigation intent to the robot's position change

This is the convergence point between AgentOps and world-model-based VLA architectures: the AgentOps world model is a structured representation of the robot's operational context, enabling the control plane to reason over live execution state rather than polling CLI output.

Interfaces

Control plane → AgentOps

POST /agentops/intents
Body: AgentIntent
→ 202 {intent_id, queue_position, estimated_start_s}

GET  /agentops/intents/{id}
→ {status, checkpoint, tool_call_count, duration_ms}

POST /agentops/intents/{id}/cancel
→ 200 | 409 (already terminal)

GET  /agentops/world-model
→ WorldModel snapshot

AgentOps → management plane

Emits AgentEvent to mp.agent.events stream (Redis Streams v1.5, NATS v2) on each agent completion or timeout.

AgentOps → user plane (backpressure)

Does not directly call user plane. Mediates control-plane agent access to user-plane CLIs via the Backpressure Controller and Guardrails Engine at subprocess spawn time.

v1 mapping (implicit AgentOps)

In v1, AgentOps concerns are handled implicitly:

AgentOps concern	v1 mechanism	Limitation
Concurrency	Synchronous subprocess (1 at a time)	Blocks FastAPI event loop
Retry	Manual in `run_agent()` caller	No backoff, no bound
State	`--resume session_id` (in-memory)	Lost on process restart
Timeout	`subprocess.run(timeout=...)`	Kill, no clean summary
Backpressure	None	Unbounded Ray job submission
Guardrails	`--allowedTools` at spawn	Tool scope only; no call-count limit
Tracing	stdout/stderr captured	No structured event; no downstream consumer

The v1 implicit model is sufficient for a single-operator platform with serialised job submission. It fails at multi-agent scale.

Implementation sequence

AgentOps is introduced incrementally, not as a big-bang v2 rewrite:

v1.5 (async worker pool)

Execution Scheduler — Redis work queue; configurable worker pool size
Agent Lifecycle Manager — retry with exponential backoff; dead-letter queue
Backpressure Controller — per-environment Ray job concurrency cap
Trace Collector — structured AgentEvent emitted to Redis Stream

v2 (full runtime)

State Stabiliser — checkpoint to Postgres; resume across restarts
Runtime Policy Engine — token budgets, tool call limits, fetched from management plane
Guardrails Engine — per-role constraint DSL; evaluated per tool call
World Model — live execution topology; causal graph; deadlock detection

Evolution path

v1   — AgentOps implicit: synchronous subprocess = accidental serialisation
v1.5 — Execution Scheduler + Lifecycle Manager + Backpressure + Trace Collector
v2   — State Stabiliser + Runtime Policy Engine + Guardrails Engine + World Model
v3   — World Model–Driven AgentOps: predictive throttling; causal replay; VLA integration

Problem​

Definition​

Goals​

Non-goals​

Position in the architecture​

Core components​

Execution Scheduler​

Agent Lifecycle Manager​

State Stabiliser​

Runtime Policy Engine​

Backpressure Controller​

Guardrails Engine​

Trace Collector​

World Model–Driven AgentOps​

Relevance to robotics (turtlebot-maze)​

Interfaces​

Control plane → AgentOps​

AgentOps → management plane​

AgentOps → user plane (backpressure)​

v1 mapping (implicit AgentOps)​

Implementation sequence​

v1.5 (async worker pool)​

v2 (full runtime)​

Evolution path​