Four-Plane Architecture

Date: 2026-02-23 Updated: 2026-03-10 Status: Approved

Overview

Auraison is structured as four planes following the SDN / telecom separation pattern. Three vertical planes (user, control, management) handle execution, orchestration, and governance respectively. The data plane sits horizontally, serving all three. The planes have fundamentally different latency, consistency, and availability requirements.

Plane	What runs here	Latency / consistency	Failure consequence
User plane	Customer agents: VLA, Nav2, behavior trees, YOLOv8, SLAM; Cosmos-Reason2 (physical reasoning), Cosmos-Predict2 (world model), Cosmos-Transfer2.5 (sim2real)	Real-time (ms), stateful per-session	Agent stops; robot halts
Control plane	Job dispatch, cluster management, experiment tracking, agent lifecycle governance	Seconds, eventually consistent	Degraded visibility; user plane continues
Data plane	Lakehouse (DuckDB + DuckLake + MinIO), embeddings, event log	Seconds, eventually consistent	Queries fail; ingestion queued; agents lose context
Management plane	Billing, tenancy, quotas, user management	Minutes, strongly consistent	No new deployments; running agents unaffected

The control plane includes an agent operations subsystem (execution scheduling, backpressure, guardrails, trace collection) that governs agent behaviour at runtime. This is implemented as control-plane/backend/agentops/ — a package within the control plane, not a separate architectural layer.

First principle: User plane failures must not cascade to the control plane, and control plane outages must not halt running agents.

System-level requirements

These requirements apply across all planes. Plane-specific requirements are decomposed in each plane's design.md and trace back to these system-level IDs.

ID	Requirement	Traces to
SYS-001	The system shall be structured as four planes (user, control, data, management) following the SDN/telecom separation pattern	§Overview
SYS-002	User plane failures shall not cascade to the control plane; control plane outages shall not halt running agents	§Overview, first principle
SYS-003	The system shall support four reference applications: turtlebot-maze, ar4-physical-ai, Deep Evidence Agent, and counter-uas (v2)	§Reference applications
SYS-004	The system shall adopt Zenoh as the standard non-ROS transport for reference applications	§Middleware and inference serving
SYS-005	The system shall implement a dual-speed architecture: vLLM on torch.dev.gpu (System 2, planning) and VLA action heads on ros.dev.gpu (System 1, real-time control)	§Middleware and inference serving
SYS-006	The system shall follow the v1 → v1.5 → v2 → v3 evolution path	§Evolution path
SYS-007	The data plane shall sit horizontally, serving all three vertical planes via DuckDB + DuckLake + MinIO	§Overview
SYS-008	Agents shall run as `claude -p` subprocesses, reusing the Claude Code subscription (not Anthropic API)	control-plane/design.md
SYS-009	The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka)	§Streaming substrate
SYS-010	Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onward	§Streaming substrate
SYS-011	Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processing	§Streaming substrate
SYS-012	The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connection	§Streaming substrate
SYS-013	Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streams	§Streaming substrate
SYS-014	Multi-tenant topic isolation for the management plane shall use Apache Pulsar	§Streaming substrate

Requirements decomposition

Each plane's design document contains plane-specific requirements that trace back to these system-level requirements:

Plane	Design doc	Requirement prefix	Count
Control plane	`control-plane/design.md`	CP-xxx	30
User plane	`user-plane/design.md`	UP-xxx	36
Data plane	`data-plane/design.md`	DP-xxx	43
Management plane	`management-plane/design.md`	MP-xxx	40
Architecture	`architecture/four-plane.mdx`	SYS-xxx	14
Total			171

System context (C4 Level 1)

Reading the diagram. Dashed blue containers group plane / sub-plane scopes. Solid arrows are runtime data/control paths. Dashed arrows are v2 governance paths. For brevity only the critical cross-plane connectors are drawn — full edge list (Predict → Transfer → Reason loop, Zenoh ↔ vLLM, agent → AgentEvent → OBS, etc.) is described in the prose below.

Repository layout

auraison/
├── control-plane/     FastAPI API + Claude Code agent layer + AgentOps subsystem + Next.js UI
├── user-plane/        Agentic workloads: VLA, ROS 2, multi-agent (KubeRay)
├── data-plane/        Lakehouse: DuckDB + DuckLake + MinIO (migrated from aegean-ai/lakehouse)
├── management-plane/  Billing, tenancy, quotas (v2)
├── docs/
│   ├── architecture/  System-level design docs (this directory)
│   ├── plans/         Plane-specific design docs
│   └── decisions/     Cross-cutting ADRs
└── docker-compose.yml Local dev infra (Postgres + Redis)

Communication between planes

See docs/control-plane/design.md §"Communication between planes" for the current v1 contract (subprocess + webhook) and the v1.5/v2 evolution path (Redis Streams → NATS + Kafka).

A dedicated cross-plane communication design doc is tracked in beads issue auraison-eco.

Reference applications

Auraison supports the following reference applications. The robotics applications are independent GitHub repositories under the aegean-ai org, deployed onto KubeRay clusters. The Deep Evidence Agent lives in aegean-ai/dea.

turtlebot-maze — `aegean-ai/turtlebot-maze`

The canonical navigation application. Demonstrates Claude Code + ros-mcp-server doing real-time robot control on ros.dev.gpu, extended in v1.5 with the Cosmos model stack:

Claude Code /navigate skill
  → ros-mcp-server (MCP over rosbridge WebSocket :9090)
  → ROS 2 Nav2 action server
  → TurtleBot navigation

Predict → Transfer → Reason → Execute loop (v1.5):
  Cosmos-Predict2 (torch.dev.gpu): current frame + action → synthetic trajectory
  → Cosmos-Transfer2.5 (torch.dev.gpu): synthetic → photorealistic
  → Cosmos-Reason2 (ros.dev.gpu): feasibility evaluation → go / no-go
  → Nav2 goal dispatched or behavior tree selects alternative

ar4-physical-ai — `aegean-ai/ar4-physical-ai`

VLA (Vision-Language-Action) manipulation platform for the AR4 MK3 robotic arm. Layered on the AR4 ROS driver and HuggingFace LeRobot. Key architectural characteristics:

LeRobot-native — uses lerobot-ros (AnninAR4 class) as the bridge between ROS 2 and LeRobot for recording, training, and inference
Zenoh middleware — Zenoh router + DDS bridge decouples non-ROS components (Optuna PID tuner, future LLM inference) from the DDS discovery mesh
Docker-first — multi-stage GPU containers (base → overlay → dev) with docker-compose
Simulation-first — physics-enabled Gazebo Harmonic with gravity, contact, graspable objects
VLA progression — LeRobot ACT (v1) → cross-embodiment transfer (v1.5) → Pi0/GR00T (v2)

LeRobot pipeline:
  lerobot-record → Dataset v3.0 (Parquet + MP4) → lerobot-train (ACT/Diffusion/Pi0) → lerobot-evaluate
  ↕ lerobot-ros (ROS2Robot / AnninAR4)
  ↕ MoveIt Servo + ros2_control + Gazebo Harmonic (sim) / Teensy 4.1 (real)

Zenoh transport (non-ROS workloads):
  Optuna PID tuner ←→ Zenoh Router :7447 ←→ zenoh-bridge-ros2dds ←→ CycloneDDS ←→ Gazebo
  vLLM inference (future) ←→ Zenoh Router :7447 ←→ ROS 2 planning nodes

Deep Evidence Agent (DEA) — `aegean-ai/dea`

Multi-agent system for safety- and mission-critical engineering organizations. Turns scattered engineering artifacts (requirements, design docs, code, tests, standards, incident reports) into a traceable, auditable knowledge base with evidence-grounded reasoning. Key architectural characteristics:

Multi-agent orchestration — Planner (decomposes questions), Researcher (retrieves artifacts), Critic (validates evidence), Synthesizer (generates reports)
Evidence-grounded — every claim linked to primary sources with full provenance
Domain-specific — built for engineering traceability, not generic AI chat
Human-in-the-loop — engineers review, curate, and approve all outputs
GraphRAG — Microsoft GraphRAG as git submodule for graph-based retrieval
Separate repo — aegean-ai/dea (docs, graphrag submodule)

User query ("What is the impact of changing REQ-123?")
  → Planner agent: decompose into sub-questions
  → Researcher agent: retrieve artifacts via GraphRAG + data plane (lakehouse)
  → Critic agent: validate evidence chains, flag gaps
  → Synthesizer agent: produce trace matrix / impact report with citations
  → Human review + approval

counter-uas — `aegean-ai/counter-uas` (v2)

Counter-UAS system combining VisDrone aerial perception, Unreal Engine 5 simulation, and General Robotics GRID hardware integration. Demonstrates the platform's support for non-manipulation, non-navigation workloads — aerial object detection, tracking, and classification. Key architectural characteristics:

UE5 simulation — photorealistic aerial scenarios for training and evaluation
VisDrone dataset — benchmark for drone-based object detection and tracking
GRID integration — General Robotics counter-UAS hardware platform
Perception-first — detection/tracking models, not VLA or navigation policies

Common patterns across applications

The control plane manages ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and experiment bookkeeping. The control plane does not control the robot in real-time — that is ros-mcp-server's domain (turtlebot-maze) or lerobot-ros's domain (ar4-physical-ai).

All reference applications share a common Layer C abstraction: vLLM inference serving via Zenoh queryable on torch.dev.gpu. Each application plugs in its own model backend without platform changes.

Concern	turtlebot-maze	ar4-physical-ai	Deep Evidence Agent	counter-uas (v2)
Robot framework	Nav2 + behavior trees	MoveIt2 + ros2_control	N/A	GRID platform
AI model	Cosmos stack (world model)	LeRobot VLA (ACT → Pi0)	Multi-agent LLM (Claude)	Detection/tracking models
Middleware	DDS (CycloneDDS)	Zenoh + DDS bridge	Control plane API	Zenoh + DDS bridge
LLM integration	Claude Code via ros-mcp-server	Future: vLLM via Zenoh	Native (Claude -p subprocess)	vLLM via Zenoh
Sim environment	Gazebo (Nav2 worlds)	Gazebo Harmonic (tabletop)	N/A	Unreal Engine 5
Data pipeline	ROS bag → lakehouse	LeRobot v3.0 → HuggingFace Hub	Artifact ingestion → lakehouse	VisDrone + UE5 → lakehouse

Middleware and inference serving: Zenoh vs vLLM

Zenoh and vLLM operate at different layers and are complementary, not competing.

	Zenoh	vLLM
Layer	Communication middleware (transport)	Compute engine (GPU inference)
What it does	Pub/sub/query between distributed nodes	High-throughput LLM inference with PagedAttention
Protocol	Zenoh protocol (tcp/udp/quic), DDS bridge	HTTP (OpenAI-compatible API)
Latency	Microseconds (wire protocol)	Milliseconds–seconds (model inference)
State	Stateless router + optional storage	Stateful (KV cache, model weights on GPU)

How they integrate

In the auraison architecture, Zenoh serves as the transport layer between robot-side ROS 2 nodes and non-ROS compute services (like vLLM). The key mechanism is Zenoh queryables — on-demand request/response handlers that function like lightweight RPC:

Robot node (ROS 2) → CycloneDDS → zenoh-bridge-ros2dds → Zenoh Router
  → Zenoh queryable ("ai/inference/vla") → vLLM async engine
  → inference result returned through Zenoh → bridge → DDS → ROS 2 node

This is already proven in ar4-physical-ai where Zenoh decouples the Optuna PID tuner from the ROS 2 graph. The same pattern extends to LLM inference — a thin Zenoh-to-vLLM adapter registers a queryable, forwards requests to vLLM's async engine, and returns results. No HTTP overhead, location-transparent (inference can move edge ↔ cloud without client changes).

Note: Two ROS 2 integration paths exist:

zenoh-plugin-ros2dds — bridge that routes DDS traffic through Zenoh (no code changes)
rmw_zenoh — native RMW implementation replacing DDS entirely (experimental in ROS 2 Kilted Kaiju 2025)

Why Zenoh over raw HTTP for LLM serving

Decoupled discovery — ROS 2 nodes don't need to know vLLM's IP/port; Zenoh key-expressions route by topic
Streaming — Zenoh pub/sub naturally supports token streaming (vs HTTP SSE/WebSocket)
Multi-consumer — multiple nodes can subscribe to the same inference result
Unified transport — one middleware for sensor data, control commands, and LLM queries
Edge deployment — Zenoh's low overhead supports on-robot inference routing

Industry context

The emerging industry pattern follows a dual-speed architecture (inspired by Kahneman's System 1 / System 2, exemplified by NVIDIA GR00T):

System 2 (slow, 100ms+) — VLM/LLM for reasoning, planning, task decomposition. Runs in cloud or on powerful edge GPU (vLLM, TensorRT-LLM).
System 1 (fast, <50ms) — lightweight action model that translates plans into continuous motor commands at real-time control rates. Runs on-robot (TensorRT Edge-LLM on Jetson).

Team	High-level planner	Low-level policy	Middleware	Inference engine
NVIDIA GR00T	VLM + Cosmos Reason (System 2)	Action decoder (System 1)	Custom	TensorRT-LLM / TensorRT Edge-LLM
Google DeepMind	RT-2 (12B–55B VLA)	End-to-end (actions as text tokens)	Custom	TPU serving
HuggingFace LeRobot	SmolVLA async perception	ACT / Diffusion / Pi0 action head	None (in-process)	PyTorch native
PickNik MoveIt Pro	LLM for behavior tree generation	MoveIt2 motion planning	DDS	HTTP to LLM API
Intrinsic (Alphabet)	Task planner	Skill primitives	Zenoh + ROS 2 Jazzy	—
Auraison (ours)	vLLM on torch.dev.gpu	LeRobot VLA / Cosmos	Zenoh + DDS bridge	vLLM via Zenoh queryable

Key adopters of Zenoh: General Motors (uProtocol), Volvo, Intrinsic, smart-city deployments. Zenoh 1.0 benchmarks: 13μs latency, 50 Gbps throughput, superior on Wi-Fi/lossy networks vs DDS.

Recommendation for auraison

Adopt Zenoh as the standard non-ROS transport for both reference applications
Follow the System 2 / System 1 pattern: vLLM on torch.dev.gpu for high-level planning (System 2); LeRobot VLA action heads on ros.dev.gpu for real-time control (System 1)
Bridge via Zenoh queryables: robot nodes issue zenoh.get("ai/inference/vla", observation) — a thin adapter forwards to vLLM's async engine and returns results. No direct HTTP from ROS 2 nodes.
Do not replace one with the other — Zenoh is plumbing, vLLM is compute
Future: evaluate rmw_zenoh as a full DDS replacement once it stabilizes past experimental status in ROS 2

Streaming substrate

Auraison operates three distinct streaming tiers. They differ in latency, durability, and consumer type, and are not interchangeable:

Tier	Latency	Durability	Technology	Purpose
Data streaming (sensor / telemetry)	Microseconds	Ephemeral	Zenoh	Sensor frames, point clouds, IMU, DDS bridge, vLLM token streaming within user plane
Control event streaming	Milliseconds	At-least-once	Redis Streams (v1.5) → NATS (v2)	JobSpec dispatch, StatusEvents, agent lifecycle notifications
Analytics / audit streaming	Seconds	Exactly-once	Kafka (v2)	Agent traces, experiment audit log, compliance telemetry to data plane

Stream processing platform context

Eight mainstream stream processing platforms were evaluated (Ably survey): Apache Spark, Apache Kafka Streams, Apache Flink, Spring Cloud Data Flow, Amazon Kinesis, Google Cloud Dataflow, Apache Pulsar, and IBM Streams.

Key architectural characteristics relevant to agentic workloads:

Platform	Latency	Deployment model	Fit for Auraison
Apache Flink	Low (event-time)	Cluster	✓ v2 — complex event processing over agent trace streams; millions of events/s; built-in ML connectors
Apache Kafka Streams	~10ms	Client library	✓ v1.5 — lightweight; integrates inside control-plane process; no cluster manager needed
Apache Pulsar	Low publish	Cloud-native, multi-layer	✓ v2 — 1M+ topics, geo-aware replication; strong fit for multi-tenant management plane
Amazon Kinesis / Cloud Dataflow	~seconds	Serverless	✗ Cloud-lock; unsuitable for on-premise Proxmox deployment
Apache Spark	Batch-first	Cluster	✗ High memory; better for SDG dataset processing than real-time event streams

Decision: Kafka Streams (v1.5) for control-plane event enrichment; Apache Flink (v2) for analytics pipeline over the agent trace audit log; Pulsar (v2) for multi-tenant topic isolation. Zenoh remains the data-plane transport — it is not an analytics platform.

Agent-native streaming pattern

The critical distinction for agentic architectures is that LLMs are stream consumers and stream producers. A streaming agent:

Consumes an event stream (sensor readings, job status, world-model snapshots)
Reasons over a sliding window of that stream (LLM / VLA context)
Produces an enriched output stream (decisions, annotations, action commands)

This is the Streaming Context Protocol (SCP) pattern — an evolution of MCP: MCP provides stateful, AI-native tool access over a persistent connection. When the connection carries a continuous event stream (Zenoh pub/sub, Kafka consumer, NATS subject), MCP becomes the AI-native streaming interface: the agent subscribes to a stream, processes events with LLM context, and publishes enriched responses as a new stream.

Stream source (Zenoh / Kafka / NATS)
  → [persistent MCP connection]
      → LLM agent context window (sliding window over recent events)
          → enriched stream (decisions · annotations · action commands)
              → downstream consumer (Nav2 · data plane · control plane)

This is distinct from REST: REST excels at discrete service calls; MCP + streaming excels at continuous, stateful, AI-enriched stream processing where the model is a first-class consumer. MCP PR #206 (serverless transport) enables this pattern without a persistent server process — the LLM subscribes as a serverless stream processor.

Streaming requirements

ID	Requirement	Tier	Version
SYS-009	The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka)	All	v1–v2
SYS-010	Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onward	Control	v1.5
SYS-011	Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processing	Data	v2
SYS-012	The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connection	All	v2
SYS-013	Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streams	Data	v2
SYS-014	Multi-tenant topic isolation for the management plane shall use Apache Pulsar	Management	v2

Evolution path

v1   — Control plane + user plane operational; data plane migrated to monorepo; synchronous subprocess dispatch
       Reference apps: turtlebot-maze (Nav2), ar4-physical-ai (LeRobot ACT + Zenoh)
       Digital Twins: persistent world model in lakehouse (TurtleBot + AR4); in-job writes + post-job reconciliation
v1.5 — AgentOps subsystem in control plane: execution scheduler, backpressure, trace collector; Redis Streams (SYS-010)
       Cosmos-Reason2 (ros.dev.gpu): physical reasoning + actuation feasibility gating
       Cosmos-Predict2 (torch.dev.gpu): world model inference for pre-execution trajectory simulation
       Cosmos-Transfer2.5 (torch.dev.gpu): sim2real augmentation; SDG pipeline → lakehouse datasets
       Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
       ar4-physical-ai: cross-embodiment VLA transfer (SO-101 → AR4)
       Zenoh adopted as standard non-ROS transport; vLLM on torch.dev.gpu for VLA planning
       Digital Twins: Cosmos-predicted twin state snapshots; Redis hot-cache for live pose
v2   — NATS (control messages) + Kafka (audit/telemetry, SYS-011); Flink for agent trace analytics (SYS-013); Pulsar for multi-tenant topics (SYS-014)
       Agent-native streaming pattern: MCP over persistent stream connections (SYS-012); Pydantic AI runtime agents; management plane; data plane RAG
       Cosmos models post-trained on turtlebot-maze ROS bag recordings
       ar4-physical-ai: Pi0 / GR00T N1.5 foundation VLA models via vLLM + Zenoh bridge
       counter-uas (aegean-ai/counter-uas): UE5 + VisDrone perception twin; third reference application
v3   — World-model-driven agent governance; VLA training pipeline over SDG lakehouse datasets; feature store

Overview​

System-level requirements​

Requirements decomposition​

System context (C4 Level 1)​

Repository layout​

Communication between planes​

Reference applications​

turtlebot-maze — aegean-ai/turtlebot-maze​

ar4-physical-ai — aegean-ai/ar4-physical-ai​

Deep Evidence Agent (DEA) — aegean-ai/dea​

counter-uas — aegean-ai/counter-uas (v2)​

Common patterns across applications​

Middleware and inference serving: Zenoh vs vLLM​

How they integrate​

Why Zenoh over raw HTTP for LLM serving​

Industry context​

Recommendation for auraison​

Streaming substrate​

Stream processing platform context​

Agent-native streaming pattern​

Streaming requirements​

Evolution path​