Skip to main content

Four-Plane Architecture

Date: 2026-02-23 Updated: 2026-03-10 Status: Approved


Overview

Auraison is structured as four planes following the SDN / telecom separation pattern. Three vertical planes (user, control, management) handle execution, orchestration, and governance respectively. The data plane sits horizontally, serving all three. The planes have fundamentally different latency, consistency, and availability requirements.

PlaneWhat runs hereLatency / consistencyFailure consequence
User planeCustomer agents: VLA, Nav2, behavior trees, YOLOv8, SLAM; Cosmos-Reason2 (physical reasoning), Cosmos-Predict2 (world model), Cosmos-Transfer2.5 (sim2real)Real-time (ms), stateful per-sessionAgent stops; robot halts
Control planeJob dispatch, cluster management, experiment tracking, agent lifecycle governanceSeconds, eventually consistentDegraded visibility; user plane continues
Data planeLakehouse (DuckDB + DuckLake + MinIO), embeddings, event logSeconds, eventually consistentQueries fail; ingestion queued; agents lose context
Management planeBilling, tenancy, quotas, user managementMinutes, strongly consistentNo new deployments; running agents unaffected

The control plane includes an agent operations subsystem (execution scheduling, backpressure, guardrails, trace collection) that governs agent behaviour at runtime. This is implemented as control-plane/backend/agentops/ — a package within the control plane, not a separate architectural layer.

First principle: User plane failures must not cascade to the control plane, and control plane outages must not halt running agents.


System-level requirements

These requirements apply across all planes. Plane-specific requirements are decomposed in each plane's design.md and trace back to these system-level IDs.

IDRequirementTraces to
SYS-001The system shall be structured as four planes (user, control, data, management) following the SDN/telecom separation pattern§Overview
SYS-002User plane failures shall not cascade to the control plane; control plane outages shall not halt running agents§Overview, first principle
SYS-003The system shall support four reference applications: turtlebot-maze, ar4-physical-ai, Deep Evidence Agent, and counter-uas (v2)§Reference applications
SYS-004The system shall adopt Zenoh as the standard non-ROS transport for reference applications§Middleware and inference serving
SYS-005The system shall implement a dual-speed architecture: vLLM on torch.dev.gpu (System 2, planning) and VLA action heads on ros.dev.gpu (System 1, real-time control)§Middleware and inference serving
SYS-006The system shall follow the v1 → v1.5 → v2 → v3 evolution path§Evolution path
SYS-007The data plane shall sit horizontally, serving all three vertical planes via DuckDB + DuckLake + MinIO§Overview
SYS-008Agents shall run as claude -p subprocesses, reusing the Claude Code subscription (not Anthropic API)control-plane/design.md
SYS-009The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka)§Streaming substrate
SYS-010Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onward§Streaming substrate
SYS-011Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processing§Streaming substrate
SYS-012The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connection§Streaming substrate
SYS-013Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streams§Streaming substrate
SYS-014Multi-tenant topic isolation for the management plane shall use Apache Pulsar§Streaming substrate

Requirements decomposition

Each plane's design document contains plane-specific requirements that trace back to these system-level requirements:

PlaneDesign docRequirement prefixCount
Control planecontrol-plane/design.mdCP-xxx30
User planeuser-plane/design.mdUP-xxx36
Data planedata-plane/design.mdDP-xxx43
Management planemanagement-plane/design.mdMP-xxx40
Architecturearchitecture/four-plane.mdxSYS-xxx14
Total171

System context (C4 Level 1)

MANAGEMENT PLANE (v2)Billing · Tenancy · QuotasObservability StoreCONTROL PLANE — control-plane/Next.js DashboardFastAPI APIAgentOps Subsystemscheduler · backpressureClaude Code Agentsclaude -p subprocessesPostgres · RedisDATA PLANE — data-plane/LakehouseDuckDB + DuckLake + MinIOEmbeddings Storev2USER PLANE — user-plane/torch.dev.gpuNotebook workerspapermill + W&BCosmos-Predict2world modelCosmos-Transfer2.5sim2realvLLM / LLM ServingVLA inference · planningros.dev.gpuROS 2 workersNav2 · YOLOv8 · SLAMros-mcp-serverClaude Code → robotCosmos-Reason2physical reasoningZenoh RouterDDS bridge · non-ROSspawn supervisedread: ctxwrite: telemetrymanages (v2)

Reading the diagram. Dashed blue containers group plane / sub-plane scopes. Solid arrows are runtime data/control paths. Dashed arrows are v2 governance paths. For brevity only the critical cross-plane connectors are drawn — full edge list (Predict → Transfer → Reason loop, Zenoh ↔ vLLM, agent → AgentEvent → OBS, etc.) is described in the prose below.


Repository layout

auraison/
├── control-plane/ FastAPI API + Claude Code agent layer + AgentOps subsystem + Next.js UI
├── user-plane/ Agentic workloads: VLA, ROS 2, multi-agent (KubeRay)
├── data-plane/ Lakehouse: DuckDB + DuckLake + MinIO (migrated from aegean-ai/lakehouse)
├── management-plane/ Billing, tenancy, quotas (v2)
├── docs/
│ ├── architecture/ System-level design docs (this directory)
│ ├── plans/ Plane-specific design docs
│ └── decisions/ Cross-cutting ADRs
└── docker-compose.yml Local dev infra (Postgres + Redis)

Communication between planes

See docs/control-plane/design.md §"Communication between planes" for the current v1 contract (subprocess + webhook) and the v1.5/v2 evolution path (Redis Streams → NATS + Kafka).

A dedicated cross-plane communication design doc is tracked in beads issue auraison-eco.


Reference applications

Auraison supports the following reference applications. The robotics applications are independent GitHub repositories under the aegean-ai org, deployed onto KubeRay clusters. The Deep Evidence Agent lives in aegean-ai/dea.

turtlebot-maze — aegean-ai/turtlebot-maze

The canonical navigation application. Demonstrates Claude Code + ros-mcp-server doing real-time robot control on ros.dev.gpu, extended in v1.5 with the Cosmos model stack:

Claude Code /navigate skill
→ ros-mcp-server (MCP over rosbridge WebSocket :9090)
→ ROS 2 Nav2 action server
→ TurtleBot navigation

Predict → Transfer → Reason → Execute loop (v1.5):
Cosmos-Predict2 (torch.dev.gpu): current frame + action → synthetic trajectory
→ Cosmos-Transfer2.5 (torch.dev.gpu): synthetic → photorealistic
→ Cosmos-Reason2 (ros.dev.gpu): feasibility evaluation → go / no-go
→ Nav2 goal dispatched or behavior tree selects alternative

ar4-physical-ai — aegean-ai/ar4-physical-ai

VLA (Vision-Language-Action) manipulation platform for the AR4 MK3 robotic arm. Layered on the AR4 ROS driver and HuggingFace LeRobot. Key architectural characteristics:

  • LeRobot-native — uses lerobot-ros (AnninAR4 class) as the bridge between ROS 2 and LeRobot for recording, training, and inference
  • Zenoh middleware — Zenoh router + DDS bridge decouples non-ROS components (Optuna PID tuner, future LLM inference) from the DDS discovery mesh
  • Docker-first — multi-stage GPU containers (base → overlay → dev) with docker-compose
  • Simulation-first — physics-enabled Gazebo Harmonic with gravity, contact, graspable objects
  • VLA progression — LeRobot ACT (v1) → cross-embodiment transfer (v1.5) → Pi0/GR00T (v2)
LeRobot pipeline:
lerobot-record → Dataset v3.0 (Parquet + MP4) → lerobot-train (ACT/Diffusion/Pi0) → lerobot-evaluate
↕ lerobot-ros (ROS2Robot / AnninAR4)
↕ MoveIt Servo + ros2_control + Gazebo Harmonic (sim) / Teensy 4.1 (real)

Zenoh transport (non-ROS workloads):
Optuna PID tuner ←→ Zenoh Router :7447 ←→ zenoh-bridge-ros2dds ←→ CycloneDDS ←→ Gazebo
vLLM inference (future) ←→ Zenoh Router :7447 ←→ ROS 2 planning nodes

Deep Evidence Agent (DEA) — aegean-ai/dea

Multi-agent system for safety- and mission-critical engineering organizations. Turns scattered engineering artifacts (requirements, design docs, code, tests, standards, incident reports) into a traceable, auditable knowledge base with evidence-grounded reasoning. Key architectural characteristics:

  • Multi-agent orchestration — Planner (decomposes questions), Researcher (retrieves artifacts), Critic (validates evidence), Synthesizer (generates reports)
  • Evidence-grounded — every claim linked to primary sources with full provenance
  • Domain-specific — built for engineering traceability, not generic AI chat
  • Human-in-the-loop — engineers review, curate, and approve all outputs
  • GraphRAG — Microsoft GraphRAG as git submodule for graph-based retrieval
  • Separate repoaegean-ai/dea (docs, graphrag submodule)
User query ("What is the impact of changing REQ-123?")
→ Planner agent: decompose into sub-questions
→ Researcher agent: retrieve artifacts via GraphRAG + data plane (lakehouse)
→ Critic agent: validate evidence chains, flag gaps
→ Synthesizer agent: produce trace matrix / impact report with citations
→ Human review + approval

counter-uas — aegean-ai/counter-uas (v2)

Counter-UAS system combining VisDrone aerial perception, Unreal Engine 5 simulation, and General Robotics GRID hardware integration. Demonstrates the platform's support for non-manipulation, non-navigation workloads — aerial object detection, tracking, and classification. Key architectural characteristics:

  • UE5 simulation — photorealistic aerial scenarios for training and evaluation
  • VisDrone dataset — benchmark for drone-based object detection and tracking
  • GRID integration — General Robotics counter-UAS hardware platform
  • Perception-first — detection/tracking models, not VLA or navigation policies

Common patterns across applications

The control plane manages ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and experiment bookkeeping. The control plane does not control the robot in real-time — that is ros-mcp-server's domain (turtlebot-maze) or lerobot-ros's domain (ar4-physical-ai).

All reference applications share a common Layer C abstraction: vLLM inference serving via Zenoh queryable on torch.dev.gpu. Each application plugs in its own model backend without platform changes.

Concernturtlebot-mazear4-physical-aiDeep Evidence Agentcounter-uas (v2)
Robot frameworkNav2 + behavior treesMoveIt2 + ros2_controlN/AGRID platform
AI modelCosmos stack (world model)LeRobot VLA (ACT → Pi0)Multi-agent LLM (Claude)Detection/tracking models
MiddlewareDDS (CycloneDDS)Zenoh + DDS bridgeControl plane APIZenoh + DDS bridge
LLM integrationClaude Code via ros-mcp-serverFuture: vLLM via ZenohNative (Claude -p subprocess)vLLM via Zenoh
Sim environmentGazebo (Nav2 worlds)Gazebo Harmonic (tabletop)N/AUnreal Engine 5
Data pipelineROS bag → lakehouseLeRobot v3.0 → HuggingFace HubArtifact ingestion → lakehouseVisDrone + UE5 → lakehouse

Middleware and inference serving: Zenoh vs vLLM

Zenoh and vLLM operate at different layers and are complementary, not competing.

ZenohvLLM
LayerCommunication middleware (transport)Compute engine (GPU inference)
What it doesPub/sub/query between distributed nodesHigh-throughput LLM inference with PagedAttention
ProtocolZenoh protocol (tcp/udp/quic), DDS bridgeHTTP (OpenAI-compatible API)
LatencyMicroseconds (wire protocol)Milliseconds–seconds (model inference)
StateStateless router + optional storageStateful (KV cache, model weights on GPU)

How they integrate

In the auraison architecture, Zenoh serves as the transport layer between robot-side ROS 2 nodes and non-ROS compute services (like vLLM). The key mechanism is Zenoh queryables — on-demand request/response handlers that function like lightweight RPC:

Robot node (ROS 2) → CycloneDDS → zenoh-bridge-ros2dds → Zenoh Router
→ Zenoh queryable ("ai/inference/vla") → vLLM async engine
→ inference result returned through Zenoh → bridge → DDS → ROS 2 node

This is already proven in ar4-physical-ai where Zenoh decouples the Optuna PID tuner from the ROS 2 graph. The same pattern extends to LLM inference — a thin Zenoh-to-vLLM adapter registers a queryable, forwards requests to vLLM's async engine, and returns results. No HTTP overhead, location-transparent (inference can move edge ↔ cloud without client changes).

Note: Two ROS 2 integration paths exist:

  • zenoh-plugin-ros2dds — bridge that routes DDS traffic through Zenoh (no code changes)
  • rmw_zenoh — native RMW implementation replacing DDS entirely (experimental in ROS 2 Kilted Kaiju 2025)

Why Zenoh over raw HTTP for LLM serving

  • Decoupled discovery — ROS 2 nodes don't need to know vLLM's IP/port; Zenoh key-expressions route by topic
  • Streaming — Zenoh pub/sub naturally supports token streaming (vs HTTP SSE/WebSocket)
  • Multi-consumer — multiple nodes can subscribe to the same inference result
  • Unified transport — one middleware for sensor data, control commands, and LLM queries
  • Edge deployment — Zenoh's low overhead supports on-robot inference routing

Industry context

The emerging industry pattern follows a dual-speed architecture (inspired by Kahneman's System 1 / System 2, exemplified by NVIDIA GR00T):

  • System 2 (slow, 100ms+) — VLM/LLM for reasoning, planning, task decomposition. Runs in cloud or on powerful edge GPU (vLLM, TensorRT-LLM).
  • System 1 (fast, <50ms) — lightweight action model that translates plans into continuous motor commands at real-time control rates. Runs on-robot (TensorRT Edge-LLM on Jetson).
TeamHigh-level plannerLow-level policyMiddlewareInference engine
NVIDIA GR00TVLM + Cosmos Reason (System 2)Action decoder (System 1)CustomTensorRT-LLM / TensorRT Edge-LLM
Google DeepMindRT-2 (12B–55B VLA)End-to-end (actions as text tokens)CustomTPU serving
HuggingFace LeRobotSmolVLA async perceptionACT / Diffusion / Pi0 action headNone (in-process)PyTorch native
PickNik MoveIt ProLLM for behavior tree generationMoveIt2 motion planningDDSHTTP to LLM API
Intrinsic (Alphabet)Task plannerSkill primitivesZenoh + ROS 2 Jazzy
Auraison (ours)vLLM on torch.dev.gpuLeRobot VLA / CosmosZenoh + DDS bridgevLLM via Zenoh queryable

Key adopters of Zenoh: General Motors (uProtocol), Volvo, Intrinsic, smart-city deployments. Zenoh 1.0 benchmarks: 13μs latency, 50 Gbps throughput, superior on Wi-Fi/lossy networks vs DDS.

Recommendation for auraison

  1. Adopt Zenoh as the standard non-ROS transport for both reference applications
  2. Follow the System 2 / System 1 pattern: vLLM on torch.dev.gpu for high-level planning (System 2); LeRobot VLA action heads on ros.dev.gpu for real-time control (System 1)
  3. Bridge via Zenoh queryables: robot nodes issue zenoh.get("ai/inference/vla", observation) — a thin adapter forwards to vLLM's async engine and returns results. No direct HTTP from ROS 2 nodes.
  4. Do not replace one with the other — Zenoh is plumbing, vLLM is compute
  5. Future: evaluate rmw_zenoh as a full DDS replacement once it stabilizes past experimental status in ROS 2

Streaming substrate

Auraison operates three distinct streaming tiers. They differ in latency, durability, and consumer type, and are not interchangeable:

TierLatencyDurabilityTechnologyPurpose
Data streaming (sensor / telemetry)MicrosecondsEphemeralZenohSensor frames, point clouds, IMU, DDS bridge, vLLM token streaming within user plane
Control event streamingMillisecondsAt-least-onceRedis Streams (v1.5) → NATS (v2)JobSpec dispatch, StatusEvents, agent lifecycle notifications
Analytics / audit streamingSecondsExactly-onceKafka (v2)Agent traces, experiment audit log, compliance telemetry to data plane

Stream processing platform context

Eight mainstream stream processing platforms were evaluated (Ably survey): Apache Spark, Apache Kafka Streams, Apache Flink, Spring Cloud Data Flow, Amazon Kinesis, Google Cloud Dataflow, Apache Pulsar, and IBM Streams.

Key architectural characteristics relevant to agentic workloads:

PlatformLatencyDeployment modelFit for Auraison
Apache FlinkLow (event-time)Cluster✓ v2 — complex event processing over agent trace streams; millions of events/s; built-in ML connectors
Apache Kafka Streams~10msClient library✓ v1.5 — lightweight; integrates inside control-plane process; no cluster manager needed
Apache PulsarLow publishCloud-native, multi-layer✓ v2 — 1M+ topics, geo-aware replication; strong fit for multi-tenant management plane
Amazon Kinesis / Cloud Dataflow~secondsServerless✗ Cloud-lock; unsuitable for on-premise Proxmox deployment
Apache SparkBatch-firstCluster✗ High memory; better for SDG dataset processing than real-time event streams

Decision: Kafka Streams (v1.5) for control-plane event enrichment; Apache Flink (v2) for analytics pipeline over the agent trace audit log; Pulsar (v2) for multi-tenant topic isolation. Zenoh remains the data-plane transport — it is not an analytics platform.

Agent-native streaming pattern

The critical distinction for agentic architectures is that LLMs are stream consumers and stream producers. A streaming agent:

  1. Consumes an event stream (sensor readings, job status, world-model snapshots)
  2. Reasons over a sliding window of that stream (LLM / VLA context)
  3. Produces an enriched output stream (decisions, annotations, action commands)

This is the Streaming Context Protocol (SCP) pattern — an evolution of MCP: MCP provides stateful, AI-native tool access over a persistent connection. When the connection carries a continuous event stream (Zenoh pub/sub, Kafka consumer, NATS subject), MCP becomes the AI-native streaming interface: the agent subscribes to a stream, processes events with LLM context, and publishes enriched responses as a new stream.

Stream source (Zenoh / Kafka / NATS)
→ [persistent MCP connection]
→ LLM agent context window (sliding window over recent events)
→ enriched stream (decisions · annotations · action commands)
→ downstream consumer (Nav2 · data plane · control plane)

This is distinct from REST: REST excels at discrete service calls; MCP + streaming excels at continuous, stateful, AI-enriched stream processing where the model is a first-class consumer. MCP PR #206 (serverless transport) enables this pattern without a persistent server process — the LLM subscribes as a serverless stream processor.

Streaming requirements

IDRequirementTierVersion
SYS-009The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka)Allv1–v2
SYS-010Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onwardControlv1.5
SYS-011Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processingDatav2
SYS-012The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connectionAllv2
SYS-013Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streamsDatav2
SYS-014Multi-tenant topic isolation for the management plane shall use Apache PulsarManagementv2

Evolution path

v1   — Control plane + user plane operational; data plane migrated to monorepo; synchronous subprocess dispatch
Reference apps: turtlebot-maze (Nav2), ar4-physical-ai (LeRobot ACT + Zenoh)
Digital Twins: persistent world model in lakehouse (TurtleBot + AR4); in-job writes + post-job reconciliation
v1.5 — AgentOps subsystem in control plane: execution scheduler, backpressure, trace collector; Redis Streams (SYS-010)
Cosmos-Reason2 (ros.dev.gpu): physical reasoning + actuation feasibility gating
Cosmos-Predict2 (torch.dev.gpu): world model inference for pre-execution trajectory simulation
Cosmos-Transfer2.5 (torch.dev.gpu): sim2real augmentation; SDG pipeline → lakehouse datasets
Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
ar4-physical-ai: cross-embodiment VLA transfer (SO-101 → AR4)
Zenoh adopted as standard non-ROS transport; vLLM on torch.dev.gpu for VLA planning
Digital Twins: Cosmos-predicted twin state snapshots; Redis hot-cache for live pose
v2 — NATS (control messages) + Kafka (audit/telemetry, SYS-011); Flink for agent trace analytics (SYS-013); Pulsar for multi-tenant topics (SYS-014)
Agent-native streaming pattern: MCP over persistent stream connections (SYS-012); Pydantic AI runtime agents; management plane; data plane RAG
Cosmos models post-trained on turtlebot-maze ROS bag recordings
ar4-physical-ai: Pi0 / GR00T N1.5 foundation VLA models via vLLM + Zenoh bridge
counter-uas (aegean-ai/counter-uas): UE5 + VisDrone perception twin; third reference application
v3 — World-model-driven agent governance; VLA training pipeline over SDG lakehouse datasets; feature store