Four-Plane Architecture
Date: 2026-02-23 Updated: 2026-03-10 Status: Approved
Overview
Auraison is structured as four planes following the SDN / telecom separation pattern. Three vertical planes (user, control, management) handle execution, orchestration, and governance respectively. The data plane sits horizontally, serving all three. The planes have fundamentally different latency, consistency, and availability requirements.
| Plane | What runs here | Latency / consistency | Failure consequence |
|---|---|---|---|
| User plane | Customer agents: VLA, Nav2, behavior trees, YOLOv8, SLAM; Cosmos-Reason2 (physical reasoning), Cosmos-Predict2 (world model), Cosmos-Transfer2.5 (sim2real) | Real-time (ms), stateful per-session | Agent stops; robot halts |
| Control plane | Job dispatch, cluster management, experiment tracking, agent lifecycle governance | Seconds, eventually consistent | Degraded visibility; user plane continues |
| Data plane | Lakehouse (DuckDB + DuckLake + MinIO), embeddings, event log | Seconds, eventually consistent | Queries fail; ingestion queued; agents lose context |
| Management plane | Billing, tenancy, quotas, user management | Minutes, strongly consistent | No new deployments; running agents unaffected |
The control plane includes an agent operations subsystem (execution scheduling,
backpressure, guardrails, trace collection) that governs agent behaviour at runtime.
This is implemented as control-plane/backend/agentops/ — a package within the control
plane, not a separate architectural layer.
First principle: User plane failures must not cascade to the control plane, and control plane outages must not halt running agents.
System-level requirements
These requirements apply across all planes. Plane-specific requirements are decomposed
in each plane's design.md and trace back to these system-level IDs.
| ID | Requirement | Traces to |
|---|---|---|
| SYS-001 | The system shall be structured as four planes (user, control, data, management) following the SDN/telecom separation pattern | §Overview |
| SYS-002 | User plane failures shall not cascade to the control plane; control plane outages shall not halt running agents | §Overview, first principle |
| SYS-003 | The system shall support four reference applications: turtlebot-maze, ar4-physical-ai, Deep Evidence Agent, and counter-uas (v2) | §Reference applications |
| SYS-004 | The system shall adopt Zenoh as the standard non-ROS transport for reference applications | §Middleware and inference serving |
| SYS-005 | The system shall implement a dual-speed architecture: vLLM on torch.dev.gpu (System 2, planning) and VLA action heads on ros.dev.gpu (System 1, real-time control) | §Middleware and inference serving |
| SYS-006 | The system shall follow the v1 → v1.5 → v2 → v3 evolution path | §Evolution path |
| SYS-007 | The data plane shall sit horizontally, serving all three vertical planes via DuckDB + DuckLake + MinIO | §Overview |
| SYS-008 | Agents shall run as claude -p subprocesses, reusing the Claude Code subscription (not Anthropic API) | control-plane/design.md |
| SYS-009 | The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka) | §Streaming substrate |
| SYS-010 | Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onward | §Streaming substrate |
| SYS-011 | Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processing | §Streaming substrate |
| SYS-012 | The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connection | §Streaming substrate |
| SYS-013 | Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streams | §Streaming substrate |
| SYS-014 | Multi-tenant topic isolation for the management plane shall use Apache Pulsar | §Streaming substrate |
Requirements decomposition
Each plane's design document contains plane-specific requirements that trace back to these system-level requirements:
| Plane | Design doc | Requirement prefix | Count |
|---|---|---|---|
| Control plane | control-plane/design.md | CP-xxx | 30 |
| User plane | user-plane/design.md | UP-xxx | 36 |
| Data plane | data-plane/design.md | DP-xxx | 43 |
| Management plane | management-plane/design.md | MP-xxx | 40 |
| Architecture | architecture/four-plane.mdx | SYS-xxx | 14 |
| Total | 171 |
System context (C4 Level 1)
Reading the diagram. Dashed blue containers group plane / sub-plane scopes. Solid arrows are runtime data/control paths. Dashed arrows are v2 governance paths. For brevity only the critical cross-plane connectors are drawn — full edge list (Predict → Transfer → Reason loop, Zenoh ↔ vLLM, agent → AgentEvent → OBS, etc.) is described in the prose below.
Repository layout
auraison/
├── control-plane/ FastAPI API + Claude Code agent layer + AgentOps subsystem + Next.js UI
├── user-plane/ Agentic workloads: VLA, ROS 2, multi-agent (KubeRay)
├── data-plane/ Lakehouse: DuckDB + DuckLake + MinIO (migrated from aegean-ai/lakehouse)
├── management-plane/ Billing, tenancy, quotas (v2)
├── docs/
│ ├── architecture/ System-level design docs (this directory)
│ ├── plans/ Plane-specific design docs
│ └── decisions/ Cross-cutting ADRs
└── docker-compose.yml Local dev infra (Postgres + Redis)
Communication between planes
See docs/control-plane/design.md §"Communication between planes"
for the current v1 contract (subprocess + webhook) and the v1.5/v2 evolution path
(Redis Streams → NATS + Kafka).
A dedicated cross-plane communication design doc is tracked in beads issue auraison-eco.
Reference applications
Auraison supports the following reference applications. The robotics applications are
independent GitHub repositories under the aegean-ai org, deployed onto KubeRay clusters.
The Deep Evidence Agent lives in aegean-ai/dea.
turtlebot-maze — aegean-ai/turtlebot-maze
The canonical navigation application. Demonstrates Claude Code + ros-mcp-server doing
real-time robot control on ros.dev.gpu, extended in v1.5 with the Cosmos model stack:
Claude Code /navigate skill
→ ros-mcp-server (MCP over rosbridge WebSocket :9090)
→ ROS 2 Nav2 action server
→ TurtleBot navigation
Predict → Transfer → Reason → Execute loop (v1.5):
Cosmos-Predict2 (torch.dev.gpu): current frame + action → synthetic trajectory
→ Cosmos-Transfer2.5 (torch.dev.gpu): synthetic → photorealistic
→ Cosmos-Reason2 (ros.dev.gpu): feasibility evaluation → go / no-go
→ Nav2 goal dispatched or behavior tree selects alternative
ar4-physical-ai — aegean-ai/ar4-physical-ai
VLA (Vision-Language-Action) manipulation platform for the AR4 MK3 robotic arm. Layered on the AR4 ROS driver and HuggingFace LeRobot. Key architectural characteristics:
- LeRobot-native — uses
lerobot-ros(AnninAR4class) as the bridge between ROS 2 and LeRobot for recording, training, and inference - Zenoh middleware — Zenoh router + DDS bridge decouples non-ROS components (Optuna PID tuner, future LLM inference) from the DDS discovery mesh
- Docker-first — multi-stage GPU containers (base → overlay → dev) with docker-compose
- Simulation-first — physics-enabled Gazebo Harmonic with gravity, contact, graspable objects
- VLA progression — LeRobot ACT (v1) → cross-embodiment transfer (v1.5) → Pi0/GR00T (v2)
LeRobot pipeline:
lerobot-record → Dataset v3.0 (Parquet + MP4) → lerobot-train (ACT/Diffusion/Pi0) → lerobot-evaluate
↕ lerobot-ros (ROS2Robot / AnninAR4)
↕ MoveIt Servo + ros2_control + Gazebo Harmonic (sim) / Teensy 4.1 (real)
Zenoh transport (non-ROS workloads):
Optuna PID tuner ←→ Zenoh Router :7447 ←→ zenoh-bridge-ros2dds ←→ CycloneDDS ←→ Gazebo
vLLM inference (future) ←→ Zenoh Router :7447 ←→ ROS 2 planning nodes
Deep Evidence Agent (DEA) — aegean-ai/dea
Multi-agent system for safety- and mission-critical engineering organizations. Turns scattered engineering artifacts (requirements, design docs, code, tests, standards, incident reports) into a traceable, auditable knowledge base with evidence-grounded reasoning. Key architectural characteristics:
- Multi-agent orchestration — Planner (decomposes questions), Researcher (retrieves artifacts), Critic (validates evidence), Synthesizer (generates reports)
- Evidence-grounded — every claim linked to primary sources with full provenance
- Domain-specific — built for engineering traceability, not generic AI chat
- Human-in-the-loop — engineers review, curate, and approve all outputs
- GraphRAG — Microsoft GraphRAG as git submodule for graph-based retrieval
- Separate repo —
aegean-ai/dea(docs, graphrag submodule)
User query ("What is the impact of changing REQ-123?")
→ Planner agent: decompose into sub-questions
→ Researcher agent: retrieve artifacts via GraphRAG + data plane (lakehouse)
→ Critic agent: validate evidence chains, flag gaps
→ Synthesizer agent: produce trace matrix / impact report with citations
→ Human review + approval
counter-uas — aegean-ai/counter-uas (v2)
Counter-UAS system combining VisDrone aerial perception, Unreal Engine 5 simulation, and General Robotics GRID hardware integration. Demonstrates the platform's support for non-manipulation, non-navigation workloads — aerial object detection, tracking, and classification. Key architectural characteristics:
- UE5 simulation — photorealistic aerial scenarios for training and evaluation
- VisDrone dataset — benchmark for drone-based object detection and tracking
- GRID integration — General Robotics counter-UAS hardware platform
- Perception-first — detection/tracking models, not VLA or navigation policies
Common patterns across applications
The control plane manages ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and
experiment bookkeeping. The control plane does not control the robot in real-time — that
is ros-mcp-server's domain (turtlebot-maze) or lerobot-ros's domain (ar4-physical-ai).
All reference applications share a common Layer C abstraction: vLLM inference serving
via Zenoh queryable on torch.dev.gpu. Each application plugs in its own model backend
without platform changes.
| Concern | turtlebot-maze | ar4-physical-ai | Deep Evidence Agent | counter-uas (v2) |
|---|---|---|---|---|
| Robot framework | Nav2 + behavior trees | MoveIt2 + ros2_control | N/A | GRID platform |
| AI model | Cosmos stack (world model) | LeRobot VLA (ACT → Pi0) | Multi-agent LLM (Claude) | Detection/tracking models |
| Middleware | DDS (CycloneDDS) | Zenoh + DDS bridge | Control plane API | Zenoh + DDS bridge |
| LLM integration | Claude Code via ros-mcp-server | Future: vLLM via Zenoh | Native (Claude -p subprocess) | vLLM via Zenoh |
| Sim environment | Gazebo (Nav2 worlds) | Gazebo Harmonic (tabletop) | N/A | Unreal Engine 5 |
| Data pipeline | ROS bag → lakehouse | LeRobot v3.0 → HuggingFace Hub | Artifact ingestion → lakehouse | VisDrone + UE5 → lakehouse |
Middleware and inference serving: Zenoh vs vLLM
Zenoh and vLLM operate at different layers and are complementary, not competing.
| Zenoh | vLLM | |
|---|---|---|
| Layer | Communication middleware (transport) | Compute engine (GPU inference) |
| What it does | Pub/sub/query between distributed nodes | High-throughput LLM inference with PagedAttention |
| Protocol | Zenoh protocol (tcp/udp/quic), DDS bridge | HTTP (OpenAI-compatible API) |
| Latency | Microseconds (wire protocol) | Milliseconds–seconds (model inference) |
| State | Stateless router + optional storage | Stateful (KV cache, model weights on GPU) |
How they integrate
In the auraison architecture, Zenoh serves as the transport layer between robot-side ROS 2 nodes and non-ROS compute services (like vLLM). The key mechanism is Zenoh queryables — on-demand request/response handlers that function like lightweight RPC:
Robot node (ROS 2) → CycloneDDS → zenoh-bridge-ros2dds → Zenoh Router
→ Zenoh queryable ("ai/inference/vla") → vLLM async engine
→ inference result returned through Zenoh → bridge → DDS → ROS 2 node
This is already proven in ar4-physical-ai where Zenoh decouples the Optuna PID tuner from the ROS 2 graph. The same pattern extends to LLM inference — a thin Zenoh-to-vLLM adapter registers a queryable, forwards requests to vLLM's async engine, and returns results. No HTTP overhead, location-transparent (inference can move edge ↔ cloud without client changes).
Note: Two ROS 2 integration paths exist:
- zenoh-plugin-ros2dds — bridge that routes DDS traffic through Zenoh (no code changes)
- rmw_zenoh — native RMW implementation replacing DDS entirely (experimental in ROS 2 Kilted Kaiju 2025)
Why Zenoh over raw HTTP for LLM serving
- Decoupled discovery — ROS 2 nodes don't need to know vLLM's IP/port; Zenoh key-expressions route by topic
- Streaming — Zenoh pub/sub naturally supports token streaming (vs HTTP SSE/WebSocket)
- Multi-consumer — multiple nodes can subscribe to the same inference result
- Unified transport — one middleware for sensor data, control commands, and LLM queries
- Edge deployment — Zenoh's low overhead supports on-robot inference routing
Industry context
The emerging industry pattern follows a dual-speed architecture (inspired by Kahneman's System 1 / System 2, exemplified by NVIDIA GR00T):
- System 2 (slow, 100ms+) — VLM/LLM for reasoning, planning, task decomposition. Runs in cloud or on powerful edge GPU (vLLM, TensorRT-LLM).
- System 1 (fast, <50ms) — lightweight action model that translates plans into continuous motor commands at real-time control rates. Runs on-robot (TensorRT Edge-LLM on Jetson).
| Team | High-level planner | Low-level policy | Middleware | Inference engine |
|---|---|---|---|---|
| NVIDIA GR00T | VLM + Cosmos Reason (System 2) | Action decoder (System 1) | Custom | TensorRT-LLM / TensorRT Edge-LLM |
| Google DeepMind | RT-2 (12B–55B VLA) | End-to-end (actions as text tokens) | Custom | TPU serving |
| HuggingFace LeRobot | SmolVLA async perception | ACT / Diffusion / Pi0 action head | None (in-process) | PyTorch native |
| PickNik MoveIt Pro | LLM for behavior tree generation | MoveIt2 motion planning | DDS | HTTP to LLM API |
| Intrinsic (Alphabet) | Task planner | Skill primitives | Zenoh + ROS 2 Jazzy | — |
| Auraison (ours) | vLLM on torch.dev.gpu | LeRobot VLA / Cosmos | Zenoh + DDS bridge | vLLM via Zenoh queryable |
Key adopters of Zenoh: General Motors (uProtocol), Volvo, Intrinsic, smart-city deployments. Zenoh 1.0 benchmarks: 13μs latency, 50 Gbps throughput, superior on Wi-Fi/lossy networks vs DDS.
Recommendation for auraison
- Adopt Zenoh as the standard non-ROS transport for both reference applications
- Follow the System 2 / System 1 pattern: vLLM on
torch.dev.gpufor high-level planning (System 2); LeRobot VLA action heads onros.dev.gpufor real-time control (System 1) - Bridge via Zenoh queryables: robot nodes issue
zenoh.get("ai/inference/vla", observation)— a thin adapter forwards to vLLM's async engine and returns results. No direct HTTP from ROS 2 nodes. - Do not replace one with the other — Zenoh is plumbing, vLLM is compute
- Future: evaluate
rmw_zenohas a full DDS replacement once it stabilizes past experimental status in ROS 2
Streaming substrate
Auraison operates three distinct streaming tiers. They differ in latency, durability, and consumer type, and are not interchangeable:
| Tier | Latency | Durability | Technology | Purpose |
|---|---|---|---|---|
| Data streaming (sensor / telemetry) | Microseconds | Ephemeral | Zenoh | Sensor frames, point clouds, IMU, DDS bridge, vLLM token streaming within user plane |
| Control event streaming | Milliseconds | At-least-once | Redis Streams (v1.5) → NATS (v2) | JobSpec dispatch, StatusEvents, agent lifecycle notifications |
| Analytics / audit streaming | Seconds | Exactly-once | Kafka (v2) | Agent traces, experiment audit log, compliance telemetry to data plane |
Stream processing platform context
Eight mainstream stream processing platforms were evaluated (Ably survey): Apache Spark, Apache Kafka Streams, Apache Flink, Spring Cloud Data Flow, Amazon Kinesis, Google Cloud Dataflow, Apache Pulsar, and IBM Streams.
Key architectural characteristics relevant to agentic workloads:
| Platform | Latency | Deployment model | Fit for Auraison |
|---|---|---|---|
| Apache Flink | Low (event-time) | Cluster | ✓ v2 — complex event processing over agent trace streams; millions of events/s; built-in ML connectors |
| Apache Kafka Streams | ~10ms | Client library | ✓ v1.5 — lightweight; integrates inside control-plane process; no cluster manager needed |
| Apache Pulsar | Low publish | Cloud-native, multi-layer | ✓ v2 — 1M+ topics, geo-aware replication; strong fit for multi-tenant management plane |
| Amazon Kinesis / Cloud Dataflow | ~seconds | Serverless | ✗ Cloud-lock; unsuitable for on-premise Proxmox deployment |
| Apache Spark | Batch-first | Cluster | ✗ High memory; better for SDG dataset processing than real-time event streams |
Decision: Kafka Streams (v1.5) for control-plane event enrichment; Apache Flink (v2) for analytics pipeline over the agent trace audit log; Pulsar (v2) for multi-tenant topic isolation. Zenoh remains the data-plane transport — it is not an analytics platform.
Agent-native streaming pattern
The critical distinction for agentic architectures is that LLMs are stream consumers and stream producers. A streaming agent:
- Consumes an event stream (sensor readings, job status, world-model snapshots)
- Reasons over a sliding window of that stream (LLM / VLA context)
- Produces an enriched output stream (decisions, annotations, action commands)
This is the Streaming Context Protocol (SCP) pattern — an evolution of MCP: MCP provides stateful, AI-native tool access over a persistent connection. When the connection carries a continuous event stream (Zenoh pub/sub, Kafka consumer, NATS subject), MCP becomes the AI-native streaming interface: the agent subscribes to a stream, processes events with LLM context, and publishes enriched responses as a new stream.
Stream source (Zenoh / Kafka / NATS)
→ [persistent MCP connection]
→ LLM agent context window (sliding window over recent events)
→ enriched stream (decisions · annotations · action commands)
→ downstream consumer (Nav2 · data plane · control plane)
This is distinct from REST: REST excels at discrete service calls; MCP + streaming excels at continuous, stateful, AI-enriched stream processing where the model is a first-class consumer. MCP PR #206 (serverless transport) enables this pattern without a persistent server process — the LLM subscribes as a serverless stream processor.
Streaming requirements
| ID | Requirement | Tier | Version |
|---|---|---|---|
| SYS-009 | The system shall support three streaming tiers: data (Zenoh), control events (Redis/NATS), analytics (Kafka) | All | v1–v2 |
| SYS-010 | Control-plane agents shall consume StatusEvents as streams (not polling) from v1.5 onward | Control | v1.5 |
| SYS-011 | Agent traces and experiment audit logs shall be published to an analytics stream (Kafka) for downstream processing | Data | v2 |
| SYS-012 | The system shall support the agent-native streaming pattern: LLM agent consumes event stream, enriches, and publishes output stream via persistent MCP connection | All | v2 |
| SYS-013 | Stream processing for analytics shall use Apache Flink for complex event processing over agent trace streams | Data | v2 |
| SYS-014 | Multi-tenant topic isolation for the management plane shall use Apache Pulsar | Management | v2 |
Evolution path
v1 — Control plane + user plane operational; data plane migrated to monorepo; synchronous subprocess dispatch
Reference apps: turtlebot-maze (Nav2), ar4-physical-ai (LeRobot ACT + Zenoh)
Digital Twins: persistent world model in lakehouse (TurtleBot + AR4); in-job writes + post-job reconciliation
v1.5 — AgentOps subsystem in control plane: execution scheduler, backpressure, trace collector; Redis Streams (SYS-010)
Cosmos-Reason2 (ros.dev.gpu): physical reasoning + actuation feasibility gating
Cosmos-Predict2 (torch.dev.gpu): world model inference for pre-execution trajectory simulation
Cosmos-Transfer2.5 (torch.dev.gpu): sim2real augmentation; SDG pipeline → lakehouse datasets
Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
ar4-physical-ai: cross-embodiment VLA transfer (SO-101 → AR4)
Zenoh adopted as standard non-ROS transport; vLLM on torch.dev.gpu for VLA planning
Digital Twins: Cosmos-predicted twin state snapshots; Redis hot-cache for live pose
v2 — NATS (control messages) + Kafka (audit/telemetry, SYS-011); Flink for agent trace analytics (SYS-013); Pulsar for multi-tenant topics (SYS-014)
Agent-native streaming pattern: MCP over persistent stream connections (SYS-012); Pydantic AI runtime agents; management plane; data plane RAG
Cosmos models post-trained on turtlebot-maze ROS bag recordings
ar4-physical-ai: Pi0 / GR00T N1.5 foundation VLA models via vLLM + Zenoh bridge
counter-uas (aegean-ai/counter-uas): UE5 + VisDrone perception twin; third reference application
v3 — World-model-driven agent governance; VLA training pipeline over SDG lakehouse datasets; feature store