Auraison — User Plane Design
Date: 2026-02-23 Updated: 2026-03-02 Status: Approved (v1)
Problem
The user plane is where customer agentic workloads execute: VLA agents, behavior trees, real-time robot control, notebook-based ML training, SLAM, object detection. These workloads have fundamentally different requirements from the control plane — they are real-time, stateful per-session, and hardware-bound. They must continue running even when the control plane is degraded or unreachable.
The user plane is the execution mesh. It does not reason about what to run; it runs what it is told, as fast as the hardware allows.
Goals
- Provide a multi-environment execution mesh for heterogeneous agentic workloads
- Hardware abstraction: workloads declare resource requirements; the plane satisfies them
- Isolation between workloads: a failing Nav2 job must not affect a running notebook job
- Accept job specifications from the control plane; emit status events back
- Support the canonical turtlebot-maze reference application end-to-end
- Remain operational during control plane outages
Non-goals (v1)
- Reasoning, planning, or orchestration — that is the control plane
- Billing and quota enforcement — that is the management plane
- Full LLM reasoning loops within user plane agents — constrained tool use only (v1)
Architecture
Environments
The user plane is structured as two named KubeRay environments on Proxmox K8s, each with distinct hardware profiles and workload classes:
| Environment | Hardware | Workload class |
|---|---|---|
torch.dev.gpu | GPU nodes, CUDA, PyTorch | Notebook execution, VLA training, ML inference, Cosmos-Predict2 world model inference, Cosmos-Transfer2.5 sim2real augmentation |
ros.dev.gpu | GPU nodes, ROS 2 Jazzy | Robot simulation, Nav2, YOLOv8, SLAM, Cosmos-Reason2 physical reasoning |
Each environment is a separate RayCluster CR. Workloads are Ray Jobs submitted by the control plane and executed by Ray workers in the appropriate cluster.
System context (C4 Level 2)
Reference application: turtlebot-maze
turtlebot-maze is the canonical user-plane application. It demonstrates all user-plane
capabilities in a single end-to-end scenario:
Claude Code /navigate skill (user plane — real-time)
→ ros-mcp-server
MCP tool call: publish_cmd_vel, get_odom, set_nav_goal
→ rosbridge WebSocket :9090
→ ROS 2 Nav2 action server
→ TurtleBot base controller
→ Gazebo simulation (or physical robot)
Supporting subsystems:
On ros.dev.gpu Ray workers:
- Behavior trees (py_trees / BehaviorTree.CPP): autonomous navigation + search sequences
- YOLOv8: object detection via PyTorch, decoupled from ROS via Zenoh transport
- stella_vslam: visual SLAM for mapping and localization
- Nav2: path planning and collision avoidance
- Cosmos-Reason2 (auraison-eh1): physical AI reasoning — evaluates action feasibility using spatial/physics common sense before Nav2 goal dispatch
On torch.dev.gpu Ray workers:
- Cosmos-Predict2 (auraison-oys): world model inference — given current camera frame + planned action, generates predicted future video frames
- Cosmos-Transfer2.5 (auraison-i6l): sim2real augmentation — translates Gazebo/Predict2 synthetic video to photorealistic video; conditioned on depth, edge, and segmentation control maps extracted from Gazebo; documented +68.5% mission success rate improvement on navigation tasks
Predict → Transfer → Reason → Execute loop (v1.5):
Cosmos-Predict2 (torch.dev.gpu)
current frame + proposed action → synthetic trajectory video
→ Cosmos-Transfer2.5 (torch.dev.gpu)
synthetic → photorealistic (depth + edge control maps from Gazebo)
→ Cosmos-Reason2 (ros.dev.gpu)
feasibility evaluation (physics / obstacle / reachability)
→ go: Nav2 goal dispatched
→ no-go: action rejected, behavior tree selects alternative
Synthetic data generation (SDG) pipeline:
Gazebo rollouts (ros.dev.gpu) → Cosmos-Predict2 → Cosmos-Transfer2.5
→ augmented video + action labels → Parquet dataset in data-plane lakehouse
→ VLA / Nav2 policy fine-tuning jobs on torch.dev.gpu
The control plane manages the ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and
experiment bookkeeping. It does not participate in the real-time control loop — that loop
runs entirely within the user plane.
Interfaces
Control → User plane (job submission)
The control plane emits a JobSpec to the user plane. In v1, this is a direct ray job submit
CLI invocation by the NotebookAgent subprocess. In v2, the control plane writes to a NATS subject
and a user-plane executor subscribes and submits.
JobSpec {
job_id: UUID
environment: "torch.dev.gpu" | "ros.dev.gpu"
entrypoint: path to notebook or ROS launch file
resources: {num_gpus: int, num_cpus: int, memory_gb: float}
parameters: dict (papermill parameters or ROS args)
copyback_url: callback URL for result delivery
max_duration: seconds (safety: forced termination if exceeded)
}
User → Control plane (status events)
The user plane emits status events back to the control plane. In v1, the control plane polls via the ClusterAgent. In v2, user-plane workers emit events to a Redis Stream (v1.5) or NATS subject (v2) and the control plane subscribes.
StatusEvent {
job_id: UUID
ray_job_id: str
status: PENDING | RUNNING | SUCCEEDED | FAILED | STOPPED
timestamp: ISO 8601
logs: str (tail of worker stdout/stderr)
wandb_run_id?: str (if W&B logging active)
}
Copyback webhook
On job completion, the Ray worker calls POST {copyback_url} with the executed notebook
payload. The control plane relays this to eaia for MDX regeneration.
Event substrate
The user plane is natively event-driven. The internal communication substrate within the user plane uses standard robotics and ML messaging:
| Use case | Technology |
|---|---|
| ROS node ↔ ROS node (real-time) | DDS (rmw_fastrtps) |
| ROS ↔ non-ROS containers (e.g. YOLOv8) | Zenoh bridge |
| ML worker logging | W&B SDK (direct to W&B API) |
| User plane → control plane (v1) | HTTP polling (pull) |
| User plane → control plane (v1.5) | Redis Streams (push) |
| User plane → control plane (v2) | NATS subjects (push) |
Zenoh is used to decouple non-ROS workloads (YOLOv8 inference, SLAM) from the ROS graph. A Zenoh bridge republishes DDS topics as Zenoh subjects, allowing PyTorch containers without ROS deps to subscribe to sensor data.
Safety constraints
The user plane runs workloads that can actuate physical hardware. Safety is enforced at the plane boundary, before execution begins:
| Constraint | Mechanism |
|---|---|
| Max job duration | max_duration in JobSpec; KubeRay job TTL |
| Resource caps | RayCluster resources.limits in CR spec |
| Namespace isolation | Each RayCluster in its own K8s namespace |
| Actuation gating (v2) | Confidence threshold check before Nav2 goal submission |
| Emergency stop | Control plane sends SIGTERM via ray job stop; ros-mcp-server has /emergency_stop MCP tool |
| Circuit breaker | ClusterAgent monitors anomaly rate; pauses cluster on threshold breach |
v1 Hybrid Compromise
In v1, the claude -p subprocesses invoked by the control plane (NotebookAgent, ClusterAgent)
conflate control-plane reasoning with user-plane execution. A NotebookAgent subprocess
both decides how to submit a job (control plane cognition) and issues the ray job submit
command (user plane execution). This is a deliberate pragmatic choice for v1.
The subprocess boundary (claude -p as a child process of the FastAPI app) serves as the
physical plane separator in v1: the control plane process never directly touches kubectl
or ray; only the subprocess does. This is sufficient isolation for v1 but conflates the two
planes logically.
In v2, the separation becomes explicit: control agents emit JobSpec messages to a NATS
subject; a user-plane executor (a lightweight worker with no reasoning capability) consumes
the spec and runs ray job submit. The control agent never touches infrastructure CLIs
directly.
Infrastructure
KubeRay operator
Deployed via Helm into the kuberay namespace on the Proxmox K8s cluster. Manages
RayCluster, RayJob, and RayService CRs.
infra/k8s/
├── kuberay-operator/values.yaml Helm values (resource limits, image tags)
├── raycluster-torch-gpu.yaml RayCluster CR for torch.dev.gpu
└── raycluster-ros-gpu.yaml RayCluster CR for ros.dev.gpu
Worker images
Each environment uses a purpose-built worker image:
| Environment | Base image | Key packages |
|---|---|---|
torch.dev.gpu | rayproject/ray-ml:2.x-gpu | PyTorch, papermill, wandb, diffusers, Cosmos-Predict2, Cosmos-Transfer2.5 |
ros.dev.gpu | Custom: ROS 2 Jazzy + Ray | ros2, nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 (vLLM) |
Evolution Path
v1 — KubeRay on Proxmox K8s; ray job submit via NotebookAgent subprocess (hybrid)
v1.5 — Redis Streams: user plane emits StatusEvents; control plane subscribes (decoupled polling)
Cosmos-Reason2 on ros.dev.gpu for physical reasoning
Cosmos-Predict2 + Cosmos-Transfer2.5 on torch.dev.gpu for world model inference + sim2real
Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
SDG pipeline: Gazebo → Predict2 → Transfer2.5 → lakehouse augmented dataset
v2 — NATS: JobSpec dispatch; pure executor workers (no LLM); Zenoh → NATS bridge for ROS events
Actuation confidence gates backed by Cosmos-Reason2 feasibility scores
Cosmos-Predict2 post-trained on turtlebot-maze ROS bag recordings
Cosmos-Transfer2.5 Real2Real augmentation of turtlebot-maze ROS bags for policy robustness
Management plane subscribes to execution telemetry
v3 — Edge deployment: Cosmos-Reason2 (2B, FP8) on Jetson AGX Orin/Thor for on-robot inference
Dual-compute: Jetson (edge reasoning) + KubeRay (training, SDG, heavy inference)
References
- Cosmos-Reason2 on Jetson — deploying Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM; demonstrates real-time webcam inference and robotic manipulation (pick-and-place). Key for v3 edge deployment: the 2B model runs on Jetson AGX Orin 64 GB at 8192 token context, and on Orin Super Nano at 256 token context with aggressive memory tuning.
- Cosmos-Reason2 — physical AI reasoning VLM (2B/8B); spatial, temporal, physics comprehension (auraison-eh1)
- Cosmos-Predict2.5 — world foundation model for future state prediction via video generation (auraison-oys)
- Cosmos-Transfer2.5 — multi-controlnet sim2real augmentation; +68.5% mission success rate on navigation tasks (auraison-i6l)