Auraison — User Plane Design

Date: 2026-02-23 Updated: 2026-03-02 Status: Approved (v1)

Problem

The user plane is where customer agentic workloads execute: VLA agents, behavior trees, real-time robot control, notebook-based ML training, SLAM, object detection. These workloads have fundamentally different requirements from the control plane — they are real-time, stateful per-session, and hardware-bound. They must continue running even when the control plane is degraded or unreachable.

The user plane is the execution mesh. It does not reason about what to run; it runs what it is told, as fast as the hardware allows.

Goals

Provide a multi-environment execution mesh for heterogeneous agentic workloads
Hardware abstraction: workloads declare resource requirements; the plane satisfies them
Isolation between workloads: a failing Nav2 job must not affect a running notebook job
Accept job specifications from the control plane; emit status events back
Support the canonical turtlebot-maze reference application end-to-end
Remain operational during control plane outages

Non-goals (v1)

Reasoning, planning, or orchestration — that is the control plane
Billing and quota enforcement — that is the management plane
Full LLM reasoning loops within user plane agents — constrained tool use only (v1)

Architecture

Environments

The user plane is structured as two named KubeRay environments on Proxmox K8s, each with distinct hardware profiles and workload classes:

Environment	Hardware	Workload class
`torch.dev.gpu`	GPU nodes, CUDA, PyTorch	Notebook execution, VLA training, ML inference, Cosmos-Predict2 world model inference, Cosmos-Transfer2.5 sim2real augmentation
`ros.dev.gpu`	GPU nodes, ROS 2 Jazzy	Robot simulation, Nav2, YOLOv8, SLAM, Cosmos-Reason2 physical reasoning

Each environment is a separate RayCluster CR. Workloads are Ray Jobs submitted by the control plane and executed by Ray workers in the appropriate cluster.

System context (C4 Level 2)

Reference application: turtlebot-maze

turtlebot-maze is the canonical user-plane application. It demonstrates all user-plane capabilities in a single end-to-end scenario:

Claude Code /navigate skill (user plane — real-time)
  → ros-mcp-server
      MCP tool call: publish_cmd_vel, get_odom, set_nav_goal
  → rosbridge WebSocket :9090
  → ROS 2 Nav2 action server
  → TurtleBot base controller
  → Gazebo simulation (or physical robot)

Supporting subsystems:

On ros.dev.gpu Ray workers:

Behavior trees (py_trees / BehaviorTree.CPP): autonomous navigation + search sequences
YOLOv8: object detection via PyTorch, decoupled from ROS via Zenoh transport
stella_vslam: visual SLAM for mapping and localization
Nav2: path planning and collision avoidance
Cosmos-Reason2 (auraison-eh1): physical AI reasoning — evaluates action feasibility using spatial/physics common sense before Nav2 goal dispatch

On torch.dev.gpu Ray workers:

Cosmos-Predict2 (auraison-oys): world model inference — given current camera frame + planned action, generates predicted future video frames
Cosmos-Transfer2.5 (auraison-i6l): sim2real augmentation — translates Gazebo/Predict2 synthetic video to photorealistic video; conditioned on depth, edge, and segmentation control maps extracted from Gazebo; documented +68.5% mission success rate improvement on navigation tasks

Predict → Transfer → Reason → Execute loop (v1.5):

Cosmos-Predict2 (torch.dev.gpu)
  current frame + proposed action → synthetic trajectory video
    → Cosmos-Transfer2.5 (torch.dev.gpu)
        synthetic → photorealistic (depth + edge control maps from Gazebo)
          → Cosmos-Reason2 (ros.dev.gpu)
              feasibility evaluation (physics / obstacle / reachability)
                → go: Nav2 goal dispatched
                → no-go: action rejected, behavior tree selects alternative

Synthetic data generation (SDG) pipeline:

Gazebo rollouts (ros.dev.gpu) → Cosmos-Predict2 → Cosmos-Transfer2.5
  → augmented video + action labels → Parquet dataset in data-plane lakehouse
  → VLA / Nav2 policy fine-tuning jobs on torch.dev.gpu

The control plane manages the ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and experiment bookkeeping. It does not participate in the real-time control loop — that loop runs entirely within the user plane.

Interfaces

Control → User plane (job submission)

The control plane emits a JobSpec to the user plane. In v1, this is a direct ray job submit CLI invocation by the NotebookAgent subprocess. In v2, the control plane writes to a NATS subject and a user-plane executor subscribes and submits.

JobSpec {
  job_id:        UUID
  environment:   "torch.dev.gpu" | "ros.dev.gpu"
  entrypoint:    path to notebook or ROS launch file
  resources:     {num_gpus: int, num_cpus: int, memory_gb: float}
  parameters:    dict (papermill parameters or ROS args)
  copyback_url:  callback URL for result delivery
  max_duration:  seconds (safety: forced termination if exceeded)
}

User → Control plane (status events)

The user plane emits status events back to the control plane. In v1, the control plane polls via the ClusterAgent. In v2, user-plane workers emit events to a Redis Stream (v1.5) or NATS subject (v2) and the control plane subscribes.

StatusEvent {
  job_id:      UUID
  ray_job_id:  str
  status:      PENDING | RUNNING | SUCCEEDED | FAILED | STOPPED
  timestamp:   ISO 8601
  logs:        str (tail of worker stdout/stderr)
  wandb_run_id?: str (if W&B logging active)
}

Copyback webhook

On job completion, the Ray worker calls POST {copyback_url} with the executed notebook payload. The control plane relays this to eaia for MDX regeneration.

Event substrate

The user plane is natively event-driven. The internal communication substrate within the user plane uses standard robotics and ML messaging:

Use case	Technology
ROS node ↔ ROS node (real-time)	DDS (rmw_fastrtps)
ROS ↔ non-ROS containers (e.g. YOLOv8)	Zenoh bridge
ML worker logging	W&B SDK (direct to W&B API)
User plane → control plane (v1)	HTTP polling (pull)
User plane → control plane (v1.5)	Redis Streams (push)
User plane → control plane (v2)	NATS subjects (push)

Zenoh is used to decouple non-ROS workloads (YOLOv8 inference, SLAM) from the ROS graph. A Zenoh bridge republishes DDS topics as Zenoh subjects, allowing PyTorch containers without ROS deps to subscribe to sensor data.

Safety constraints

The user plane runs workloads that can actuate physical hardware. Safety is enforced at the plane boundary, before execution begins:

Constraint	Mechanism
Max job duration	`max_duration` in JobSpec; KubeRay job TTL
Resource caps	RayCluster `resources.limits` in CR spec
Namespace isolation	Each RayCluster in its own K8s namespace
Actuation gating (v2)	Confidence threshold check before Nav2 goal submission
Emergency stop	Control plane sends SIGTERM via `ray job stop`; ros-mcp-server has `/emergency_stop` MCP tool
Circuit breaker	ClusterAgent monitors anomaly rate; pauses cluster on threshold breach

v1 Hybrid Compromise

In v1, the claude -p subprocesses invoked by the control plane (NotebookAgent, ClusterAgent) conflate control-plane reasoning with user-plane execution. A NotebookAgent subprocess both decides how to submit a job (control plane cognition) and issues the ray job submit command (user plane execution). This is a deliberate pragmatic choice for v1.

The subprocess boundary (claude -p as a child process of the FastAPI app) serves as the physical plane separator in v1: the control plane process never directly touches kubectl or ray; only the subprocess does. This is sufficient isolation for v1 but conflates the two planes logically.

In v2, the separation becomes explicit: control agents emit JobSpec messages to a NATS subject; a user-plane executor (a lightweight worker with no reasoning capability) consumes the spec and runs ray job submit. The control agent never touches infrastructure CLIs directly.

Infrastructure

KubeRay operator

Deployed via Helm into the kuberay namespace on the Proxmox K8s cluster. Manages RayCluster, RayJob, and RayService CRs.

infra/k8s/
├── kuberay-operator/values.yaml    Helm values (resource limits, image tags)
├── raycluster-torch-gpu.yaml       RayCluster CR for torch.dev.gpu
└── raycluster-ros-gpu.yaml         RayCluster CR for ros.dev.gpu

Worker images

Each environment uses a purpose-built worker image:

Environment	Base image	Key packages
`torch.dev.gpu`	`rayproject/ray-ml:2.x-gpu`	PyTorch, papermill, wandb, diffusers, Cosmos-Predict2, Cosmos-Transfer2.5
`ros.dev.gpu`	Custom: ROS 2 Jazzy + Ray	ros2, nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 (vLLM)

Evolution Path

v1   — KubeRay on Proxmox K8s; ray job submit via NotebookAgent subprocess (hybrid)
v1.5 — Redis Streams: user plane emits StatusEvents; control plane subscribes (decoupled polling)
       Cosmos-Reason2 on ros.dev.gpu for physical reasoning
       Cosmos-Predict2 + Cosmos-Transfer2.5 on torch.dev.gpu for world model inference + sim2real
       Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
       SDG pipeline: Gazebo → Predict2 → Transfer2.5 → lakehouse augmented dataset
v2   — NATS: JobSpec dispatch; pure executor workers (no LLM); Zenoh → NATS bridge for ROS events
       Actuation confidence gates backed by Cosmos-Reason2 feasibility scores
       Cosmos-Predict2 post-trained on turtlebot-maze ROS bag recordings
       Cosmos-Transfer2.5 Real2Real augmentation of turtlebot-maze ROS bags for policy robustness
       Management plane subscribes to execution telemetry
v3   — Edge deployment: Cosmos-Reason2 (2B, FP8) on Jetson AGX Orin/Thor for on-robot inference
       Dual-compute: Jetson (edge reasoning) + KubeRay (training, SDG, heavy inference)

References

Cosmos-Reason2 on Jetson — deploying Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM; demonstrates real-time webcam inference and robotic manipulation (pick-and-place). Key for v3 edge deployment: the 2B model runs on Jetson AGX Orin 64 GB at 8192 token context, and on Orin Super Nano at 256 token context with aggressive memory tuning.
Cosmos-Reason2 — physical AI reasoning VLM (2B/8B); spatial, temporal, physics comprehension (auraison-eh1)
Cosmos-Predict2.5 — world foundation model for future state prediction via video generation (auraison-oys)
Cosmos-Transfer2.5 — multi-controlnet sim2real augmentation; +68.5% mission success rate on navigation tasks (auraison-i6l)

Problem​

Goals​

Non-goals (v1)​

Architecture​

Environments​

System context (C4 Level 2)​

Reference application: turtlebot-maze​

Interfaces​

Control → User plane (job submission)​

User → Control plane (status events)​

Copyback webhook​

Event substrate​

Safety constraints​

v1 Hybrid Compromise​

Infrastructure​

KubeRay operator​

Worker images​

Evolution Path​

References​