User Plane Design

Date: 2026-02-23 Updated: 2026-03-14 Status: Approved (v1)

Problem

The user plane is where customer agentic workloads execute: VLA agents, behavior trees, real-time robot control, notebook-based ML training, SLAM, object detection. These workloads have fundamentally different requirements from the control plane — they are real-time, stateful per-session, and hardware-bound. They must continue running even when the control plane is degraded or unreachable.

The user plane is the execution mesh. It does not reason about what to run; it runs what it is told, as fast as the hardware allows.

Goals

Provide a multi-environment execution mesh for heterogeneous agentic workloads
Hardware abstraction: workloads declare resource requirements; the plane satisfies them
Isolation between workloads: a failing Nav2 job must not affect a running notebook job
Accept job specifications from the control plane; emit status events back
Support the canonical turtlebot-maze reference application end-to-end
Remain operational during control plane outages

Non-goals (v1)

Reasoning, planning, or orchestration — that is the control plane
Billing and quota enforcement — that is the management plane
Full LLM reasoning loops within user plane agents — constrained tool use only (v1)

Architecture

Deployment: Ray on Proxmox VMs (v1)

The user plane runs Ray natively on Proxmox VMs using the Ray on-premise cluster launcher -- not Kubernetes. This is a deliberate simplification for a single-team AI lab.

Why not KubeRay/K8s:

K8s adds a second orchestrator on top of Proxmox with significant maintenance overhead (etcd, kubelet, CNI, cert rotation, device plugins, resource quotas)
For a single team with known hardware, Ray's native SSH-based launcher is sufficient
Claude Code can manage the cluster via ray up/down/submit -- three commands vs kubectl + helm + CRD YAML
Faster cold starts (direct process start vs pod scheduling + image pull)
Direct GPU access without K8s device plugins
KubeRay is justified only for multi-tenancy, multiple Ray versions, or cloud burst auto-scaling -- none of which apply in v1

Job execution model (Level 1: Ray dispatches Docker containers):

Ray runs bare on the VM and has Docker socket access. Each job runs inside an application-specific Docker container. Ray handles GPU scheduling -- multiple applications can queue jobs, and Ray executes them sequentially as GPU resources become available.

Application A (tcc):     ray job submit --num-gpus 1 → docker run tcc-dev papermill ...
Application B (frcnn):   ray job submit --num-gpus 1 → docker run frcnn-dev train.py ...
                                   |
                                   v
                         Ray head (gpu-node-3, 1 GPU)
                         +------------------------+
                         | Job queue:             |
                         | 1. tcc (running)       | ← GPU allocated
                         | 2. frcnn (pending)     | ← waits for GPU
                         +------------------------+

Each application brings its own Docker image (via its own docker-compose.yml and Dockerfile). Ray does not know or care that jobs are Docker containers -- it manages the GPU lock. When a job finishes and releases the GPU, the next pending job starts.

Validated hardware (2026-03-14):

Resource	Value
Node	`gpu-node-3` (192.168.1.78)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GiB VRAM)
CUDA	13.0
Docker	29.3
Ray	2.54.0
Dashboard	http://192.168.1.78:8265

Cluster configuration:

# infra/ray/cluster.yaml
cluster_name: auraison-gpu
provider:
  type: local
  head_ip: 192.168.1.78    # gpu-node-3
  worker_ips: []            # single-node: head is also the worker
auth:
  ssh_user: pantelis.monogioudis
  ssh_private_key: ~/.ssh/keys/id_ed25519
head_start_ray_commands:
  - ray stop
  - source ~/ray-venv/bin/activate && ray start --head --port=6379 --dashboard-host=0.0.0.0 --num-gpus=1

ray up infra/ray/cluster.yaml     # start Ray head on gpu-node-3
ray status                         # check cluster health
ray down infra/ray/cluster.yaml    # tear down

Environments

The user plane is structured as two named Ray environments on Proxmox VMs, each with distinct hardware profiles and workload classes:

Environment	Hardware	Workload class
`torch.dev.gpu`	GPU nodes, CUDA, PyTorch	Notebook execution, VLA training, ML inference, Cosmos-Predict2 world model inference, Cosmos-Transfer2.5 sim2real augmentation
`ros.dev.gpu`	GPU nodes, ROS 2 Jazzy	Robot simulation, Nav2, YOLOv8, SLAM, Cosmos-Reason2 physical reasoning

In v1, both environments run on the same Ray cluster with workloads differentiated by runtime environment and resource requirements. Workloads are Ray Jobs submitted by the control plane and executed by Ray workers.

System context (C4 Level 2)

Reference application: turtlebot-maze

turtlebot-maze is the canonical user-plane application. It demonstrates all user-plane capabilities in a single end-to-end scenario:

Claude Code /navigate skill (user plane — real-time)
  → ros-mcp-server
      MCP tool call: publish_cmd_vel, get_odom, set_nav_goal
  → rosbridge WebSocket :9090
  → ROS 2 Nav2 action server
  → TurtleBot base controller
  → Gazebo simulation (or physical robot)

Supporting subsystems:

On ros.dev.gpu Ray workers:

Behavior trees (py_trees / BehaviorTree.CPP): autonomous navigation + search sequences
YOLOv8: object detection via PyTorch, decoupled from ROS via Zenoh transport
stella_vslam: visual SLAM for mapping and localization
Nav2: path planning and collision avoidance
Cosmos-Reason2 (auraison-eh1): physical AI reasoning — evaluates action feasibility using spatial/physics common sense before Nav2 goal dispatch

On torch.dev.gpu Ray workers:

Cosmos-Predict2 (auraison-oys): world model inference — given current camera frame + planned action, generates predicted future video frames
Cosmos-Transfer2.5 (auraison-i6l): sim2real augmentation — translates Gazebo/Predict2 synthetic video to photorealistic video; conditioned on depth, edge, and segmentation control maps extracted from Gazebo; documented +68.5% mission success rate improvement on navigation tasks

Predict → Transfer → Reason → Execute loop (v1.5):

Cosmos-Predict2 (torch.dev.gpu)
  current frame + proposed action → synthetic trajectory video
    → Cosmos-Transfer2.5 (torch.dev.gpu)
        synthetic → photorealistic (depth + edge control maps from Gazebo)
          → Cosmos-Reason2 (ros.dev.gpu)
              feasibility evaluation (physics / obstacle / reachability)
                → go: Nav2 goal dispatched
                → no-go: action rejected, behavior tree selects alternative

Synthetic data generation (SDG) pipeline:

Gazebo rollouts (ros.dev.gpu) → Cosmos-Predict2 → Cosmos-Transfer2.5
  → augmented video + action labels → Parquet dataset in data-plane lakehouse
  → VLA / Nav2 policy fine-tuning jobs on torch.dev.gpu

The control plane manages the ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and experiment bookkeeping. It does not participate in the real-time control loop — that loop runs entirely within the user plane.

Additional reference applications

ar4-physical-ai (aegean-ai/ar4-physical-ai) — VLA manipulation platform for the AR4 MK3 6-DOF robotic arm. Uses LeRobot (lerobot-ros / AnninAR4) for recording, training, and inference; Zenoh middleware for non-ROS transport; MoveIt2 + ros2_control for safe trajectory execution. Runs on both ros.dev.gpu (ROS 2 + Gazebo Harmonic) and torch.dev.gpu (VLA inference via vLLM + Zenoh queryable). See user-plane/ar4-digital-twin.md for the digital twin design.

counter-uas (aegean-ai/counter-uas, v2) — Counter-UAS system with VisDrone perception, Unreal Engine 5 simulation, and General Robotics GRID integration. Demonstrates the platform's support for non-manipulation, non-navigation workloads (aerial perception + tracking).

tube-quality-control (colgate/tube-quality-control) — AI-driven manufacturing quality control for tube production lines. Runs on torch.dev.gpu. Demonstrates the platform's support for industrial computer vision workloads:

Anomaly detection: PatchCore, EfficientAD (anomalib) for unsupervised defect detection
Supervised contrastive learning: ResNet50 fine-tuning with SupCon loss for defect classification across 2/4/11/12 class configurations
Embedding-based similarity search: Qdrant vector database for nearest-neighbor defect retrieval; pretrained and fine-tuned timm embeddings
Dataset pipeline: S3/MinIO raw images → COCO-format annotations → HuggingFace Hub datasets
Experiment tracking: ClearML for training lineage and model versioning
Visualization: FiftyOne for dataset exploration and annotation review
Domain model: Pydantic-based entities (MachineSettingModel, ImageModel, AnomalyLabelModel) with Hydra configuration management
Infrastructure: Docker GPU containers, MongoDB (FiftyOne), Qdrant, MinIO, NATS (event bus)
Edge deployment path: OpenVINO export for factory-floor inference

All reference applications share the same Layer C abstraction: vLLM inference serving via Zenoh queryable on torch.dev.gpu. Each plugs in its own model backend (Cosmos stack, LeRobot VLA, perception/tracking models, anomaly detection) without platform changes.

Interfaces

Control → User plane (job submission)

The control plane emits a JobSpec to the user plane. In v1, this is a direct ray job submit CLI invocation by the NotebookAgent subprocess. In v2, the control plane writes to a NATS subject and a user-plane executor subscribes and submits.

JobSpec {
  job_id:        UUID
  environment:   "torch.dev.gpu" | "ros.dev.gpu"
  entrypoint:    path to notebook or ROS launch file
  resources:     {num_gpus: int, num_cpus: int, memory_gb: float}
  parameters:    dict (papermill parameters or ROS args)
  copyback_url:  callback URL for result delivery
  max_duration:  seconds (safety: forced termination if exceeded)
}

User → Control plane (status events)

The user plane emits status events back to the control plane. In v1, the control plane polls via the ClusterAgent. In v2, user-plane workers emit events to a Redis Stream (v1.5) or NATS subject (v2) and the control plane subscribes.

StatusEvent {
  job_id:      UUID
  ray_job_id:  str
  status:      PENDING | RUNNING | SUCCEEDED | FAILED | STOPPED
  timestamp:   ISO 8601
  logs:        str (tail of worker stdout/stderr)
  wandb_run_id?: str (if W&B logging active)
}

Copyback webhook

On job completion, the Ray worker calls POST \{copyback_url\} with the executed notebook payload. The control plane relays this to eaia for MDX regeneration.

Event substrate

The user plane is natively event-driven. The internal communication substrate within the user plane uses standard robotics and ML messaging:

Use case	Technology
ROS node ↔ ROS node (real-time)	DDS (rmw_fastrtps)
ROS ↔ non-ROS containers (e.g. YOLOv8)	Zenoh bridge
ML worker logging	W&B SDK (direct to W&B API)
User plane → control plane (v1)	HTTP polling (pull)
User plane → control plane (v1.5)	Redis Streams (push)
User plane → control plane (v2)	NATS subjects (push)

Zenoh is used to decouple non-ROS workloads (YOLOv8 inference, SLAM) from the ROS graph. A Zenoh bridge republishes DDS topics as Zenoh subjects, allowing PyTorch containers without ROS deps to subscribe to sensor data.

Safety constraints

The user plane runs workloads that can actuate physical hardware. Safety is enforced at the plane boundary, before execution begins:

Constraint	Mechanism
Max job duration	`max_duration` in JobSpec; Ray Job timeout
Resource caps	Ray resource scheduling (`num_gpus`, `num_cpus`)
Process isolation	Each environment uses separate Ray runtime environments
Actuation gating (v2)	Confidence threshold check before Nav2 goal submission
Emergency stop	Control plane sends SIGTERM via `ray job stop`; ros-mcp-server has `/emergency_stop` MCP tool
Circuit breaker	ClusterAgent monitors anomaly rate; pauses cluster on threshold breach

v1 Hybrid Compromise

In v1, the claude -p subprocesses invoked by the control plane (NotebookAgent, ClusterAgent) conflate control-plane reasoning with user-plane execution. A NotebookAgent subprocess both decides how to submit a job (control plane cognition) and issues the ray job submit command (user plane execution). This is a deliberate pragmatic choice for v1.

The subprocess boundary (claude -p as a child process of the FastAPI app) serves as the physical plane separator in v1: the control plane process never directly touches kubectl or ray; only the subprocess does. This is sufficient isolation for v1 but conflates the two planes logically.

In v2, the separation becomes explicit: control agents emit JobSpec messages to a NATS subject; a user-plane executor (a lightweight worker with no reasoning capability) consumes the spec and runs ray job submit. The control agent never touches infrastructure CLIs directly.

Infrastructure

Ray on Proxmox VMs (v1)

Ray is deployed natively on Proxmox VMs using the on-premise cluster launcher. No Kubernetes layer is required. The cluster configuration and VM provisioning are managed via Terraform (see infra/terraform/) and the Ray cluster YAML.

infra/ray/
├── cluster.yaml                    Ray cluster config (head_ip + worker_ips)
└── runtime_envs/
    ├── torch-gpu.yaml              Runtime env for torch.dev.gpu workloads
    └── ros-gpu.yaml                Runtime env for ros.dev.gpu workloads

Claude Code agents manage the cluster lifecycle via:

ray up cluster.yaml -- start/restart the cluster
ray submit cluster.yaml job.py -- submit a job
ray status -- check cluster health
ray down cluster.yaml -- tear down

KubeRay (deferred to v2)

KubeRay operator and RayCluster CRs are available in infra/k8s/ but deferred until multi-tenancy or cloud burst auto-scaling is required. The K8s manifests are kept as a future migration path, not the v1 deployment target.

Worker environments

Each environment uses a purpose-built VM image or conda/pip runtime environment:

Environment	VM setup	Key packages
`torch.dev.gpu`	Ubuntu 24.04 + CUDA 12.x + Ray	PyTorch, papermill, wandb, diffusers, Cosmos-Predict2, Cosmos-Transfer2.5
`ros.dev.gpu`	Ubuntu 24.04 + ROS 2 Jazzy + CUDA + Ray	ros2, nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 (vLLM)

Evolution Path

v1   — Ray on Proxmox VMs (native launcher); ray job submit via NotebookAgent subprocess (hybrid)
v1.5 — Redis Streams: user plane emits StatusEvents; control plane subscribes (decoupled polling)
       Cosmos-Reason2 on ros.dev.gpu for physical reasoning
       Cosmos-Predict2 + Cosmos-Transfer2.5 on torch.dev.gpu for world model inference + sim2real
       Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
       SDG pipeline: Gazebo → Predict2 → Transfer2.5 → lakehouse augmented dataset
v2   — NATS: JobSpec dispatch; pure executor workers (no LLM); Zenoh → NATS bridge for ROS events
       Actuation confidence gates backed by Cosmos-Reason2 feasibility scores
       Cosmos-Predict2 post-trained on turtlebot-maze ROS bag recordings
       Cosmos-Transfer2.5 Real2Real augmentation of turtlebot-maze ROS bags for policy robustness
       Management plane subscribes to execution telemetry
v3   — Edge deployment: Cosmos-Reason2 (2B, FP8) on Jetson AGX Orin/Thor for on-robot inference
       Dual-compute: Jetson (edge reasoning) + KubeRay (training, SDG, heavy inference)

Requirements (UP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

ID	Requirement	Traces to	Version
UP-001	The user plane shall host agentic workloads: VLA, behavior trees, robot control, ML training, SLAM, object detection	SYS-001	v1
UP-002	The user plane shall have real-time (ms) latency, stateful per-session	SYS-001	v1
UP-003	User plane failure shall stop the agent/robot but the control plane shall continue	SYS-002	v1
UP-004	The user plane shall remain operational during control plane outages	SYS-002	v1
UP-005	The user plane shall provide two Ray environments: `torch.dev.gpu` and `ros.dev.gpu` on Proxmox VMs	SYS-001	v1
UP-006	`torch.dev.gpu` shall support CUDA, PyTorch, notebook execution, VLA training, Cosmos-Predict2, Cosmos-Transfer2.5	UP-005	v1
UP-007	`ros.dev.gpu` shall support ROS 2 Jazzy, Nav2, YOLOv8, SLAM, Cosmos-Reason2	UP-005	v1
UP-008	The user plane shall manage Ray cluster lifecycles via `ray up/down` but NOT participate in real-time control	SYS-002	v1
UP-009	The user plane shall accept JobSpec from control plane: job_id, environment, entrypoint, resources, parameters, copyback_url, max_duration	—	v1
UP-010	The user plane shall emit StatusEvent to control plane: job_id, status, timestamp, logs, wandb_run_id	—	v1
UP-011	The user plane shall call copyback webhook on job completion	CP-015	v1
UP-012	ROS node communication shall use DDS (rmw_fastrtps)	—	v1
UP-013	Non-ROS containers shall use Zenoh bridge to decouple from ROS graph	SYS-004	v1
UP-014	Maximum job duration shall be enforced via Ray Job timeout	CP-020	v1
UP-015	Resource caps shall be enforced via Ray resource scheduling (`num_gpus`, `num_cpus`)	—	v1
UP-016	Environments shall be isolated via separate Ray runtime environments	—	v1
UP-017	Actuation gating shall require confidence threshold before goal submission	SYS-005	v2
UP-018	Emergency stop shall be supported via `ray job stop` and ros-mcp-server `/emergency_stop`	—	v1
UP-019	Circuit breaker: ClusterAgent shall pause cluster on anomaly rate threshold breach	SYS-002	v1
UP-020	The user plane shall support turtlebot-maze with Claude Code `/navigate` via ros-mcp-server	SYS-003	v1
UP-021	turtlebot-maze shall support behavior trees for autonomous navigation	SYS-003	v1
UP-022	turtlebot-maze shall support YOLOv8 object detection via Zenoh	SYS-003, SYS-004	v1
UP-023	turtlebot-maze shall support stella_vslam for mapping and localization	SYS-003	v1
UP-024	turtlebot-maze shall support Cosmos-Reason2 for physical reasoning and feasibility evaluation	SYS-003, SYS-005	v1.5
UP-025	Cosmos-Predict2 shall run on torch.dev.gpu for world model inference	SYS-005	v1.5
UP-026	Cosmos-Transfer2.5 shall run on torch.dev.gpu for sim2real augmentation	SYS-005	v1.5
UP-027	turtlebot-maze shall implement Predict → Transfer → Reason → Execute loop	SYS-005	v1.5
UP-028	SDG pipeline: Gazebo → Cosmos-Predict2 → Cosmos-Transfer2.5 → lakehouse	SYS-005, SYS-007	v1.5
UP-029	torch.dev.gpu Ray worker image: rayproject/ray-ml with PyTorch, papermill, wandb, diffusers, Cosmos	UP-006	v1
UP-030	ros.dev.gpu Ray worker image: ROS 2 Jazzy + Ray base with nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2	UP-007	v1
UP-031	Ray cluster config shall be stored in `infra/ray/cluster.yaml`; KubeRay manifests deferred to `infra/k8s/` (v2)	—	v1
UP-032	In v1, jobs shall be dispatched synchronously via `claude -p` subprocess	CP-026	v1
UP-033	In v1.5, StatusEvent shall be emitted via Redis Streams	—	v1.5
UP-034	In v2, jobs shall be dispatched via NATS; pure executor workers shall consume JobSpec	CP-027	v2
UP-035	In v3, Cosmos-Reason2 (2B, FP8) shall run on Jetson AGX Orin/Thor for on-robot inference	SYS-005	v3
UP-036	In v3, dual-compute: Jetson (edge) + KubeRay (training, SDG, heavy inference)	SYS-005	v3

References

Cosmos-Reason2 on Jetson — deploying Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM; demonstrates real-time webcam inference and robotic manipulation (pick-and-place). Key for v3 edge deployment: the 2B model runs on Jetson AGX Orin 64 GB at 8192 token context, and on Orin Super Nano at 256 token context with aggressive memory tuning.
Cosmos-Reason2 — physical AI reasoning VLM (2B/8B); spatial, temporal, physics comprehension (auraison-eh1)
Cosmos-Predict2.5 — world foundation model for future state prediction via video generation (auraison-oys)
Cosmos-Transfer2.5 — multi-controlnet sim2real augmentation; +68.5% mission success rate on navigation tasks (auraison-i6l)

Problem​

Goals​

Non-goals (v1)​

Architecture​

Deployment: Ray on Proxmox VMs (v1)​

Environments​

System context (C4 Level 2)​

Reference application: turtlebot-maze​

Additional reference applications​

Interfaces​

Control → User plane (job submission)​

User → Control plane (status events)​

Copyback webhook​

Event substrate​

Safety constraints​

v1 Hybrid Compromise​

Infrastructure​

Ray on Proxmox VMs (v1)​

KubeRay (deferred to v2)​

Worker environments​

Evolution Path​

Requirements (UP-xxx)​

See also​

References​