User Plane Design
Date: 2026-02-23 Updated: 2026-03-14 Status: Approved (v1)
Problem
The user plane is where customer agentic workloads execute: VLA agents, behavior trees, real-time robot control, notebook-based ML training, SLAM, object detection. These workloads have fundamentally different requirements from the control plane — they are real-time, stateful per-session, and hardware-bound. They must continue running even when the control plane is degraded or unreachable.
The user plane is the execution mesh. It does not reason about what to run; it runs what it is told, as fast as the hardware allows.
Goals
- Provide a multi-environment execution mesh for heterogeneous agentic workloads
- Hardware abstraction: workloads declare resource requirements; the plane satisfies them
- Isolation between workloads: a failing Nav2 job must not affect a running notebook job
- Accept job specifications from the control plane; emit status events back
- Support the canonical turtlebot-maze reference application end-to-end
- Remain operational during control plane outages
Non-goals (v1)
- Reasoning, planning, or orchestration — that is the control plane
- Billing and quota enforcement — that is the management plane
- Full LLM reasoning loops within user plane agents — constrained tool use only (v1)
Architecture
Deployment: Ray on Proxmox VMs (v1)
The user plane runs Ray natively on Proxmox VMs using the Ray on-premise cluster launcher -- not Kubernetes. This is a deliberate simplification for a single-team AI lab.
Why not KubeRay/K8s:
- K8s adds a second orchestrator on top of Proxmox with significant maintenance overhead (etcd, kubelet, CNI, cert rotation, device plugins, resource quotas)
- For a single team with known hardware, Ray's native SSH-based launcher is sufficient
- Claude Code can manage the cluster via
ray up/down/submit-- three commands vs kubectl + helm + CRD YAML - Faster cold starts (direct process start vs pod scheduling + image pull)
- Direct GPU access without K8s device plugins
- KubeRay is justified only for multi-tenancy, multiple Ray versions, or cloud burst auto-scaling -- none of which apply in v1
Job execution model (Level 1: Ray dispatches Docker containers):
Ray runs bare on the VM and has Docker socket access. Each job runs inside an application-specific Docker container. Ray handles GPU scheduling -- multiple applications can queue jobs, and Ray executes them sequentially as GPU resources become available.
Application A (tcc): ray job submit --num-gpus 1 → docker run tcc-dev papermill ...
Application B (frcnn): ray job submit --num-gpus 1 → docker run frcnn-dev train.py ...
|
v
Ray head (gpu-node-3, 1 GPU)
+------------------------+
| Job queue: |
| 1. tcc (running) | ← GPU allocated
| 2. frcnn (pending) | ← waits for GPU
+------------------------+
Each application brings its own Docker image (via its own docker-compose.yml and Dockerfile).
Ray does not know or care that jobs are Docker containers -- it manages the GPU lock.
When a job finishes and releases the GPU, the next pending job starts.
Validated hardware (2026-03-14):
| Resource | Value |
|---|---|
| Node | gpu-node-3 (192.168.1.78) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GiB VRAM) |
| CUDA | 13.0 |
| Docker | 29.3 |
| Ray | 2.54.0 |
| Dashboard | http://192.168.1.78:8265 |
Cluster configuration:
# infra/ray/cluster.yaml
cluster_name: auraison-gpu
provider:
type: local
head_ip: 192.168.1.78 # gpu-node-3
worker_ips: [] # single-node: head is also the worker
auth:
ssh_user: pantelis.monogioudis
ssh_private_key: ~/.ssh/keys/id_ed25519
head_start_ray_commands:
- ray stop
- source ~/ray-venv/bin/activate && ray start --head --port=6379 --dashboard-host=0.0.0.0 --num-gpus=1
ray up infra/ray/cluster.yaml # start Ray head on gpu-node-3
ray status # check cluster health
ray down infra/ray/cluster.yaml # tear down
Environments
The user plane is structured as two named Ray environments on Proxmox VMs, each with distinct hardware profiles and workload classes:
| Environment | Hardware | Workload class |
|---|---|---|
torch.dev.gpu | GPU nodes, CUDA, PyTorch | Notebook execution, VLA training, ML inference, Cosmos-Predict2 world model inference, Cosmos-Transfer2.5 sim2real augmentation |
ros.dev.gpu | GPU nodes, ROS 2 Jazzy | Robot simulation, Nav2, YOLOv8, SLAM, Cosmos-Reason2 physical reasoning |
In v1, both environments run on the same Ray cluster with workloads differentiated by runtime environment and resource requirements. Workloads are Ray Jobs submitted by the control plane and executed by Ray workers.
System context (C4 Level 2)
Reference application: turtlebot-maze
turtlebot-maze is the canonical user-plane application. It demonstrates all user-plane
capabilities in a single end-to-end scenario:
Claude Code /navigate skill (user plane — real-time)
→ ros-mcp-server
MCP tool call: publish_cmd_vel, get_odom, set_nav_goal
→ rosbridge WebSocket :9090
→ ROS 2 Nav2 action server
→ TurtleBot base controller
→ Gazebo simulation (or physical robot)
Supporting subsystems:
On ros.dev.gpu Ray workers:
- Behavior trees (py_trees / BehaviorTree.CPP): autonomous navigation + search sequences
- YOLOv8: object detection via PyTorch, decoupled from ROS via Zenoh transport
- stella_vslam: visual SLAM for mapping and localization
- Nav2: path planning and collision avoidance
- Cosmos-Reason2 (auraison-eh1): physical AI reasoning — evaluates action feasibility using spatial/physics common sense before Nav2 goal dispatch
On torch.dev.gpu Ray workers:
- Cosmos-Predict2 (auraison-oys): world model inference — given current camera frame + planned action, generates predicted future video frames
- Cosmos-Transfer2.5 (auraison-i6l): sim2real augmentation — translates Gazebo/Predict2 synthetic video to photorealistic video; conditioned on depth, edge, and segmentation control maps extracted from Gazebo; documented +68.5% mission success rate improvement on navigation tasks
Predict → Transfer → Reason → Execute loop (v1.5):
Cosmos-Predict2 (torch.dev.gpu)
current frame + proposed action → synthetic trajectory video
→ Cosmos-Transfer2.5 (torch.dev.gpu)
synthetic → photorealistic (depth + edge control maps from Gazebo)
→ Cosmos-Reason2 (ros.dev.gpu)
feasibility evaluation (physics / obstacle / reachability)
→ go: Nav2 goal dispatched
→ no-go: action rejected, behavior tree selects alternative
Synthetic data generation (SDG) pipeline:
Gazebo rollouts (ros.dev.gpu) → Cosmos-Predict2 → Cosmos-Transfer2.5
→ augmented video + action labels → Parquet dataset in data-plane lakehouse
→ VLA / Nav2 policy fine-tuning jobs on torch.dev.gpu
The control plane manages the ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and
experiment bookkeeping. It does not participate in the real-time control loop — that loop
runs entirely within the user plane.
Additional reference applications
ar4-physical-ai (aegean-ai/ar4-physical-ai) — VLA manipulation platform for the AR4
MK3 6-DOF robotic arm. Uses LeRobot (lerobot-ros / AnninAR4) for recording, training, and
inference; Zenoh middleware for non-ROS transport; MoveIt2 + ros2_control for safe trajectory
execution. Runs on both ros.dev.gpu (ROS 2 + Gazebo Harmonic) and torch.dev.gpu (VLA
inference via vLLM + Zenoh queryable). See user-plane/ar4-digital-twin.md for the digital
twin design.
counter-uas (aegean-ai/counter-uas, v2) — Counter-UAS system with VisDrone perception,
Unreal Engine 5 simulation, and General Robotics GRID integration. Demonstrates the platform's
support for non-manipulation, non-navigation workloads (aerial perception + tracking).
tube-quality-control (colgate/tube-quality-control) — AI-driven manufacturing quality
control for tube production lines. Runs on torch.dev.gpu. Demonstrates the platform's support
for industrial computer vision workloads:
- Anomaly detection: PatchCore, EfficientAD (anomalib) for unsupervised defect detection
- Supervised contrastive learning: ResNet50 fine-tuning with SupCon loss for defect classification across 2/4/11/12 class configurations
- Embedding-based similarity search: Qdrant vector database for nearest-neighbor defect retrieval; pretrained and fine-tuned timm embeddings
- Dataset pipeline: S3/MinIO raw images → COCO-format annotations → HuggingFace Hub datasets
- Experiment tracking: ClearML for training lineage and model versioning
- Visualization: FiftyOne for dataset exploration and annotation review
- Domain model: Pydantic-based entities (MachineSettingModel, ImageModel, AnomalyLabelModel) with Hydra configuration management
- Infrastructure: Docker GPU containers, MongoDB (FiftyOne), Qdrant, MinIO, NATS (event bus)
- Edge deployment path: OpenVINO export for factory-floor inference
All reference applications share the same Layer C abstraction: vLLM inference serving via
Zenoh queryable on torch.dev.gpu. Each plugs in its own model backend (Cosmos stack,
LeRobot VLA, perception/tracking models, anomaly detection) without platform changes.
Interfaces
Control → User plane (job submission)
The control plane emits a JobSpec to the user plane. In v1, this is a direct ray job submit
CLI invocation by the NotebookAgent subprocess. In v2, the control plane writes to a NATS subject
and a user-plane executor subscribes and submits.
JobSpec {
job_id: UUID
environment: "torch.dev.gpu" | "ros.dev.gpu"
entrypoint: path to notebook or ROS launch file
resources: {num_gpus: int, num_cpus: int, memory_gb: float}
parameters: dict (papermill parameters or ROS args)
copyback_url: callback URL for result delivery
max_duration: seconds (safety: forced termination if exceeded)
}
User → Control plane (status events)
The user plane emits status events back to the control plane. In v1, the control plane polls via the ClusterAgent. In v2, user-plane workers emit events to a Redis Stream (v1.5) or NATS subject (v2) and the control plane subscribes.
StatusEvent {
job_id: UUID
ray_job_id: str
status: PENDING | RUNNING | SUCCEEDED | FAILED | STOPPED
timestamp: ISO 8601
logs: str (tail of worker stdout/stderr)
wandb_run_id?: str (if W&B logging active)
}
Copyback webhook
On job completion, the Ray worker calls POST \{copyback_url\} with the executed notebook
payload. The control plane relays this to eaia for MDX regeneration.
Event substrate
The user plane is natively event-driven. The internal communication substrate within the user plane uses standard robotics and ML messaging:
| Use case | Technology |
|---|---|
| ROS node ↔ ROS node (real-time) | DDS (rmw_fastrtps) |
| ROS ↔ non-ROS containers (e.g. YOLOv8) | Zenoh bridge |
| ML worker logging | W&B SDK (direct to W&B API) |
| User plane → control plane (v1) | HTTP polling (pull) |
| User plane → control plane (v1.5) | Redis Streams (push) |
| User plane → control plane (v2) | NATS subjects (push) |
Zenoh is used to decouple non-ROS workloads (YOLOv8 inference, SLAM) from the ROS graph. A Zenoh bridge republishes DDS topics as Zenoh subjects, allowing PyTorch containers without ROS deps to subscribe to sensor data.
Safety constraints
The user plane runs workloads that can actuate physical hardware. Safety is enforced at the plane boundary, before execution begins:
| Constraint | Mechanism |
|---|---|
| Max job duration | max_duration in JobSpec; Ray Job timeout |
| Resource caps | Ray resource scheduling (num_gpus, num_cpus) |
| Process isolation | Each environment uses separate Ray runtime environments |
| Actuation gating (v2) | Confidence threshold check before Nav2 goal submission |
| Emergency stop | Control plane sends SIGTERM via ray job stop; ros-mcp-server has /emergency_stop MCP tool |
| Circuit breaker | ClusterAgent monitors anomaly rate; pauses cluster on threshold breach |
v1 Hybrid Compromise
In v1, the claude -p subprocesses invoked by the control plane (NotebookAgent, ClusterAgent)
conflate control-plane reasoning with user-plane execution. A NotebookAgent subprocess
both decides how to submit a job (control plane cognition) and issues the ray job submit
command (user plane execution). This is a deliberate pragmatic choice for v1.
The subprocess boundary (claude -p as a child process of the FastAPI app) serves as the
physical plane separator in v1: the control plane process never directly touches kubectl
or ray; only the subprocess does. This is sufficient isolation for v1 but conflates the two
planes logically.
In v2, the separation becomes explicit: control agents emit JobSpec messages to a NATS
subject; a user-plane executor (a lightweight worker with no reasoning capability) consumes
the spec and runs ray job submit. The control agent never touches infrastructure CLIs
directly.
Infrastructure
Ray on Proxmox VMs (v1)
Ray is deployed natively on Proxmox VMs using the on-premise cluster launcher. No Kubernetes
layer is required. The cluster configuration and VM provisioning are managed via Terraform
(see infra/terraform/) and the Ray cluster YAML.
infra/ray/
├── cluster.yaml Ray cluster config (head_ip + worker_ips)
└── runtime_envs/
├── torch-gpu.yaml Runtime env for torch.dev.gpu workloads
└── ros-gpu.yaml Runtime env for ros.dev.gpu workloads
Claude Code agents manage the cluster lifecycle via:
ray up cluster.yaml-- start/restart the clusterray submit cluster.yaml job.py-- submit a jobray status-- check cluster healthray down cluster.yaml-- tear down
KubeRay (deferred to v2)
KubeRay operator and RayCluster CRs are available in infra/k8s/ but deferred until
multi-tenancy or cloud burst auto-scaling is required. The K8s manifests are kept as a
future migration path, not the v1 deployment target.
Worker environments
Each environment uses a purpose-built VM image or conda/pip runtime environment:
| Environment | VM setup | Key packages |
|---|---|---|
torch.dev.gpu | Ubuntu 24.04 + CUDA 12.x + Ray | PyTorch, papermill, wandb, diffusers, Cosmos-Predict2, Cosmos-Transfer2.5 |
ros.dev.gpu | Ubuntu 24.04 + ROS 2 Jazzy + CUDA + Ray | ros2, nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 (vLLM) |
Evolution Path
v1 — Ray on Proxmox VMs (native launcher); ray job submit via NotebookAgent subprocess (hybrid)
v1.5 — Redis Streams: user plane emits StatusEvents; control plane subscribes (decoupled polling)
Cosmos-Reason2 on ros.dev.gpu for physical reasoning
Cosmos-Predict2 + Cosmos-Transfer2.5 on torch.dev.gpu for world model inference + sim2real
Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
SDG pipeline: Gazebo → Predict2 → Transfer2.5 → lakehouse augmented dataset
v2 — NATS: JobSpec dispatch; pure executor workers (no LLM); Zenoh → NATS bridge for ROS events
Actuation confidence gates backed by Cosmos-Reason2 feasibility scores
Cosmos-Predict2 post-trained on turtlebot-maze ROS bag recordings
Cosmos-Transfer2.5 Real2Real augmentation of turtlebot-maze ROS bags for policy robustness
Management plane subscribes to execution telemetry
v3 — Edge deployment: Cosmos-Reason2 (2B, FP8) on Jetson AGX Orin/Thor for on-robot inference
Dual-compute: Jetson (edge reasoning) + KubeRay (training, SDG, heavy inference)
Requirements (UP-xxx)
Traces to system-level requirements in architecture/four-plane.md.
| ID | Requirement | Traces to | Version |
|---|---|---|---|
| UP-001 | The user plane shall host agentic workloads: VLA, behavior trees, robot control, ML training, SLAM, object detection | SYS-001 | v1 |
| UP-002 | The user plane shall have real-time (ms) latency, stateful per-session | SYS-001 | v1 |
| UP-003 | User plane failure shall stop the agent/robot but the control plane shall continue | SYS-002 | v1 |
| UP-004 | The user plane shall remain operational during control plane outages | SYS-002 | v1 |
| UP-005 | The user plane shall provide two Ray environments: torch.dev.gpu and ros.dev.gpu on Proxmox VMs | SYS-001 | v1 |
| UP-006 | torch.dev.gpu shall support CUDA, PyTorch, notebook execution, VLA training, Cosmos-Predict2, Cosmos-Transfer2.5 | UP-005 | v1 |
| UP-007 | ros.dev.gpu shall support ROS 2 Jazzy, Nav2, YOLOv8, SLAM, Cosmos-Reason2 | UP-005 | v1 |
| UP-008 | The user plane shall manage Ray cluster lifecycles via ray up/down but NOT participate in real-time control | SYS-002 | v1 |
| UP-009 | The user plane shall accept JobSpec from control plane: job_id, environment, entrypoint, resources, parameters, copyback_url, max_duration | — | v1 |
| UP-010 | The user plane shall emit StatusEvent to control plane: job_id, status, timestamp, logs, wandb_run_id | — | v1 |
| UP-011 | The user plane shall call copyback webhook on job completion | CP-015 | v1 |
| UP-012 | ROS node communication shall use DDS (rmw_fastrtps) | — | v1 |
| UP-013 | Non-ROS containers shall use Zenoh bridge to decouple from ROS graph | SYS-004 | v1 |
| UP-014 | Maximum job duration shall be enforced via Ray Job timeout | CP-020 | v1 |
| UP-015 | Resource caps shall be enforced via Ray resource scheduling (num_gpus, num_cpus) | — | v1 |
| UP-016 | Environments shall be isolated via separate Ray runtime environments | — | v1 |
| UP-017 | Actuation gating shall require confidence threshold before goal submission | SYS-005 | v2 |
| UP-018 | Emergency stop shall be supported via ray job stop and ros-mcp-server /emergency_stop | — | v1 |
| UP-019 | Circuit breaker: ClusterAgent shall pause cluster on anomaly rate threshold breach | SYS-002 | v1 |
| UP-020 | The user plane shall support turtlebot-maze with Claude Code /navigate via ros-mcp-server | SYS-003 | v1 |
| UP-021 | turtlebot-maze shall support behavior trees for autonomous navigation | SYS-003 | v1 |
| UP-022 | turtlebot-maze shall support YOLOv8 object detection via Zenoh | SYS-003, SYS-004 | v1 |
| UP-023 | turtlebot-maze shall support stella_vslam for mapping and localization | SYS-003 | v1 |
| UP-024 | turtlebot-maze shall support Cosmos-Reason2 for physical reasoning and feasibility evaluation | SYS-003, SYS-005 | v1.5 |
| UP-025 | Cosmos-Predict2 shall run on torch.dev.gpu for world model inference | SYS-005 | v1.5 |
| UP-026 | Cosmos-Transfer2.5 shall run on torch.dev.gpu for sim2real augmentation | SYS-005 | v1.5 |
| UP-027 | turtlebot-maze shall implement Predict → Transfer → Reason → Execute loop | SYS-005 | v1.5 |
| UP-028 | SDG pipeline: Gazebo → Cosmos-Predict2 → Cosmos-Transfer2.5 → lakehouse | SYS-005, SYS-007 | v1.5 |
| UP-029 | torch.dev.gpu Ray worker image: rayproject/ray-ml with PyTorch, papermill, wandb, diffusers, Cosmos | UP-006 | v1 |
| UP-030 | ros.dev.gpu Ray worker image: ROS 2 Jazzy + Ray base with nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 | UP-007 | v1 |
| UP-031 | Ray cluster config shall be stored in infra/ray/cluster.yaml; KubeRay manifests deferred to infra/k8s/ (v2) | — | v1 |
| UP-032 | In v1, jobs shall be dispatched synchronously via claude -p subprocess | CP-026 | v1 |
| UP-033 | In v1.5, StatusEvent shall be emitted via Redis Streams | — | v1.5 |
| UP-034 | In v2, jobs shall be dispatched via NATS; pure executor workers shall consume JobSpec | CP-027 | v2 |
| UP-035 | In v3, Cosmos-Reason2 (2B, FP8) shall run on Jetson AGX Orin/Thor for on-robot inference | SYS-005 | v3 |
| UP-036 | In v3, dual-compute: Jetson (edge) + KubeRay (training, SDG, heavy inference) | SYS-005 | v3 |
See also
docs/user-plane/digital-twins.mdx— Digital Twins: persistent world model spanning user plane, data plane, and control plane; TwinAgent; TurtleBot reference assetdocs/user-plane/ar4-digital-twin.mdx— AR4-MK3 as second reference asset; layered plane decomposition; schema extensions for 6-DOF armsdocs/control-plane/design.mdx— NotebookAgent, ClusterAgent, AgentOps subsystem that orchestrates this planedocs/data-plane/design.mdx— data plane that receives user-plane outputs (ingestion, state snapshots)
References
- Cosmos-Reason2 on Jetson — deploying Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM; demonstrates real-time webcam inference and robotic manipulation (pick-and-place). Key for v3 edge deployment: the 2B model runs on Jetson AGX Orin 64 GB at 8192 token context, and on Orin Super Nano at 256 token context with aggressive memory tuning.
- Cosmos-Reason2 — physical AI reasoning VLM (2B/8B); spatial, temporal, physics comprehension (auraison-eh1)
- Cosmos-Predict2.5 — world foundation model for future state prediction via video generation (auraison-oys)
- Cosmos-Transfer2.5 — multi-controlnet sim2real augmentation; +68.5% mission success rate on navigation tasks (auraison-i6l)