Skip to main content

User Plane Design

Date: 2026-02-23 Updated: 2026-03-14 Status: Approved (v1)


Problem

The user plane is where customer agentic workloads execute: VLA agents, behavior trees, real-time robot control, notebook-based ML training, SLAM, object detection. These workloads have fundamentally different requirements from the control plane — they are real-time, stateful per-session, and hardware-bound. They must continue running even when the control plane is degraded or unreachable.

The user plane is the execution mesh. It does not reason about what to run; it runs what it is told, as fast as the hardware allows.


Goals

  • Provide a multi-environment execution mesh for heterogeneous agentic workloads
  • Hardware abstraction: workloads declare resource requirements; the plane satisfies them
  • Isolation between workloads: a failing Nav2 job must not affect a running notebook job
  • Accept job specifications from the control plane; emit status events back
  • Support the canonical turtlebot-maze reference application end-to-end
  • Remain operational during control plane outages

Non-goals (v1)

  • Reasoning, planning, or orchestration — that is the control plane
  • Billing and quota enforcement — that is the management plane
  • Full LLM reasoning loops within user plane agents — constrained tool use only (v1)

Architecture

Deployment: Ray on Proxmox VMs (v1)

The user plane runs Ray natively on Proxmox VMs using the Ray on-premise cluster launcher -- not Kubernetes. This is a deliberate simplification for a single-team AI lab.

Why not KubeRay/K8s:

  • K8s adds a second orchestrator on top of Proxmox with significant maintenance overhead (etcd, kubelet, CNI, cert rotation, device plugins, resource quotas)
  • For a single team with known hardware, Ray's native SSH-based launcher is sufficient
  • Claude Code can manage the cluster via ray up/down/submit -- three commands vs kubectl + helm + CRD YAML
  • Faster cold starts (direct process start vs pod scheduling + image pull)
  • Direct GPU access without K8s device plugins
  • KubeRay is justified only for multi-tenancy, multiple Ray versions, or cloud burst auto-scaling -- none of which apply in v1

Job execution model (Level 1: Ray dispatches Docker containers):

Ray runs bare on the VM and has Docker socket access. Each job runs inside an application-specific Docker container. Ray handles GPU scheduling -- multiple applications can queue jobs, and Ray executes them sequentially as GPU resources become available.

Application A (tcc):     ray job submit --num-gpus 1 → docker run tcc-dev papermill ...
Application B (frcnn): ray job submit --num-gpus 1 → docker run frcnn-dev train.py ...
|
v
Ray head (gpu-node-3, 1 GPU)
+------------------------+
| Job queue: |
| 1. tcc (running) | ← GPU allocated
| 2. frcnn (pending) | ← waits for GPU
+------------------------+

Each application brings its own Docker image (via its own docker-compose.yml and Dockerfile). Ray does not know or care that jobs are Docker containers -- it manages the GPU lock. When a job finishes and releases the GPU, the next pending job starts.

Validated hardware (2026-03-14):

ResourceValue
Nodegpu-node-3 (192.168.1.78)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96 GiB VRAM)
CUDA13.0
Docker29.3
Ray2.54.0
Dashboardhttp://192.168.1.78:8265

Cluster configuration:

# infra/ray/cluster.yaml
cluster_name: auraison-gpu
provider:
type: local
head_ip: 192.168.1.78 # gpu-node-3
worker_ips: [] # single-node: head is also the worker
auth:
ssh_user: pantelis.monogioudis
ssh_private_key: ~/.ssh/keys/id_ed25519
head_start_ray_commands:
- ray stop
- source ~/ray-venv/bin/activate && ray start --head --port=6379 --dashboard-host=0.0.0.0 --num-gpus=1
ray up infra/ray/cluster.yaml     # start Ray head on gpu-node-3
ray status # check cluster health
ray down infra/ray/cluster.yaml # tear down

Environments

The user plane is structured as two named Ray environments on Proxmox VMs, each with distinct hardware profiles and workload classes:

EnvironmentHardwareWorkload class
torch.dev.gpuGPU nodes, CUDA, PyTorchNotebook execution, VLA training, ML inference, Cosmos-Predict2 world model inference, Cosmos-Transfer2.5 sim2real augmentation
ros.dev.gpuGPU nodes, ROS 2 JazzyRobot simulation, Nav2, YOLOv8, SLAM, Cosmos-Reason2 physical reasoning

In v1, both environments run on the same Ray cluster with workloads differentiated by runtime environment and resource requirements. Workloads are Ray Jobs submitted by the control plane and executed by Ray workers.

System context (C4 Level 2)

Reference application: turtlebot-maze

turtlebot-maze is the canonical user-plane application. It demonstrates all user-plane capabilities in a single end-to-end scenario:

Claude Code /navigate skill (user plane — real-time)
→ ros-mcp-server
MCP tool call: publish_cmd_vel, get_odom, set_nav_goal
→ rosbridge WebSocket :9090
→ ROS 2 Nav2 action server
→ TurtleBot base controller
→ Gazebo simulation (or physical robot)

Supporting subsystems:

On ros.dev.gpu Ray workers:

  • Behavior trees (py_trees / BehaviorTree.CPP): autonomous navigation + search sequences
  • YOLOv8: object detection via PyTorch, decoupled from ROS via Zenoh transport
  • stella_vslam: visual SLAM for mapping and localization
  • Nav2: path planning and collision avoidance
  • Cosmos-Reason2 (auraison-eh1): physical AI reasoning — evaluates action feasibility using spatial/physics common sense before Nav2 goal dispatch

On torch.dev.gpu Ray workers:

  • Cosmos-Predict2 (auraison-oys): world model inference — given current camera frame + planned action, generates predicted future video frames
  • Cosmos-Transfer2.5 (auraison-i6l): sim2real augmentation — translates Gazebo/Predict2 synthetic video to photorealistic video; conditioned on depth, edge, and segmentation control maps extracted from Gazebo; documented +68.5% mission success rate improvement on navigation tasks

Predict → Transfer → Reason → Execute loop (v1.5):

Cosmos-Predict2 (torch.dev.gpu)
current frame + proposed action → synthetic trajectory video
→ Cosmos-Transfer2.5 (torch.dev.gpu)
synthetic → photorealistic (depth + edge control maps from Gazebo)
→ Cosmos-Reason2 (ros.dev.gpu)
feasibility evaluation (physics / obstacle / reachability)
→ go: Nav2 goal dispatched
→ no-go: action rejected, behavior tree selects alternative

Synthetic data generation (SDG) pipeline:

Gazebo rollouts (ros.dev.gpu) → Cosmos-Predict2 → Cosmos-Transfer2.5
→ augmented video + action labels → Parquet dataset in data-plane lakehouse
→ VLA / Nav2 policy fine-tuning jobs on torch.dev.gpu

The control plane manages the ros.dev.gpu and torch.dev.gpu RayCluster lifecycles and experiment bookkeeping. It does not participate in the real-time control loop — that loop runs entirely within the user plane.

Additional reference applications

ar4-physical-ai (aegean-ai/ar4-physical-ai) — VLA manipulation platform for the AR4 MK3 6-DOF robotic arm. Uses LeRobot (lerobot-ros / AnninAR4) for recording, training, and inference; Zenoh middleware for non-ROS transport; MoveIt2 + ros2_control for safe trajectory execution. Runs on both ros.dev.gpu (ROS 2 + Gazebo Harmonic) and torch.dev.gpu (VLA inference via vLLM + Zenoh queryable). See user-plane/ar4-digital-twin.md for the digital twin design.

counter-uas (aegean-ai/counter-uas, v2) — Counter-UAS system with VisDrone perception, Unreal Engine 5 simulation, and General Robotics GRID integration. Demonstrates the platform's support for non-manipulation, non-navigation workloads (aerial perception + tracking).

tube-quality-control (colgate/tube-quality-control) — AI-driven manufacturing quality control for tube production lines. Runs on torch.dev.gpu. Demonstrates the platform's support for industrial computer vision workloads:

  • Anomaly detection: PatchCore, EfficientAD (anomalib) for unsupervised defect detection
  • Supervised contrastive learning: ResNet50 fine-tuning with SupCon loss for defect classification across 2/4/11/12 class configurations
  • Embedding-based similarity search: Qdrant vector database for nearest-neighbor defect retrieval; pretrained and fine-tuned timm embeddings
  • Dataset pipeline: S3/MinIO raw images → COCO-format annotations → HuggingFace Hub datasets
  • Experiment tracking: ClearML for training lineage and model versioning
  • Visualization: FiftyOne for dataset exploration and annotation review
  • Domain model: Pydantic-based entities (MachineSettingModel, ImageModel, AnomalyLabelModel) with Hydra configuration management
  • Infrastructure: Docker GPU containers, MongoDB (FiftyOne), Qdrant, MinIO, NATS (event bus)
  • Edge deployment path: OpenVINO export for factory-floor inference

All reference applications share the same Layer C abstraction: vLLM inference serving via Zenoh queryable on torch.dev.gpu. Each plugs in its own model backend (Cosmos stack, LeRobot VLA, perception/tracking models, anomaly detection) without platform changes.


Interfaces

Control → User plane (job submission)

The control plane emits a JobSpec to the user plane. In v1, this is a direct ray job submit CLI invocation by the NotebookAgent subprocess. In v2, the control plane writes to a NATS subject and a user-plane executor subscribes and submits.

JobSpec {
job_id: UUID
environment: "torch.dev.gpu" | "ros.dev.gpu"
entrypoint: path to notebook or ROS launch file
resources: {num_gpus: int, num_cpus: int, memory_gb: float}
parameters: dict (papermill parameters or ROS args)
copyback_url: callback URL for result delivery
max_duration: seconds (safety: forced termination if exceeded)
}

User → Control plane (status events)

The user plane emits status events back to the control plane. In v1, the control plane polls via the ClusterAgent. In v2, user-plane workers emit events to a Redis Stream (v1.5) or NATS subject (v2) and the control plane subscribes.

StatusEvent {
job_id: UUID
ray_job_id: str
status: PENDING | RUNNING | SUCCEEDED | FAILED | STOPPED
timestamp: ISO 8601
logs: str (tail of worker stdout/stderr)
wandb_run_id?: str (if W&B logging active)
}

Copyback webhook

On job completion, the Ray worker calls POST \{copyback_url\} with the executed notebook payload. The control plane relays this to eaia for MDX regeneration.


Event substrate

The user plane is natively event-driven. The internal communication substrate within the user plane uses standard robotics and ML messaging:

Use caseTechnology
ROS node ↔ ROS node (real-time)DDS (rmw_fastrtps)
ROS ↔ non-ROS containers (e.g. YOLOv8)Zenoh bridge
ML worker loggingW&B SDK (direct to W&B API)
User plane → control plane (v1)HTTP polling (pull)
User plane → control plane (v1.5)Redis Streams (push)
User plane → control plane (v2)NATS subjects (push)

Zenoh is used to decouple non-ROS workloads (YOLOv8 inference, SLAM) from the ROS graph. A Zenoh bridge republishes DDS topics as Zenoh subjects, allowing PyTorch containers without ROS deps to subscribe to sensor data.


Safety constraints

The user plane runs workloads that can actuate physical hardware. Safety is enforced at the plane boundary, before execution begins:

ConstraintMechanism
Max job durationmax_duration in JobSpec; Ray Job timeout
Resource capsRay resource scheduling (num_gpus, num_cpus)
Process isolationEach environment uses separate Ray runtime environments
Actuation gating (v2)Confidence threshold check before Nav2 goal submission
Emergency stopControl plane sends SIGTERM via ray job stop; ros-mcp-server has /emergency_stop MCP tool
Circuit breakerClusterAgent monitors anomaly rate; pauses cluster on threshold breach

v1 Hybrid Compromise

In v1, the claude -p subprocesses invoked by the control plane (NotebookAgent, ClusterAgent) conflate control-plane reasoning with user-plane execution. A NotebookAgent subprocess both decides how to submit a job (control plane cognition) and issues the ray job submit command (user plane execution). This is a deliberate pragmatic choice for v1.

The subprocess boundary (claude -p as a child process of the FastAPI app) serves as the physical plane separator in v1: the control plane process never directly touches kubectl or ray; only the subprocess does. This is sufficient isolation for v1 but conflates the two planes logically.

In v2, the separation becomes explicit: control agents emit JobSpec messages to a NATS subject; a user-plane executor (a lightweight worker with no reasoning capability) consumes the spec and runs ray job submit. The control agent never touches infrastructure CLIs directly.


Infrastructure

Ray on Proxmox VMs (v1)

Ray is deployed natively on Proxmox VMs using the on-premise cluster launcher. No Kubernetes layer is required. The cluster configuration and VM provisioning are managed via Terraform (see infra/terraform/) and the Ray cluster YAML.

infra/ray/
├── cluster.yaml Ray cluster config (head_ip + worker_ips)
└── runtime_envs/
├── torch-gpu.yaml Runtime env for torch.dev.gpu workloads
└── ros-gpu.yaml Runtime env for ros.dev.gpu workloads

Claude Code agents manage the cluster lifecycle via:

  • ray up cluster.yaml -- start/restart the cluster
  • ray submit cluster.yaml job.py -- submit a job
  • ray status -- check cluster health
  • ray down cluster.yaml -- tear down

KubeRay (deferred to v2)

KubeRay operator and RayCluster CRs are available in infra/k8s/ but deferred until multi-tenancy or cloud burst auto-scaling is required. The K8s manifests are kept as a future migration path, not the v1 deployment target.

Worker environments

Each environment uses a purpose-built VM image or conda/pip runtime environment:

EnvironmentVM setupKey packages
torch.dev.gpuUbuntu 24.04 + CUDA 12.x + RayPyTorch, papermill, wandb, diffusers, Cosmos-Predict2, Cosmos-Transfer2.5
ros.dev.gpuUbuntu 24.04 + ROS 2 Jazzy + CUDA + Rayros2, nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2 (vLLM)

Evolution Path

v1   — Ray on Proxmox VMs (native launcher); ray job submit via NotebookAgent subprocess (hybrid)
v1.5 — Redis Streams: user plane emits StatusEvents; control plane subscribes (decoupled polling)
Cosmos-Reason2 on ros.dev.gpu for physical reasoning
Cosmos-Predict2 + Cosmos-Transfer2.5 on torch.dev.gpu for world model inference + sim2real
Predict → Transfer → Reason → Execute loop for turtlebot-maze reference application
SDG pipeline: Gazebo → Predict2 → Transfer2.5 → lakehouse augmented dataset
v2 — NATS: JobSpec dispatch; pure executor workers (no LLM); Zenoh → NATS bridge for ROS events
Actuation confidence gates backed by Cosmos-Reason2 feasibility scores
Cosmos-Predict2 post-trained on turtlebot-maze ROS bag recordings
Cosmos-Transfer2.5 Real2Real augmentation of turtlebot-maze ROS bags for policy robustness
Management plane subscribes to execution telemetry
v3 — Edge deployment: Cosmos-Reason2 (2B, FP8) on Jetson AGX Orin/Thor for on-robot inference
Dual-compute: Jetson (edge reasoning) + KubeRay (training, SDG, heavy inference)

Requirements (UP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

IDRequirementTraces toVersion
UP-001The user plane shall host agentic workloads: VLA, behavior trees, robot control, ML training, SLAM, object detectionSYS-001v1
UP-002The user plane shall have real-time (ms) latency, stateful per-sessionSYS-001v1
UP-003User plane failure shall stop the agent/robot but the control plane shall continueSYS-002v1
UP-004The user plane shall remain operational during control plane outagesSYS-002v1
UP-005The user plane shall provide two Ray environments: torch.dev.gpu and ros.dev.gpu on Proxmox VMsSYS-001v1
UP-006torch.dev.gpu shall support CUDA, PyTorch, notebook execution, VLA training, Cosmos-Predict2, Cosmos-Transfer2.5UP-005v1
UP-007ros.dev.gpu shall support ROS 2 Jazzy, Nav2, YOLOv8, SLAM, Cosmos-Reason2UP-005v1
UP-008The user plane shall manage Ray cluster lifecycles via ray up/down but NOT participate in real-time controlSYS-002v1
UP-009The user plane shall accept JobSpec from control plane: job_id, environment, entrypoint, resources, parameters, copyback_url, max_durationv1
UP-010The user plane shall emit StatusEvent to control plane: job_id, status, timestamp, logs, wandb_run_idv1
UP-011The user plane shall call copyback webhook on job completionCP-015v1
UP-012ROS node communication shall use DDS (rmw_fastrtps)v1
UP-013Non-ROS containers shall use Zenoh bridge to decouple from ROS graphSYS-004v1
UP-014Maximum job duration shall be enforced via Ray Job timeoutCP-020v1
UP-015Resource caps shall be enforced via Ray resource scheduling (num_gpus, num_cpus)v1
UP-016Environments shall be isolated via separate Ray runtime environmentsv1
UP-017Actuation gating shall require confidence threshold before goal submissionSYS-005v2
UP-018Emergency stop shall be supported via ray job stop and ros-mcp-server /emergency_stopv1
UP-019Circuit breaker: ClusterAgent shall pause cluster on anomaly rate threshold breachSYS-002v1
UP-020The user plane shall support turtlebot-maze with Claude Code /navigate via ros-mcp-serverSYS-003v1
UP-021turtlebot-maze shall support behavior trees for autonomous navigationSYS-003v1
UP-022turtlebot-maze shall support YOLOv8 object detection via ZenohSYS-003, SYS-004v1
UP-023turtlebot-maze shall support stella_vslam for mapping and localizationSYS-003v1
UP-024turtlebot-maze shall support Cosmos-Reason2 for physical reasoning and feasibility evaluationSYS-003, SYS-005v1.5
UP-025Cosmos-Predict2 shall run on torch.dev.gpu for world model inferenceSYS-005v1.5
UP-026Cosmos-Transfer2.5 shall run on torch.dev.gpu for sim2real augmentationSYS-005v1.5
UP-027turtlebot-maze shall implement Predict → Transfer → Reason → Execute loopSYS-005v1.5
UP-028SDG pipeline: Gazebo → Cosmos-Predict2 → Cosmos-Transfer2.5 → lakehouseSYS-005, SYS-007v1.5
UP-029torch.dev.gpu Ray worker image: rayproject/ray-ml with PyTorch, papermill, wandb, diffusers, CosmosUP-006v1
UP-030ros.dev.gpu Ray worker image: ROS 2 Jazzy + Ray base with nav2, pytorch, yolov8, stella_vslam, Cosmos-Reason2UP-007v1
UP-031Ray cluster config shall be stored in infra/ray/cluster.yaml; KubeRay manifests deferred to infra/k8s/ (v2)v1
UP-032In v1, jobs shall be dispatched synchronously via claude -p subprocessCP-026v1
UP-033In v1.5, StatusEvent shall be emitted via Redis Streamsv1.5
UP-034In v2, jobs shall be dispatched via NATS; pure executor workers shall consume JobSpecCP-027v2
UP-035In v3, Cosmos-Reason2 (2B, FP8) shall run on Jetson AGX Orin/Thor for on-robot inferenceSYS-005v3
UP-036In v3, dual-compute: Jetson (edge) + KubeRay (training, SDG, heavy inference)SYS-005v3

See also


References

  • Cosmos-Reason2 on Jetson — deploying Cosmos-Reason2 2B (FP8) on Jetson AGX Orin/Thor via vLLM; demonstrates real-time webcam inference and robotic manipulation (pick-and-place). Key for v3 edge deployment: the 2B model runs on Jetson AGX Orin 64 GB at 8192 token context, and on Orin Super Nano at 256 token context with aggressive memory tuning.
  • Cosmos-Reason2 — physical AI reasoning VLM (2B/8B); spatial, temporal, physics comprehension (auraison-eh1)
  • Cosmos-Predict2.5 — world foundation model for future state prediction via video generation (auraison-oys)
  • Cosmos-Transfer2.5 — multi-controlnet sim2real augmentation; +68.5% mission success rate on navigation tasks (auraison-i6l)