Digital Twins Design

Date: 2026-03-02 Status: Approved (v1) Epic: auraison-5z3

Relationship to the user plane design

Document	Purpose
`docs/user-plane/design.mdx`	Canonical user plane design — KubeRay environments, reference applications, interfaces
This document	Digital Twins subsystem design — persistent world model spanning user plane, data plane, and control plane
`docs/user-plane/ar4-digital-twin.mdx`	AR4-MK3 as second reference asset; layered plane decomposition; schema extensions

Digital twins are a cross-plane feature: the user plane writes live state, the data plane stores it, and the control plane (TwinAgent subprocess) reconciles and queries it. This document specifies all three sides of that interaction. The TurtleBot on ros.dev.gpu is the v1 reference asset.

Problem

Agentic workloads in the user plane produce rich runtime state — robot pose, sensor readings, navigation events, perception outputs — but this state is ephemeral. It lives inside a RayJob for its lifetime and is lost when the job completes. The control plane has no persistent, queryable model of the physical world its agents are acting on.

A digital twin is a persistent, structured representation of a physical asset that accumulates state over time. For Auraison, twins are the bridge between the transient execution world of the user plane and the durable memory world of the data plane. They enable:

Historical replay: reconstruct what the robot was doing during any past job
Causal analysis: correlate agent decisions with physical state at decision time
Predictive modelling: feed twin state into Cosmos-Predict2 + Cosmos-Transfer2.5 to forecast photorealistic future states
Agent memory: let control-plane agents read world state without polling live sensors
Reasoning substrate: serve twin state snapshots as visual context to Cosmos-Reason2 for physics-grounded feasibility evaluation

Goals

Persist physical asset state (pose, sensor readings, events) in the data-plane lakehouse
Provide a TwinAgent subprocess for control-plane agents to create, sync, query, and retire twins
Expose twin state via a FastAPI router consistent with existing API conventions
Demonstrate end-to-end with the TurtleBot reference asset on ros.dev.gpu

Non-goals (v1)

Real-time sub-second twin state (deferred to v1.5 with Redis hot-cache)
Multi-tenant asset isolation (management plane — v2)
Cosmos-Predict2 / Cosmos-Transfer2.5 predicted twin state snapshots (v1.5)
Physics simulation driven from twin state (v1.5 / v2)
Visualisation UI in the Next.js dashboard (v2)

Architecture

Digital twins span three planes:

User Plane (ros.dev.gpu)
  Ray worker writes live sensor data → lakehouse (data plane) during job
  Ray worker writes live pose snapshots → lakehouse (data plane) during job
  Cosmos-Reason2: reads twin state snapshots as visual context for feasibility evaluation

User Plane (torch.dev.gpu)
  Cosmos-Predict2: reads observed twin state → generates predicted future state video
  Cosmos-Transfer2.5: translates predicted synthetic video → photorealistic; stored as predicted snapshots

Control Plane
  TwinAgent subprocess (claude -p) reconciles and validates at job end
  TwinAgent exposes create / sync / query / retire / predict operations
  FastAPI /api/v1/twins router accepts HTTP calls from UI and other agents

Data Plane (Lakehouse)
  Seven Parquet tables under twins/ prefix
  state_snapshots.source distinguishes observed vs predicted vs sim2real snapshots
  DuckDB queries served by TwinAgent and LakehouseAgent

Editable Mermaid source: images/digital-twins-cross-plane-architecture.mermaid.md

TwinAgent

TwinAgent is a claude -p subprocess in control-plane/backend/agents/twin_agent.py, following the same pattern as LakehouseAgent and ClusterAgent.

Tool scope: Bash(duckdb *), Bash(python *), Read

Operations exposed to the control plane:

Operation	Description
`create_twin(asset_id, asset_type, urdf_path)`	Register a new asset; create lakehouse tables if absent
`sync_twin(twin_id, job_id)`	Post-job reconciliation: read ROS bag / W&B outputs; append validated snapshots
`query_twin(twin_id, query)`	DuckDB query over twin tables; returns Arrow/pandas result
`get_twin_state(twin_id, at=None)`	Latest state snapshot, or point-in-time if `at` provided
`annotate_twin(twin_id, annotation)`	Append an annotation record
`predict_twin(twin_id, action, horizon_s)`	Call Cosmos-Predict2 → Cosmos-Transfer2.5 with latest observed state; store predicted snapshots (source=predicted/sim2real)
`retire_twin(twin_id)`	Mark asset as retired; preserve history

Python wrapper interface:

# control-plane/backend/agents/twin_agent.py
def create_twin(asset_id: str, asset_type: str, urdf_path: str | None = None) -> dict: ...
def sync_twin(twin_id: str, job_id: str) -> dict: ...
def query_twin(twin_id: str, query: str) -> dict: ...
def get_twin_state(twin_id: str, at: str | None = None) -> dict: ...
def predict_twin(twin_id: str, action: dict, horizon_s: float) -> dict: ...
def annotate_twin(twin_id: str, annotation: dict) -> dict: ...
def retire_twin(twin_id: str) -> dict: ...

Each function constructs a prompt and calls run_agent() from agents/base.py with ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Read".

Schema

Seven Parquet tables under the twins/ prefix in the lakehouse. DuckDB reads them via the existing DuckLake configuration.

`twins/assets`

Asset registry — one row per physical asset.

Column	Type	Description
`asset_id`	VARCHAR PK	Stable identifier (e.g. `turtlebot-01`)
`asset_type`	VARCHAR	`robot \| drone \| sensor_node \| custom`
`display_name`	VARCHAR	Human-readable name
`status`	VARCHAR	`active \| retired`
`created_at`	TIMESTAMP
`retired_at`	TIMESTAMP	NULL if active
`metadata`	JSON	Arbitrary key-value pairs

`twins/urdf_assets`

URDF / CAD model references, versioned.

Column	Type	Description
`urdf_id`	VARCHAR PK
`asset_id`	VARCHAR FK → assets
`version`	VARCHAR	Semantic version string
`urdf_path`	VARCHAR	Path in the lakehouse or local repo
`format`	VARCHAR	`urdf \| xacro \| sdf \| obj`
`uploaded_at`	TIMESTAMP
`checksum`	VARCHAR	SHA-256 of file

`twins/state_snapshots`

Point-in-time pose snapshots.

Column	Type	Description
`snapshot_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs
`timestamp`	TIMESTAMP	Time of observation
`source`	VARCHAR	`ros_job \| manual \| simulation \| predicted \| sim2real`
`position_x`	DOUBLE	Metres, world frame
`position_y`	DOUBLE
`position_z`	DOUBLE
`orientation_qx`	DOUBLE	Quaternion
`orientation_qy`	DOUBLE
`orientation_qz`	DOUBLE
`orientation_qw`	DOUBLE
`linear_velocity`	DOUBLE	m/s
`angular_velocity`	DOUBLE	rad/s
`reconciled`	BOOLEAN	True after post-job TwinAgent validation
`cosmos_model`	VARCHAR	NULL for observed; `predict2 \| transfer2.5` for generated snapshots
`predicted_from_snapshot_id`	VARCHAR FK → state_snapshots	Seed snapshot used by Cosmos-Predict2; NULL for observed

`twins/sensor_readings`

Time-series sensor data (separate from pose to keep state_snapshots lean).

Column	Type	Description
`reading_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs
`sensor_type`	VARCHAR	`imu \| lidar \| camera \| gps \| odometry`
`timestamp`	TIMESTAMP
`payload`	JSON	Sensor-specific structured data
`raw_path`	VARCHAR	Path to raw file in the lakehouse (e.g. ROS bag slice)

`twins/events`

Discrete twin lifecycle and runtime events.

Column	Type	Description
`event_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs	NULL for lifecycle events
`event_type`	VARCHAR	`twin.created \| twin.synced \| twin.retired \| nav.goal_set \| nav.goal_reached \| nav.obstacle_detected \| …`
`timestamp`	TIMESTAMP
`actor`	VARCHAR	Agent or system that generated the event
`payload`	JSON	Event-specific data

`twins/twin_jobs`

Link table — maps twin to jobs that produced state.

Column	Type	Description
`twin_job_id`	VARCHAR PK	UUID
`twin_id`	VARCHAR FK → assets
`job_id`	VARCHAR	Control-plane job UUID
`ray_job_id`	VARCHAR	Ray job ID
`environment`	VARCHAR	`torch.dev.gpu \| ros.dev.gpu`
`started_at`	TIMESTAMP
`completed_at`	TIMESTAMP
`wandb_run_id`	VARCHAR	NULL if not tracked
`sync_status`	VARCHAR	`pending \| synced \| failed`

`twins/annotations`

Human or agent annotations on twin state.

Column	Type	Description
`annotation_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`snapshot_id`	VARCHAR FK → state_snapshots	NULL for job-level annotations
`author`	VARCHAR	Agent name or user email
`annotation_type`	VARCHAR	`label \| anomaly_flag \| note \| review`
`content`	VARCHAR	Free text or JSON
`created_at`	TIMESTAMP

Data flow — v1

In-job writes (Ray worker → lakehouse)

During a ros.dev.gpu RayJob the worker writes live data directly to the lakehouse:

Gazebo + Nav2
  → DDS topics
  → Zenoh bridge
  → Ray worker (Python)
      → pyarrow: append rows to sensor_readings Parquet partition
      → pyarrow: append rows to state_snapshots Parquet partition (reconciled=False)

The worker uses the lakehouse S3 endpoint configured in data-plane/ (same credentials as the LakehouseAgent) and writes to a job-specific partition: twins/state_snapshots/job_id=\{job_id\}/part-0.parquet.

Post-job reconciliation (TwinAgent)

When the Ray worker completes, the control plane calls sync_twin(twin_id, job_id). The TwinAgent subprocess:

Reads in-job Parquet partitions from the lakehouse
Reads the W&B run (if linked) for additional metrics
Validates schema integrity; flags anomalies in events
Sets reconciled=True on validated snapshots
Merges job partition into main table (or leaves partitioned — DuckDB handles both)
Appends a twin.synced event record
Updates twin_jobs.sync_status = 'synced'

API

New router: control-plane/backend/api/twins.py, mounted at /api/v1/twins.

GET    /api/v1/twins                      List all registered twins
POST   /api/v1/twins                      Register a new twin (calls TwinAgent.create_twin)
GET    /api/v1/twins/{id}                 Twin details (asset record + latest state)
GET    /api/v1/twins/{id}/state           Latest state snapshot (or ?at=ISO8601 for point-in-time)
POST   /api/v1/twins/{id}/sync            Trigger post-job reconciliation
POST   /api/v1/twins/{id}/predict         Generate predicted future snapshots via Cosmos-Predict2 + Transfer2.5
GET    /api/v1/twins/{id}/events          Event log (paginated)
GET    /api/v1/twins/{id}/annotations     Annotations
POST   /api/v1/twins/{id}/annotations     Add annotation

Pydantic models in control-plane/backend/models/twin.py: Twin, TwinCreate, TwinState, TwinSyncRequest, TwinAnnotation.

Reference asset: TurtleBot

The v1 end-to-end demo twins turtlebot-01 on ros.dev.gpu:

1. TwinAgent.create_twin("turtlebot-01", "robot", urdf_path="user-plane/turtlebot/urdf/turtlebot3.urdf")
2. Control plane submits turtlebot-maze RayJob to ros.dev.gpu
3. Ray worker connects to Gazebo via Zenoh; writes pose + IMU to the lakehouse in-job
4. Job completes → POST /api/v1/twins/turtlebot-01/sync?job_id=<id>
5. TwinAgent reconciles; twin.synced event logged
6. bd query_twin("turtlebot-01", "SELECT timestamp, position_x, position_y FROM state_snapshots ORDER BY timestamp")
   → returns full trajectory from the job

# v1.5 extension — Cosmos-driven predicted twin state:
7. POST /api/v1/twins/turtlebot-01/predict  \{action: \{cmd_vel: \{linear: 0.3, angular: 0.1\}\}, horizon_s: 5.0\}
   → TwinAgent calls Cosmos-Predict2 (torch.dev.gpu) with latest observed snapshot as seed frame
   → Cosmos-Transfer2.5 translates synthetic prediction → photorealistic
   → Predicted snapshots written to state_snapshots (source=predicted, cosmos_model=transfer2.5)
8. Cosmos-Reason2 (ros.dev.gpu) reads predicted snapshots as visual context
   → evaluates feasibility → go / no-go before Nav2 goal dispatch

Evolution path

v1   — Persistent world model: lakehouse-backed twins for TurtleBot; in-job writes + post-job reconciliation
v1.5 — Real-time sync: Redis hot-cache for live pose (sub-second); Zenoh → twin writer as a persistent Ray actor
       Cosmos-Predict2 + Cosmos-Transfer2.5: predicted twin state snapshots (source=predicted/sim2real)
       predict_twin() operation + POST /api/v1/twins/{id}/predict endpoint
       Cosmos-Reason2: reads predicted snapshots as visual context for pre-execution feasibility evaluation
       Predict → Transfer → Reason → Execute loop integrated with twin state
v2   — Agent memory substrate: TwinAgent is the canonical world-model interface for all control-plane agents
       Cosmos models post-trained on turtlebot-maze ROS bags; domain-specific predicted snapshots
       Visualisation: Rerun viewer embedded in Next.js dashboard (observed + predicted state overlay)
       counter-uas (aegean-ai/counter-uas): UE5 + VisDrone perception/tracking twin; third reference asset
       Management plane: per-twin access control, retention policies

Files to create

control-plane/backend/
  agents/twin_agent.py          TwinAgent wrapper (claude -p subprocess)
  api/twins.py                  FastAPI router
  models/twin.py                Pydantic models

data-plane/
  schema/twins/                 Schema definitions and migration scripts

.claude/agents/
  twin-agent.md                 Agent definition (YAML frontmatter + system prompt)

main.py — mount twins router alongside existing jobs, clusters, experiments, lakehouse.

Digital Twins Design

On this page