Auraison — Digital Twins Design

Date: 2026-03-02 Status: Approved (v1) Epic: auraison-5z3

Problem

Agentic workloads in the user plane produce rich runtime state — robot pose, sensor readings, navigation events, perception outputs — but this state is ephemeral. It lives inside a RayJob for its lifetime and is lost when the job completes. The control plane has no persistent, queryable model of the physical world its agents are acting on.

A digital twin is a persistent, structured representation of a physical asset that accumulates state over time. For Auraison, twins are the bridge between the transient execution world of the user plane and the durable memory world of the data plane. They enable:

Historical replay: reconstruct what the robot was doing during any past job
Causal analysis: correlate agent decisions with physical state at decision time
Predictive modelling: feed twin state into Cosmos-Predict2 + Cosmos-Transfer2.5 to forecast photorealistic future states
Agent memory: let control-plane agents read world state without polling live sensors
Reasoning substrate: serve twin state snapshots as visual context to Cosmos-Reason2 for physics-grounded feasibility evaluation

Goals

Persist physical asset state (pose, sensor readings, events) in the data-plane lakehouse
Provide a TwinAgent subprocess for control-plane agents to create, sync, query, and retire twins
Expose twin state via a FastAPI router consistent with existing API conventions
Demonstrate end-to-end with the TurtleBot reference asset on ros.dev.gpu

Non-goals (v1)

Real-time sub-second twin state (deferred to v1.5 with Redis hot-cache)
Multi-tenant asset isolation (management plane — v2)
Cosmos-Predict2 / Cosmos-Transfer2.5 predicted twin state snapshots (v1.5)
Physics simulation driven from twin state (v1.5 / v2)
Visualisation UI in the Next.js dashboard (v2)

Architecture

Digital twins span three planes:

User Plane (ros.dev.gpu)
  Ray worker writes live sensor data → MinIO (data plane) during job
  Ray worker writes live pose snapshots → MinIO (data plane) during job
  Cosmos-Reason2: reads twin state snapshots as visual context for feasibility evaluation

User Plane (torch.dev.gpu)
  Cosmos-Predict2: reads observed twin state → generates predicted future state video
  Cosmos-Transfer2.5: translates predicted synthetic video → photorealistic; stored as predicted snapshots

Control Plane
  TwinAgent subprocess (claude -p) reconciles and validates at job end
  TwinAgent exposes create / sync / query / retire / predict operations
  FastAPI /api/v1/twins router accepts HTTP calls from UI and other agents

Data Plane (MinIO + DuckDB)
  Seven Parquet tables under twins/ prefix
  state_snapshots.source distinguishes observed vs predicted vs sim2real snapshots
  DuckDB queries served by TwinAgent and LakehouseAgent

TwinAgent

TwinAgent is a claude -p subprocess in control-plane/backend/agents/twin_agent.py, following the same pattern as LakehouseAgent and ClusterAgent.

Tool scope: Bash(duckdb *), Bash(python *), Read

Operations exposed to the control plane:

Operation	Description
`create_twin(asset_id, asset_type, urdf_path)`	Register a new asset; create lakehouse tables if absent
`sync_twin(twin_id, job_id)`	Post-job reconciliation: read ROS bag / W&B outputs; append validated snapshots
`query_twin(twin_id, query)`	DuckDB query over twin tables; returns Arrow/pandas result
`get_twin_state(twin_id, at=None)`	Latest state snapshot, or point-in-time if `at` provided
`annotate_twin(twin_id, annotation)`	Append an annotation record
`predict_twin(twin_id, action, horizon_s)`	Call Cosmos-Predict2 → Cosmos-Transfer2.5 with latest observed state; store predicted snapshots (source=predicted/sim2real)
`retire_twin(twin_id)`	Mark asset as retired; preserve history

Python wrapper interface:

# control-plane/backend/agents/twin_agent.py
def create_twin(asset_id: str, asset_type: str, urdf_path: str | None = None) -> dict: ...
def sync_twin(twin_id: str, job_id: str) -> dict: ...
def query_twin(twin_id: str, query: str) -> dict: ...
def get_twin_state(twin_id: str, at: str | None = None) -> dict: ...
def predict_twin(twin_id: str, action: dict, horizon_s: float) -> dict: ...
def annotate_twin(twin_id: str, annotation: dict) -> dict: ...
def retire_twin(twin_id: str) -> dict: ...

Each function constructs a prompt and calls run_agent() from agents/base.py with ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Read".

Schema

Seven Parquet tables under the twins/ prefix in MinIO. DuckDB reads them via the existing DuckLake configuration.

`twins/assets`

Asset registry — one row per physical asset.

Column	Type	Description
`asset_id`	VARCHAR PK	Stable identifier (e.g. `turtlebot-01`)
`asset_type`	VARCHAR	`robot \| drone \| sensor_node \| custom`
`display_name`	VARCHAR	Human-readable name
`status`	VARCHAR	`active \| retired`
`created_at`	TIMESTAMP
`retired_at`	TIMESTAMP	NULL if active
`metadata`	JSON	Arbitrary key-value pairs

`twins/urdf_assets`

URDF / CAD model references, versioned.

Column	Type	Description
`urdf_id`	VARCHAR PK
`asset_id`	VARCHAR FK → assets
`version`	VARCHAR	Semantic version string
`urdf_path`	VARCHAR	Path in MinIO or local repo
`format`	VARCHAR	`urdf \| xacro \| sdf \| obj`
`uploaded_at`	TIMESTAMP
`checksum`	VARCHAR	SHA-256 of file

`twins/state_snapshots`

Point-in-time pose snapshots.

Column	Type	Description
`snapshot_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs
`timestamp`	TIMESTAMP	Time of observation
`source`	VARCHAR	`ros_job \| manual \| simulation \| predicted \| sim2real`
`position_x`	DOUBLE	Metres, world frame
`position_y`	DOUBLE
`position_z`	DOUBLE
`orientation_qx`	DOUBLE	Quaternion
`orientation_qy`	DOUBLE
`orientation_qz`	DOUBLE
`orientation_qw`	DOUBLE
`linear_velocity`	DOUBLE	m/s
`angular_velocity`	DOUBLE	rad/s
`reconciled`	BOOLEAN	True after post-job TwinAgent validation
`cosmos_model`	VARCHAR	NULL for observed; `predict2 \| transfer2.5` for generated snapshots
`predicted_from_snapshot_id`	VARCHAR FK → state_snapshots	Seed snapshot used by Cosmos-Predict2; NULL for observed

`twins/sensor_readings`

Time-series sensor data (separate from pose to keep state_snapshots lean).

Column	Type	Description
`reading_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs
`sensor_type`	VARCHAR	`imu \| lidar \| camera \| gps \| odometry`
`timestamp`	TIMESTAMP
`payload`	JSON	Sensor-specific structured data
`raw_path`	VARCHAR	Path to raw file in MinIO (e.g. ROS bag slice)

`twins/events`

Discrete twin lifecycle and runtime events.

Column	Type	Description
`event_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`job_id`	VARCHAR FK → twin_jobs	NULL for lifecycle events
`event_type`	VARCHAR	`twin.created \| twin.synced \| twin.retired \| nav.goal_set \| nav.goal_reached \| nav.obstacle_detected \| …`
`timestamp`	TIMESTAMP
`actor`	VARCHAR	Agent or system that generated the event
`payload`	JSON	Event-specific data

`twins/twin_jobs`

Link table — maps twin to jobs that produced state.

Column	Type	Description
`twin_job_id`	VARCHAR PK	UUID
`twin_id`	VARCHAR FK → assets
`job_id`	VARCHAR	Control-plane job UUID
`ray_job_id`	VARCHAR	Ray job ID
`environment`	VARCHAR	`torch.dev.gpu \| ros.dev.gpu`
`started_at`	TIMESTAMP
`completed_at`	TIMESTAMP
`wandb_run_id`	VARCHAR	NULL if not tracked
`sync_status`	VARCHAR	`pending \| synced \| failed`

`twins/annotations`

Human or agent annotations on twin state.

Column	Type	Description
`annotation_id`	VARCHAR PK	UUID
`asset_id`	VARCHAR FK → assets
`snapshot_id`	VARCHAR FK → state_snapshots	NULL for job-level annotations
`author`	VARCHAR	Agent name or user email
`annotation_type`	VARCHAR	`label \| anomaly_flag \| note \| review`
`content`	VARCHAR	Free text or JSON
`created_at`	TIMESTAMP

Data flow — v1

In-job writes (Ray worker → MinIO)

During a ros.dev.gpu RayJob the worker writes live data directly to MinIO:

Gazebo + Nav2
  → DDS topics
  → Zenoh bridge
  → Ray worker (Python)
      → pyarrow: append rows to sensor_readings Parquet partition
      → pyarrow: append rows to state_snapshots Parquet partition (reconciled=False)

The worker uses the MinIO endpoint configured in data-plane/ (same credentials as the LakehouseAgent) and writes to a job-specific partition: twins/state_snapshots/job_id={job_id}/part-0.parquet.

Post-job reconciliation (TwinAgent)

When the Ray worker completes, the control plane calls sync_twin(twin_id, job_id). The TwinAgent subprocess:

Reads in-job Parquet partitions from MinIO
Reads the W&B run (if linked) for additional metrics
Validates schema integrity; flags anomalies in events
Sets reconciled=True on validated snapshots
Merges job partition into main table (or leaves partitioned — DuckDB handles both)
Appends a twin.synced event record
Updates twin_jobs.sync_status = 'synced'

API

New router: control-plane/backend/api/twins.py, mounted at /api/v1/twins.

GET    /api/v1/twins                      List all registered twins
POST   /api/v1/twins                      Register a new twin (calls TwinAgent.create_twin)
GET    /api/v1/twins/{id}                 Twin details (asset record + latest state)
GET    /api/v1/twins/{id}/state           Latest state snapshot (or ?at=ISO8601 for point-in-time)
POST   /api/v1/twins/{id}/sync            Trigger post-job reconciliation
POST   /api/v1/twins/{id}/predict         Generate predicted future snapshots via Cosmos-Predict2 + Transfer2.5
GET    /api/v1/twins/{id}/events          Event log (paginated)
GET    /api/v1/twins/{id}/annotations     Annotations
POST   /api/v1/twins/{id}/annotations     Add annotation

Pydantic models in control-plane/backend/models/twin.py: Twin, TwinCreate, TwinState, TwinSyncRequest, TwinAnnotation.

Reference asset: TurtleBot

The v1 end-to-end demo twins turtlebot-01 on ros.dev.gpu:

1. TwinAgent.create_twin("turtlebot-01", "robot", urdf_path="user-plane/turtlebot/urdf/turtlebot3.urdf")
2. Control plane submits turtlebot-maze RayJob to ros.dev.gpu
3. Ray worker connects to Gazebo via Zenoh; writes pose + IMU to MinIO in-job
4. Job completes → POST /api/v1/twins/turtlebot-01/sync?job_id=<id>
5. TwinAgent reconciles; twin.synced event logged
6. bd query_twin("turtlebot-01", "SELECT timestamp, position_x, position_y FROM state_snapshots ORDER BY timestamp")
   → returns full trajectory from the job

# v1.5 extension — Cosmos-driven predicted twin state:
7. POST /api/v1/twins/turtlebot-01/predict  {action: {cmd_vel: {linear: 0.3, angular: 0.1}}, horizon_s: 5.0}
   → TwinAgent calls Cosmos-Predict2 (torch.dev.gpu) with latest observed snapshot as seed frame
   → Cosmos-Transfer2.5 translates synthetic prediction → photorealistic
   → Predicted snapshots written to state_snapshots (source=predicted, cosmos_model=transfer2.5)
8. Cosmos-Reason2 (ros.dev.gpu) reads predicted snapshots as visual context
   → evaluates feasibility → go / no-go before Nav2 goal dispatch

Evolution path

v1   — Persistent world model: lakehouse-backed twins for TurtleBot; in-job writes + post-job reconciliation
v1.5 — Real-time sync: Redis hot-cache for live pose (sub-second); Zenoh → twin writer as a persistent Ray actor
       Cosmos-Predict2 + Cosmos-Transfer2.5: predicted twin state snapshots (source=predicted/sim2real)
       predict_twin() operation + POST /api/v1/twins/{id}/predict endpoint
       Cosmos-Reason2: reads predicted snapshots as visual context for pre-execution feasibility evaluation
       Predict → Transfer → Reason → Execute loop integrated with twin state
v2   — Agent memory substrate: TwinAgent is the canonical world-model interface for all control-plane agents
       Cosmos models post-trained on turtlebot-maze ROS bags; domain-specific predicted snapshots
       Visualisation: Rerun viewer embedded in Next.js dashboard (observed + predicted state overlay)
       Multi-asset: VisDrone camera platform as second reference twin
       Management plane: per-twin access control, retention policies

Files to create

control-plane/backend/
  agents/twin_agent.py          TwinAgent wrapper (claude -p subprocess)
  api/twins.py                  FastAPI router
  models/twin.py                Pydantic models

data-plane/
  schema/twins/                 Schema definitions and migration scripts

.claude/agents/
  twin-agent.md                 Agent definition (YAML frontmatter + system prompt)

main.py — mount twins router alongside existing jobs, clusters, experiments, lakehouse.

Problem​

Goals​

Non-goals (v1)​

Architecture​

TwinAgent​

Schema​

twins/assets​

twins/urdf_assets​

twins/state_snapshots​

twins/sensor_readings​

twins/events​

twins/twin_jobs​

twins/annotations​

Data flow — v1​

In-job writes (Ray worker → MinIO)​

Post-job reconciliation (TwinAgent)​

API​

Reference asset: TurtleBot​

Evolution path​

Files to create​