Skip to main content

Auraison — Digital Twins Design

Date: 2026-03-02 Status: Approved (v1) Epic: auraison-5z3


Problem

Agentic workloads in the user plane produce rich runtime state — robot pose, sensor readings, navigation events, perception outputs — but this state is ephemeral. It lives inside a RayJob for its lifetime and is lost when the job completes. The control plane has no persistent, queryable model of the physical world its agents are acting on.

A digital twin is a persistent, structured representation of a physical asset that accumulates state over time. For Auraison, twins are the bridge between the transient execution world of the user plane and the durable memory world of the data plane. They enable:

  • Historical replay: reconstruct what the robot was doing during any past job
  • Causal analysis: correlate agent decisions with physical state at decision time
  • Predictive modelling: feed twin state into Cosmos-Predict2 + Cosmos-Transfer2.5 to forecast photorealistic future states
  • Agent memory: let control-plane agents read world state without polling live sensors
  • Reasoning substrate: serve twin state snapshots as visual context to Cosmos-Reason2 for physics-grounded feasibility evaluation

Goals

  • Persist physical asset state (pose, sensor readings, events) in the data-plane lakehouse
  • Provide a TwinAgent subprocess for control-plane agents to create, sync, query, and retire twins
  • Expose twin state via a FastAPI router consistent with existing API conventions
  • Demonstrate end-to-end with the TurtleBot reference asset on ros.dev.gpu

Non-goals (v1)

  • Real-time sub-second twin state (deferred to v1.5 with Redis hot-cache)
  • Multi-tenant asset isolation (management plane — v2)
  • Cosmos-Predict2 / Cosmos-Transfer2.5 predicted twin state snapshots (v1.5)
  • Physics simulation driven from twin state (v1.5 / v2)
  • Visualisation UI in the Next.js dashboard (v2)

Architecture

Digital twins span three planes:

User Plane (ros.dev.gpu)
Ray worker writes live sensor data → MinIO (data plane) during job
Ray worker writes live pose snapshots → MinIO (data plane) during job
Cosmos-Reason2: reads twin state snapshots as visual context for feasibility evaluation

User Plane (torch.dev.gpu)
Cosmos-Predict2: reads observed twin state → generates predicted future state video
Cosmos-Transfer2.5: translates predicted synthetic video → photorealistic; stored as predicted snapshots

Control Plane
TwinAgent subprocess (claude -p) reconciles and validates at job end
TwinAgent exposes create / sync / query / retire / predict operations
FastAPI /api/v1/twins router accepts HTTP calls from UI and other agents

Data Plane (MinIO + DuckDB)
Seven Parquet tables under twins/ prefix
state_snapshots.source distinguishes observed vs predicted vs sim2real snapshots
DuckDB queries served by TwinAgent and LakehouseAgent

TwinAgent

TwinAgent is a claude -p subprocess in control-plane/backend/agents/twin_agent.py, following the same pattern as LakehouseAgent and ClusterAgent.

Tool scope: Bash(duckdb *), Bash(python *), Read

Operations exposed to the control plane:

OperationDescription
create_twin(asset_id, asset_type, urdf_path)Register a new asset; create lakehouse tables if absent
sync_twin(twin_id, job_id)Post-job reconciliation: read ROS bag / W&B outputs; append validated snapshots
query_twin(twin_id, query)DuckDB query over twin tables; returns Arrow/pandas result
get_twin_state(twin_id, at=None)Latest state snapshot, or point-in-time if at provided
annotate_twin(twin_id, annotation)Append an annotation record
predict_twin(twin_id, action, horizon_s)Call Cosmos-Predict2 → Cosmos-Transfer2.5 with latest observed state; store predicted snapshots (source=predicted/sim2real)
retire_twin(twin_id)Mark asset as retired; preserve history

Python wrapper interface:

# control-plane/backend/agents/twin_agent.py
def create_twin(asset_id: str, asset_type: str, urdf_path: str | None = None) -> dict: ...
def sync_twin(twin_id: str, job_id: str) -> dict: ...
def query_twin(twin_id: str, query: str) -> dict: ...
def get_twin_state(twin_id: str, at: str | None = None) -> dict: ...
def predict_twin(twin_id: str, action: dict, horizon_s: float) -> dict: ...
def annotate_twin(twin_id: str, annotation: dict) -> dict: ...
def retire_twin(twin_id: str) -> dict: ...

Each function constructs a prompt and calls run_agent() from agents/base.py with ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Read".


Schema

Seven Parquet tables under the twins/ prefix in MinIO. DuckDB reads them via the existing DuckLake configuration.

twins/assets

Asset registry — one row per physical asset.

ColumnTypeDescription
asset_idVARCHAR PKStable identifier (e.g. turtlebot-01)
asset_typeVARCHARrobot | drone | sensor_node | custom
display_nameVARCHARHuman-readable name
statusVARCHARactive | retired
created_atTIMESTAMP
retired_atTIMESTAMPNULL if active
metadataJSONArbitrary key-value pairs

twins/urdf_assets

URDF / CAD model references, versioned.

ColumnTypeDescription
urdf_idVARCHAR PK
asset_idVARCHAR FK → assets
versionVARCHARSemantic version string
urdf_pathVARCHARPath in MinIO or local repo
formatVARCHARurdf | xacro | sdf | obj
uploaded_atTIMESTAMP
checksumVARCHARSHA-256 of file

twins/state_snapshots

Point-in-time pose snapshots.

ColumnTypeDescription
snapshot_idVARCHAR PKUUID
asset_idVARCHAR FK → assets
job_idVARCHAR FK → twin_jobs
timestampTIMESTAMPTime of observation
sourceVARCHARros_job | manual | simulation | predicted | sim2real
position_xDOUBLEMetres, world frame
position_yDOUBLE
position_zDOUBLE
orientation_qxDOUBLEQuaternion
orientation_qyDOUBLE
orientation_qzDOUBLE
orientation_qwDOUBLE
linear_velocityDOUBLEm/s
angular_velocityDOUBLErad/s
reconciledBOOLEANTrue after post-job TwinAgent validation
cosmos_modelVARCHARNULL for observed; predict2 | transfer2.5 for generated snapshots
predicted_from_snapshot_idVARCHAR FK → state_snapshotsSeed snapshot used by Cosmos-Predict2; NULL for observed

twins/sensor_readings

Time-series sensor data (separate from pose to keep state_snapshots lean).

ColumnTypeDescription
reading_idVARCHAR PKUUID
asset_idVARCHAR FK → assets
job_idVARCHAR FK → twin_jobs
sensor_typeVARCHARimu | lidar | camera | gps | odometry
timestampTIMESTAMP
payloadJSONSensor-specific structured data
raw_pathVARCHARPath to raw file in MinIO (e.g. ROS bag slice)

twins/events

Discrete twin lifecycle and runtime events.

ColumnTypeDescription
event_idVARCHAR PKUUID
asset_idVARCHAR FK → assets
job_idVARCHAR FK → twin_jobsNULL for lifecycle events
event_typeVARCHARtwin.created | twin.synced | twin.retired | nav.goal_set | nav.goal_reached | nav.obstacle_detected | …
timestampTIMESTAMP
actorVARCHARAgent or system that generated the event
payloadJSONEvent-specific data

twins/twin_jobs

Link table — maps twin to jobs that produced state.

ColumnTypeDescription
twin_job_idVARCHAR PKUUID
twin_idVARCHAR FK → assets
job_idVARCHARControl-plane job UUID
ray_job_idVARCHARRay job ID
environmentVARCHARtorch.dev.gpu | ros.dev.gpu
started_atTIMESTAMP
completed_atTIMESTAMP
wandb_run_idVARCHARNULL if not tracked
sync_statusVARCHARpending | synced | failed

twins/annotations

Human or agent annotations on twin state.

ColumnTypeDescription
annotation_idVARCHAR PKUUID
asset_idVARCHAR FK → assets
snapshot_idVARCHAR FK → state_snapshotsNULL for job-level annotations
authorVARCHARAgent name or user email
annotation_typeVARCHARlabel | anomaly_flag | note | review
contentVARCHARFree text or JSON
created_atTIMESTAMP

Data flow — v1

In-job writes (Ray worker → MinIO)

During a ros.dev.gpu RayJob the worker writes live data directly to MinIO:

Gazebo + Nav2
→ DDS topics
→ Zenoh bridge
→ Ray worker (Python)
→ pyarrow: append rows to sensor_readings Parquet partition
→ pyarrow: append rows to state_snapshots Parquet partition (reconciled=False)

The worker uses the MinIO endpoint configured in data-plane/ (same credentials as the LakehouseAgent) and writes to a job-specific partition: twins/state_snapshots/job_id={job_id}/part-0.parquet.

Post-job reconciliation (TwinAgent)

When the Ray worker completes, the control plane calls sync_twin(twin_id, job_id). The TwinAgent subprocess:

  1. Reads in-job Parquet partitions from MinIO
  2. Reads the W&B run (if linked) for additional metrics
  3. Validates schema integrity; flags anomalies in events
  4. Sets reconciled=True on validated snapshots
  5. Merges job partition into main table (or leaves partitioned — DuckDB handles both)
  6. Appends a twin.synced event record
  7. Updates twin_jobs.sync_status = 'synced'

API

New router: control-plane/backend/api/twins.py, mounted at /api/v1/twins.

GET    /api/v1/twins                      List all registered twins
POST /api/v1/twins Register a new twin (calls TwinAgent.create_twin)
GET /api/v1/twins/{id} Twin details (asset record + latest state)
GET /api/v1/twins/{id}/state Latest state snapshot (or ?at=ISO8601 for point-in-time)
POST /api/v1/twins/{id}/sync Trigger post-job reconciliation
POST /api/v1/twins/{id}/predict Generate predicted future snapshots via Cosmos-Predict2 + Transfer2.5
GET /api/v1/twins/{id}/events Event log (paginated)
GET /api/v1/twins/{id}/annotations Annotations
POST /api/v1/twins/{id}/annotations Add annotation

Pydantic models in control-plane/backend/models/twin.py: Twin, TwinCreate, TwinState, TwinSyncRequest, TwinAnnotation.


Reference asset: TurtleBot

The v1 end-to-end demo twins turtlebot-01 on ros.dev.gpu:

1. TwinAgent.create_twin("turtlebot-01", "robot", urdf_path="user-plane/turtlebot/urdf/turtlebot3.urdf")
2. Control plane submits turtlebot-maze RayJob to ros.dev.gpu
3. Ray worker connects to Gazebo via Zenoh; writes pose + IMU to MinIO in-job
4. Job completes → POST /api/v1/twins/turtlebot-01/sync?job_id=<id>
5. TwinAgent reconciles; twin.synced event logged
6. bd query_twin("turtlebot-01", "SELECT timestamp, position_x, position_y FROM state_snapshots ORDER BY timestamp")
→ returns full trajectory from the job

# v1.5 extension — Cosmos-driven predicted twin state:
7. POST /api/v1/twins/turtlebot-01/predict {action: {cmd_vel: {linear: 0.3, angular: 0.1}}, horizon_s: 5.0}
→ TwinAgent calls Cosmos-Predict2 (torch.dev.gpu) with latest observed snapshot as seed frame
→ Cosmos-Transfer2.5 translates synthetic prediction → photorealistic
→ Predicted snapshots written to state_snapshots (source=predicted, cosmos_model=transfer2.5)
8. Cosmos-Reason2 (ros.dev.gpu) reads predicted snapshots as visual context
→ evaluates feasibility → go / no-go before Nav2 goal dispatch

Evolution path

v1   — Persistent world model: lakehouse-backed twins for TurtleBot; in-job writes + post-job reconciliation
v1.5 — Real-time sync: Redis hot-cache for live pose (sub-second); Zenoh → twin writer as a persistent Ray actor
Cosmos-Predict2 + Cosmos-Transfer2.5: predicted twin state snapshots (source=predicted/sim2real)
predict_twin() operation + POST /api/v1/twins/{id}/predict endpoint
Cosmos-Reason2: reads predicted snapshots as visual context for pre-execution feasibility evaluation
Predict → Transfer → Reason → Execute loop integrated with twin state
v2 — Agent memory substrate: TwinAgent is the canonical world-model interface for all control-plane agents
Cosmos models post-trained on turtlebot-maze ROS bags; domain-specific predicted snapshots
Visualisation: Rerun viewer embedded in Next.js dashboard (observed + predicted state overlay)
Multi-asset: VisDrone camera platform as second reference twin
Management plane: per-twin access control, retention policies

Files to create

control-plane/backend/
agents/twin_agent.py TwinAgent wrapper (claude -p subprocess)
api/twins.py FastAPI router
models/twin.py Pydantic models

data-plane/
schema/twins/ Schema definitions and migration scripts

.claude/agents/
twin-agent.md Agent definition (YAML frontmatter + system prompt)

main.py — mount twins router alongside existing jobs, clusters, experiments, lakehouse.