Auraison — Digital Twins Design
Date: 2026-03-02 Status: Approved (v1) Epic: auraison-5z3
Problem
Agentic workloads in the user plane produce rich runtime state — robot pose, sensor readings, navigation events, perception outputs — but this state is ephemeral. It lives inside a RayJob for its lifetime and is lost when the job completes. The control plane has no persistent, queryable model of the physical world its agents are acting on.
A digital twin is a persistent, structured representation of a physical asset that accumulates state over time. For Auraison, twins are the bridge between the transient execution world of the user plane and the durable memory world of the data plane. They enable:
- Historical replay: reconstruct what the robot was doing during any past job
- Causal analysis: correlate agent decisions with physical state at decision time
- Predictive modelling: feed twin state into Cosmos-Predict2 + Cosmos-Transfer2.5 to forecast photorealistic future states
- Agent memory: let control-plane agents read world state without polling live sensors
- Reasoning substrate: serve twin state snapshots as visual context to Cosmos-Reason2 for physics-grounded feasibility evaluation
Goals
- Persist physical asset state (pose, sensor readings, events) in the data-plane lakehouse
- Provide a TwinAgent subprocess for control-plane agents to create, sync, query, and retire twins
- Expose twin state via a FastAPI router consistent with existing API conventions
- Demonstrate end-to-end with the TurtleBot reference asset on
ros.dev.gpu
Non-goals (v1)
- Real-time sub-second twin state (deferred to v1.5 with Redis hot-cache)
- Multi-tenant asset isolation (management plane — v2)
- Cosmos-Predict2 / Cosmos-Transfer2.5 predicted twin state snapshots (v1.5)
- Physics simulation driven from twin state (v1.5 / v2)
- Visualisation UI in the Next.js dashboard (v2)
Architecture
Digital twins span three planes:
User Plane (ros.dev.gpu)
Ray worker writes live sensor data → MinIO (data plane) during job
Ray worker writes live pose snapshots → MinIO (data plane) during job
Cosmos-Reason2: reads twin state snapshots as visual context for feasibility evaluation
User Plane (torch.dev.gpu)
Cosmos-Predict2: reads observed twin state → generates predicted future state video
Cosmos-Transfer2.5: translates predicted synthetic video → photorealistic; stored as predicted snapshots
Control Plane
TwinAgent subprocess (claude -p) reconciles and validates at job end
TwinAgent exposes create / sync / query / retire / predict operations
FastAPI /api/v1/twins router accepts HTTP calls from UI and other agents
Data Plane (MinIO + DuckDB)
Seven Parquet tables under twins/ prefix
state_snapshots.source distinguishes observed vs predicted vs sim2real snapshots
DuckDB queries served by TwinAgent and LakehouseAgent
TwinAgent
TwinAgent is a claude -p subprocess in control-plane/backend/agents/twin_agent.py,
following the same pattern as LakehouseAgent and ClusterAgent.
Tool scope: Bash(duckdb *), Bash(python *), Read
Operations exposed to the control plane:
| Operation | Description |
|---|---|
create_twin(asset_id, asset_type, urdf_path) | Register a new asset; create lakehouse tables if absent |
sync_twin(twin_id, job_id) | Post-job reconciliation: read ROS bag / W&B outputs; append validated snapshots |
query_twin(twin_id, query) | DuckDB query over twin tables; returns Arrow/pandas result |
get_twin_state(twin_id, at=None) | Latest state snapshot, or point-in-time if at provided |
annotate_twin(twin_id, annotation) | Append an annotation record |
predict_twin(twin_id, action, horizon_s) | Call Cosmos-Predict2 → Cosmos-Transfer2.5 with latest observed state; store predicted snapshots (source=predicted/sim2real) |
retire_twin(twin_id) | Mark asset as retired; preserve history |
Python wrapper interface:
# control-plane/backend/agents/twin_agent.py
def create_twin(asset_id: str, asset_type: str, urdf_path: str | None = None) -> dict: ...
def sync_twin(twin_id: str, job_id: str) -> dict: ...
def query_twin(twin_id: str, query: str) -> dict: ...
def get_twin_state(twin_id: str, at: str | None = None) -> dict: ...
def predict_twin(twin_id: str, action: dict, horizon_s: float) -> dict: ...
def annotate_twin(twin_id: str, annotation: dict) -> dict: ...
def retire_twin(twin_id: str) -> dict: ...
Each function constructs a prompt and calls run_agent() from agents/base.py with
ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Read".
Schema
Seven Parquet tables under the twins/ prefix in MinIO. DuckDB reads them via the existing
DuckLake configuration.
twins/assets
Asset registry — one row per physical asset.
| Column | Type | Description |
|---|---|---|
asset_id | VARCHAR PK | Stable identifier (e.g. turtlebot-01) |
asset_type | VARCHAR | robot | drone | sensor_node | custom |
display_name | VARCHAR | Human-readable name |
status | VARCHAR | active | retired |
created_at | TIMESTAMP | |
retired_at | TIMESTAMP | NULL if active |
metadata | JSON | Arbitrary key-value pairs |
twins/urdf_assets
URDF / CAD model references, versioned.
| Column | Type | Description |
|---|---|---|
urdf_id | VARCHAR PK | |
asset_id | VARCHAR FK → assets | |
version | VARCHAR | Semantic version string |
urdf_path | VARCHAR | Path in MinIO or local repo |
format | VARCHAR | urdf | xacro | sdf | obj |
uploaded_at | TIMESTAMP | |
checksum | VARCHAR | SHA-256 of file |
twins/state_snapshots
Point-in-time pose snapshots.
| Column | Type | Description |
|---|---|---|
snapshot_id | VARCHAR PK | UUID |
asset_id | VARCHAR FK → assets | |
job_id | VARCHAR FK → twin_jobs | |
timestamp | TIMESTAMP | Time of observation |
source | VARCHAR | ros_job | manual | simulation | predicted | sim2real |
position_x | DOUBLE | Metres, world frame |
position_y | DOUBLE | |
position_z | DOUBLE | |
orientation_qx | DOUBLE | Quaternion |
orientation_qy | DOUBLE | |
orientation_qz | DOUBLE | |
orientation_qw | DOUBLE | |
linear_velocity | DOUBLE | m/s |
angular_velocity | DOUBLE | rad/s |
reconciled | BOOLEAN | True after post-job TwinAgent validation |
cosmos_model | VARCHAR | NULL for observed; predict2 | transfer2.5 for generated snapshots |
predicted_from_snapshot_id | VARCHAR FK → state_snapshots | Seed snapshot used by Cosmos-Predict2; NULL for observed |
twins/sensor_readings
Time-series sensor data (separate from pose to keep state_snapshots lean).
| Column | Type | Description |
|---|---|---|
reading_id | VARCHAR PK | UUID |
asset_id | VARCHAR FK → assets | |
job_id | VARCHAR FK → twin_jobs | |
sensor_type | VARCHAR | imu | lidar | camera | gps | odometry |
timestamp | TIMESTAMP | |
payload | JSON | Sensor-specific structured data |
raw_path | VARCHAR | Path to raw file in MinIO (e.g. ROS bag slice) |
twins/events
Discrete twin lifecycle and runtime events.
| Column | Type | Description |
|---|---|---|
event_id | VARCHAR PK | UUID |
asset_id | VARCHAR FK → assets | |
job_id | VARCHAR FK → twin_jobs | NULL for lifecycle events |
event_type | VARCHAR | twin.created | twin.synced | twin.retired | nav.goal_set | nav.goal_reached | nav.obstacle_detected | … |
timestamp | TIMESTAMP | |
actor | VARCHAR | Agent or system that generated the event |
payload | JSON | Event-specific data |
twins/twin_jobs
Link table — maps twin to jobs that produced state.
| Column | Type | Description |
|---|---|---|
twin_job_id | VARCHAR PK | UUID |
twin_id | VARCHAR FK → assets | |
job_id | VARCHAR | Control-plane job UUID |
ray_job_id | VARCHAR | Ray job ID |
environment | VARCHAR | torch.dev.gpu | ros.dev.gpu |
started_at | TIMESTAMP | |
completed_at | TIMESTAMP | |
wandb_run_id | VARCHAR | NULL if not tracked |
sync_status | VARCHAR | pending | synced | failed |
twins/annotations
Human or agent annotations on twin state.
| Column | Type | Description |
|---|---|---|
annotation_id | VARCHAR PK | UUID |
asset_id | VARCHAR FK → assets | |
snapshot_id | VARCHAR FK → state_snapshots | NULL for job-level annotations |
author | VARCHAR | Agent name or user email |
annotation_type | VARCHAR | label | anomaly_flag | note | review |
content | VARCHAR | Free text or JSON |
created_at | TIMESTAMP |
Data flow — v1
In-job writes (Ray worker → MinIO)
During a ros.dev.gpu RayJob the worker writes live data directly to MinIO:
Gazebo + Nav2
→ DDS topics
→ Zenoh bridge
→ Ray worker (Python)
→ pyarrow: append rows to sensor_readings Parquet partition
→ pyarrow: append rows to state_snapshots Parquet partition (reconciled=False)
The worker uses the MinIO endpoint configured in data-plane/ (same credentials as
the LakehouseAgent) and writes to a job-specific partition:
twins/state_snapshots/job_id={job_id}/part-0.parquet.
Post-job reconciliation (TwinAgent)
When the Ray worker completes, the control plane calls sync_twin(twin_id, job_id).
The TwinAgent subprocess:
- Reads in-job Parquet partitions from MinIO
- Reads the W&B run (if linked) for additional metrics
- Validates schema integrity; flags anomalies in
events - Sets
reconciled=Trueon validated snapshots - Merges job partition into main table (or leaves partitioned — DuckDB handles both)
- Appends a
twin.syncedevent record - Updates
twin_jobs.sync_status = 'synced'
API
New router: control-plane/backend/api/twins.py, mounted at /api/v1/twins.
GET /api/v1/twins List all registered twins
POST /api/v1/twins Register a new twin (calls TwinAgent.create_twin)
GET /api/v1/twins/{id} Twin details (asset record + latest state)
GET /api/v1/twins/{id}/state Latest state snapshot (or ?at=ISO8601 for point-in-time)
POST /api/v1/twins/{id}/sync Trigger post-job reconciliation
POST /api/v1/twins/{id}/predict Generate predicted future snapshots via Cosmos-Predict2 + Transfer2.5
GET /api/v1/twins/{id}/events Event log (paginated)
GET /api/v1/twins/{id}/annotations Annotations
POST /api/v1/twins/{id}/annotations Add annotation
Pydantic models in control-plane/backend/models/twin.py:
Twin, TwinCreate, TwinState, TwinSyncRequest, TwinAnnotation.
Reference asset: TurtleBot
The v1 end-to-end demo twins turtlebot-01 on ros.dev.gpu:
1. TwinAgent.create_twin("turtlebot-01", "robot", urdf_path="user-plane/turtlebot/urdf/turtlebot3.urdf")
2. Control plane submits turtlebot-maze RayJob to ros.dev.gpu
3. Ray worker connects to Gazebo via Zenoh; writes pose + IMU to MinIO in-job
4. Job completes → POST /api/v1/twins/turtlebot-01/sync?job_id=<id>
5. TwinAgent reconciles; twin.synced event logged
6. bd query_twin("turtlebot-01", "SELECT timestamp, position_x, position_y FROM state_snapshots ORDER BY timestamp")
→ returns full trajectory from the job
# v1.5 extension — Cosmos-driven predicted twin state:
7. POST /api/v1/twins/turtlebot-01/predict {action: {cmd_vel: {linear: 0.3, angular: 0.1}}, horizon_s: 5.0}
→ TwinAgent calls Cosmos-Predict2 (torch.dev.gpu) with latest observed snapshot as seed frame
→ Cosmos-Transfer2.5 translates synthetic prediction → photorealistic
→ Predicted snapshots written to state_snapshots (source=predicted, cosmos_model=transfer2.5)
8. Cosmos-Reason2 (ros.dev.gpu) reads predicted snapshots as visual context
→ evaluates feasibility → go / no-go before Nav2 goal dispatch
Evolution path
v1 — Persistent world model: lakehouse-backed twins for TurtleBot; in-job writes + post-job reconciliation
v1.5 — Real-time sync: Redis hot-cache for live pose (sub-second); Zenoh → twin writer as a persistent Ray actor
Cosmos-Predict2 + Cosmos-Transfer2.5: predicted twin state snapshots (source=predicted/sim2real)
predict_twin() operation + POST /api/v1/twins/{id}/predict endpoint
Cosmos-Reason2: reads predicted snapshots as visual context for pre-execution feasibility evaluation
Predict → Transfer → Reason → Execute loop integrated with twin state
v2 — Agent memory substrate: TwinAgent is the canonical world-model interface for all control-plane agents
Cosmos models post-trained on turtlebot-maze ROS bags; domain-specific predicted snapshots
Visualisation: Rerun viewer embedded in Next.js dashboard (observed + predicted state overlay)
Multi-asset: VisDrone camera platform as second reference twin
Management plane: per-twin access control, retention policies
Files to create
control-plane/backend/
agents/twin_agent.py TwinAgent wrapper (claude -p subprocess)
api/twins.py FastAPI router
models/twin.py Pydantic models
data-plane/
schema/twins/ Schema definitions and migration scripts
.claude/agents/
twin-agent.md Agent definition (YAML frontmatter + system prompt)
main.py — mount twins router alongside existing jobs, clusters, experiments, lakehouse.