Data Plane

Data Plane Design

Date: 2026-03-12 Status: Approved (v2)


Problem

The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse is used simultaneously as:

  • Storage for user-plane outputs (perception data, job results, telemetry)
  • Memory substrate for control-plane agents (job history, cluster failure patterns)
  • Training corpus for model fine-tuning (VLA, classification models)
  • Observability archive (agent traces, experiment results)

No single plane owns this. The control plane's LakehouseAgent reaches into it; the user plane writes to it; the management plane governs access to it. The lakehouse is not a single-plane component — it spans planes, and needs its own architectural treatment.

The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.

Traditional:  logic  →  data
Agentic:      data   →  reasoning  →  action

Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.


Definition

The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).

The data plane sits horizontally — all other planes interact with it:

Left-to-right graph showing the data plane sitting horizontally across the other planes. On the left, four caller nodes: Management Plane (governs storage: policies, lifecycle, access); Control Plane — AgentOps Subsystem (checkpointing, world-model snapshots); Control Plane (reads: semantic memory, historical context, world state); User Plane (writes: telemetry, perception, job outputs). On the right, a dashed Data Plane — centerpiece group containing five nodes: Lakehouse (DuckDB plus DuckLake), Embeddings Store (semantic retrieval), Feature Store (VLA, ML inputs), Event Log (agent traces, Zenoh history), and World Model Snapshots (point-in-time env state). Edges: Management Plane to the data plane labelled retention, RBAC, lifecycle; AgentOps to World Model Snapshots labelled write snapshots, checkpoints; Control Plane to the data plane labelled read: RAG, job history; User Plane to the data plane labelled write: outputs, telemetry.

Editable Mermaid source: images/design-data-plane-horizontal.mermaid.md


Goals

  • Provide a unified persistent storage substrate for all planes
  • Formalise the LakehouseAgent as the control-plane API boundary to the data plane
  • Define ingestion pipelines from the user plane (structured, versioned, lineaged)
  • Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
  • Support world-model snapshots for AgentOps checkpointing and causal replay
  • Serve as training data substrate for VLA and ML model fine-tuning (v3)

Non-goals

  • Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
  • Governance policy definition — that is the management plane
  • Agent reasoning or query planning — that is the control plane
  • Job execution — that is the user plane

Architecture

Storage layers

LayerTechnologyContentsAccess pattern
LakehouseDuckDB + DuckLake (PostgreSQL catalog) over S3 object storageStructured outputs, experiment results, job historySQL (DuckDB in-process), Parquet partitions in object storage
Object storeRustFS (local, on TrueNAS ZFS) + Cloudflare R2 (cloud, exposure-only)Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpointsBlob read/write via s3fs
Embeddings storepgvector / Chroma (v2)Dense vectors for semantic retrieval (agent traces, docs)ANN search
Feature storeFeast / custom Parquet (v3)Structured ML features for VLA and classification modelsBatch + online
Event logAppend-only Parquet partitions in the landing/ bucketAgent traces, Zenoh event recordings, ROS bag metadataAppend write, batch read
World model snapshotsDuckLake snapshots (v2)Point-in-time environment state for AgentOpsWrite on checkpoint, read on replay

The object-storage layer is a two-tier S3 design: RustFS on TrueNAS ZFS as the local authoritative store (migrated in place from MinIO), and Cloudflare R2 for the exposure tier only (buckets that must reach Cloudflare/website infrastructure). See the MinIO → RustFS migration runbook for the rationale and the TrueNAS SCALE migration steps.

Dataset & checkpoint versioning (R2 + W&B artifacts)

R2 has no native bucket versioning — the S3 PutBucketVersioning operation is unimplemented and the Cloudflare Terraform provider exposes no versioning resource. Binary artifact folders that do not live in DuckLake as Parquet — robot episode datasets (LeRobot / Rerun recordings) and model checkpoints — get reproducible versioning from two mechanisms that compose:

  1. Immutable vN/ prefixes. Each revision is written under a fresh vN/ prefix and is never overwritten, e.g. s3://ar4-physical-ai/datasets/ar4-pick-place/v3/. This preserves the bytes.
  2. W&B reference artifacts. Each vN/ folder is logged as a W&B artifact via Artifact.add_reference("s3://…/vN/"). W&B records a manifest (object key + ETag + size) and layers a version graph, a latest alias, and run lineage on top of the immutable bytes.

add_reference stores a manifest, not the bytes, so a referenced version only stays reproducible if its objects never change. The immutable vN/ rule guarantees that — which is why the helper refuses to write into an existing version (it raises rather than overwrite).

Bucket layout:

s3://ar4-physical-ai/
  datasets/<name>/vN/…       # episode datasets
  checkpoints/<name>/vN/…    # model checkpoints

Helper — data-plane/lakehouse/versioning.py:

from lakehouse.versioning import version_dataset, version_checkpoint, list_versions
 
res = version_dataset("/data/ar4/pick_place/run", "ar4-pick-place",
                      metadata={"episodes": 120, "sim": "gazebo"})
res.version_tag      # "v3"  — next free version, auto-allocated (never overwrites v1/v2)
res.s3_uri           # "s3://ar4-physical-ai/datasets/ar4-pick-place/v3/"
res.wandb_artifact   # "ar4-pick-place:v3"  (W&B aliases: v3, latest)
 
version_checkpoint("/ckpt/act_step_50k", "ar4-act-policy")
list_versions("ar4-pick-place", "dataset")   # [1, 2, 3]

Credentials reuse the data plane's R2_ENDPOINT / R2_ACCESS_KEY / R2_SECRET_KEY (mirrored into the AWS_* env vars so W&B's S3 reference crawl reaches R2) plus WANDB_API_KEY. Pass log_wandb=False to upload to R2 only.

This is complementary to DuckLake time-travel (see the "Time Travel" use case): DuckLake versions structured Parquet tables in the catalog; the vN/ + W&B scheme versions opaque blob folders (episodes, checkpoints) that are not catalog tables. Both let you pin training to an exact input version — choose by data shape.

DuckDB + DuckLake as the query and catalog layer

DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog: ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and whose data lives in the lakehouse as Parquet fragments. The full DuckLake schema (174 catalog tables) is in data-plane/tests/ducklake-schema.sql.

Key catalog tables:

  • experiments — experiment registry (id, project, description, created_at)
  • simulation_runs — per-simulator run records with S3 prefix, status, config
  • ducklake_table, ducklake_data_file — DuckLake's own catalog metadata

The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's operator interface: it runs DuckDB queries, inspects the catalog, and calls python -m lakehouse commands. It is not a transformation pipeline — it is a catalog operator and query runner.


Data flow

User plane → data plane (ingestion)

User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the data plane on job completion:

Sequence diagram of user-plane to data-plane ingestion with four participants: Ray Worker, Ingestion API, Lakehouse, and DuckLake catalog. Messages top to bottom: Ray Worker sends POST /data/ingest to Ingestion API; Ingestion API self-call to validate schema and assign partition key; Ingestion API writes Parquet fragment to the landing bucket on Lakehouse; Lakehouse returns ok; Ingestion API registers the fragment in ducklake_data_file on DuckLake catalog; DuckLake catalog returns ok; Ingestion API returns 201 with partition_id to Ray Worker. A closing note over DuckLake catalog states DuckDB can now query the new partition.

Editable Mermaid source: images/design-ingestion-sequence.mermaid.md

In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in data-plane/lakehouse/ called directly by Ray workers on job completion.

Control plane → data plane (reads)

Control-plane agents read from the data plane in two modes:

  1. Structured query (current): LakehouseAgent runs DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess
  2. Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context
# v1: LakehouseAgent reads directly
"Run SELECT * FROM job_outcomes ORDER BY completed_at DESC LIMIT 10 WHERE status = 'failed'"
 
# v2: semantic retrieval endpoint
GET /data/retrieve?query="notebook jobs that failed on torch.dev.gpu last week"&top_k=5
→ [{job_id, summary, outcome, wandb_run_id}]

AgentOps → data plane (snapshots)

The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.

WorldModelSnapshot {
  snapshot_id:     UUID
  intent_id:       UUID
  timestamp:       ISO 8601
  agents_active:   [{role, status, tool_call_count}]
  user_plane:      {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
  causal_chain:    [{intent_id, job_id, outcome}]
}

The lakehouse as persistent world model

In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:

Memory typeContentsEnables
EpisodicJob history, cluster failure events, navigation trial outcomesAgent reasoning over "what happened before"
SemanticLearned cluster failure patterns, experiment regression signaturesAgent pattern matching without re-executing
ProceduralSuccessful job submission sequences, DuckLake catalog dependency chainsAgent skill recall
PerceptualGeoParquet outputs, YOLOv8 detections, SLAM mapsVLA fine-tuning, world-model grounding

For the turtlebot-maze reference application:

  • Every Nav2 navigation trial → appended to event_log.ros_navigation_trials
  • Every YOLOv8 detection → appended to event_log.object_detections
  • Every world-model snapshot at decision point → stored in world_model.snapshots
  • Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models

This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.


Interfaces

Ingestion API (data plane ← user plane)

POST /data/ingest
Body: {table: str, records: list[dict], schema_version: str, job_id: UUID, tenant_id: UUID}
→ 201 {partition_id, row_count}

POST /data/snapshots
Body: WorldModelSnapshot
→ 201 {snapshot_id}

Query API (data plane → control plane)

GET /data/query?sql=<duckdb_sql>&tenant_id=<uuid>
→ {columns: [...], rows: [...]}

GET /data/retrieve?query=<natural_language>&top_k=<n>&tenant_id=<uuid>   (v2)
→ [{id, text, score, metadata}]

Policy interface (management plane → data plane)

PUT /data/policy
Body: {tenant_id, retention_days, allowed_tables: [...], max_storage_gb: float}
→ 200

Critical distinction: data plane vs management plane

A common mistake is placing the lakehouse under the management plane. They are separate:

Data PlaneManagement Plane
Stores and transforms dataGoverns how data is stored
Serves queriesDefines query access policies
Manages schemas and lineageManages access rights and retention
Enables agent reasoningEnforces compliance

The management plane governs the data plane. It does not own it.


Evolution path

v1   — Data plane in data-plane/; LakehouseAgent as catalog operator; manual ingestion
v1.5 — Formalise ingestion API; structured AgentEvent log
v2   — Embeddings store + RAG retrieval endpoint for control-plane agents
       World-model snapshots (AgentOps → data plane)
       Schema lineage visible in management-plane dashboard
v3   — VLA training pipeline: accumulated perception data → fine-tuning loop
       Feature store for online inference (real-time VLA feature serving)
       Data plane exposes MCP server: agents query job history and world state via tool calls

Requirements (DP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

IDRequirementTraces toVersion
DP-001The data plane shall govern movement, storage, transformation, and accessibility of data across the systemSYS-001v1
DP-002The data plane shall sit horizontally, serving all other planesSYS-007v1
DP-003The data plane shall have latency of seconds, eventually consistentSYS-001v1
DP-004Data plane failure shall cause query failures and queued ingestion; agents lose contextSYS-002v1
DP-006The data plane shall provide a unified persistent storage substrate for all planesSYS-007v1
DP-007LakehouseAgent shall be the control-plane API boundary to the data planeCP-010v1
DP-008The data plane shall define structured, versioned, lineaged ingestion pipelines from user planev1
DP-009The data plane shall enable semantic retrieval for control-plane agents (RAG over job history)v2
DP-010The data plane shall support world-model snapshots for AgentOps checkpointing and causal replayCP-030v2
DP-011The data plane shall serve as training data substrate for VLA fine-tuningv3
DP-012The data plane shall use DuckDB + DuckLake (PostgreSQL catalog) over S3 object storageSYS-007v1
DP-013Structured outputs and experiment results shall be stored as Parquet in the lakehouse via DuckLakeDP-012v1
DP-014Raw files (fMP4, GeoParquet, point clouds, checkpoints) shall be stored in object storeDP-012v1
DP-015Embeddings store (pgvector/Chroma) shall support dense vectors and semantic retrievalDP-009v2
DP-016Feature store shall support structured ML features for VLA and classificationDP-011v3
DP-017Event log shall be append-only Parquet in the landing/ bucket for agent traces and ROS bag metadatav1
DP-018World model snapshots shall capture point-in-time environment stateDP-010v2
DP-019DuckLake transactional catalog shall use PostgreSQL metadata and Parquet data in object storageDP-012v1
DP-020DuckLake schema shall include experiments, simulation_runs, ducklake_table, ducklake_data_fileDP-019v1
DP-021LakehouseAgent allowedTools: Bash(duckdb *),Bash(python *),Read,EditCP-010v1
DP-022The data plane shall not be a transformation pipeline; LakehouseAgent is a catalog operatorv1
DP-023Ingestion API: POST /data/ingest → 201 {partition_id, row_count}v1.5
DP-024Snapshot API: POST /data/snapshots with WorldModelSnapshot → 201 {snapshot_id}DP-010v2
DP-025Query API: GET /data/query?sql=<duckdb_sql> → {columns, rows}v1
DP-026Semantic retrieval API: GET /data/retrieve?query=<text>&top_k=<n> → [{id, text, score}]DP-009v2
DP-027Policy interface: PUT /data/policy with retention/RBAC config → 200MP-004v2
DP-028The data plane shall NOT be the message passing substrate (that is Zenoh/NATS/DDS)v1
DP-029The data plane shall NOT define governance policy (that is the management plane)v1
DP-030User plane Ray workers shall write outputs to the landing/ bucket; DuckLake registers fragmentsv1
DP-031v1 ingestion: Ray worker writes files; LakehouseAgent registers via DuckDBv1
DP-032v1.5 ingestion: lightweight API endpoint callable directly by Ray workersDP-023v1.5
DP-033v1 reads: LakehouseAgent runs DuckDB queriesv1
DP-034v2 reads: semantic retrieval via embeddings query endpointDP-026v2
DP-035The data plane shall accumulate episodic memory (job history, cluster failures, navigation trials)v1
DP-036The data plane shall accumulate semantic memory (failure patterns, regression signatures)v2
DP-037The data plane shall accumulate procedural memory (successful job sequences, dependency chains)v2
DP-038The data plane shall accumulate perceptual memory (GeoParquet, YOLOv8 detections, SLAM maps)v1
DP-039turtlebot-maze: Nav2 navigation trials stored in event_log.ros_navigation_trialsSYS-003v1
DP-040turtlebot-maze: YOLOv8 detections stored in event_log.object_detectionsSYS-003v1
DP-041turtlebot-maze: world-model snapshots at decision points in world_model.snapshotsSYS-003, DP-010v2
DP-042The data plane shall be separate from the management planeSYS-001v1

See also:


Example Use Cases of the Auraison Lakehouse

The following examples use the SQuAD validation dataset (stored in s3://landing/squad/) to illustrate the concrete benefits of the DuckLake lakehouse over a plain object store or traditional database.

1. Schema Evolution — add columns without rewriting parquet files

After a model inference run, new columns can be appended to an existing table with no data rewrite:

ALTER TABLE squad ADD COLUMN model_answer VARCHAR;
ALTER TABLE squad ADD COLUMN confidence FLOAT;
 
-- Backfill from an inference run
UPDATE squad SET model_answer = 'Denver Broncos', confidence = 0.97
WHERE id = '56be4db0acb8001400a502ec';

DuckLake records the schema change in the PostgreSQL catalog; the underlying Parquet files in the lakehouse are untouched. Without a lakehouse this requires a full dataset rewrite.

2. Time Travel — pin to the exact snapshot used for training

Every write creates a new catalog snapshot. Queries can be issued against any prior version:

-- Inspect data before model answers were added
SELECT * FROM squad AT (VERSION => 1) LIMIT 5;
 
-- Diff two snapshots: which answers changed?
SELECT curr.id, curr.model_answer, prev.model_answer AS old_answer
FROM squad AT (VERSION => 3) curr
JOIN squad AT (VERSION => 1) prev USING (id)
WHERE curr.model_answer IS DISTINCT FROM prev.model_answer;

This makes ML experiments reproducible: a training run can be tied to a specific catalog version and replayed exactly.

3. Cross-plane Join — catalog metadata + raw data in a single query

simulation_runs (DuckLake catalog) and squad (raw parquet in the lakehouse) can be joined without ETL:

SELECT
    r.config->>'model'                                                   AS model,
    COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1])          AS exact_match,
    COUNT(*)                                                             AS total,
    ROUND(100.0 *
        COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1])
        / COUNT(*), 2)                                                   AS em_pct
FROM squad s
JOIN simulation_runs r ON r.s3_prefix = s.title
GROUP BY 1
ORDER BY em_pct DESC;

Without the lakehouse this requires a separate MLflow or W&B lookup followed by a manual join.

4. Predicate Pushdown — skip irrelevant row groups

DuckDB pushes WHERE predicates into the Parquet reader, scanning only matching row groups:

-- Only reads the row groups where title = 'Super_Bowl_50'
SELECT id, question, answers.text[1] AS answer
FROM read_parquet('s3://landing/squad/**/*.parquet')
WHERE title = 'Super_Bowl_50';
 
-- Inspect the query plan
EXPLAIN SELECT * FROM squad WHERE title = 'Beyoncé';

On a large dataset (e.g. full COCO-Caption at ~100 GB) this reduces scan time from minutes to seconds.

5. Incremental Ingest — idempotent append with no duplicates

New parquet drops in landing/ can be merged into the warehouse without risk of duplicates:

INSERT INTO squad
SELECT s.*
FROM read_parquet('s3://landing/squad/**/*.parquet') s
LEFT JOIN squad w ON s.id = w.id
WHERE w.id IS NULL;

DuckLake's MVCC guarantees that concurrent readers see a consistent snapshot even while the insert is in flight.

6. Analytical Queries — OLAP directly on lakehouse data

No separate analytics database is needed; DuckDB runs columnar OLAP over the same Parquet files used for training:

-- Answer length distribution by article
SELECT
    title,
    COUNT(*)                          AS num_questions,
    AVG(LENGTH(answers.text[1]))      AS avg_answer_len,
    MAX(LENGTH(context))              AS max_context_len
FROM squad
GROUP BY title
ORDER BY num_questions DESC;
 
-- Questions with multiple valid answers (annotator disagreement)
SELECT id, question, len(answers.text) AS num_answers
FROM squad
WHERE len(answers.text) > 1
ORDER BY num_answers DESC
LIMIT 10;

Summary

PatternLakehouse benefitWithout lakehouse
Schema evolutionZero-copy column addFull dataset rewrite
Time travelSnapshot pinning for reproducibilityManual versioned file copies
Cross-plane joinCatalog + data in one querySeparate MLflow/W&B lookup
Predicate pushdownRow-group pruning, sub-second scansFull table scan
Incremental ingestIdempotent MVCC appendsApp-level dedup logic
Analytical queriesIn-place OLAP over training dataExport to separate analytics DB

External Infrastructure (TrueNAS Lab)

The data-plane can run against existing lab infrastructure instead of the bundled Docker services. The lab exposes two services:

ServiceAddressNotes
S3 (TrueNAS)https://s3.aegeanai.com/Cloudflare Tunnel — always reachable, valid TLS
S3 (LAN direct)http://<TRUENAS_LAN_IP>:9000Faster; no Cloudflare hop; requires LAN or Tailscale
PostgreSQL192.168.1.26:5432LAN only — Tailscale required when off-LAN

Networking matrix

ContextS3 endpointPG hostTailscale needed?
Docker (default)http://rustfs:9000postgresql:55432No
Lab / on-LANLAN direct http://<IP>:9000192.168.1.26:5432No
Remote / off-LANhttps://s3.aegeanai.com/192.168.1.26:5432 via TailscaleYes (PG only)

Cloudflare Tunnel caveat

The public https://s3.aegeanai.com/ endpoint does not support chunked/multipart uploads. boto3 falls back to multipart for objects larger than ~8 MB by default. Set a high threshold or disable multipart for bulk Parquet ingestion through the tunnel:

from boto3.s3.transfer import TransferConfig
config = TransferConfig(multipart_threshold=10 * 1024 ** 3)  # effectively disables multipart
s3.upload_fileobj(f, bucket, key, Config=config)

Recommendation: use the LAN direct path (or Tailscale + LAN) for bulk ingestion; reserve the Cloudflare tunnel for small files and catalog operations from remote machines.

Setup — TrueNAS S3

  1. In TrueNAS → Credentials → S3 API Keys: create a key with read/write on warehouse, iceberg, landing.
  2. Pre-create the three buckets (or run mc-job pointed at the TrueNAS endpoint).
  3. Set in .env:
MINIO_ENDPOINT=https://s3.aegeanai.com   # or http://<LAN-IP>:9000 for LAN direct
MINIO_ACCESS_KEY=<truenas-key-id>
MINIO_SECRET_KEY=<truenas-secret>
MINIO_USE_SSL=true                        # false for LAN direct

Setup — TrueNAS PostgreSQL

  1. On the TrueNAS PostgreSQL instance:
CREATE ROLE ducklake LOGIN PASSWORD '<choose-a-password>';
CREATE DATABASE ducklake OWNER ducklake;
  1. Ensure the TrueNAS firewall allows TCP 5432 from your dev host / container subnet.
  2. Set in .env:
DUCKLAKE_POSTGRES_DSN=postgresql://ducklake:<password>@192.168.1.26:5432/ducklake

Starting containers against external infra

# Start only the dev container (skip rustfs / postgresql / bucket-init):
cd data-plane
docker compose up -d lakehouse.dev.cpu
 
# The container inherits MINIO_ENDPOINT and DUCKLAKE_POSTGRES_DSN from .env
# and falls back to Docker service names if the vars are not set.
# Start with local Docker infra (original behaviour):
docker compose --profile local-infra up -d