Data Plane Design
Date: 2026-03-12 Status: Approved (v2)
Problem
The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse is used simultaneously as:
- Storage for user-plane outputs (perception data, job results, telemetry)
- Memory substrate for control-plane agents (job history, cluster failure patterns)
- Training corpus for model fine-tuning (VLA, classification models)
- Observability archive (agent traces, experiment results)
No single plane owns this. The control plane's LakehouseAgent reaches into it; the user
plane writes to it; the management plane governs access to it. The lakehouse is not a
single-plane component — it spans planes, and needs its own architectural treatment.
The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.
Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.
Definition
The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).
The data plane sits horizontally — all other planes interact with it:
Editable Mermaid source: images/design-data-plane-horizontal.mermaid.md
Goals
- Provide a unified persistent storage substrate for all planes
- Formalise the
LakehouseAgentas the control-plane API boundary to the data plane - Define ingestion pipelines from the user plane (structured, versioned, lineaged)
- Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
- Support world-model snapshots for AgentOps checkpointing and causal replay
- Serve as training data substrate for VLA and ML model fine-tuning (v3)
Non-goals
- Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
- Governance policy definition — that is the management plane
- Agent reasoning or query planning — that is the control plane
- Job execution — that is the user plane
Architecture
Storage layers
| Layer | Technology | Contents | Access pattern |
|---|---|---|---|
| Lakehouse | DuckDB + DuckLake (PostgreSQL catalog) over S3 object storage | Structured outputs, experiment results, job history | SQL (DuckDB in-process), Parquet partitions in object storage |
| Object store | RustFS (local, on TrueNAS ZFS) + Cloudflare R2 (cloud, exposure-only) | Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpoints | Blob read/write via s3fs |
| Embeddings store | pgvector / Chroma (v2) | Dense vectors for semantic retrieval (agent traces, docs) | ANN search |
| Feature store | Feast / custom Parquet (v3) | Structured ML features for VLA and classification models | Batch + online |
| Event log | Append-only Parquet partitions in the landing/ bucket | Agent traces, Zenoh event recordings, ROS bag metadata | Append write, batch read |
| World model snapshots | DuckLake snapshots (v2) | Point-in-time environment state for AgentOps | Write on checkpoint, read on replay |
The object-storage layer is a two-tier S3 design: RustFS on TrueNAS ZFS as the local authoritative store (migrated in place from MinIO), and Cloudflare R2 for the exposure tier only (buckets that must reach Cloudflare/website infrastructure). See the MinIO → RustFS migration runbook for the rationale and the TrueNAS SCALE migration steps.
Dataset & checkpoint versioning (R2 + W&B artifacts)
R2 has no native bucket versioning — the S3 PutBucketVersioning operation is unimplemented
and the Cloudflare Terraform provider exposes no versioning resource. Binary artifact folders that
do not live in DuckLake as Parquet — robot episode datasets (LeRobot / Rerun recordings)
and model checkpoints — get reproducible versioning from two mechanisms that compose:
- Immutable
vN/prefixes. Each revision is written under a freshvN/prefix and is never overwritten, e.g.s3://ar4-physical-ai/datasets/ar4-pick-place/v3/. This preserves the bytes. - W&B reference artifacts. Each
vN/folder is logged as a W&B artifact viaArtifact.add_reference("s3://…/vN/"). W&B records a manifest (object key + ETag + size) and layers a version graph, alatestalias, and run lineage on top of the immutable bytes.
add_reference stores a manifest, not the bytes, so a referenced version only stays reproducible
if its objects never change. The immutable vN/ rule guarantees that — which is why the helper
refuses to write into an existing version (it raises rather than overwrite).
Bucket layout:
Helper — data-plane/lakehouse/versioning.py:
Credentials reuse the data plane's R2_ENDPOINT / R2_ACCESS_KEY / R2_SECRET_KEY (mirrored into
the AWS_* env vars so W&B's S3 reference crawl reaches R2) plus WANDB_API_KEY. Pass
log_wandb=False to upload to R2 only.
This is complementary to DuckLake time-travel (see the "Time Travel" use case): DuckLake versions
structured Parquet tables in the catalog; the vN/ + W&B scheme versions opaque blob folders
(episodes, checkpoints) that are not catalog tables. Both let you pin training to an exact input
version — choose by data shape.
DuckDB + DuckLake as the query and catalog layer
DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog:
ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and
whose data lives in the lakehouse as Parquet fragments. The full DuckLake schema (174 catalog tables)
is in data-plane/tests/ducklake-schema.sql.
Key catalog tables:
experiments— experiment registry (id, project, description, created_at)simulation_runs— per-simulator run records with S3 prefix, status, configducklake_table,ducklake_data_file— DuckLake's own catalog metadata
The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's
operator interface: it runs DuckDB queries, inspects the catalog, and calls
python -m lakehouse commands. It is not a transformation pipeline — it is a catalog
operator and query runner.
Data flow
User plane → data plane (ingestion)
User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the
data plane on job completion:
Editable Mermaid source: images/design-ingestion-sequence.mermaid.md
In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them
via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in
data-plane/lakehouse/ called directly by Ray workers on job completion.
Control plane → data plane (reads)
Control-plane agents read from the data plane in two modes:
- Structured query (current):
LakehouseAgentruns DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess - Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context
AgentOps → data plane (snapshots)
The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.
The lakehouse as persistent world model
In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:
| Memory type | Contents | Enables |
|---|---|---|
| Episodic | Job history, cluster failure events, navigation trial outcomes | Agent reasoning over "what happened before" |
| Semantic | Learned cluster failure patterns, experiment regression signatures | Agent pattern matching without re-executing |
| Procedural | Successful job submission sequences, DuckLake catalog dependency chains | Agent skill recall |
| Perceptual | GeoParquet outputs, YOLOv8 detections, SLAM maps | VLA fine-tuning, world-model grounding |
For the turtlebot-maze reference application:
- Every Nav2 navigation trial → appended to
event_log.ros_navigation_trials - Every YOLOv8 detection → appended to
event_log.object_detections - Every world-model snapshot at decision point → stored in
world_model.snapshots - Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models
This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.
Interfaces
Ingestion API (data plane ← user plane)
Query API (data plane → control plane)
Policy interface (management plane → data plane)
Critical distinction: data plane vs management plane
A common mistake is placing the lakehouse under the management plane. They are separate:
| Data Plane | Management Plane |
|---|---|
| Stores and transforms data | Governs how data is stored |
| Serves queries | Defines query access policies |
| Manages schemas and lineage | Manages access rights and retention |
| Enables agent reasoning | Enforces compliance |
The management plane governs the data plane. It does not own it.
Evolution path
Requirements (DP-xxx)
Traces to system-level requirements in architecture/four-plane.md.
| ID | Requirement | Traces to | Version |
|---|---|---|---|
| DP-001 | The data plane shall govern movement, storage, transformation, and accessibility of data across the system | SYS-001 | v1 |
| DP-002 | The data plane shall sit horizontally, serving all other planes | SYS-007 | v1 |
| DP-003 | The data plane shall have latency of seconds, eventually consistent | SYS-001 | v1 |
| DP-004 | Data plane failure shall cause query failures and queued ingestion; agents lose context | SYS-002 | v1 |
| DP-006 | The data plane shall provide a unified persistent storage substrate for all planes | SYS-007 | v1 |
| DP-007 | LakehouseAgent shall be the control-plane API boundary to the data plane | CP-010 | v1 |
| DP-008 | The data plane shall define structured, versioned, lineaged ingestion pipelines from user plane | — | v1 |
| DP-009 | The data plane shall enable semantic retrieval for control-plane agents (RAG over job history) | — | v2 |
| DP-010 | The data plane shall support world-model snapshots for AgentOps checkpointing and causal replay | CP-030 | v2 |
| DP-011 | The data plane shall serve as training data substrate for VLA fine-tuning | — | v3 |
| DP-012 | The data plane shall use DuckDB + DuckLake (PostgreSQL catalog) over S3 object storage | SYS-007 | v1 |
| DP-013 | Structured outputs and experiment results shall be stored as Parquet in the lakehouse via DuckLake | DP-012 | v1 |
| DP-014 | Raw files (fMP4, GeoParquet, point clouds, checkpoints) shall be stored in object store | DP-012 | v1 |
| DP-015 | Embeddings store (pgvector/Chroma) shall support dense vectors and semantic retrieval | DP-009 | v2 |
| DP-016 | Feature store shall support structured ML features for VLA and classification | DP-011 | v3 |
| DP-017 | Event log shall be append-only Parquet in the landing/ bucket for agent traces and ROS bag metadata | — | v1 |
| DP-018 | World model snapshots shall capture point-in-time environment state | DP-010 | v2 |
| DP-019 | DuckLake transactional catalog shall use PostgreSQL metadata and Parquet data in object storage | DP-012 | v1 |
| DP-020 | DuckLake schema shall include experiments, simulation_runs, ducklake_table, ducklake_data_file | DP-019 | v1 |
| DP-021 | LakehouseAgent allowedTools: Bash(duckdb *),Bash(python *),Read,Edit | CP-010 | v1 |
| DP-022 | The data plane shall not be a transformation pipeline; LakehouseAgent is a catalog operator | — | v1 |
| DP-023 | Ingestion API: POST /data/ingest → 201 {partition_id, row_count} | — | v1.5 |
| DP-024 | Snapshot API: POST /data/snapshots with WorldModelSnapshot → 201 {snapshot_id} | DP-010 | v2 |
| DP-025 | Query API: GET /data/query?sql=<duckdb_sql> → {columns, rows} | — | v1 |
| DP-026 | Semantic retrieval API: GET /data/retrieve?query=<text>&top_k=<n> → [{id, text, score}] | DP-009 | v2 |
| DP-027 | Policy interface: PUT /data/policy with retention/RBAC config → 200 | MP-004 | v2 |
| DP-028 | The data plane shall NOT be the message passing substrate (that is Zenoh/NATS/DDS) | — | v1 |
| DP-029 | The data plane shall NOT define governance policy (that is the management plane) | — | v1 |
| DP-030 | User plane Ray workers shall write outputs to the landing/ bucket; DuckLake registers fragments | — | v1 |
| DP-031 | v1 ingestion: Ray worker writes files; LakehouseAgent registers via DuckDB | — | v1 |
| DP-032 | v1.5 ingestion: lightweight API endpoint callable directly by Ray workers | DP-023 | v1.5 |
| DP-033 | v1 reads: LakehouseAgent runs DuckDB queries | — | v1 |
| DP-034 | v2 reads: semantic retrieval via embeddings query endpoint | DP-026 | v2 |
| DP-035 | The data plane shall accumulate episodic memory (job history, cluster failures, navigation trials) | — | v1 |
| DP-036 | The data plane shall accumulate semantic memory (failure patterns, regression signatures) | — | v2 |
| DP-037 | The data plane shall accumulate procedural memory (successful job sequences, dependency chains) | — | v2 |
| DP-038 | The data plane shall accumulate perceptual memory (GeoParquet, YOLOv8 detections, SLAM maps) | — | v1 |
| DP-039 | turtlebot-maze: Nav2 navigation trials stored in event_log.ros_navigation_trials | SYS-003 | v1 |
| DP-040 | turtlebot-maze: YOLOv8 detections stored in event_log.object_detections | SYS-003 | v1 |
| DP-041 | turtlebot-maze: world-model snapshots at decision points in world_model.snapshots | SYS-003, DP-010 | v2 |
| DP-042 | The data plane shall be separate from the management plane | SYS-001 | v1 |
See also:
docs/control-plane/design.mdx§"AgentOps Subsystem" — world-model snapshots, checkpointing, LakehouseAgent, agent memorydocs/management-plane/design.md— retention policy, RBACdocs/user-plane/design.md— ingestion producers (Ray workers)docs/data-plane/coco-demo-design.mdx— Experiment #0: multimodal dataset store design (HF datasets, Zenoh v2, W&B adapter, Rerun routing)docs/data-plane/coco-demo-plan.mdx— Experiment #0 implementation plan fordata-plane/lakehouse/
Example Use Cases of the Auraison Lakehouse
The following examples use the SQuAD validation dataset (stored in s3://landing/squad/) to illustrate the concrete benefits of the DuckLake lakehouse over a plain object store or traditional database.
1. Schema Evolution — add columns without rewriting parquet files
After a model inference run, new columns can be appended to an existing table with no data rewrite:
DuckLake records the schema change in the PostgreSQL catalog; the underlying Parquet files in the lakehouse are untouched. Without a lakehouse this requires a full dataset rewrite.
2. Time Travel — pin to the exact snapshot used for training
Every write creates a new catalog snapshot. Queries can be issued against any prior version:
This makes ML experiments reproducible: a training run can be tied to a specific catalog version and replayed exactly.
3. Cross-plane Join — catalog metadata + raw data in a single query
simulation_runs (DuckLake catalog) and squad (raw parquet in the lakehouse) can be joined without ETL:
Without the lakehouse this requires a separate MLflow or W&B lookup followed by a manual join.
4. Predicate Pushdown — skip irrelevant row groups
DuckDB pushes WHERE predicates into the Parquet reader, scanning only matching row groups:
On a large dataset (e.g. full COCO-Caption at ~100 GB) this reduces scan time from minutes to seconds.
5. Incremental Ingest — idempotent append with no duplicates
New parquet drops in landing/ can be merged into the warehouse without risk of duplicates:
DuckLake's MVCC guarantees that concurrent readers see a consistent snapshot even while the insert is in flight.
6. Analytical Queries — OLAP directly on lakehouse data
No separate analytics database is needed; DuckDB runs columnar OLAP over the same Parquet files used for training:
Summary
| Pattern | Lakehouse benefit | Without lakehouse |
|---|---|---|
| Schema evolution | Zero-copy column add | Full dataset rewrite |
| Time travel | Snapshot pinning for reproducibility | Manual versioned file copies |
| Cross-plane join | Catalog + data in one query | Separate MLflow/W&B lookup |
| Predicate pushdown | Row-group pruning, sub-second scans | Full table scan |
| Incremental ingest | Idempotent MVCC appends | App-level dedup logic |
| Analytical queries | In-place OLAP over training data | Export to separate analytics DB |
External Infrastructure (TrueNAS Lab)
The data-plane can run against existing lab infrastructure instead of the bundled Docker services. The lab exposes two services:
| Service | Address | Notes |
|---|---|---|
| S3 (TrueNAS) | https://s3.aegeanai.com/ | Cloudflare Tunnel — always reachable, valid TLS |
| S3 (LAN direct) | http://<TRUENAS_LAN_IP>:9000 | Faster; no Cloudflare hop; requires LAN or Tailscale |
| PostgreSQL | 192.168.1.26:5432 | LAN only — Tailscale required when off-LAN |
Networking matrix
| Context | S3 endpoint | PG host | Tailscale needed? |
|---|---|---|---|
| Docker (default) | http://rustfs:9000 | postgresql:55432 | No |
| Lab / on-LAN | LAN direct http://<IP>:9000 | 192.168.1.26:5432 | No |
| Remote / off-LAN | https://s3.aegeanai.com/ | 192.168.1.26:5432 via Tailscale | Yes (PG only) |
Cloudflare Tunnel caveat
The public https://s3.aegeanai.com/ endpoint does not support chunked/multipart uploads.
boto3 falls back to multipart for objects larger than ~8 MB by default.
Set a high threshold or disable multipart for bulk Parquet ingestion through the tunnel:
Recommendation: use the LAN direct path (or Tailscale + LAN) for bulk ingestion; reserve the Cloudflare tunnel for small files and catalog operations from remote machines.
Setup — TrueNAS S3
- In TrueNAS → Credentials → S3 API Keys: create a key with read/write on
warehouse,iceberg,landing. - Pre-create the three buckets (or run
mc-jobpointed at the TrueNAS endpoint). - Set in
.env:
Setup — TrueNAS PostgreSQL
- On the TrueNAS PostgreSQL instance:
- Ensure the TrueNAS firewall allows TCP 5432 from your dev host / container subnet.
- Set in
.env: