Data Plane Design

Date: 2026-03-12 Status: Approved (v2)

Problem

The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse is used simultaneously as:

Storage for user-plane outputs (perception data, job results, telemetry)
Memory substrate for control-plane agents (job history, cluster failure patterns)
Training corpus for model fine-tuning (VLA, classification models)
Observability archive (agent traces, experiment results)

No single plane owns this. The control plane's LakehouseAgent reaches into it; the user plane writes to it; the management plane governs access to it. The lakehouse is not a single-plane component — it spans planes, and needs its own architectural treatment.

The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.

Traditional:  logic  →  data
Agentic:      data   →  reasoning  →  action

Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.

Definition

The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).

The data plane sits horizontally — all other planes interact with it:

Goals

Provide a unified persistent storage substrate for all planes
Formalise the LakehouseAgent as the control-plane API boundary to the data plane
Define ingestion pipelines from the user plane (structured, versioned, lineaged)
Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
Support world-model snapshots for AgentOps checkpointing and causal replay
Serve as training data substrate for VLA and ML model fine-tuning (v3)

Non-goals

Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
Governance policy definition — that is the management plane
Agent reasoning or query planning — that is the control plane
Job execution — that is the user plane

Architecture

Storage layers

Layer	Technology	Contents	Access pattern
Lakehouse	DuckDB + DuckLake (PostgreSQL catalog) + MinIO S3	Structured outputs, experiment results, job history	SQL (DuckDB in-process), Parquet partitions in MinIO
Object store	MinIO (local) / S3 / Cloudflare R2	Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpoints	Blob read/write via s3fs
Embeddings store	pgvector / Chroma (v2)	Dense vectors for semantic retrieval (agent traces, docs)	ANN search
Feature store	Feast / custom Parquet (v3)	Structured ML features for VLA and classification models	Batch + online
Event log	Append-only Parquet partitions in MinIO `landing/`	Agent traces, Zenoh event recordings, ROS bag metadata	Append write, batch read
World model snapshots	DuckLake snapshots (v2)	Point-in-time environment state for AgentOps	Write on checkpoint, read on replay

DuckDB + DuckLake as the query and catalog layer

DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog: ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and whose data lives in MinIO as Parquet fragments. The full DuckLake schema (174 catalog tables) is in data-plane/tests/ducklake-schema.sql.

Key catalog tables:

experiments — experiment registry (id, project, description, created_at)
simulation_runs — per-simulator run records with S3 prefix, status, config
ducklake_table, ducklake_data_file — DuckLake's own catalog metadata

The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's operator interface: it runs DuckDB queries, inspects the catalog, and calls python -m lakehouse commands. It is not a transformation pipeline — it is a catalog operator and query runner.

Data flow

User plane → data plane (ingestion)

User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the data plane on job completion:

In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in data-plane/lakehouse/ called directly by Ray workers on job completion.

Control plane → data plane (reads)

Control-plane agents read from the data plane in two modes:

Structured query (current): LakehouseAgent runs DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess
Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context

# v1: LakehouseAgent reads directly
"Run SELECT * FROM job_outcomes ORDER BY completed_at DESC LIMIT 10 WHERE status = 'failed'"

# v2: semantic retrieval endpoint
GET /data/retrieve?query="notebook jobs that failed on torch.dev.gpu last week"&top_k=5
→ [{job_id, summary, outcome, wandb_run_id}]

AgentOps → data plane (snapshots)

The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.

WorldModelSnapshot {
  snapshot_id:     UUID
  intent_id:       UUID
  timestamp:       ISO 8601
  agents_active:   [{role, status, tool_call_count}]
  user_plane:      {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
  causal_chain:    [{intent_id, job_id, outcome}]
}

The lakehouse as persistent world model

In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:

Memory type	Contents	Enables
Episodic	Job history, cluster failure events, navigation trial outcomes	Agent reasoning over "what happened before"
Semantic	Learned cluster failure patterns, experiment regression signatures	Agent pattern matching without re-executing
Procedural	Successful job submission sequences, DuckLake catalog dependency chains	Agent skill recall
Perceptual	GeoParquet outputs, YOLOv8 detections, SLAM maps	VLA fine-tuning, world-model grounding

For the turtlebot-maze reference application:

Every Nav2 navigation trial → appended to event_log.ros_navigation_trials
Every YOLOv8 detection → appended to event_log.object_detections
Every world-model snapshot at decision point → stored in world_model.snapshots
Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models

This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.

Interfaces

Ingestion API (data plane ← user plane)

POST /data/ingest
Body: {table: str, records: list[dict], schema_version: str, job_id: UUID, tenant_id: UUID}
→ 201 {partition_id, row_count}

POST /data/snapshots
Body: WorldModelSnapshot
→ 201 {snapshot_id}

Query API (data plane → control plane)

GET /data/query?sql=<duckdb_sql>&tenant_id=<uuid>
→ {columns: [...], rows: [...]}

GET /data/retrieve?query=<natural_language>&top_k=<n>&tenant_id=<uuid>   (v2)
→ [{id, text, score, metadata}]

Policy interface (management plane → data plane)

PUT /data/policy
Body: {tenant_id, retention_days, allowed_tables: [...], max_storage_gb: float}
→ 200

Critical distinction: data plane vs management plane

A common mistake is placing the lakehouse under the management plane. They are separate:

Data Plane	Management Plane
Stores and transforms data	Governs how data is stored
Serves queries	Defines query access policies
Manages schemas and lineage	Manages access rights and retention
Enables agent reasoning	Enforces compliance

The management plane governs the data plane. It does not own it.

Evolution path

v1   — Data plane in data-plane/; LakehouseAgent as catalog operator; manual ingestion
v1.5 — Formalise ingestion API; structured AgentEvent log
v2   — Embeddings store + RAG retrieval endpoint for control-plane agents
       World-model snapshots (AgentOps → data plane)
       Schema lineage visible in management-plane dashboard
v3   — VLA training pipeline: accumulated perception data → fine-tuning loop
       Feature store for online inference (real-time VLA feature serving)
       Data plane exposes MCP server: agents query job history and world state via tool calls

Requirements (DP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

ID	Requirement	Traces to	Version
DP-001	The data plane shall govern movement, storage, transformation, and accessibility of data across the system	SYS-001	v1
DP-002	The data plane shall sit horizontally, serving all other planes	SYS-007	v1
DP-003	The data plane shall have latency of seconds, eventually consistent	SYS-001	v1
DP-004	Data plane failure shall cause query failures and queued ingestion; agents lose context	SYS-002	v1
DP-006	The data plane shall provide a unified persistent storage substrate for all planes	SYS-007	v1
DP-007	LakehouseAgent shall be the control-plane API boundary to the data plane	CP-010	v1
DP-008	The data plane shall define structured, versioned, lineaged ingestion pipelines from user plane	—	v1
DP-009	The data plane shall enable semantic retrieval for control-plane agents (RAG over job history)	—	v2
DP-010	The data plane shall support world-model snapshots for AgentOps checkpointing and causal replay	CP-030	v2
DP-011	The data plane shall serve as training data substrate for VLA fine-tuning	—	v3
DP-012	The data plane shall use DuckDB + DuckLake (PostgreSQL catalog) + MinIO S3	SYS-007	v1
DP-013	Structured outputs and experiment results shall be stored as Parquet in MinIO via DuckLake	DP-012	v1
DP-014	Raw files (fMP4, GeoParquet, point clouds, checkpoints) shall be stored in object store	DP-012	v1
DP-015	Embeddings store (pgvector/Chroma) shall support dense vectors and semantic retrieval	DP-009	v2
DP-016	Feature store shall support structured ML features for VLA and classification	DP-011	v3
DP-017	Event log shall be append-only Parquet in MinIO `landing/` for agent traces and ROS bag metadata	—	v1
DP-018	World model snapshots shall capture point-in-time environment state	DP-010	v2
DP-019	DuckLake transactional catalog shall use PostgreSQL metadata and MinIO Parquet data	DP-012	v1
DP-020	DuckLake schema shall include `experiments`, `simulation_runs`, `ducklake_table`, `ducklake_data_file`	DP-019	v1
DP-021	LakehouseAgent allowedTools: `Bash(duckdb ),Bash(python ),Read,Edit`	CP-010	v1
DP-022	The data plane shall not be a transformation pipeline; LakehouseAgent is a catalog operator	—	v1
DP-023	Ingestion API: `POST /data/ingest` → 201 {partition_id, row_count}	—	v1.5
DP-024	Snapshot API: `POST /data/snapshots` with WorldModelSnapshot → 201 {snapshot_id}	DP-010	v2
DP-025	Query API: `GET /data/query?sql=<duckdb_sql>` → {columns, rows}	—	v1
DP-026	Semantic retrieval API: `GET /data/retrieve?query=<text>&top_k=<n>` → [{id, text, score}]	DP-009	v2
DP-027	Policy interface: `PUT /data/policy` with retention/RBAC config → 200	MP-004	v2
DP-028	The data plane shall NOT be the message passing substrate (that is Zenoh/NATS/DDS)	—	v1
DP-029	The data plane shall NOT define governance policy (that is the management plane)	—	v1
DP-030	User plane Ray workers shall write outputs to MinIO `landing/`; DuckLake registers fragments	—	v1
DP-031	v1 ingestion: Ray worker writes files; LakehouseAgent registers via DuckDB	—	v1
DP-032	v1.5 ingestion: lightweight API endpoint callable directly by Ray workers	DP-023	v1.5
DP-033	v1 reads: LakehouseAgent runs DuckDB queries	—	v1
DP-034	v2 reads: semantic retrieval via embeddings query endpoint	DP-026	v2
DP-035	The data plane shall accumulate episodic memory (job history, cluster failures, navigation trials)	—	v1
DP-036	The data plane shall accumulate semantic memory (failure patterns, regression signatures)	—	v2
DP-037	The data plane shall accumulate procedural memory (successful job sequences, dependency chains)	—	v2
DP-038	The data plane shall accumulate perceptual memory (GeoParquet, YOLOv8 detections, SLAM maps)	—	v1
DP-039	turtlebot-maze: Nav2 navigation trials stored in `event_log.ros_navigation_trials`	SYS-003	v1
DP-040	turtlebot-maze: YOLOv8 detections stored in `event_log.object_detections`	SYS-003	v1
DP-041	turtlebot-maze: world-model snapshots at decision points in `world_model.snapshots`	SYS-003, DP-010	v2
DP-042	The data plane shall be separate from the management plane	SYS-001	v1

Example Use Cases of the Auraison Lakehouse

The following examples use the SQuAD validation dataset (stored in s3://landing/squad/) to illustrate the concrete benefits of the DuckLake lakehouse over a plain object store or traditional database.

1. Schema Evolution — add columns without rewriting parquet files

After a model inference run, new columns can be appended to an existing table with no data rewrite:

ALTER TABLE squad ADD COLUMN model_answer VARCHAR;
ALTER TABLE squad ADD COLUMN confidence FLOAT;

-- Backfill from an inference run
UPDATE squad SET model_answer = 'Denver Broncos', confidence = 0.97
WHERE id = '56be4db0acb8001400a502ec';

DuckLake records the schema change in the PostgreSQL catalog; the underlying Parquet files in MinIO are untouched. Without a lakehouse this requires a full dataset rewrite.

2. Time Travel — pin to the exact snapshot used for training

Every write creates a new catalog snapshot. Queries can be issued against any prior version:

-- Inspect data before model answers were added
SELECT * FROM squad AT (VERSION => 1) LIMIT 5;

-- Diff two snapshots: which answers changed?
SELECT curr.id, curr.model_answer, prev.model_answer AS old_answer
FROM squad AT (VERSION => 3) curr
JOIN squad AT (VERSION => 1) prev USING (id)
WHERE curr.model_answer IS DISTINCT FROM prev.model_answer;

This makes ML experiments reproducible: a training run can be tied to a specific catalog version and replayed exactly.

3. Cross-plane Join — catalog metadata + raw data in a single query

simulation_runs (DuckLake catalog) and squad (raw parquet in MinIO) can be joined without ETL:

SELECT
    r.config->>'model'                                                   AS model,
    COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1])          AS exact_match,
    COUNT(*)                                                             AS total,
    ROUND(100.0 *
        COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1])
        / COUNT(*), 2)                                                   AS em_pct
FROM squad s
JOIN simulation_runs r ON r.s3_prefix = s.title
GROUP BY 1
ORDER BY em_pct DESC;

Without the lakehouse this requires a separate MLflow or W&B lookup followed by a manual join.

4. Predicate Pushdown — skip irrelevant row groups

DuckDB pushes WHERE predicates into the Parquet reader, scanning only matching row groups:

-- Only reads the row groups where title = 'Super_Bowl_50'
SELECT id, question, answers.text[1] AS answer
FROM read_parquet('s3://landing/squad/**/*.parquet')
WHERE title = 'Super_Bowl_50';

-- Inspect the query plan
EXPLAIN SELECT * FROM squad WHERE title = 'Beyoncé';

On a large dataset (e.g. full COCO-Caption at ~100 GB) this reduces scan time from minutes to seconds.

5. Incremental Ingest — idempotent append with no duplicates

New parquet drops in landing/ can be merged into the warehouse without risk of duplicates:

INSERT INTO squad
SELECT s.*
FROM read_parquet('s3://landing/squad/**/*.parquet') s
LEFT JOIN squad w ON s.id = w.id
WHERE w.id IS NULL;

DuckLake's MVCC guarantees that concurrent readers see a consistent snapshot even while the insert is in flight.

6. Analytical Queries — OLAP directly on lakehouse data

No separate analytics database is needed; DuckDB runs columnar OLAP over the same Parquet files used for training:

-- Answer length distribution by article
SELECT
    title,
    COUNT(*)                          AS num_questions,
    AVG(LENGTH(answers.text[1]))      AS avg_answer_len,
    MAX(LENGTH(context))              AS max_context_len
FROM squad
GROUP BY title
ORDER BY num_questions DESC;

-- Questions with multiple valid answers (annotator disagreement)
SELECT id, question, len(answers.text) AS num_answers
FROM squad
WHERE len(answers.text) > 1
ORDER BY num_answers DESC
LIMIT 10;

Summary

Pattern	Lakehouse benefit	Without lakehouse
Schema evolution	Zero-copy column add	Full dataset rewrite
Time travel	Snapshot pinning for reproducibility	Manual versioned file copies
Cross-plane join	Catalog + data in one query	Separate MLflow/W&B lookup
Predicate pushdown	Row-group pruning, sub-second scans	Full table scan
Incremental ingest	Idempotent MVCC appends	App-level dedup logic
Analytical queries	In-place OLAP over training data	Export to separate analytics DB

External Infrastructure (TrueNAS Lab)

The data-plane can run against existing lab infrastructure instead of the bundled Docker services. The lab exposes two services:

Service	Address	Notes
S3 (TrueNAS)	`https://s3.aegeanai.com/`	Cloudflare Tunnel — always reachable, valid TLS
S3 (LAN direct)	`http://<TRUENAS_LAN_IP>:9000`	Faster; no Cloudflare hop; requires LAN or Tailscale
PostgreSQL	`192.168.1.26:5432`	LAN only — Tailscale required when off-LAN

Networking matrix

Context	S3 endpoint	PG host	Tailscale needed?
Docker (default)	`http://minio:9000`	`postgresql:55432`	No
Lab / on-LAN	LAN direct `http://<IP>:9000`	`192.168.1.26:5432`	No
Remote / off-LAN	`https://s3.aegeanai.com/`	`192.168.1.26:5432` via Tailscale	Yes (PG only)

Cloudflare Tunnel caveat

The public https://s3.aegeanai.com/ endpoint does not support chunked/multipart uploads. boto3 falls back to multipart for objects larger than ~8 MB by default. Set a high threshold or disable multipart for bulk Parquet ingestion through the tunnel:

from boto3.s3.transfer import TransferConfig
config = TransferConfig(multipart_threshold=10 * 1024 ** 3)  # effectively disables multipart
s3.upload_fileobj(f, bucket, key, Config=config)

Recommendation: use the LAN direct path (or Tailscale + LAN) for bulk ingestion; reserve the Cloudflare tunnel for small files and catalog operations from remote machines.

Setup — TrueNAS S3

In TrueNAS → Credentials → S3 API Keys: create a key with read/write on warehouse, iceberg, landing.
Pre-create the three buckets (or run mc-job pointed at the TrueNAS endpoint).
Set in .env:

MINIO_ENDPOINT=https://s3.aegeanai.com   # or http://<LAN-IP>:9000 for LAN direct
MINIO_ACCESS_KEY=<truenas-key-id>
MINIO_SECRET_KEY=<truenas-secret>
MINIO_USE_SSL=true                        # false for LAN direct

Setup — TrueNAS PostgreSQL

On the TrueNAS PostgreSQL instance:

CREATE ROLE ducklake LOGIN PASSWORD '<choose-a-password>';
CREATE DATABASE ducklake OWNER ducklake;

Ensure the TrueNAS firewall allows TCP 5432 from your dev host / container subnet.
Set in .env:

DUCKLAKE_POSTGRES_DSN=postgresql://ducklake:<password>@192.168.1.26:5432/ducklake

Starting containers against external infra

# Start only the dev container (skip minio / postgresql / mc-job):
cd data-plane
docker compose up -d lakehouse.dev.cpu

# The container inherits MINIO_ENDPOINT and DUCKLAKE_POSTGRES_DSN from .env
# and falls back to Docker service names if the vars are not set.

# Start with local Docker infra (original behaviour):
docker compose --profile local-infra up -d

Problem​

Definition​

Goals​

Non-goals​

Architecture​

Storage layers​

DuckDB + DuckLake as the query and catalog layer​

Data flow​

User plane → data plane (ingestion)​

Control plane → data plane (reads)​

AgentOps → data plane (snapshots)​

The lakehouse as persistent world model​

Interfaces​

Ingestion API (data plane ← user plane)​

Query API (data plane → control plane)​

Policy interface (management plane → data plane)​

Critical distinction: data plane vs management plane​

Evolution path​

Requirements (DP-xxx)​

Example Use Cases of the Auraison Lakehouse​

1. Schema Evolution — add columns without rewriting parquet files​

2. Time Travel — pin to the exact snapshot used for training​

3. Cross-plane Join — catalog metadata + raw data in a single query​

4. Predicate Pushdown — skip irrelevant row groups​

5. Incremental Ingest — idempotent append with no duplicates​

6. Analytical Queries — OLAP directly on lakehouse data​

Summary​

External Infrastructure (TrueNAS Lab)​

Networking matrix​

Cloudflare Tunnel caveat​

Setup — TrueNAS S3​

Setup — TrueNAS PostgreSQL​

Starting containers against external infra​