Skip to main content

Data Plane Design

Date: 2026-03-12 Status: Approved (v2)


Problem

The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse is used simultaneously as:

  • Storage for user-plane outputs (perception data, job results, telemetry)
  • Memory substrate for control-plane agents (job history, cluster failure patterns)
  • Training corpus for model fine-tuning (VLA, classification models)
  • Observability archive (agent traces, experiment results)

No single plane owns this. The control plane's LakehouseAgent reaches into it; the user plane writes to it; the management plane governs access to it. The lakehouse is not a single-plane component — it spans planes, and needs its own architectural treatment.

The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.

Traditional:  logic  →  data
Agentic: data → reasoning → action

Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.


Definition

The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).

The data plane sits horizontally — all other planes interact with it:


Goals

  • Provide a unified persistent storage substrate for all planes
  • Formalise the LakehouseAgent as the control-plane API boundary to the data plane
  • Define ingestion pipelines from the user plane (structured, versioned, lineaged)
  • Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
  • Support world-model snapshots for AgentOps checkpointing and causal replay
  • Serve as training data substrate for VLA and ML model fine-tuning (v3)

Non-goals

  • Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
  • Governance policy definition — that is the management plane
  • Agent reasoning or query planning — that is the control plane
  • Job execution — that is the user plane

Architecture

Storage layers

LayerTechnologyContentsAccess pattern
LakehouseDuckDB + DuckLake (PostgreSQL catalog) + MinIO S3Structured outputs, experiment results, job historySQL (DuckDB in-process), Parquet partitions in MinIO
Object storeMinIO (local) / S3 / Cloudflare R2Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpointsBlob read/write via s3fs
Embeddings storepgvector / Chroma (v2)Dense vectors for semantic retrieval (agent traces, docs)ANN search
Feature storeFeast / custom Parquet (v3)Structured ML features for VLA and classification modelsBatch + online
Event logAppend-only Parquet partitions in MinIO landing/Agent traces, Zenoh event recordings, ROS bag metadataAppend write, batch read
World model snapshotsDuckLake snapshots (v2)Point-in-time environment state for AgentOpsWrite on checkpoint, read on replay

DuckDB + DuckLake as the query and catalog layer

DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog: ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and whose data lives in MinIO as Parquet fragments. The full DuckLake schema (174 catalog tables) is in data-plane/tests/ducklake-schema.sql.

Key catalog tables:

  • experiments — experiment registry (id, project, description, created_at)
  • simulation_runs — per-simulator run records with S3 prefix, status, config
  • ducklake_table, ducklake_data_file — DuckLake's own catalog metadata

The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's operator interface: it runs DuckDB queries, inspects the catalog, and calls python -m lakehouse commands. It is not a transformation pipeline — it is a catalog operator and query runner.


Data flow

User plane → data plane (ingestion)

User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the data plane on job completion:

In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in data-plane/lakehouse/ called directly by Ray workers on job completion.

Control plane → data plane (reads)

Control-plane agents read from the data plane in two modes:

  1. Structured query (current): LakehouseAgent runs DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess
  2. Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context
# v1: LakehouseAgent reads directly
"Run SELECT * FROM job_outcomes ORDER BY completed_at DESC LIMIT 10 WHERE status = 'failed'"

# v2: semantic retrieval endpoint
GET /data/retrieve?query="notebook jobs that failed on torch.dev.gpu last week"&top_k=5
[{job_id, summary, outcome, wandb_run_id}]

AgentOps → data plane (snapshots)

The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.

WorldModelSnapshot {
snapshot_id: UUID
intent_id: UUID
timestamp: ISO 8601
agents_active: [{role, status, tool_call_count}]
user_plane: {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
causal_chain: [{intent_id, job_id, outcome}]
}

The lakehouse as persistent world model

In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:

Memory typeContentsEnables
EpisodicJob history, cluster failure events, navigation trial outcomesAgent reasoning over "what happened before"
SemanticLearned cluster failure patterns, experiment regression signaturesAgent pattern matching without re-executing
ProceduralSuccessful job submission sequences, DuckLake catalog dependency chainsAgent skill recall
PerceptualGeoParquet outputs, YOLOv8 detections, SLAM mapsVLA fine-tuning, world-model grounding

For the turtlebot-maze reference application:

  • Every Nav2 navigation trial → appended to event_log.ros_navigation_trials
  • Every YOLOv8 detection → appended to event_log.object_detections
  • Every world-model snapshot at decision point → stored in world_model.snapshots
  • Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models

This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.


Interfaces

Ingestion API (data plane ← user plane)

POST /data/ingest
Body: {table: str, records: list[dict], schema_version: str, job_id: UUID, tenant_id: UUID}
→ 201 {partition_id, row_count}

POST /data/snapshots
Body: WorldModelSnapshot
→ 201 {snapshot_id}

Query API (data plane → control plane)

GET /data/query?sql=<duckdb_sql>&tenant_id=<uuid>
→ {columns: [...], rows: [...]}

GET /data/retrieve?query=<natural_language>&top_k=<n>&tenant_id=<uuid> (v2)
→ [{id, text, score, metadata}]

Policy interface (management plane → data plane)

PUT /data/policy
Body: {tenant_id, retention_days, allowed_tables: [...], max_storage_gb: float}
→ 200

Critical distinction: data plane vs management plane

A common mistake is placing the lakehouse under the management plane. They are separate:

Data PlaneManagement Plane
Stores and transforms dataGoverns how data is stored
Serves queriesDefines query access policies
Manages schemas and lineageManages access rights and retention
Enables agent reasoningEnforces compliance

The management plane governs the data plane. It does not own it.


Evolution path

v1   — Data plane in data-plane/; LakehouseAgent as catalog operator; manual ingestion
v1.5 — Formalise ingestion API; structured AgentEvent log
v2 — Embeddings store + RAG retrieval endpoint for control-plane agents
World-model snapshots (AgentOps → data plane)
Schema lineage visible in management-plane dashboard
v3 — VLA training pipeline: accumulated perception data → fine-tuning loop
Feature store for online inference (real-time VLA feature serving)
Data plane exposes MCP server: agents query job history and world state via tool calls

Requirements (DP-xxx)

Traces to system-level requirements in architecture/four-plane.md.

IDRequirementTraces toVersion
DP-001The data plane shall govern movement, storage, transformation, and accessibility of data across the systemSYS-001v1
DP-002The data plane shall sit horizontally, serving all other planesSYS-007v1
DP-003The data plane shall have latency of seconds, eventually consistentSYS-001v1
DP-004Data plane failure shall cause query failures and queued ingestion; agents lose contextSYS-002v1
DP-006The data plane shall provide a unified persistent storage substrate for all planesSYS-007v1
DP-007LakehouseAgent shall be the control-plane API boundary to the data planeCP-010v1
DP-008The data plane shall define structured, versioned, lineaged ingestion pipelines from user planev1
DP-009The data plane shall enable semantic retrieval for control-plane agents (RAG over job history)v2
DP-010The data plane shall support world-model snapshots for AgentOps checkpointing and causal replayCP-030v2
DP-011The data plane shall serve as training data substrate for VLA fine-tuningv3
DP-012The data plane shall use DuckDB + DuckLake (PostgreSQL catalog) + MinIO S3SYS-007v1
DP-013Structured outputs and experiment results shall be stored as Parquet in MinIO via DuckLakeDP-012v1
DP-014Raw files (fMP4, GeoParquet, point clouds, checkpoints) shall be stored in object storeDP-012v1
DP-015Embeddings store (pgvector/Chroma) shall support dense vectors and semantic retrievalDP-009v2
DP-016Feature store shall support structured ML features for VLA and classificationDP-011v3
DP-017Event log shall be append-only Parquet in MinIO landing/ for agent traces and ROS bag metadatav1
DP-018World model snapshots shall capture point-in-time environment stateDP-010v2
DP-019DuckLake transactional catalog shall use PostgreSQL metadata and MinIO Parquet dataDP-012v1
DP-020DuckLake schema shall include experiments, simulation_runs, ducklake_table, ducklake_data_fileDP-019v1
DP-021LakehouseAgent allowedTools: Bash(duckdb *),Bash(python *),Read,EditCP-010v1
DP-022The data plane shall not be a transformation pipeline; LakehouseAgent is a catalog operatorv1
DP-023Ingestion API: POST /data/ingest → 201 {partition_id, row_count}v1.5
DP-024Snapshot API: POST /data/snapshots with WorldModelSnapshot → 201 {snapshot_id}DP-010v2
DP-025Query API: GET /data/query?sql=<duckdb_sql> → {columns, rows}v1
DP-026Semantic retrieval API: GET /data/retrieve?query=<text>&top_k=<n> → [{id, text, score}]DP-009v2
DP-027Policy interface: PUT /data/policy with retention/RBAC config → 200MP-004v2
DP-028The data plane shall NOT be the message passing substrate (that is Zenoh/NATS/DDS)v1
DP-029The data plane shall NOT define governance policy (that is the management plane)v1
DP-030User plane Ray workers shall write outputs to MinIO landing/; DuckLake registers fragmentsv1
DP-031v1 ingestion: Ray worker writes files; LakehouseAgent registers via DuckDBv1
DP-032v1.5 ingestion: lightweight API endpoint callable directly by Ray workersDP-023v1.5
DP-033v1 reads: LakehouseAgent runs DuckDB queriesv1
DP-034v2 reads: semantic retrieval via embeddings query endpointDP-026v2
DP-035The data plane shall accumulate episodic memory (job history, cluster failures, navigation trials)v1
DP-036The data plane shall accumulate semantic memory (failure patterns, regression signatures)v2
DP-037The data plane shall accumulate procedural memory (successful job sequences, dependency chains)v2
DP-038The data plane shall accumulate perceptual memory (GeoParquet, YOLOv8 detections, SLAM maps)v1
DP-039turtlebot-maze: Nav2 navigation trials stored in event_log.ros_navigation_trialsSYS-003v1
DP-040turtlebot-maze: YOLOv8 detections stored in event_log.object_detectionsSYS-003v1
DP-041turtlebot-maze: world-model snapshots at decision points in world_model.snapshotsSYS-003, DP-010v2
DP-042The data plane shall be separate from the management planeSYS-001v1

See also:


Example Use Cases of the Auraison Lakehouse

The following examples use the SQuAD validation dataset (stored in s3://landing/squad/) to illustrate the concrete benefits of the DuckLake lakehouse over a plain object store or traditional database.

1. Schema Evolution — add columns without rewriting parquet files

After a model inference run, new columns can be appended to an existing table with no data rewrite:

ALTER TABLE squad ADD COLUMN model_answer VARCHAR;
ALTER TABLE squad ADD COLUMN confidence FLOAT;

-- Backfill from an inference run
UPDATE squad SET model_answer = 'Denver Broncos', confidence = 0.97
WHERE id = '56be4db0acb8001400a502ec';

DuckLake records the schema change in the PostgreSQL catalog; the underlying Parquet files in MinIO are untouched. Without a lakehouse this requires a full dataset rewrite.

2. Time Travel — pin to the exact snapshot used for training

Every write creates a new catalog snapshot. Queries can be issued against any prior version:

-- Inspect data before model answers were added
SELECT * FROM squad AT (VERSION => 1) LIMIT 5;

-- Diff two snapshots: which answers changed?
SELECT curr.id, curr.model_answer, prev.model_answer AS old_answer
FROM squad AT (VERSION => 3) curr
JOIN squad AT (VERSION => 1) prev USING (id)
WHERE curr.model_answer IS DISTINCT FROM prev.model_answer;

This makes ML experiments reproducible: a training run can be tied to a specific catalog version and replayed exactly.

3. Cross-plane Join — catalog metadata + raw data in a single query

simulation_runs (DuckLake catalog) and squad (raw parquet in MinIO) can be joined without ETL:

SELECT
r.config->>'model' AS model,
COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1]) AS exact_match,
COUNT(*) AS total,
ROUND(100.0 *
COUNT(*) FILTER (WHERE s.model_answer = s.answers.text[1])
/ COUNT(*), 2) AS em_pct
FROM squad s
JOIN simulation_runs r ON r.s3_prefix = s.title
GROUP BY 1
ORDER BY em_pct DESC;

Without the lakehouse this requires a separate MLflow or W&B lookup followed by a manual join.

4. Predicate Pushdown — skip irrelevant row groups

DuckDB pushes WHERE predicates into the Parquet reader, scanning only matching row groups:

-- Only reads the row groups where title = 'Super_Bowl_50'
SELECT id, question, answers.text[1] AS answer
FROM read_parquet('s3://landing/squad/**/*.parquet')
WHERE title = 'Super_Bowl_50';

-- Inspect the query plan
EXPLAIN SELECT * FROM squad WHERE title = 'Beyoncé';

On a large dataset (e.g. full COCO-Caption at ~100 GB) this reduces scan time from minutes to seconds.

5. Incremental Ingest — idempotent append with no duplicates

New parquet drops in landing/ can be merged into the warehouse without risk of duplicates:

INSERT INTO squad
SELECT s.*
FROM read_parquet('s3://landing/squad/**/*.parquet') s
LEFT JOIN squad w ON s.id = w.id
WHERE w.id IS NULL;

DuckLake's MVCC guarantees that concurrent readers see a consistent snapshot even while the insert is in flight.

6. Analytical Queries — OLAP directly on lakehouse data

No separate analytics database is needed; DuckDB runs columnar OLAP over the same Parquet files used for training:

-- Answer length distribution by article
SELECT
title,
COUNT(*) AS num_questions,
AVG(LENGTH(answers.text[1])) AS avg_answer_len,
MAX(LENGTH(context)) AS max_context_len
FROM squad
GROUP BY title
ORDER BY num_questions DESC;

-- Questions with multiple valid answers (annotator disagreement)
SELECT id, question, len(answers.text) AS num_answers
FROM squad
WHERE len(answers.text) > 1
ORDER BY num_answers DESC
LIMIT 10;

Summary

PatternLakehouse benefitWithout lakehouse
Schema evolutionZero-copy column addFull dataset rewrite
Time travelSnapshot pinning for reproducibilityManual versioned file copies
Cross-plane joinCatalog + data in one querySeparate MLflow/W&B lookup
Predicate pushdownRow-group pruning, sub-second scansFull table scan
Incremental ingestIdempotent MVCC appendsApp-level dedup logic
Analytical queriesIn-place OLAP over training dataExport to separate analytics DB

External Infrastructure (TrueNAS Lab)

The data-plane can run against existing lab infrastructure instead of the bundled Docker services. The lab exposes two services:

ServiceAddressNotes
S3 (TrueNAS)https://s3.aegeanai.com/Cloudflare Tunnel — always reachable, valid TLS
S3 (LAN direct)http://<TRUENAS_LAN_IP>:9000Faster; no Cloudflare hop; requires LAN or Tailscale
PostgreSQL192.168.1.26:5432LAN only — Tailscale required when off-LAN

Networking matrix

ContextS3 endpointPG hostTailscale needed?
Docker (default)http://minio:9000postgresql:55432No
Lab / on-LANLAN direct http://<IP>:9000192.168.1.26:5432No
Remote / off-LANhttps://s3.aegeanai.com/192.168.1.26:5432 via TailscaleYes (PG only)

Cloudflare Tunnel caveat

The public https://s3.aegeanai.com/ endpoint does not support chunked/multipart uploads. boto3 falls back to multipart for objects larger than ~8 MB by default. Set a high threshold or disable multipart for bulk Parquet ingestion through the tunnel:

from boto3.s3.transfer import TransferConfig
config = TransferConfig(multipart_threshold=10 * 1024 ** 3) # effectively disables multipart
s3.upload_fileobj(f, bucket, key, Config=config)

Recommendation: use the LAN direct path (or Tailscale + LAN) for bulk ingestion; reserve the Cloudflare tunnel for small files and catalog operations from remote machines.

Setup — TrueNAS S3

  1. In TrueNAS → Credentials → S3 API Keys: create a key with read/write on warehouse, iceberg, landing.
  2. Pre-create the three buckets (or run mc-job pointed at the TrueNAS endpoint).
  3. Set in .env:
MINIO_ENDPOINT=https://s3.aegeanai.com   # or http://<LAN-IP>:9000 for LAN direct
MINIO_ACCESS_KEY=<truenas-key-id>
MINIO_SECRET_KEY=<truenas-secret>
MINIO_USE_SSL=true # false for LAN direct

Setup — TrueNAS PostgreSQL

  1. On the TrueNAS PostgreSQL instance:
CREATE ROLE ducklake LOGIN PASSWORD '<choose-a-password>';
CREATE DATABASE ducklake OWNER ducklake;
  1. Ensure the TrueNAS firewall allows TCP 5432 from your dev host / container subnet.
  2. Set in .env:
DUCKLAKE_POSTGRES_DSN=postgresql://ducklake:<password>@192.168.1.26:5432/ducklake

Starting containers against external infra

# Start only the dev container (skip minio / postgresql / mc-job):
cd data-plane
docker compose up -d lakehouse.dev.cpu

# The container inherits MINIO_ENDPOINT and DUCKLAKE_POSTGRES_DSN from .env
# and falls back to Docker service names if the vars are not set.
# Start with local Docker infra (original behaviour):
docker compose --profile local-infra up -d