Auraison — Data Plane Design

Date: 2026-02-23 Status: Approved (v1 — migrated from aegean-ai/lakehouse to data-plane/)

Problem

The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse (currently in the separate aegean-ai/lakehouse repo) is used simultaneously as:

Storage for user-plane outputs (perception data, job results, telemetry)
Memory substrate for control-plane agents (job history, cluster failure patterns)
Training corpus for model fine-tuning (VLA, classification models)
Observability archive (agent traces, experiment results)

No single plane owns this. The control plane's LakehouseAgent reaches into it; the user plane writes to it; the management plane governs access to it. The lakehouse is not a single-plane component — it spans planes, and needs its own architectural treatment.

The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.

Traditional:  logic  →  data
Agentic:      data   →  reasoning  →  action

Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.

Definition

The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).

The data plane sits horizontally — all other planes interact with it:

Goals

Provide a unified persistent storage substrate for all planes
Formalise the LakehouseAgent as the control-plane API boundary to the data plane
Define ingestion pipelines from the user plane (structured, versioned, lineaged)
Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
Support world-model snapshots for AgentOps checkpointing and causal replay
Serve as training data substrate for VLA and ML model fine-tuning (v3)
Consolidate aegean-ai/lakehouse into this monorepo under data-plane/

Non-goals

Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
Governance policy definition — that is the management plane
Agent reasoning or query planning — that is the control plane
Job execution — that is the user plane

Migration: aegean-ai/lakehouse → data-plane/

The aegean-ai/lakehouse repo contained the Python package scaffold, DuckLake schema reference, infrastructure config (MinIO + PostgreSQL), tests, and design docs. It has been consolidated into this monorepo.

Migration completed

What	Source	Destination
Design docs	`aegean-ai/lakehouse/docs/plans/`	`docs/plans/` (this monorepo)
Python package scaffold	`aegean-ai/lakehouse/`	`data-plane/`
Infrastructure	`aegean-ai/lakehouse/docker-compose.yml`	`data-plane/docker-compose.yml`
Tests	`aegean-ai/lakehouse/tests/`	`data-plane/tests/`
CLAUDE.md	`aegean-ai/lakehouse/CLAUDE.md`	`data-plane/CLAUDE.md`

LakehouseAgent update

LakehouseAgent was updated from dbt CLI (incorrect assumption) to duckdb and python -m lakehouse (correct tooling for DuckDB + DuckLake):

# Before:
ALLOWED_TOOLS = "Bash(dbt *),Read,Edit"

# After:
ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Bash(docker *),Read,Edit"
DATA_PLANE_DIR = REPO_ROOT / "data-plane"

Remaining step

Archive aegean-ai/lakehouse on GitHub (mark read-only) with a README redirect to this monorepo. This is a manual GitHub operation.

Architecture

Storage layers

Layer	Technology	Contents	Access pattern
Lakehouse	DuckDB + DuckLake (PostgreSQL catalog) + MinIO S3	Structured outputs, experiment results, job history	SQL (DuckDB in-process), Parquet partitions in MinIO
Object store	MinIO (local) / S3 / Cloudflare R2	Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpoints	Blob read/write via s3fs
Embeddings store	pgvector / Chroma (v2)	Dense vectors for semantic retrieval (agent traces, docs)	ANN search
Feature store	Feast / custom Parquet (v3)	Structured ML features for VLA and classification models	Batch + online
Event log	Append-only Parquet partitions in MinIO `landing/`	Agent traces, Zenoh event recordings, ROS bag metadata	Append write, batch read
World model snapshots	DuckLake snapshots (v2)	Point-in-time environment state for AgentOps	Write on checkpoint, read on replay

DuckDB + DuckLake as the query and catalog layer

DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog: ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and whose data lives in MinIO as Parquet fragments. The full DuckLake schema (174 catalog tables) is in data-plane/tests/ducklake-schema.sql.

Key catalog tables:

experiments — experiment registry (id, project, description, created_at)
simulation_runs — per-simulator run records with S3 prefix, status, config
ducklake_table, ducklake_data_file — DuckLake's own catalog metadata

The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's operator interface: it runs DuckDB queries, inspects the catalog, and calls python -m lakehouse commands. It is not a transformation pipeline — it is a catalog operator and query runner.

Data flow

User plane → data plane (ingestion)

User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the data plane on job completion:

In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in data-plane/lakehouse/ called directly by Ray workers on job completion.

Control plane → data plane (reads)

Control-plane agents read from the data plane in two modes:

Structured query (current): LakehouseAgent runs DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess
Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context

# v1: LakehouseAgent reads directly
"Run SELECT * FROM job_outcomes ORDER BY completed_at DESC LIMIT 10 WHERE status = 'failed'"

# v2: semantic retrieval endpoint
GET /data/retrieve?query="notebook jobs that failed on torch.dev.gpu last week"&top_k=5
→ [{job_id, summary, outcome, wandb_run_id}]

AgentOps → data plane (snapshots)

The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.

WorldModelSnapshot {
  snapshot_id:     UUID
  intent_id:       UUID
  timestamp:       ISO 8601
  agents_active:   [{role, status, tool_call_count}]
  user_plane:      {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
  causal_chain:    [{intent_id, job_id, outcome}]
}

The lakehouse as persistent world model

In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:

Memory type	Contents	Enables
Episodic	Job history, cluster failure events, navigation trial outcomes	Agent reasoning over "what happened before"
Semantic	Learned cluster failure patterns, experiment regression signatures	Agent pattern matching without re-executing
Procedural	Successful job submission sequences, DuckLake catalog dependency chains	Agent skill recall
Perceptual	GeoParquet outputs, YOLOv8 detections, SLAM maps	VLA fine-tuning, world-model grounding

For the turtlebot-maze reference application:

Every Nav2 navigation trial → appended to event_log.ros_navigation_trials
Every YOLOv8 detection → appended to event_log.object_detections
Every world-model snapshot at decision point → stored in world_model.snapshots
Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models

This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.

Interfaces

Ingestion API (data plane ← user plane)

POST /data/ingest
Body: {table: str, records: list[dict], schema_version: str, job_id: UUID, tenant_id: UUID}
→ 201 {partition_id, row_count}

POST /data/snapshots
Body: WorldModelSnapshot
→ 201 {snapshot_id}

Query API (data plane → control plane)

GET /data/query?sql=<duckdb_sql>&tenant_id=<uuid>
→ {columns: [...], rows: [...]}

GET /data/retrieve?query=<natural_language>&top_k=<n>&tenant_id=<uuid>   (v2)
→ [{id, text, score, metadata}]

Policy interface (management plane → data plane)

PUT /data/policy
Body: {tenant_id, retention_days, allowed_tables: [...], max_storage_gb: float}
→ 200

Critical distinction: data plane vs management plane

A common mistake is placing the lakehouse under the management plane. They are separate:

Data Plane	Management Plane
Stores and transforms data	Governs how data is stored
Serves queries	Defines query access policies
Manages schemas and lineage	Manages access rights and retention
Enables agent reasoning	Enforces compliance

The management plane governs the data plane. It does not own it.

Evolution path

v1   — Data plane implicit: lakehouse in aegean-ai/lakehouse; LakehouseAgent reaches across repos
v1.5 — Migrate to data-plane/ in monorepo; formalise ingestion API; structured AgentEvent log
v2   — Embeddings store + RAG retrieval endpoint for control-plane agents
       World-model snapshots (AgentOps → data plane)
       Schema lineage visible in management-plane dashboard
v3   — VLA training pipeline: accumulated perception data → fine-tuning loop
       Feature store for online inference (real-time VLA feature serving)
       Data plane exposes MCP server: agents query job history and world state via tool calls

Problem​

Definition​

Goals​

Non-goals​

Migration: aegean-ai/lakehouse → data-plane/​

Migration completed​

LakehouseAgent update​

Remaining step​

Architecture​

Storage layers​

DuckDB + DuckLake as the query and catalog layer​

Data flow​

User plane → data plane (ingestion)​

Control plane → data plane (reads)​

AgentOps → data plane (snapshots)​

The lakehouse as persistent world model​

Interfaces​

Ingestion API (data plane ← user plane)​

Query API (data plane → control plane)​

Policy interface (management plane → data plane)​

Critical distinction: data plane vs management plane​

Evolution path​