Skip to main content

Auraison — Data Plane Design

Date: 2026-02-23 Status: Approved (v1 — migrated from aegean-ai/lakehouse to data-plane/)


Problem

The original three-plane model (user / control / management) governs control flow cleanly but does not govern data flow. As agentic workloads grow, a structural gap emerges: the lakehouse (currently in the separate aegean-ai/lakehouse repo) is used simultaneously as:

  • Storage for user-plane outputs (perception data, job results, telemetry)
  • Memory substrate for control-plane agents (job history, cluster failure patterns)
  • Training corpus for model fine-tuning (VLA, classification models)
  • Observability archive (agent traces, experiment results)

No single plane owns this. The control plane's LakehouseAgent reaches into it; the user plane writes to it; the management plane governs access to it. The lakehouse is not a single-plane component — it spans planes, and needs its own architectural treatment.

The deeper issue: in agentic systems, the data flow direction is reversed relative to traditional software.

Traditional:  logic  →  data
Agentic: data → reasoning → action

Data becomes the substrate of cognition. The lakehouse is not analytics infrastructure — it is the persistent world model of the system. It needs a dedicated plane.


Definition

The data plane governs: movement, storage, transformation, and accessibility of data across the entire system. It is orthogonal to reasoning (control plane) or execution (user plane).

The data plane sits horizontally — all other planes interact with it:


Goals

  • Provide a unified persistent storage substrate for all planes
  • Formalise the LakehouseAgent as the control-plane API boundary to the data plane
  • Define ingestion pipelines from the user plane (structured, versioned, lineaged)
  • Enable semantic retrieval for control-plane agents (RAG over job history and agent traces)
  • Support world-model snapshots for AgentOps checkpointing and causal replay
  • Serve as training data substrate for VLA and ML model fine-tuning (v3)
  • Consolidate aegean-ai/lakehouse into this monorepo under data-plane/

Non-goals

  • Real-time message passing — that is Zenoh / NATS / DDS (transport, not storage)
  • Governance policy definition — that is the management plane
  • Agent reasoning or query planning — that is the control plane
  • Job execution — that is the user plane

Migration: aegean-ai/lakehouse → data-plane/

The aegean-ai/lakehouse repo contained the Python package scaffold, DuckLake schema reference, infrastructure config (MinIO + PostgreSQL), tests, and design docs. It has been consolidated into this monorepo.

Migration completed

WhatSourceDestination
Design docsaegean-ai/lakehouse/docs/plans/docs/plans/ (this monorepo)
Python package scaffoldaegean-ai/lakehouse/data-plane/
Infrastructureaegean-ai/lakehouse/docker-compose.ymldata-plane/docker-compose.yml
Testsaegean-ai/lakehouse/tests/data-plane/tests/
CLAUDE.mdaegean-ai/lakehouse/CLAUDE.mddata-plane/CLAUDE.md

LakehouseAgent update

LakehouseAgent was updated from dbt CLI (incorrect assumption) to duckdb and python -m lakehouse (correct tooling for DuckDB + DuckLake):

# Before:
ALLOWED_TOOLS = "Bash(dbt *),Read,Edit"

# After:
ALLOWED_TOOLS = "Bash(duckdb *),Bash(python *),Bash(docker *),Read,Edit"
DATA_PLANE_DIR = REPO_ROOT / "data-plane"

Remaining step

Archive aegean-ai/lakehouse on GitHub (mark read-only) with a README redirect to this monorepo. This is a manual GitHub operation.


Architecture

Storage layers

LayerTechnologyContentsAccess pattern
LakehouseDuckDB + DuckLake (PostgreSQL catalog) + MinIO S3Structured outputs, experiment results, job historySQL (DuckDB in-process), Parquet partitions in MinIO
Object storeMinIO (local) / S3 / Cloudflare R2Raw files: fMP4 chunks, GeoParquet, point clouds, model checkpointsBlob read/write via s3fs
Embeddings storepgvector / Chroma (v2)Dense vectors for semantic retrieval (agent traces, docs)ANN search
Feature storeFeast / custom Parquet (v3)Structured ML features for VLA and classification modelsBatch + online
Event logAppend-only Parquet partitions in MinIO landing/Agent traces, Zenoh event recordings, ROS bag metadataAppend write, batch read
World model snapshotsDuckLake snapshots (v2)Point-in-time environment state for AgentOpsWrite on checkpoint, read on replay

DuckDB + DuckLake as the query and catalog layer

DuckDB is the in-process analytical query engine. DuckLake is the transactional catalog: ATTACH 'ducklake:postgresql://...' exposes tables whose metadata lives in PostgreSQL and whose data lives in MinIO as Parquet fragments. The full DuckLake schema (174 catalog tables) is in data-plane/tests/ducklake-schema.sql.

Key catalog tables:

  • experiments — experiment registry (id, project, description, created_at)
  • simulation_runs — per-simulator run records with S3 prefix, status, config
  • ducklake_table, ducklake_data_file — DuckLake's own catalog metadata

The LakehouseAgent (Bash(duckdb *), Bash(python *), Read, Edit) is the control plane's operator interface: it runs DuckDB queries, inspects the catalog, and calls python -m lakehouse commands. It is not a transformation pipeline — it is a catalog operator and query runner.


Data flow

User plane → data plane (ingestion)

User-plane workers (Ray jobs on torch.dev.gpu and ros.dev.gpu) write outputs to the data plane on job completion:

In v1, ingestion is manual (Ray worker writes output files; LakehouseAgent registers them via DuckDB). In v1.5, the ingestion API is a lightweight FastAPI endpoint in data-plane/lakehouse/ called directly by Ray workers on job completion.

Control plane → data plane (reads)

Control-plane agents read from the data plane in two modes:

  1. Structured query (current): LakehouseAgent runs DuckDB queries over DuckLake and reads Parquet files via DuckDB in the agent subprocess
  2. Semantic retrieval (v2): control-plane agents call an embeddings query endpoint to retrieve relevant agent traces, job history, or world-model state as context
# v1: LakehouseAgent reads directly
"Run SELECT * FROM job_outcomes ORDER BY completed_at DESC LIMIT 10 WHERE status = 'failed'"

# v2: semantic retrieval endpoint
GET /data/retrieve?query="notebook jobs that failed on torch.dev.gpu last week"&top_k=5
[{job_id, summary, outcome, wandb_run_id}]

AgentOps → data plane (snapshots)

The control plane's AgentOps subsystem writes world-model snapshots to the data plane on checkpoint events and at the end of each agent invocation. These snapshots are the foundation for causal replay and VLA training data.

WorldModelSnapshot {
snapshot_id: UUID
intent_id: UUID
timestamp: ISO 8601
agents_active: [{role, status, tool_call_count}]
user_plane: {torch: {jobs_in_flight, gpu_util}, ros: {jobs_in_flight, gpu_util}}
causal_chain: [{intent_id, job_id, outcome}]
}

The lakehouse as persistent world model

In agentic robotics systems, the lakehouse accumulates not just analytics data but the episodic and semantic memory of the entire system:

Memory typeContentsEnables
EpisodicJob history, cluster failure events, navigation trial outcomesAgent reasoning over "what happened before"
SemanticLearned cluster failure patterns, experiment regression signaturesAgent pattern matching without re-executing
ProceduralSuccessful job submission sequences, DuckLake catalog dependency chainsAgent skill recall
PerceptualGeoParquet outputs, YOLOv8 detections, SLAM mapsVLA fine-tuning, world-model grounding

For the turtlebot-maze reference application:

  • Every Nav2 navigation trial → appended to event_log.ros_navigation_trials
  • Every YOLOv8 detection → appended to event_log.object_detections
  • Every world-model snapshot at decision point → stored in world_model.snapshots
  • Aggregate: accumulated robot experience becomes a fine-tuning corpus for VLA models

This is the convergence point between the data plane and world-model-based VLA research: the lakehouse is the persistent world model.


Interfaces

Ingestion API (data plane ← user plane)

POST /data/ingest
Body: {table: str, records: list[dict], schema_version: str, job_id: UUID, tenant_id: UUID}
→ 201 {partition_id, row_count}

POST /data/snapshots
Body: WorldModelSnapshot
→ 201 {snapshot_id}

Query API (data plane → control plane)

GET /data/query?sql=<duckdb_sql>&tenant_id=<uuid>
→ {columns: [...], rows: [...]}

GET /data/retrieve?query=<natural_language>&top_k=<n>&tenant_id=<uuid> (v2)
→ [{id, text, score, metadata}]

Policy interface (management plane → data plane)

PUT /data/policy
Body: {tenant_id, retention_days, allowed_tables: [...], max_storage_gb: float}
→ 200

Critical distinction: data plane vs management plane

A common mistake is placing the lakehouse under the management plane. They are separate:

Data PlaneManagement Plane
Stores and transforms dataGoverns how data is stored
Serves queriesDefines query access policies
Manages schemas and lineageManages access rights and retention
Enables agent reasoningEnforces compliance

The management plane governs the data plane. It does not own it.


Evolution path

v1   — Data plane implicit: lakehouse in aegean-ai/lakehouse; LakehouseAgent reaches across repos
v1.5 — Migrate to data-plane/ in monorepo; formalise ingestion API; structured AgentEvent log
v2 — Embeddings store + RAG retrieval endpoint for control-plane agents
World-model snapshots (AgentOps → data plane)
Schema lineage visible in management-plane dashboard
v3 — VLA training pipeline: accumulated perception data → fine-tuning loop
Feature store for online inference (real-time VLA feature serving)
Data plane exposes MCP server: agents query job history and world state via tool calls

See also:

  • docs/plans/2026-02-23-aiops-control-plane-design.md §"AgentOps Subsystem" — world-model snapshots, checkpointing
  • docs/plans/2026-02-23-auraison-control-plane-design.md — LakehouseAgent, agent memory
  • docs/plans/2026-02-23-auraison-management-plane-design.md — retention policy, RBAC
  • docs/plans/2026-02-23-auraison-user-plane-design.md — ingestion producers (Ray workers)