Skip to main content

Modal Lakehouse — Design Document

Date: 2026-02-22 Authors: Pantelis Monogioudis, Conor Fitzpatrick, John Vitz Status: Approved Scope: Full system design + COCO-Caption demo (Experiment #0)


1. Purpose

This document records the design of the Modal Lakehouse: a multimodal dataset store for simulation-based perception and robotics experiments. It captures architectural decisions, technology selections, and the rationale behind trade-offs reached during the brainstorming phase.

The COCO-Caption demo (§8) serves as Experiment #0 — a deliberately minimal exercise that instantiates every layer of the full architecture at small scale (1,000 samples, K=1 simulation).


2. Problem Statement

Simulation-based ML experiments for perception and robotics produce large volumes of heterogeneous data: video frames, point clouds, IMU streams, trajectories, captions, and derived embeddings. This data must be:

  • Stored durably at scale (100 TB+) in a queryable, versioned catalog
  • Streamed efficiently to training loops without materialising the full dataset in memory
  • Compatible with HF datasets so that training code is portable and independent of where data is hosted
  • Queryable by coding agents as callable tools, both in-process and over a web interface
  • Tracked per experiment with W&B for metrics and provenance, without W&B becoming a data dependency
  • Visualisable in Rerun (spatial/temporal) or W&B (scalar metrics) depending on data type
  • Synchronisable with Hugging Face Hub for dataset sharing without requiring HF Hub as a runtime dependency

3. Agent Interface: Python API First, MCP Server Second

3.1 Rationale

Two interface styles were evaluated:

MCP server exposes lakehouse operations over the MCP protocol (stdio or HTTP/SSE). Any MCP-compatible agent — Claude.ai web, Claude Code, Cursor — can call it regardless of language or location. It is the right choice when agents run over a web interface or across machines, because tool definitions are self-describing and connections are language-agnostic.

Python function-calling API registers typed functions as agent tools via a framework decorator (Pydantic-AI @agent.tool). The agent and lakehouse share the same Python process. Tool calls incur zero serialisation overhead — results stay as PyArrow tables or DataFrames in memory, never converted to JSON and back. DuckDB runs in-process, so queries complete without a network hop. This is dramatically cheaper in tokens and latency for local agents (Claude Code, training scripts, Ray workers).

Decision: Build the Python API first. It is the foundation either way: the MCP server is a thin JSON-over-stdio adapter over the same Python API, requiring minimal additional code. The Python API ships in Phase 1; the MCP server wraps it in Phase 2.

3.2 Summary

ScenarioInterface
Claude.ai web, remote coding agents, cross-machine orchestrationMCP server (Phase 2)
Claude Code, Pydantic-AI scripts, Ray workers, training loopsPython API (Phase 1)
Production ML pipelines on the same hostPython API

4. Full System Architecture

┌──────────────────────────────────────────────────────────────────┐
│ K PARALLEL SIMULATORS │
│ sim_0 ... sim_K-1 (perception / navigation / manipulation) │
└───────────┬──────────────────────────────┬───────────────────────┘
│ Zenoh (data plane) │ wandb.log() (metrics only)
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ LAKEHOUSE CORE │ │ WEIGHTS & BIASES │
│ │◀─ref─│ │
│ S3 (MinIO / Ceph): │ │ Project / Group │
│ landing/ │ │ Run per sim_k │
│ warehouse/ │ │ Reference Artifacts │
│ serving/ │ │ → s3://lakehouse/… │
│ │ │ Scalar metrics, charts│
│ DuckLake catalog: │ └───────────┬────────────┘
│ experiments │ │ W&B dashboard
│ simulation_runs │ ▼
│ fragment_index │
└──────────┬─────────────┘
│ NATS/JetStream (control plane events)

┌──────────────────────────────┐
│ PYDANTIC-AI AGENTS │
│ (orchestration layer) │
│ · monitor sim lifecycle │
│ · trigger quality checks │
│ · route to visualisation │
│ · promote to serving layer │
└──────────┬─────────────────┬─┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────────────────────┐
│ Rerun.io │ │ HF IterableDataset │
│ spatial / │ │ DuckDB → Arrow → DataLoader │
│ temporal │ │ (streaming training egress) │
│ replay │ └──────────────────────────────┘
└─────────────┘

5. Transport Layer: Zenoh and NATS/JetStream

Two messaging systems serve distinct roles and are not interchangeable.

5.1 Zenoh — Data Plane

Zenoh is designed for the data plane of robotics and edge computing systems. It was adopted as the default transport for ROS 2 (rmw_zenoh), operates peer-to-peer without a broker, achieves ~1.4 M msgs/s with 5-byte wire overhead, and runs from microcontrollers to cloud nodes.

Used for:

  • Simulator frames, point clouds, IMU readings → lakehouse ingest
  • Embedding streams from inference back to simulators
  • Lakehouse egress to distributed Ray training workers (fan-out)
  • Real-time replay streams to Rerun

5.2 NATS/JetStream — Control Plane

NATS/JetStream provides durable, at-least-once delivery for control events that orchestration agents must react to. JetStream adds persistence (subjects as durable queues) and a KV store for agent state and configuration.

Note: Jepsen testing (Dec 2025) found data consistency issues in NATS 2.12 that are being addressed. Monitor before adopting for critical state.

Used for:

  • lakehouse.sim.started / lakehouse.sim.completed / lakehouse.sim.failed
  • lakehouse.ingest.completed — triggers Pydantic-AI quality-check agent
  • lakehouse.dataset.versioned — triggers downstream consumers
  • lakehouse.query.requested / lakehouse.query.result — async query coordination

5.3 Interaction Pattern

Simulator ──Zenoh──▶ Lakehouse ingest ──NATS event──▶ Pydantic-AI agent

calls lakehouse Python API tools

Zenoh carries data fast. NATS carries notifications that something happened.


6. Egress Streaming and HF Datasets Compatibility

6.1 Requirement

The lakehouse must serve streaming data to training loops without materialising the full dataset in memory, and must be compatible with the HF datasets library API so that training code written against HF Hub datasets works unchanged against the lakehouse.

6.2 Stack

DuckDB query (over DuckLake + S3 Parquet)
│ .fetch_arrow_reader(batch_size=256) ← zero-copy, never loads full result

RecordBatchReader
│ Python generator wrapper

IterableDataset.from_generator() ← standard HF datasets API
│ .map() .filter() .shuffle() .batch()

PyTorch DataLoader / HF Trainer / Ray Data

DuckDB.fetch_arrow_reader() returns a RecordBatchReader — an iterator over Arrow record batches. This bridges to IterableDataset.from_generator(), making the lakehouse appear identical to load_dataset(..., streaming=True) from the perspective of training code. Data stays in MinIO; no HF Hub is involved at runtime.

6.3 Shuffling

Streaming datasets cannot be globally shuffled. The lakehouse handles this at the query layer, which is superior to buffer-based shuffling:

  • DuckDB-side: SELECT * FROM dataset USING SAMPLE 1000 or ORDER BY random()
  • Fragment-level: shuffle the list of Parquet fragment files before streaming
  • Epoch reshuffling: re-issue the query with a new seed each epoch — zero storage cost

6.4 HF Hub Sync

The sync module pulls datasets from HF Hub into MinIO using huggingface_hub. Since lmms-lab/COCO-Caption is already Parquet-native, ingestion is a direct Parquet copy with no format conversion. A future DatasetBuilder subclass will enable load_dataset("lakehouse", ...) syntax for full HF API compatibility.


7. Experiment Tracking and Visualisation

7.1 Experiment Data Model

The DuckLake catalog is the system of record for all experiment data. W&B is a metrics dashboard and collaboration layer. They are loosely coupled: the lakehouse can operate without W&B; W&B integration is an adapter, not a dependency.

experiments
experiment_id TEXT PRIMARY KEY
project TEXT
description TEXT
created_at TIMESTAMPTZ

simulation_runs
run_id TEXT PRIMARY KEY
experiment_id TEXT REFERENCES experiments
sim_index INT -- k in 0..K-1
config JSONB -- hyperparameters
status TEXT -- started / completed / failed
s3_prefix TEXT -- s3://lakehouse/exp/{id}/sim_{k}/
started_at TIMESTAMPTZ
completed_at TIMESTAMPTZ

W&B run_id and artifact_id are not stored in DuckLake. The lakehouse does not import W&B as a data dependency. W&B Reference Artifacts store s3:// URIs pointing into the lakehouse; the link is owned by W&B, not by the lakehouse.

7.2 W&B Integration Pattern

# Each simulator logs metrics directly to W&B
wandb.init(project=experiment.project, group=experiment.experiment_id)
wandb.log({"reward": r, "success_rate": s, "timestep": t})

# On sim completion, a Pydantic-AI agent registers a Reference Artifact
artifact = wandb.Artifact(name="sim-data", type="dataset")
artifact.add_reference(f"s3://lakehouse/exp/{exp_id}/sim_{k}/")
wandb.log_artifact(artifact)

7.3 Visualisation Routing

The visualize(data, backend="auto") method in the Python API inspects the data schema and routes automatically:

Data typeBackendRationale
Frames, point clouds, trajectories, transformsRerunECS temporal/spatial model
MCAP / ROS-style recordingsRerunNative MCAP loader
Scalar time-series (reward, loss, accuracy)W&BComparison across K runs
Distributions, histogramsW&BBuilt-in chart types
Dataset samples (images + text)RerunImage + text component logging
Model checkpointsW&B ArtifactsVersioning and registry

7.4 Rerun Integration

Rerun's Entity Component System maps directly onto the lakehouse data model:

Lakehouse conceptRerun concept
clip_id / frame_idEntity path (e.g. /sim/camera/front)
Timestamp columnTimeline
Parquet fragmentComponents on an entity
Scene metadataStatic data
MCAP file in S3Native MCAP loader

The lakehouse Python API exposes to_rerun(query) which fetches Arrow batches from DuckDB and logs them to a Rerun recording stream. Rerun .rrd replay files are stored as artifacts in the lakehouse alongside raw data, making every recording retrievable and re-playable.


8. Python Package Structure

lakehouse/
sync.py # HF Hub → MinIO ingestion (huggingface_hub + s3fs)
catalog.py # DuckLake attach/detach, schema management, experiment registry
query.py # Typed query functions returning PyArrow / pandas
stream.py # as_iterable_dataset(), fetch_arrow_reader() wrapper
tools.py # @agent.tool definitions (Pydantic schemas = tool schemas)
visualize.py # visualize(data, backend="auto") routing to Rerun / W&B
mcp_server.py # (Phase 2) MCP transport wrapper over tools.py

Pydantic models in tools.py serve double duty: they define the data schema (COCOSample, SimulationRun) and the tool input/output schemas for agent registration. The same model generates the MCP server's tool schema in Phase 2 with no additional work.


9. COCO-Caption Demo — Experiment #0

Dataset: lmms-lab/COCO-Caption (Parquet, image+text, COCO 2014 val) Subset: 1,000 samples, K=1 simulation Purpose: Exercise every layer of the architecture at minimal scale

9.1 Deliverables

  1. lakehouse/ package — all modules in §8 implemented
  2. experiments/coco_demo.py — runnable script, follows existing tests/ style
  3. notebooks/coco_demo.ipynb — step-by-step notebook with visible outputs

9.2 Demo Sequence

StepModuleWhat it demonstrates
1. HF syncsync.pyPull COCO-Caption Parquet → MinIO landingwarehouse
2. Catalogcatalog.pyRegister as experiment_id=coco-v1, sim_id=0 in DuckLake
3. W&B inittools.pywandb.init(), log ingest metrics, register Reference Artifact
4. Queryquery.pyDuckDB filter by caption keyword, return typed results
5. Stream egressstream.pyas_iterable_dataset() → iterate 1,000 samples in a toy loop
6. Shufflestream.pyEpoch reshuffling via ORDER BY random()
7. Rerunvisualize.pyLog 10 samples (image + caption) as Rerun entities
8. W&B chartsvisualize.pyCaption length distribution, image resolution histogram
9. Pydantic-AI agenttools.pyReact to ingest.completed NATS event → call quality_check() tool

9.3 Dependencies to Add

# pyproject.toml additions
"huggingface-hub>=0.23.0",
"datasets>=2.20.0",
"wandb>=0.17.0",
"rerun-sdk>=0.16.0",
"nats-py>=2.7.0",
"pydantic-ai>=0.0.14",

Zenoh is deferred to Phase 2 (replaced by direct S3 writes for K=1 simulator in the demo).


10. Phased Delivery

PhaseDeliverableKey capability unlocked
1 — Demolakehouse/ package + COCO notebook/scriptHF sync, DuckLake catalog, streaming egress, Rerun + W&B visualisation, one Pydantic-AI agent
2 — MCPmcp_server.py wrapping tools.pyRemote agent access (Claude.ai web, Cursor)
3 — Zenoh ingestZenoh subscriber in sync.pyK parallel simulators streaming data in real time
4 — K > 1Parallel sim orchestration via NATS + Pydantic-AIFull experiment runs with sweep support
5 — Serving layerserving/ bucket + Arrow Flight or Zenoh egressDistributed Ray training fan-out