Modal Lakehouse — Design Document
Date: 2026-02-22 Authors: Pantelis Monogioudis, Conor Fitzpatrick, John Vitz Status: Approved Scope: Full system design + COCO-Caption demo (Experiment #0)
1. Purpose
This document records the design of the Modal Lakehouse: a multimodal dataset store for simulation-based perception and robotics experiments. It captures architectural decisions, technology selections, and the rationale behind trade-offs reached during the brainstorming phase.
The COCO-Caption demo (§8) serves as Experiment #0 — a deliberately minimal exercise that instantiates every layer of the full architecture at small scale (1,000 samples, K=1 simulation).
2. Problem Statement
Simulation-based ML experiments for perception and robotics produce large volumes of heterogeneous data: video frames, point clouds, IMU streams, trajectories, captions, and derived embeddings. This data must be:
- Stored durably at scale (100 TB+) in a queryable, versioned catalog
- Streamed efficiently to training loops without materialising the full dataset in memory
- Compatible with HF
datasetsso that training code is portable and independent of where data is hosted - Queryable by coding agents as callable tools, both in-process and over a web interface
- Tracked per experiment with W&B for metrics and provenance, without W&B becoming a data dependency
- Visualisable in Rerun (spatial/temporal) or W&B (scalar metrics) depending on data type
- Synchronisable with Hugging Face Hub for dataset sharing without requiring HF Hub as a runtime dependency
3. Agent Interface: Python API First, MCP Server Second
3.1 Rationale
Two interface styles were evaluated:
MCP server exposes lakehouse operations over the MCP protocol (stdio or HTTP/SSE). Any MCP-compatible agent — Claude.ai web, Claude Code, Cursor — can call it regardless of language or location. It is the right choice when agents run over a web interface or across machines, because tool definitions are self-describing and connections are language-agnostic.
Python function-calling API registers typed functions as agent tools via a framework decorator (Pydantic-AI @agent.tool). The agent and lakehouse share the same Python process. Tool calls incur zero serialisation overhead — results stay as PyArrow tables or DataFrames in memory, never converted to JSON and back. DuckDB runs in-process, so queries complete without a network hop. This is dramatically cheaper in tokens and latency for local agents (Claude Code, training scripts, Ray workers).
Decision: Build the Python API first. It is the foundation either way: the MCP server is a thin JSON-over-stdio adapter over the same Python API, requiring minimal additional code. The Python API ships in Phase 1; the MCP server wraps it in Phase 2.
3.2 Summary
| Scenario | Interface |
|---|---|
| Claude.ai web, remote coding agents, cross-machine orchestration | MCP server (Phase 2) |
| Claude Code, Pydantic-AI scripts, Ray workers, training loops | Python API (Phase 1) |
| Production ML pipelines on the same host | Python API |
4. Full System Architecture
┌──────────────────────────────────────────────────────────────────┐
│ K PARALLEL SIMULATORS │
│ sim_0 ... sim_K-1 (perception / navigation / manipulation) │
└───────────┬──────────────────────────────┬───────────────────────┘
│ Zenoh (data plane) │ wandb.log() (metrics only)
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ LAKEHOUSE CORE │ │ WEIGHTS & BIASES │
│ │◀─ref─│ │
│ S3 (MinIO / Ceph): │ │ Project / Group │
│ landing/ │ │ Run per sim_k │
│ warehouse/ │ │ Reference Artifacts │
│ serving/ │ │ → s3://lakehouse/… │
│ │ │ Scalar metrics, charts│
│ DuckLake catalog: │ └───────────┬────────────┘
│ experiments │ │ W&B dashboard
│ simulation_runs │ ▼
│ fragment_index │
└──────────┬─────────────┘
│ NATS/JetStream (control plane events)
▼
┌──────────────────────────────┐
│ PYDANTIC-AI AGENTS │
│ (orchestration layer) │
│ · monitor sim lifecycle │
│ · trigger quality checks │
│ · route to visualisation │
│ · promote to serving layer │
└──────────┬─────────────────┬─┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────────────────────┐
│ Rerun.io │ │ HF IterableDataset │
│ spatial / │ │ DuckDB → Arrow → DataLoader │
│ temporal │ │ (streaming training egress) │
│ replay │ └──────────────────────────────┘
└─────────────┘
5. Transport Layer: Zenoh and NATS/JetStream
Two messaging systems serve distinct roles and are not interchangeable.
5.1 Zenoh — Data Plane
Zenoh is designed for the data plane of robotics and edge computing systems. It was adopted as the default transport for ROS 2 (rmw_zenoh), operates peer-to-peer without a broker, achieves ~1.4 M msgs/s with 5-byte wire overhead, and runs from microcontrollers to cloud nodes.
Used for:
- Simulator frames, point clouds, IMU readings → lakehouse ingest
- Embedding streams from inference back to simulators
- Lakehouse egress to distributed Ray training workers (fan-out)
- Real-time replay streams to Rerun
5.2 NATS/JetStream — Control Plane
NATS/JetStream provides durable, at-least-once delivery for control events that orchestration agents must react to. JetStream adds persistence (subjects as durable queues) and a KV store for agent state and configuration.
Note: Jepsen testing (Dec 2025) found data consistency issues in NATS 2.12 that are being addressed. Monitor before adopting for critical state.
Used for:
lakehouse.sim.started/lakehouse.sim.completed/lakehouse.sim.failedlakehouse.ingest.completed— triggers Pydantic-AI quality-check agentlakehouse.dataset.versioned— triggers downstream consumerslakehouse.query.requested/lakehouse.query.result— async query coordination
5.3 Interaction Pattern
Simulator ──Zenoh──▶ Lakehouse ingest ──NATS event──▶ Pydantic-AI agent
│
calls lakehouse Python API tools
Zenoh carries data fast. NATS carries notifications that something happened.
6. Egress Streaming and HF Datasets Compatibility
6.1 Requirement
The lakehouse must serve streaming data to training loops without materialising the full dataset in memory, and must be compatible with the HF datasets library API so that training code written against HF Hub datasets works unchanged against the lakehouse.
6.2 Stack
DuckDB query (over DuckLake + S3 Parquet)
│ .fetch_arrow_reader(batch_size=256) ← zero-copy, never loads full result
▼
RecordBatchReader
│ Python generator wrapper
▼
IterableDataset.from_generator() ← standard HF datasets API
│ .map() .filter() .shuffle() .batch()
▼
PyTorch DataLoader / HF Trainer / Ray Data
DuckDB.fetch_arrow_reader() returns a RecordBatchReader — an iterator over Arrow record batches. This bridges to IterableDataset.from_generator(), making the lakehouse appear identical to load_dataset(..., streaming=True) from the perspective of training code. Data stays in MinIO; no HF Hub is involved at runtime.
6.3 Shuffling
Streaming datasets cannot be globally shuffled. The lakehouse handles this at the query layer, which is superior to buffer-based shuffling:
- DuckDB-side:
SELECT * FROM dataset USING SAMPLE 1000orORDER BY random() - Fragment-level: shuffle the list of Parquet fragment files before streaming
- Epoch reshuffling: re-issue the query with a new seed each epoch — zero storage cost
6.4 HF Hub Sync
The sync module pulls datasets from HF Hub into MinIO using huggingface_hub. Since lmms-lab/COCO-Caption is already Parquet-native, ingestion is a direct Parquet copy with no format conversion. A future DatasetBuilder subclass will enable load_dataset("lakehouse", ...) syntax for full HF API compatibility.
7. Experiment Tracking and Visualisation
7.1 Experiment Data Model
The DuckLake catalog is the system of record for all experiment data. W&B is a metrics dashboard and collaboration layer. They are loosely coupled: the lakehouse can operate without W&B; W&B integration is an adapter, not a dependency.
experiments
experiment_id TEXT PRIMARY KEY
project TEXT
description TEXT
created_at TIMESTAMPTZ
simulation_runs
run_id TEXT PRIMARY KEY
experiment_id TEXT REFERENCES experiments
sim_index INT -- k in 0..K-1
config JSONB -- hyperparameters
status TEXT -- started / completed / failed
s3_prefix TEXT -- s3://lakehouse/exp/{id}/sim_{k}/
started_at TIMESTAMPTZ
completed_at TIMESTAMPTZ
W&B run_id and artifact_id are not stored in DuckLake. The lakehouse does not import W&B as a data dependency. W&B Reference Artifacts store s3:// URIs pointing into the lakehouse; the link is owned by W&B, not by the lakehouse.
7.2 W&B Integration Pattern
# Each simulator logs metrics directly to W&B
wandb.init(project=experiment.project, group=experiment.experiment_id)
wandb.log({"reward": r, "success_rate": s, "timestep": t})
# On sim completion, a Pydantic-AI agent registers a Reference Artifact
artifact = wandb.Artifact(name="sim-data", type="dataset")
artifact.add_reference(f"s3://lakehouse/exp/{exp_id}/sim_{k}/")
wandb.log_artifact(artifact)
7.3 Visualisation Routing
The visualize(data, backend="auto") method in the Python API inspects the data schema and routes automatically:
| Data type | Backend | Rationale |
|---|---|---|
| Frames, point clouds, trajectories, transforms | Rerun | ECS temporal/spatial model |
| MCAP / ROS-style recordings | Rerun | Native MCAP loader |
| Scalar time-series (reward, loss, accuracy) | W&B | Comparison across K runs |
| Distributions, histograms | W&B | Built-in chart types |
| Dataset samples (images + text) | Rerun | Image + text component logging |
| Model checkpoints | W&B Artifacts | Versioning and registry |
7.4 Rerun Integration
Rerun's Entity Component System maps directly onto the lakehouse data model:
| Lakehouse concept | Rerun concept |
|---|---|
clip_id / frame_id | Entity path (e.g. /sim/camera/front) |
| Timestamp column | Timeline |
| Parquet fragment | Components on an entity |
| Scene metadata | Static data |
| MCAP file in S3 | Native MCAP loader |
The lakehouse Python API exposes to_rerun(query) which fetches Arrow batches from DuckDB and logs them to a Rerun recording stream. Rerun .rrd replay files are stored as artifacts in the lakehouse alongside raw data, making every recording retrievable and re-playable.
8. Python Package Structure
lakehouse/
sync.py # HF Hub → MinIO ingestion (huggingface_hub + s3fs)
catalog.py # DuckLake attach/detach, schema management, experiment registry
query.py # Typed query functions returning PyArrow / pandas
stream.py # as_iterable_dataset(), fetch_arrow_reader() wrapper
tools.py # @agent.tool definitions (Pydantic schemas = tool schemas)
visualize.py # visualize(data, backend="auto") routing to Rerun / W&B
mcp_server.py # (Phase 2) MCP transport wrapper over tools.py
Pydantic models in tools.py serve double duty: they define the data schema (COCOSample, SimulationRun) and the tool input/output schemas for agent registration. The same model generates the MCP server's tool schema in Phase 2 with no additional work.
9. COCO-Caption Demo — Experiment #0
Dataset: lmms-lab/COCO-Caption (Parquet, image+text, COCO 2014 val)
Subset: 1,000 samples, K=1 simulation
Purpose: Exercise every layer of the architecture at minimal scale
9.1 Deliverables
lakehouse/package — all modules in §8 implementedexperiments/coco_demo.py— runnable script, follows existingtests/stylenotebooks/coco_demo.ipynb— step-by-step notebook with visible outputs
9.2 Demo Sequence
| Step | Module | What it demonstrates |
|---|---|---|
| 1. HF sync | sync.py | Pull COCO-Caption Parquet → MinIO landing → warehouse |
| 2. Catalog | catalog.py | Register as experiment_id=coco-v1, sim_id=0 in DuckLake |
| 3. W&B init | tools.py | wandb.init(), log ingest metrics, register Reference Artifact |
| 4. Query | query.py | DuckDB filter by caption keyword, return typed results |
| 5. Stream egress | stream.py | as_iterable_dataset() → iterate 1,000 samples in a toy loop |
| 6. Shuffle | stream.py | Epoch reshuffling via ORDER BY random() |
| 7. Rerun | visualize.py | Log 10 samples (image + caption) as Rerun entities |
| 8. W&B charts | visualize.py | Caption length distribution, image resolution histogram |
| 9. Pydantic-AI agent | tools.py | React to ingest.completed NATS event → call quality_check() tool |
9.3 Dependencies to Add
# pyproject.toml additions
"huggingface-hub>=0.23.0",
"datasets>=2.20.0",
"wandb>=0.17.0",
"rerun-sdk>=0.16.0",
"nats-py>=2.7.0",
"pydantic-ai>=0.0.14",
Zenoh is deferred to Phase 2 (replaced by direct S3 writes for K=1 simulator in the demo).
10. Phased Delivery
| Phase | Deliverable | Key capability unlocked |
|---|---|---|
| 1 — Demo | lakehouse/ package + COCO notebook/script | HF sync, DuckLake catalog, streaming egress, Rerun + W&B visualisation, one Pydantic-AI agent |
| 2 — MCP | mcp_server.py wrapping tools.py | Remote agent access (Claude.ai web, Cursor) |
| 3 — Zenoh ingest | Zenoh subscriber in sync.py | K parallel simulators streaming data in real time |
| 4 — K > 1 | Parallel sim orchestration via NATS + Pydantic-AI | Full experiment runs with sweep support |
| 5 — Serving layer | serving/ bucket + Arrow Flight or Zenoh egress | Distributed Ray training fan-out |