Modal Lakehouse — Design Document

Date: 2026-02-22 Authors: Pantelis Monogioudis, Conor Fitzpatrick, John Vitz Status: Approved Scope: Full system design + COCO-Caption demo (Experiment #0)

1. Purpose

This document records the design of the Modal Lakehouse: a multimodal dataset store for simulation-based perception and robotics experiments. It captures architectural decisions, technology selections, and the rationale behind trade-offs reached during the brainstorming phase.

The COCO-Caption demo (§8) serves as Experiment #0 — a deliberately minimal exercise that instantiates every layer of the full architecture at small scale (1,000 samples, K=1 simulation).

2. Problem Statement

Simulation-based ML experiments for perception and robotics produce large volumes of heterogeneous data: video frames, point clouds, IMU streams, trajectories, captions, and derived embeddings. This data must be:

Stored durably at scale (100 TB+) in a queryable, versioned catalog
Streamed efficiently to training loops without materialising the full dataset in memory
Compatible with HF datasets so that training code is portable and independent of where data is hosted
Queryable by coding agents as callable tools, both in-process and over a web interface
Tracked per experiment with W&B for metrics and provenance, without W&B becoming a data dependency
Visualisable in Rerun (spatial/temporal) or W&B (scalar metrics) depending on data type
Synchronisable with Hugging Face Hub for dataset sharing without requiring HF Hub as a runtime dependency

3. Agent Interface: Python API First, MCP Server Second

3.1 Rationale

Two interface styles were evaluated:

MCP server exposes lakehouse operations over the MCP protocol (stdio or HTTP/SSE). Any MCP-compatible agent — Claude.ai web, Claude Code, Cursor — can call it regardless of language or location. It is the right choice when agents run over a web interface or across machines, because tool definitions are self-describing and connections are language-agnostic.

Python function-calling API registers typed functions as agent tools via a framework decorator (Pydantic-AI @agent.tool). The agent and lakehouse share the same Python process. Tool calls incur zero serialisation overhead — results stay as PyArrow tables or DataFrames in memory, never converted to JSON and back. DuckDB runs in-process, so queries complete without a network hop. This is dramatically cheaper in tokens and latency for local agents (Claude Code, training scripts, Ray workers).

Decision: Build the Python API first. It is the foundation either way: the MCP server is a thin JSON-over-stdio adapter over the same Python API, requiring minimal additional code. The Python API ships in Phase 1; the MCP server wraps it in Phase 2.

3.2 Summary

Scenario	Interface
Claude.ai web, remote coding agents, cross-machine orchestration	MCP server (Phase 2)
Claude Code, Pydantic-AI scripts, Ray workers, training loops	Python API (Phase 1)
Production ML pipelines on the same host	Python API

4. Full System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    K PARALLEL SIMULATORS                          │
│  sim_0 ... sim_K-1  (perception / navigation / manipulation)     │
└───────────┬──────────────────────────────┬───────────────────────┘
            │ Zenoh (data plane)            │ wandb.log() (metrics only)
            ▼                              ▼
┌───────────────────────┐      ┌───────────────────────┐
│   LAKEHOUSE CORE       │      │   WEIGHTS & BIASES     │
│                        │◀─ref─│                        │
│  S3 (MinIO / Ceph):    │      │  Project / Group       │
│    landing/            │      │  Run per sim_k         │
│    warehouse/          │      │  Reference Artifacts   │
│    serving/            │      │    → s3://lakehouse/…  │
│                        │      │  Scalar metrics, charts│
│  DuckLake catalog:     │      └───────────┬────────────┘
│    experiments         │                  │ W&B dashboard
│    simulation_runs     │                  ▼
│    fragment_index      │
└──────────┬─────────────┘
           │ NATS/JetStream (control plane events)
           ▼
┌──────────────────────────────┐
│   PYDANTIC-AI AGENTS          │
│   (orchestration layer)       │
│   · monitor sim lifecycle     │
│   · trigger quality checks    │
│   · route to visualisation    │
│   · promote to serving layer  │
└──────────┬─────────────────┬─┘
           │                 │
           ▼                 ▼
    ┌─────────────┐   ┌──────────────────────────────┐
    │  Rerun.io   │   │  HF IterableDataset            │
    │  spatial /  │   │  DuckDB → Arrow → DataLoader  │
    │  temporal   │   │  (streaming training egress)  │
    │  replay     │   └──────────────────────────────┘
    └─────────────┘

5. Transport Layer: Zenoh and NATS/JetStream

Two messaging systems serve distinct roles and are not interchangeable.

5.1 Zenoh — Data Plane

Zenoh is designed for the data plane of robotics and edge computing systems. It was adopted as the default transport for ROS 2 (rmw_zenoh), operates peer-to-peer without a broker, achieves ~1.4 M msgs/s with 5-byte wire overhead, and runs from microcontrollers to cloud nodes.

Used for:

Simulator frames, point clouds, IMU readings → lakehouse ingest
Embedding streams from inference back to simulators
Lakehouse egress to distributed Ray training workers (fan-out)
Real-time replay streams to Rerun

5.2 NATS/JetStream — Control Plane

NATS/JetStream provides durable, at-least-once delivery for control events that orchestration agents must react to. JetStream adds persistence (subjects as durable queues) and a KV store for agent state and configuration.

Note: Jepsen testing (Dec 2025) found data consistency issues in NATS 2.12 that are being addressed. Monitor before adopting for critical state.

Used for:

lakehouse.sim.started / lakehouse.sim.completed / lakehouse.sim.failed
lakehouse.ingest.completed — triggers Pydantic-AI quality-check agent
lakehouse.dataset.versioned — triggers downstream consumers
lakehouse.query.requested / lakehouse.query.result — async query coordination

5.3 Interaction Pattern

Simulator ──Zenoh──▶ Lakehouse ingest ──NATS event──▶ Pydantic-AI agent
                                                              │
                                              calls lakehouse Python API tools

Zenoh carries data fast. NATS carries notifications that something happened.

6. Egress Streaming and HF Datasets Compatibility

6.1 Requirement

The lakehouse must serve streaming data to training loops without materialising the full dataset in memory, and must be compatible with the HF datasets library API so that training code written against HF Hub datasets works unchanged against the lakehouse.

6.2 Stack

DuckDB query (over DuckLake + S3 Parquet)
    │  .fetch_arrow_reader(batch_size=256)       ← zero-copy, never loads full result
    ▼
RecordBatchReader
    │  Python generator wrapper
    ▼
IterableDataset.from_generator()                 ← standard HF datasets API
    │  .map() .filter() .shuffle() .batch()
    ▼
PyTorch DataLoader  /  HF Trainer  /  Ray Data

DuckDB.fetch_arrow_reader() returns a RecordBatchReader — an iterator over Arrow record batches. This bridges to IterableDataset.from_generator(), making the lakehouse appear identical to load_dataset(..., streaming=True) from the perspective of training code. Data stays in MinIO; no HF Hub is involved at runtime.

6.3 Shuffling

Streaming datasets cannot be globally shuffled. The lakehouse handles this at the query layer, which is superior to buffer-based shuffling:

DuckDB-side: SELECT * FROM dataset USING SAMPLE 1000 or ORDER BY random()
Fragment-level: shuffle the list of Parquet fragment files before streaming
Epoch reshuffling: re-issue the query with a new seed each epoch — zero storage cost

6.4 HF Hub Sync

The sync module pulls datasets from HF Hub into MinIO using huggingface_hub. Since lmms-lab/COCO-Caption is already Parquet-native, ingestion is a direct Parquet copy with no format conversion. A future DatasetBuilder subclass will enable load_dataset("lakehouse", ...) syntax for full HF API compatibility.

7. Experiment Tracking and Visualisation

7.1 Experiment Data Model

The DuckLake catalog is the system of record for all experiment data. W&B is a metrics dashboard and collaboration layer. They are loosely coupled: the lakehouse can operate without W&B; W&B integration is an adapter, not a dependency.

experiments
  experiment_id   TEXT  PRIMARY KEY
  project         TEXT
  description     TEXT
  created_at      TIMESTAMPTZ

simulation_runs
  run_id          TEXT  PRIMARY KEY
  experiment_id   TEXT  REFERENCES experiments
  sim_index       INT              -- k in 0..K-1
  config          JSONB            -- hyperparameters
  status          TEXT             -- started / completed / failed
  s3_prefix       TEXT             -- s3://lakehouse/exp/{id}/sim_{k}/
  started_at      TIMESTAMPTZ
  completed_at    TIMESTAMPTZ

W&B run_id and artifact_id are not stored in DuckLake. The lakehouse does not import W&B as a data dependency. W&B Reference Artifacts store s3:// URIs pointing into the lakehouse; the link is owned by W&B, not by the lakehouse.

7.2 W&B Integration Pattern

# Each simulator logs metrics directly to W&B
wandb.init(project=experiment.project, group=experiment.experiment_id)
wandb.log({"reward": r, "success_rate": s, "timestep": t})

# On sim completion, a Pydantic-AI agent registers a Reference Artifact
artifact = wandb.Artifact(name="sim-data", type="dataset")
artifact.add_reference(f"s3://lakehouse/exp/{exp_id}/sim_{k}/")
wandb.log_artifact(artifact)

7.3 Visualisation Routing

The visualize(data, backend="auto") method in the Python API inspects the data schema and routes automatically:

Data type	Backend	Rationale
Frames, point clouds, trajectories, transforms	Rerun	ECS temporal/spatial model
MCAP / ROS-style recordings	Rerun	Native MCAP loader
Scalar time-series (reward, loss, accuracy)	W&B	Comparison across K runs
Distributions, histograms	W&B	Built-in chart types
Dataset samples (images + text)	Rerun	Image + text component logging
Model checkpoints	W&B Artifacts	Versioning and registry

7.4 Rerun Integration

Rerun's Entity Component System maps directly onto the lakehouse data model:

Lakehouse concept	Rerun concept
`clip_id` / `frame_id`	Entity path (e.g. `/sim/camera/front`)
Timestamp column	Timeline
Parquet fragment	Components on an entity
Scene metadata	Static data
MCAP file in S3	Native MCAP loader

The lakehouse Python API exposes to_rerun(query) which fetches Arrow batches from DuckDB and logs them to a Rerun recording stream. Rerun .rrd replay files are stored as artifacts in the lakehouse alongside raw data, making every recording retrievable and re-playable.

8. Python Package Structure

lakehouse/
  sync.py        # HF Hub → MinIO ingestion (huggingface_hub + s3fs)
  catalog.py     # DuckLake attach/detach, schema management, experiment registry
  query.py       # Typed query functions returning PyArrow / pandas
  stream.py      # as_iterable_dataset(), fetch_arrow_reader() wrapper
  tools.py       # @agent.tool definitions (Pydantic schemas = tool schemas)
  visualize.py   # visualize(data, backend="auto") routing to Rerun / W&B
  mcp_server.py  # (Phase 2) MCP transport wrapper over tools.py

Pydantic models in tools.py serve double duty: they define the data schema (COCOSample, SimulationRun) and the tool input/output schemas for agent registration. The same model generates the MCP server's tool schema in Phase 2 with no additional work.

9. COCO-Caption Demo — Experiment #0

Dataset: lmms-lab/COCO-Caption (Parquet, image+text, COCO 2014 val) Subset: 1,000 samples, K=1 simulation Purpose: Exercise every layer of the architecture at minimal scale

9.1 Deliverables

lakehouse/ package — all modules in §8 implemented
experiments/coco_demo.py — runnable script, follows existing tests/ style
notebooks/coco_demo.ipynb — step-by-step notebook with visible outputs

9.2 Demo Sequence

Step	Module	What it demonstrates
1. HF sync	`sync.py`	Pull COCO-Caption Parquet → MinIO `landing` → `warehouse`
2. Catalog	`catalog.py`	Register as `experiment_id=coco-v1`, `sim_id=0` in DuckLake
3. W&B init	`tools.py`	`wandb.init()`, log ingest metrics, register Reference Artifact
4. Query	`query.py`	DuckDB filter by caption keyword, return typed results
5. Stream egress	`stream.py`	`as_iterable_dataset()` → iterate 1,000 samples in a toy loop
6. Shuffle	`stream.py`	Epoch reshuffling via `ORDER BY random()`
7. Rerun	`visualize.py`	Log 10 samples (image + caption) as Rerun entities
8. W&B charts	`visualize.py`	Caption length distribution, image resolution histogram
9. Pydantic-AI agent	`tools.py`	React to `ingest.completed` NATS event → call `quality_check()` tool

9.3 Dependencies to Add

# pyproject.toml additions
"huggingface-hub>=0.23.0",
"datasets>=2.20.0",
"wandb>=0.17.0",
"rerun-sdk>=0.16.0",
"nats-py>=2.7.0",
"pydantic-ai>=0.0.14",

Zenoh is deferred to Phase 2 (replaced by direct S3 writes for K=1 simulator in the demo).

10. Phased Delivery

Phase	Deliverable	Key capability unlocked
1 — Demo	`lakehouse/` package + COCO notebook/script	HF sync, DuckLake catalog, streaming egress, Rerun + W&B visualisation, one Pydantic-AI agent
2 — MCP	`mcp_server.py` wrapping `tools.py`	Remote agent access (Claude.ai web, Cursor)
3 — Zenoh ingest	Zenoh subscriber in `sync.py`	K parallel simulators streaming data in real time
4 — K > 1	Parallel sim orchestration via NATS + Pydantic-AI	Full experiment runs with sweep support
5 — Serving layer	`serving/` bucket + Arrow Flight or Zenoh egress	Distributed Ray training fan-out

1. Purpose​

2. Problem Statement​

3. Agent Interface: Python API First, MCP Server Second​

3.1 Rationale​

3.2 Summary​

4. Full System Architecture​

5. Transport Layer: Zenoh and NATS/JetStream​

5.1 Zenoh — Data Plane​

5.2 NATS/JetStream — Control Plane​

5.3 Interaction Pattern​

6. Egress Streaming and HF Datasets Compatibility​

6.1 Requirement​

6.2 Stack​

6.3 Shuffling​

6.4 HF Hub Sync​

7. Experiment Tracking and Visualisation​

7.1 Experiment Data Model​

7.2 W&B Integration Pattern​

7.3 Visualisation Routing​

7.4 Rerun Integration​

8. Python Package Structure​

9. COCO-Caption Demo — Experiment #0​

9.1 Deliverables​

9.2 Demo Sequence​

9.3 Dependencies to Add​

10. Phased Delivery​