Data Plane — COCO-Caption Demo Implementation Plan (Experiment #0)

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Relationship to the data plane design

This is the implementation plan for Experiment #0 of the Auraison data plane.

Document	Purpose
`docs/data-plane/design.mdx`	Canonical data plane design — four-plane model, requirements DP-001 to DP-042
`docs/data-plane/coco-demo-design.mdx`	Architectural decisions for this demo (read before implementing)
This document	Step-by-step implementation plan for `data-plane/lakehouse/` and Experiment #0

Goal: Build the lakehouse/ Python package and the COCO-Caption Experiment #0 demo (script + notebook) that exercises every layer of the full architecture at 1,000-sample scale.

Architecture: A layered Python package (catalog → sync → query → stream → tools → visualize) backed by DuckDB/DuckLake for catalog management, the lakehouse for object storage, and HF datasets for the streaming egress interface. Pydantic-AI registers the query/sample functions as agent tools (v2 preview — in v1 the LakehouseAgent Claude Code subprocess covers this role). W&B is an optional loose adapter (no W&B = everything still works). Rerun visualises spatial/temporal samples.

Tech Stack: Python 3.12, DuckDB 1.3+, DuckLake (DuckDB extension), s3fs, huggingface-hub, datasets, pydantic-ai, wandb, rerun-sdk, moto (test mocking), pytest

Prerequisites

Docker Compose services must be running for integration tests:

docker-compose up -d
# RustFS (local S3): http://localhost:29000  (key=minio, secret=minio123)
# PostgreSQL: localhost:55432    (db=ducklake, user=ducklake, pass=123456)

Unit tests use moto (S3 mock) and a tmp file catalog — no Docker needed.

Task dependency graph

Editable Mermaid source: images/coco-demo-plan-task-dependency-graph.mermaid.md

Task 1: Add Dependencies

Files:

Modify: pyproject.toml

Step 1: Add the six new runtime dependencies

Edit the dependencies list in pyproject.toml to add after "duckdb>=1.3.0":

    "huggingface-hub>=0.23.0",
    "datasets>=2.20.0",
    "wandb>=0.17.0",
    "rerun-sdk>=0.16.0",
    "nats-py>=2.7.0",
    "pydantic-ai>=0.0.14",
    "moto[s3]>=5.0.0",
    "boto3>=1.35.0",
    "Pillow>=10.0.0",

Step 2: Sync the environment

uv sync

Expected: lock file updated, packages installed with no errors.

Step 3: Verify key imports

uv run python -c "import duckdb, datasets, wandb, rerun, pydantic_ai, nats; print('all imports OK')"

Expected: all imports OK

Step 4: Commit

git add pyproject.toml uv.lock
git commit -m "chore: add huggingface, wandb, rerun, pydantic-ai, nats dependencies"

Task 2: Package Skeleton

Files:

Create: lakehouse/__init__.py
Create: lakehouse/catalog.py
Create: lakehouse/sync.py
Create: lakehouse/query.py
Create: lakehouse/stream.py
Create: lakehouse/tools.py
Create: lakehouse/visualize.py
Create: tests/test_catalog.py
Create: tests/test_sync.py
Create: tests/test_query.py
Create: tests/test_stream.py
Create: tests/test_tools.py
Create: tests/test_visualize.py
Create: experiments/__init__.py
Create: notebooks/ (directory only)

Step 1: Create directory structure

mkdir -p lakehouse experiments notebooks

Step 2: Create lakehouse/__init__.py

from lakehouse.catalog import LakehouseCatalog, Experiment, SimulationRun
from lakehouse.sync import sync_from_hf
from lakehouse.query import QueryResult
from lakehouse.stream import as_iterable_dataset
from lakehouse.visualize import visualize
 
__all__ = [
    "LakehouseCatalog",
    "Experiment",
    "SimulationRun",
    "sync_from_hf",
    "QueryResult",
    "as_iterable_dataset",
    "visualize",
]

Step 3: Create all other files as empty stubs

Each file should contain only:

# stub — implemented in subsequent tasks

Step 4: Create experiments/__init__.py as empty.

Step 5: Verify package is importable

uv run python -c "import lakehouse; print('package OK')"

Expected: package OK

Step 6: Commit

git add lakehouse/ experiments/ notebooks/ tests/test_catalog.py tests/test_sync.py tests/test_query.py tests/test_stream.py tests/test_tools.py tests/test_visualize.py
git commit -m "chore: scaffold lakehouse package and test files"

Task 3: Catalog — Experiment Schema and DuckLake Attach

Files:

Modify: lakehouse/catalog.py
Modify: tests/test_catalog.py

What this builds

LakehouseCatalog wraps a DuckDB connection with a DuckLake catalog attached. It owns the experiments and simulation_runs tables and provides typed methods for registering and updating records.

Step 1: Write failing tests in tests/test_catalog.py

import pytest
import tempfile
import os
from pathlib import Path
from lakehouse.catalog import LakehouseCatalog, Experiment, SimulationRun
 
 
@pytest.fixture
def catalog(tmp_path):
    db_path = str(tmp_path / "test.ducklake")
    # file-based DuckLake, no S3 needed for catalog tests
    cat = LakehouseCatalog(catalog_path=db_path, data_path=str(tmp_path / "data"))
    yield cat
    cat.close()
 
 
def test_register_experiment(catalog):
    exp = Experiment(experiment_id="exp-001", project="test-project", description="unit test")
    catalog.register_experiment(exp)
    row = catalog.get_experiment("exp-001")
    assert row.experiment_id == "exp-001"
    assert row.project == "test-project"
 
 
def test_register_experiment_idempotent(catalog):
    exp = Experiment(experiment_id="exp-dup", project="proj")
    catalog.register_experiment(exp)
    catalog.register_experiment(exp)  # second call must not raise
    row = catalog.get_experiment("exp-dup")
    assert row is not None
 
 
def test_register_simulation_run(catalog):
    catalog.register_experiment(Experiment(experiment_id="exp-001", project="proj"))
    catalog.register_run(
        SimulationRun(
            run_id="run-001",
            experiment_id="exp-001",
            sim_index=0,
            config={"lr": 0.001},
            s3_prefix="s3://warehouse/exp-001/sim_0/",
        )
    )
    run = catalog.get_run("run-001")
    assert run.sim_index == 0
    assert run.status == "started"
    assert run.config["lr"] == pytest.approx(0.001)
 
 
def test_complete_run(catalog):
    catalog.register_experiment(Experiment(experiment_id="exp-001", project="proj"))
    catalog.register_run(
        SimulationRun(run_id="run-001", experiment_id="exp-001", sim_index=0,
                      config={}, s3_prefix="s3://warehouse/exp-001/sim_0/")
    )
    catalog.complete_run("run-001")
    run = catalog.get_run("run-001")
    assert run.status == "completed"
    assert run.completed_at is not None
 
 
def test_list_runs_for_experiment(catalog):
    catalog.register_experiment(Experiment(experiment_id="exp-002", project="proj"))
    for k in range(3):
        catalog.register_run(
            SimulationRun(run_id=f"run-{k}", experiment_id="exp-002",
                          sim_index=k, config={}, s3_prefix=f"s3://w/exp-002/sim_{k}/")
        )
    runs = catalog.list_runs("exp-002")
    assert len(runs) == 3
    assert {r.sim_index for r in runs} == {0, 1, 2}

Step 2: Run to verify they fail

uv run pytest tests/test_catalog.py -v

Expected: ImportError or AttributeError — LakehouseCatalog not implemented.

Step 3: Implement lakehouse/catalog.py

from __future__ import annotations
 
import json
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
 
import duckdb
 
 
@dataclass
class Experiment:
    experiment_id: str
    project: str
    description: str = ""
 
 
@dataclass
class SimulationRun:
    run_id: str
    experiment_id: str
    sim_index: int
    config: dict
    s3_prefix: str
    status: str = "started"
    started_at: Optional[datetime] = None
    completed_at: Optional[datetime] = None
 
 
class LakehouseCatalog:
    def __init__(self, catalog_path: str, data_path: str) -> None:
        self._conn = duckdb.connect()
        self._conn.execute(
            f"ATTACH 'ducklake:{catalog_path}' AS lh (DATA_PATH '{data_path}')"
        )
        self._conn.execute("USE lh")
        self._init_schema()
 
    def _init_schema(self) -> None:
        self._conn.execute("""
            CREATE TABLE IF NOT EXISTS experiments (
                experiment_id TEXT PRIMARY KEY,
                project       TEXT NOT NULL,
                description   TEXT DEFAULT '',
                created_at    TIMESTAMPTZ DEFAULT now()
            )
        """)
        self._conn.execute("""
            CREATE TABLE IF NOT EXISTS simulation_runs (
                run_id        TEXT PRIMARY KEY,
                experiment_id TEXT NOT NULL REFERENCES experiments(experiment_id),
                sim_index     INTEGER NOT NULL,
                config        JSON DEFAULT '{}',
                status        TEXT DEFAULT 'started',
                s3_prefix     TEXT,
                started_at    TIMESTAMPTZ DEFAULT now(),
                completed_at  TIMESTAMPTZ
            )
        """)
 
    def register_experiment(self, exp: Experiment) -> None:
        self._conn.execute(
            """
            INSERT INTO experiments (experiment_id, project, description)
            VALUES (?, ?, ?)
            ON CONFLICT (experiment_id) DO NOTHING
            """,
            [exp.experiment_id, exp.project, exp.description],
        )
 
    def get_experiment(self, experiment_id: str) -> Optional[Experiment]:
        row = self._conn.execute(
            "SELECT experiment_id, project, description FROM experiments WHERE experiment_id = ?",
            [experiment_id],
        ).fetchone()
        if row is None:
            return None
        return Experiment(experiment_id=row[0], project=row[1], description=row[2])
 
    def register_run(self, run: SimulationRun) -> None:
        self._conn.execute(
            """
            INSERT INTO simulation_runs
                (run_id, experiment_id, sim_index, config, s3_prefix)
            VALUES (?, ?, ?, ?, ?)
            """,
            [run.run_id, run.experiment_id, run.sim_index,
             json.dumps(run.config), run.s3_prefix],
        )
 
    def get_run(self, run_id: str) -> Optional[SimulationRun]:
        row = self._conn.execute(
            """
            SELECT run_id, experiment_id, sim_index, config, status,
                   s3_prefix, started_at, completed_at
            FROM simulation_runs WHERE run_id = ?
            """,
            [run_id],
        ).fetchone()
        if row is None:
            return None
        return SimulationRun(
            run_id=row[0], experiment_id=row[1], sim_index=row[2],
            config=json.loads(row[3]) if isinstance(row[3], str) else row[3],
            status=row[4], s3_prefix=row[5],
            started_at=row[6], completed_at=row[7],
        )
 
    def complete_run(self, run_id: str) -> None:
        self._conn.execute(
            "UPDATE simulation_runs SET status='completed', completed_at=now() WHERE run_id=?",
            [run_id],
        )
 
    def list_runs(self, experiment_id: str) -> list[SimulationRun]:
        rows = self._conn.execute(
            """
            SELECT run_id, experiment_id, sim_index, config, status,
                   s3_prefix, started_at, completed_at
            FROM simulation_runs WHERE experiment_id = ?
            ORDER BY sim_index
            """,
            [experiment_id],
        ).fetchall()
        return [
            SimulationRun(
                run_id=r[0], experiment_id=r[1], sim_index=r[2],
                config=json.loads(r[3]) if isinstance(r[3], str) else r[3],
                status=r[4], s3_prefix=r[5], started_at=r[6], completed_at=r[7],
            )
            for r in rows
        ]
 
    def close(self) -> None:
        self._conn.close()

Step 4: Run tests and verify they pass

uv run pytest tests/test_catalog.py -v

Expected: 5 tests pass.

Step 5: Commit

git add lakehouse/catalog.py tests/test_catalog.py
git commit -m "feat: implement LakehouseCatalog with experiment/run schema"

Task 4: Sync — HF Hub to the lakehouse

Files:

Modify: lakehouse/sync.py
Modify: tests/test_sync.py

What this builds

sync_from_hf() downloads Parquet files from HF Hub and uploads them to a lakehouse S3 bucket. Tests use moto to mock S3 so Docker is not required.

Step 1: Write failing tests in tests/test_sync.py

import pytest
import io
from unittest.mock import patch, MagicMock
import pyarrow as pa
import pyarrow.parquet as pq
from moto import mock_aws
import boto3
from lakehouse.sync import sync_from_hf, SyncResult
 
 
def _make_parquet_bytes(n: int = 10) -> bytes:
    table = pa.table({
        "image_id": list(range(n)),
        "caption": [f"a photo of object {i}" for i in range(n)],
    })
    buf = io.BytesIO()
    pq.write_table(table, buf)
    return buf.getvalue()
 
 
@mock_aws
def test_sync_uploads_parquet_to_s3():
    # Create the mock bucket
    s3 = boto3.client("s3", region_name="us-east-1")
    s3.create_bucket(Bucket="warehouse")
 
    fake_files = {"data/train-00000-of-00001.parquet": _make_parquet_bytes(10)}
 
    with patch("lakehouse.sync._download_hf_parquet_files", return_value=fake_files):
        result = sync_from_hf(
            hf_repo="lmms-lab/COCO-Caption",
            bucket="warehouse",
            s3_prefix="coco/v1",
            endpoint_url="http://localhost:29000",
            access_key="minio",
            secret_key="minio123",
            max_samples=None,
        )
 
    assert result.files_uploaded == 1
    assert result.rows_total == 10
    assert "s3://warehouse/coco/v1/" in result.s3_prefix
 
 
@mock_aws
def test_sync_respects_max_samples():
    s3 = boto3.client("s3", region_name="us-east-1")
    s3.create_bucket(Bucket="warehouse")
 
    fake_files = {"data/part-0.parquet": _make_parquet_bytes(100)}
 
    with patch("lakehouse.sync._download_hf_parquet_files", return_value=fake_files):
        result = sync_from_hf(
            hf_repo="lmms-lab/COCO-Caption",
            bucket="warehouse",
            s3_prefix="coco/v1-small",
            endpoint_url="http://localhost:29000",
            access_key="minio",
            secret_key="minio123",
            max_samples=20,
        )
 
    assert result.rows_total == 20

Step 2: Run to verify they fail

uv run pytest tests/test_sync.py -v

Expected: ImportError — sync_from_hf not implemented.

Step 3: Implement lakehouse/sync.py

from __future__ import annotations
 
import io
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
 
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
from huggingface_hub import list_repo_files, hf_hub_download
 
 
@dataclass
class SyncResult:
    hf_repo: str
    s3_prefix: str
    files_uploaded: int
    rows_total: int
 
 
def _download_hf_parquet_files(hf_repo: str) -> dict[str, bytes]:
    """Return {relative_path: parquet_bytes} for all parquet files in the repo."""
    files = {}
    for filename in list_repo_files(hf_repo, repo_type="dataset"):
        if not filename.endswith(".parquet"):
            continue
        local = hf_hub_download(repo_id=hf_repo, filename=filename, repo_type="dataset")
        with open(local, "rb") as f:
            files[filename] = f.read()
    return files
 
 
def sync_from_hf(
    hf_repo: str,
    bucket: str,
    s3_prefix: str,
    endpoint_url: str,
    access_key: str,
    secret_key: str,
    max_samples: Optional[int] = None,
) -> SyncResult:
    """Download Parquet files from HF Hub and upload to lakehouse S3."""
    raw_files = _download_hf_parquet_files(hf_repo)
 
    s3 = boto3.client(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key,
        region_name="local",
    )
 
    files_uploaded = 0
    rows_total = 0
 
    for relative_path, parquet_bytes in raw_files.items():
        table = pq.read_table(io.BytesIO(parquet_bytes))
 
        if max_samples is not None:
            remaining = max_samples - rows_total
            if remaining <= 0:
                break
            table = table.slice(0, min(remaining, len(table)))
 
        rows_total += len(table)
        out_buf = io.BytesIO()
        pq.write_table(table, out_buf)
        out_buf.seek(0)
 
        filename = Path(relative_path).name
        key = f"{s3_prefix}/{filename}"
        s3.put_object(Bucket=bucket, Key=key, Body=out_buf.read())
        files_uploaded += 1
 
        if max_samples is not None and rows_total >= max_samples:
            break
 
    return SyncResult(
        hf_repo=hf_repo,
        s3_prefix=f"s3://{bucket}/{s3_prefix}/",
        files_uploaded=files_uploaded,
        rows_total=rows_total,
    )

Step 4: Run tests and verify they pass

uv run pytest tests/test_sync.py -v

Expected: 2 tests pass.

Step 5: Commit

git add lakehouse/sync.py tests/test_sync.py
git commit -m "feat: implement sync_from_hf for HF Hub → MinIO Parquet ingestion"

Task 5: Query — Typed DuckDB Queries Over the Catalog

Files:

Modify: lakehouse/query.py
Modify: tests/test_query.py

What this builds

LakehouseQuery wraps a DuckDB connection with S3 secrets configured and exposes typed query methods. Tests build a small in-memory Parquet fixture via DuckDB without touching the lakehouse.

Step 1: Write failing tests in tests/test_query.py

import pytest
import tempfile
import os
import io
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from lakehouse.query import LakehouseQuery, QueryResult
 
 
@pytest.fixture
def parquet_dir(tmp_path):
    """Create a small Parquet file the query layer can read locally."""
    table = pa.table({
        "image_id": list(range(50)),
        "caption": [f"a photo showing object number {i}" for i in range(50)],
        "width": [640] * 50,
        "height": [480] * 50,
    })
    pq.write_table(table, tmp_path / "part-0.parquet")
    return tmp_path
 
 
@pytest.fixture
def query(parquet_dir):
    q = LakehouseQuery(
        s3_endpoint=None,  # local mode — reads from filesystem paths
        access_key=None,
        secret_key=None,
    )
    q.register_local_path("coco_test", str(parquet_dir / "*.parquet"))
    yield q
    q.close()
 
 
def test_query_all_returns_arrow(query):
    result = query.execute("SELECT * FROM coco_test")
    assert isinstance(result, QueryResult)
    assert result.num_rows == 50
 
 
def test_query_filter(query):
    result = query.execute("SELECT * FROM coco_test WHERE image_id < 10")
    assert result.num_rows == 10
 
 
def test_query_to_pandas(query):
    result = query.execute("SELECT image_id, caption FROM coco_test LIMIT 5")
    df = result.to_pandas()
    assert len(df) == 5
    assert "caption" in df.columns
 
 
def test_query_to_arrow(query):
    result = query.execute("SELECT * FROM coco_test LIMIT 3")
    table = result.to_arrow()
    assert isinstance(table, pa.Table)
    assert table.num_rows == 3
 
 
def test_query_sample(query):
    result = query.sample("coco_test", n=10)
    assert result.num_rows == 10
 
 
def test_query_caption_keyword(query):
    result = query.execute(
        "SELECT * FROM coco_test WHERE caption LIKE '%object number 1%'"
    )
    # Matches "object number 1", "object number 10", "object number 11" ... "object number 19"
    assert result.num_rows >= 1

Step 2: Run to verify they fail

uv run pytest tests/test_query.py -v

Expected: ImportError.

Step 3: Implement lakehouse/query.py

from __future__ import annotations
 
from typing import Optional
 
import duckdb
import pyarrow as pa
import pandas as pd
 
 
class QueryResult:
    def __init__(self, relation: duckdb.DuckDBPyRelation) -> None:
        self._rel = relation
        self._arrow: Optional[pa.Table] = None
 
    def _materialise(self) -> pa.Table:
        if self._arrow is None:
            self._arrow = self._rel.arrow()
        return self._arrow
 
    @property
    def num_rows(self) -> int:
        return len(self._materialise())
 
    def to_arrow(self) -> pa.Table:
        return self._materialise()
 
    def to_pandas(self) -> pd.DataFrame:
        return self._materialise().to_pandas()
 
 
class LakehouseQuery:
    def __init__(
        self,
        s3_endpoint: Optional[str],
        access_key: Optional[str],
        secret_key: Optional[str],
    ) -> None:
        self._conn = duckdb.connect()
        if s3_endpoint:
            self._conn.execute("INSTALL httpfs; LOAD httpfs;")
            self._conn.execute(f"SET s3_endpoint='{s3_endpoint}';")
            self._conn.execute(f"SET s3_access_key_id='{access_key}';")
            self._conn.execute(f"SET s3_secret_access_key='{secret_key}';")
            self._conn.execute("SET s3_use_ssl=false;")
            self._conn.execute("SET s3_url_style='path';")
 
    def register_local_path(self, name: str, glob_path: str) -> None:
        """Register a local or S3 glob pattern as a named view."""
        self._conn.execute(
            f"CREATE OR REPLACE VIEW {name} AS SELECT * FROM read_parquet('{glob_path}')"
        )
 
    def execute(self, sql: str) -> QueryResult:
        return QueryResult(self._conn.sql(sql))
 
    def sample(self, table_or_view: str, n: int, seed: Optional[int] = None) -> QueryResult:
        seed_clause = f"REPEATABLE ({seed})" if seed is not None else ""
        sql = f"SELECT * FROM {table_or_view} USING SAMPLE {n} ROWS {seed_clause}"
        return QueryResult(self._conn.sql(sql))
 
    def close(self) -> None:
        self._conn.close()

Step 4: Run tests and verify they pass

uv run pytest tests/test_query.py -v

Expected: 6 tests pass.

Step 5: Commit

git add lakehouse/query.py tests/test_query.py
git commit -m "feat: implement LakehouseQuery with typed Arrow/pandas results"

Task 6: Stream — HF IterableDataset Egress

Files:

Modify: lakehouse/stream.py
Modify: tests/test_stream.py

What this builds

as_iterable_dataset() wraps a LakehouseQuery result in an HF IterableDataset, enabling training code to consume lakehouse data with the standard datasets API. Uses fetch_arrow_reader() for zero-copy streaming.

Step 1: Write failing tests in tests/test_stream.py

import pytest
import io
import pyarrow as pa
import pyarrow.parquet as pq
from datasets import IterableDataset
from lakehouse.query import LakehouseQuery
from lakehouse.stream import as_iterable_dataset, stream_batches
 
 
@pytest.fixture
def query_with_data(tmp_path):
    table = pa.table({
        "image_id": list(range(100)),
        "caption": [f"caption {i}" for i in range(100)],
        "width": [320] * 100,
    })
    pq.write_table(table, tmp_path / "data.parquet")
    q = LakehouseQuery(s3_endpoint=None, access_key=None, secret_key=None)
    q.register_local_path("ds", str(tmp_path / "*.parquet"))
    yield q
    q.close()
 
 
def test_as_iterable_dataset_returns_correct_type(query_with_data):
    ds = as_iterable_dataset(query_with_data, "SELECT * FROM ds")
    assert isinstance(ds, IterableDataset)
 
 
def test_iterable_dataset_yields_all_rows(query_with_data):
    ds = as_iterable_dataset(query_with_data, "SELECT * FROM ds")
    rows = list(ds)
    assert len(rows) == 100
 
 
def test_iterable_dataset_has_correct_fields(query_with_data):
    ds = as_iterable_dataset(query_with_data, "SELECT * FROM ds LIMIT 1")
    row = next(iter(ds))
    assert "image_id" in row
    assert "caption" in row
 
 
def test_iterable_dataset_respects_limit(query_with_data):
    ds = as_iterable_dataset(query_with_data, "SELECT * FROM ds LIMIT 10")
    rows = list(ds)
    assert len(rows) == 10
 
 
def test_stream_batches_yields_arrow_tables(query_with_data):
    batches = list(stream_batches(query_with_data, "SELECT * FROM ds", batch_size=32))
    total = sum(b.num_rows for b in batches)
    assert total == 100
    assert all(isinstance(b, pa.RecordBatch) for b in batches)
 
 
def test_epoch_reshuffle_differs(query_with_data):
    sql = "SELECT * FROM ds ORDER BY random()"
    epoch1 = [row["image_id"] for row in as_iterable_dataset(query_with_data, sql)]
    epoch2 = [row["image_id"] for row in as_iterable_dataset(query_with_data, sql)]
    # With high probability, random order differs between epochs
    assert epoch1 != epoch2

Step 2: Run to verify they fail

uv run pytest tests/test_stream.py -v

Expected: ImportError.

Step 3: Implement lakehouse/stream.py

from __future__ import annotations
 
from typing import Iterator
 
import pyarrow as pa
from datasets import IterableDataset
 
from lakehouse.query import LakehouseQuery
 
 
def stream_batches(
    query: LakehouseQuery,
    sql: str,
    batch_size: int = 256,
) -> Iterator[pa.RecordBatch]:
    """Yield Arrow RecordBatches from a DuckDB query. Zero-copy, never materialises full result."""
    reader = query._conn.execute(sql).fetch_arrow_reader(batch_size)
    yield from reader
 
 
def as_iterable_dataset(
    query: LakehouseQuery,
    sql: str,
    batch_size: int = 256,
) -> IterableDataset:
    """
    Wrap a DuckDB SQL query as an HF IterableDataset.
 
    Training code using this is identical to load_dataset(..., streaming=True).
    Data stays in the lakehouse; HF Hub is not involved at runtime.
 
    To reshuffle between epochs, change the SQL (e.g. append 'ORDER BY random()').
    """
    def _generator():
        for batch in stream_batches(query, sql, batch_size):
            yield from batch.to_pylist()
 
    return IterableDataset.from_generator(_generator)

Step 4: Run tests and verify they pass

uv run pytest tests/test_stream.py -v

Expected: 6 tests pass. Note: test_epoch_reshuffle_differs has a tiny probability of false failure (two identical random permutations of 100 items). Re-run if it fails once.

Step 5: Commit

git add lakehouse/stream.py tests/test_stream.py
git commit -m "feat: implement streaming egress via HF IterableDataset and Arrow reader"

Task 7: Tools — Pydantic-AI Agent Tool Definitions

Files:

Modify: lakehouse/tools.py
Modify: tests/test_tools.py

What this builds

build_lakehouse_agent() returns a configured Pydantic-AI Agent with lakehouse query, sample, and quality-check tools registered. The same Pydantic models that define tool schemas will later generate the MCP server schema (Phase 2).

Step 1: Write failing tests in tests/test_tools.py

import pytest
import pyarrow as pa
import pyarrow.parquet as pq
from unittest.mock import AsyncMock, patch, MagicMock
from lakehouse.query import LakehouseQuery
from lakehouse.tools import build_lakehouse_agent, QualityReport
 
 
@pytest.fixture
def query_fixture(tmp_path):
    table = pa.table({
        "image_id": list(range(20)),
        "caption": [f"a cat sitting on a mat {i}" for i in range(20)],
        "width": [640] * 20,
        "height": [480] * 20,
    })
    pq.write_table(table, tmp_path / "data.parquet")
    q = LakehouseQuery(s3_endpoint=None, access_key=None, secret_key=None)
    q.register_local_path("coco", str(tmp_path / "*.parquet"))
    yield q
    q.close()
 
 
def test_quality_report_fields():
    report = QualityReport(
        experiment_id="exp-001",
        total_rows=1000,
        null_caption_count=2,
        mean_caption_length=45.3,
        passed=True,
    )
    assert report.passed is True
    assert report.null_caption_count == 2
 
 
def test_build_lakehouse_agent_returns_agent(query_fixture):
    from pydantic_ai import Agent
    agent = build_lakehouse_agent(query_fixture, view_name="coco")
    assert isinstance(agent, Agent)
 
 
@pytest.mark.asyncio
async def test_agent_can_call_quality_check(query_fixture):
    agent = build_lakehouse_agent(query_fixture, view_name="coco")
    with patch("pydantic_ai.Agent.run", new_callable=AsyncMock) as mock_run:
        mock_result = MagicMock()
        mock_result.data = "Quality check passed: 20 rows, 0 nulls"
        mock_run.return_value = mock_result
        result = await agent.run("Run a quality check on coco")
        assert mock_run.called

Step 2: Run to verify they fail

uv run pytest tests/test_tools.py -v

Expected: ImportError.

Step 3: Add pytest-asyncio to dependencies

Add to pyproject.toml:

    "pytest-asyncio>=0.23.0",

Run: uv sync

Also add pytest.ini or pyproject.toml section:

[tool.pytest.ini_options]
asyncio_mode = "auto"

Step 4: Implement lakehouse/tools.py

from __future__ import annotations
 
from dataclasses import dataclass
from typing import Optional
 
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext
 
from lakehouse.query import LakehouseQuery
 
 
class QualityReport(BaseModel):
    experiment_id: str
    total_rows: int
    null_caption_count: int
    mean_caption_length: float
    passed: bool
 
 
class SampleRequest(BaseModel):
    view_name: str
    n: int = 10
    seed: Optional[int] = None
 
 
def build_lakehouse_agent(
    query: LakehouseQuery,
    view_name: str,
    model: str = "claude-sonnet-4-6",
) -> Agent:
    """
    Build a Pydantic-AI Agent with lakehouse tools registered.
 
    The same Pydantic models here generate the MCP server tool schemas in Phase 2.
    All tools are synchronous wrappers — the Agent handles async scheduling.
    """
    agent: Agent[LakehouseQuery, str] = Agent(
        model,
        deps_type=LakehouseQuery,
        result_type=str,
        system_prompt=(
            "You are a lakehouse orchestration agent. "
            "Use the available tools to query, sample, and validate datasets."
        ),
    )
 
    @agent.tool
    def query_dataset(ctx: RunContext[LakehouseQuery], sql: str) -> dict:
        """Execute a SQL query against the lakehouse and return summary statistics."""
        result = ctx.deps.execute(sql)
        table = result.to_arrow()
        return {
            "num_rows": table.num_rows,
            "columns": table.schema.names,
            "preview": table.slice(0, 3).to_pydict(),
        }
 
    @agent.tool
    def sample_dataset(ctx: RunContext[LakehouseQuery], req: SampleRequest) -> dict:
        """Sample N rows from a lakehouse view, optionally with a fixed seed."""
        result = ctx.deps.sample(req.view_name, n=req.n, seed=req.seed)
        return result.to_arrow().to_pydict()
 
    @agent.tool
    def quality_check(
        ctx: RunContext[LakehouseQuery],
        experiment_id: str,
        caption_column: str = "caption",
    ) -> QualityReport:
        """
        Run a basic quality check on the active dataset view.
        Returns a QualityReport with null counts and caption length statistics.
        """
        total = ctx.deps.execute(f"SELECT count(*) FROM {view_name}").to_arrow()[0][0].as_py()
        null_count = ctx.deps.execute(
            f"SELECT count(*) FROM {view_name} WHERE {caption_column} IS NULL"
        ).to_arrow()[0][0].as_py()
        mean_len = ctx.deps.execute(
            f"SELECT avg(length({caption_column})) FROM {view_name} WHERE {caption_column} IS NOT NULL"
        ).to_arrow()[0][0].as_py() or 0.0
 
        return QualityReport(
            experiment_id=experiment_id,
            total_rows=total,
            null_caption_count=null_count,
            mean_caption_length=round(mean_len, 2),
            passed=(null_count == 0 and total > 0),
        )
 
    return agent

Step 5: Run tests and verify they pass

uv run pytest tests/test_tools.py -v

Expected: 3 tests pass.

Step 6: Commit

git add lakehouse/tools.py tests/test_tools.py pyproject.toml
git commit -m "feat: implement Pydantic-AI agent with query/sample/quality_check tools"

Task 8: Visualize — Rerun and W&B Routing

Files:

Modify: lakehouse/visualize.py
Modify: tests/test_visualize.py

What this builds

visualize(data, backend="auto") inspects the Arrow schema and routes to Rerun (spatial columns present) or W&B (scalar time-series). Tests mock both backends so no running Rerun viewer or W&B account is needed.

Step 1: Write failing tests in tests/test_visualize.py

import pytest
import pyarrow as pa
from unittest.mock import patch, MagicMock, call
from lakehouse.visualize import visualize, detect_backend, Backend
 
 
def test_detect_backend_spatial_goes_to_rerun():
    schema = pa.schema([
        ("frame_id", pa.int64()),
        ("timestamp", pa.float64()),
        ("image_bytes", pa.binary()),
        ("caption", pa.string()),
    ])
    assert detect_backend(schema) == Backend.RERUN
 
 
def test_detect_backend_scalar_goes_to_wandb():
    schema = pa.schema([
        ("step", pa.int64()),
        ("reward", pa.float64()),
        ("loss", pa.float32()),
    ])
    assert detect_backend(schema) == Backend.WANDB
 
 
def test_detect_backend_caption_only_goes_to_rerun():
    schema = pa.schema([
        ("image_id", pa.int64()),
        ("caption", pa.string()),
        ("width", pa.int32()),
        ("height", pa.int32()),
    ])
    assert detect_backend(schema) == Backend.RERUN
 
 
def test_visualize_routes_to_rerun(tmp_path):
    table = pa.table({
        "image_id": [0, 1],
        "caption": ["cat", "dog"],
        "width": [640, 640],
        "height": [480, 480],
    })
    with patch("lakehouse.visualize._log_to_rerun") as mock_rerun:
        visualize(table, backend="auto")
        mock_rerun.assert_called_once_with(table)
 
 
def test_visualize_routes_to_wandb():
    table = pa.table({
        "step": [0, 1, 2],
        "reward": [0.1, 0.5, 0.8],
        "loss": [1.0, 0.7, 0.4],
    })
    with patch("lakehouse.visualize._log_to_wandb") as mock_wb:
        visualize(table, backend="auto")
        mock_wb.assert_called_once_with(table)
 
 
def test_visualize_explicit_backend_overrides_auto():
    table = pa.table({"step": [0], "reward": [1.0]})
    with patch("lakehouse.visualize._log_to_rerun") as mock_rerun:
        visualize(table, backend="rerun")
        mock_rerun.assert_called_once()

Step 2: Run to verify they fail

uv run pytest tests/test_visualize.py -v

Expected: ImportError.

Step 3: Implement lakehouse/visualize.py

from __future__ import annotations
 
from enum import Enum
from typing import Literal
 
import pyarrow as pa
 
# Spatial column names that indicate Rerun is the right backend
_SPATIAL_COLUMNS = frozenset({
    "image_bytes", "image_path", "frame_id", "point_cloud",
    "trajectory", "transform", "pose", "bbox", "embedding",
    "caption", "width", "height",
})
 
# Scalar metric column names that indicate W&B is the right backend
_METRIC_COLUMNS = frozenset({
    "reward", "loss", "accuracy", "step", "epoch",
    "success_rate", "value", "metric",
})
 
 
class Backend(str, Enum):
    RERUN = "rerun"
    WANDB = "wandb"
 
 
def detect_backend(schema: pa.Schema) -> Backend:
    """
    Inspect column names to determine the appropriate visualisation backend.
    Rerun: spatial, temporal, or multimodal data (images, captions, poses).
    W&B: scalar time-series (reward, loss, accuracy).
    When ambiguous, prefer Rerun.
    """
    col_names = set(schema.names)
    if col_names & _SPATIAL_COLUMNS:
        return Backend.RERUN
    if col_names & _METRIC_COLUMNS:
        return Backend.WANDB
    return Backend.RERUN  # default
 
 
def _log_to_rerun(table: pa.Table) -> None:
    """Log an Arrow table to a Rerun recording stream."""
    import rerun as rr
 
    rr.init("lakehouse", spawn=True)
    for i, row in enumerate(table.to_pylist()):
        timestamp = row.get("timestamp", float(i))
        rr.set_time_seconds("time", timestamp)
        entity = f"sample/{row.get('image_id', i)}"
 
        if "image_bytes" in row and row["image_bytes"] is not None:
            rr.log(entity, rr.Image(row["image_bytes"]))
        if "caption" in row:
            rr.log(f"{entity}/caption", rr.TextLog(str(row["caption"])))
        if "bbox" in row and row["bbox"] is not None:
            rr.log(f"{entity}/bbox", rr.Boxes2D(array=row["bbox"]))
 
 
def _log_to_wandb(table: pa.Table) -> None:
    """Log scalar columns from an Arrow table to W&B as time-series metrics."""
    import wandb
 
    df = table.to_pandas()
    step_col = "step" if "step" in df.columns else None
    for _, row in df.iterrows():
        metrics = {k: v for k, v in row.items() if k != step_col}
        step = int(row[step_col]) if step_col else None
        wandb.log(metrics, step=step)
 
 
def visualize(
    data: pa.Table,
    backend: Literal["auto", "rerun", "wandb"] = "auto",
) -> None:
    """
    Visualise an Arrow table in the appropriate backend.
 
    backend="auto": inspect schema columns to choose Rerun or W&B.
    backend="rerun": force Rerun regardless of schema.
    backend="wandb": force W&B regardless of schema.
    """
    if backend == "auto":
        chosen = detect_backend(data.schema)
    else:
        chosen = Backend(backend)
 
    if chosen == Backend.RERUN:
        _log_to_rerun(data)
    else:
        _log_to_wandb(data)

Step 4: Run tests and verify they pass

uv run pytest tests/test_visualize.py -v

Expected: 6 tests pass.

Step 5: Commit

git add lakehouse/visualize.py tests/test_visualize.py
git commit -m "feat: implement visualize() routing to Rerun or W&B by schema inspection"

Task 9: Full Unit Test Suite

Step 1: Run all unit tests together

uv run pytest tests/test_catalog.py tests/test_sync.py tests/test_query.py tests/test_stream.py tests/test_tools.py tests/test_visualize.py -v

Expected: all tests pass (approx 28 tests).

Step 2: If any failures, fix before continuing

Do not proceed to Task 10 with failing tests.

Step 3: Commit any fixes

git add -p
git commit -m "fix: resolve test failures before integration demo"

Task 10: COCO Demo Script

Files:

Create: experiments/coco_demo.py

What this builds

An end-to-end runnable script that exercises all lakehouse layers in sequence. Requires Docker Compose services running.

Step 1: Create experiments/coco_demo.py

#!/usr/bin/env python3
"""
Experiment #0: COCO-Caption lakehouse demo.
 
Requires:
    docker-compose up -d   (RustFS on :29000, PostgreSQL on :55432)
    uv run python experiments/coco_demo.py
"""
 
import os
import asyncio
import tempfile
from pathlib import Path
from dotenv import load_dotenv
from rich.console import Console
from rich.panel import Panel
 
load_dotenv()
console = Console()
 
S3_ENDPOINT = os.getenv("S3_ENDPOINT", "http://localhost:29000")
S3_KEY = os.getenv("DESTINATION__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID", "minio")
S3_SECRET = os.getenv("DESTINATION__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY", "minio123")
HF_REPO = "lmms-lab/COCO-Caption"
MAX_SAMPLES = 1000
EXPERIMENT_ID = "coco-v1"
BUCKET = "warehouse"
S3_PREFIX = f"experiments/{EXPERIMENT_ID}/sim_0"
 
 
def step1_sync():
    console.print(Panel("[bold]Step 1: HF Hub → lakehouse sync[/bold]"))
    from lakehouse.sync import sync_from_hf
    result = sync_from_hf(
        hf_repo=HF_REPO,
        bucket=BUCKET,
        s3_prefix=S3_PREFIX,
        endpoint_url=S3_ENDPOINT,
        access_key=S3_KEY,
        secret_key=S3_SECRET,
        max_samples=MAX_SAMPLES,
    )
    console.print(f"  Uploaded {result.files_uploaded} file(s), {result.rows_total} rows")
    console.print(f"  S3 prefix: {result.s3_prefix}")
    return result
 
 
def step2_catalog(tmp_path: str):
    console.print(Panel("[bold]Step 2: Register in DuckLake catalog[/bold]"))
    from lakehouse.catalog import LakehouseCatalog, Experiment, SimulationRun
    cat = LakehouseCatalog(
        catalog_path=f"{tmp_path}/coco-v1.ducklake",
        data_path=f"{tmp_path}/data",
    )
    cat.register_experiment(Experiment(
        experiment_id=EXPERIMENT_ID,
        project="lakehouse-demo",
        description="COCO-Caption 1k sample demo",
    ))
    cat.register_run(SimulationRun(
        run_id="run-coco-001",
        experiment_id=EXPERIMENT_ID,
        sim_index=0,
        config={"max_samples": MAX_SAMPLES, "hf_repo": HF_REPO},
        s3_prefix=f"s3://{BUCKET}/{S3_PREFIX}/",
    ))
    console.print(f"  Registered experiment: {EXPERIMENT_ID}")
    return cat
 
 
def step3_wandb(s3_prefix: str):
    console.print(Panel("[bold]Step 3: W&B init + Reference Artifact[/bold]"))
    try:
        import wandb
        run = wandb.init(
            project="lakehouse-demo",
            group=EXPERIMENT_ID,
            name="run-coco-001",
            config={"max_samples": MAX_SAMPLES, "hf_repo": HF_REPO},
            mode=os.getenv("WANDB_MODE", "disabled"),  # set to "online" to enable
        )
        artifact = wandb.Artifact(name="coco-caption-1k", type="dataset")
        artifact.add_reference(s3_prefix)
        wandb.log_artifact(artifact)
        wandb.log({"samples_ingested": MAX_SAMPLES})
        console.print(f"  W&B run: {run.name} (mode={os.getenv('WANDB_MODE', 'disabled')})")
        return run
    except Exception as e:
        console.print(f"  [yellow]W&B skipped: {e}[/yellow]")
        return None
 
 
def step4_query():
    console.print(Panel("[bold]Step 4: DuckDB query[/bold]"))
    from lakehouse.query import LakehouseQuery
    q = LakehouseQuery(
        s3_endpoint=S3_ENDPOINT.replace("http://", ""),
        access_key=S3_KEY,
        secret_key=S3_SECRET,
    )
    q.register_local_path(
        "coco",
        f"s3://{BUCKET}/{S3_PREFIX}/*.parquet",
    )
    result = q.execute("SELECT * FROM coco WHERE caption LIKE '%cat%' LIMIT 5")
    console.print(f"  Rows matching 'cat': {result.num_rows}")
    console.print(result.to_pandas()[["image_id", "caption"]].to_string(index=False))
    return q
 
 
def step5_stream(query):
    console.print(Panel("[bold]Step 5: HF IterableDataset streaming egress[/bold]"))
    from lakehouse.stream import as_iterable_dataset
    ds = as_iterable_dataset(query, "SELECT * FROM coco")
    count = sum(1 for _ in ds)
    console.print(f"  Streamed {count} rows via IterableDataset")
 
 
def step6_shuffle(query):
    console.print(Panel("[bold]Step 6: Epoch reshuffling[/bold]"))
    from lakehouse.stream import as_iterable_dataset
    ids_e1 = [r["image_id"] for r in as_iterable_dataset(query, "SELECT * FROM coco ORDER BY random()")]
    ids_e2 = [r["image_id"] for r in as_iterable_dataset(query, "SELECT * FROM coco ORDER BY random()")]
    same = ids_e1 == ids_e2
    console.print(f"  Epoch 1 == Epoch 2: {same} (should be False for good shuffle)")
 
 
def step7_rerun(query):
    console.print(Panel("[bold]Step 7: Rerun visualisation (10 samples)[/bold]"))
    from lakehouse.visualize import visualize
    result = query.execute("SELECT * FROM coco LIMIT 10")
    try:
        visualize(result.to_arrow(), backend="rerun")
        console.print("  Logged 10 samples to Rerun")
    except Exception as e:
        console.print(f"  [yellow]Rerun skipped (no viewer running): {e}[/yellow]")
 
 
def step8_wandb_charts(query, wandb_run):
    console.print(Panel("[bold]Step 8: W&B data quality charts[/bold]"))
    try:
        import wandb
        if wandb_run is None:
            console.print("  [yellow]Skipped — W&B not initialised[/yellow]")
            return
        result = query.execute(
            "SELECT length(caption) AS cap_len, width, height FROM coco"
        )
        df = result.to_pandas()
        wandb.log({
            "caption_length_mean": df["cap_len"].mean(),
            "caption_length_std": df["cap_len"].std(),
        })
        console.print(f"  Caption length: mean={df['cap_len'].mean():.1f}, std={df['cap_len'].std():.1f}")
    except Exception as e:
        console.print(f"  [yellow]Skipped: {e}[/yellow]")
 
 
async def step9_pydantic_agent(query):
    console.print(Panel("[bold]Step 9: Pydantic-AI quality-check agent[/bold]"))
    from lakehouse.tools import build_lakehouse_agent
    agent = build_lakehouse_agent(query, view_name="coco")
    try:
        result = await agent.run(
            "Run a quality check on the coco dataset for experiment coco-v1",
            deps=query,
        )
        console.print(f"  Agent result: {result.data}")
    except Exception as e:
        console.print(f"  [yellow]Agent skipped (no API key?): {e}[/yellow]")
 
 
async def main():
    console.print(Panel("[bold green]COCO-Caption Lakehouse Demo — Experiment #0[/bold green]"))
 
    with tempfile.TemporaryDirectory() as tmp:
        sync_result = step1_sync()
        cat = step2_catalog(tmp)
        wb_run = step3_wandb(sync_result.s3_prefix)
        query = step4_query()
        step5_stream(query)
        step6_shuffle(query)
        step7_rerun(query)
        step8_wandb_charts(query, wb_run)
        await step9_pydantic_agent(query)
        cat.complete_run("run-coco-001")
        cat.close()
        query.close()
 
    if wb_run:
        import wandb
        wandb.finish()
 
    console.print(Panel("[bold green]Demo complete.[/bold green]"))
 
 
if __name__ == "__main__":
    asyncio.run(main())

Step 2: Run the script (requires Docker Compose up)

uv run python experiments/coco_demo.py

Expected: all 9 steps print with either output or a yellow skipped notice. No unhandled exceptions.

Step 3: Commit

git add experiments/coco_demo.py experiments/__init__.py
git commit -m "feat: add COCO-Caption demo script (Experiment #0)"

Task 11: COCO Demo Notebook

Files:

Create: notebooks/coco_demo.ipynb

Step 1: Create the notebook

Run in the repo root:

uv run jupyter nbconvert --to notebook --execute --output notebooks/coco_demo.ipynb \
  --ExecutePreprocessor.kernel_name=python3 \
  --ExecutePreprocessor.timeout=600 \
  /dev/stdin <<'EOF'
{
 "cells": [
  {"cell_type":"markdown","source":["# COCO-Caption Lakehouse Demo\n\nExperiment #0 — walks through every layer of the Modal Lakehouse architecture.\n\nSee design doc: `docs/data-plane/coco-demo-design.md`"],"metadata":{}},
  {"cell_type":"code","source":["import os\nos.environ.setdefault('WANDB_MODE', 'disabled')\nfrom experiments.coco_demo import *"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 1 — HF Hub → lakehouse Sync"],"metadata":{}},
  {"cell_type":"code","source":["sync_result = step1_sync()\nprint(sync_result)"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 2 — DuckLake Catalog Registration"],"metadata":{}},
  {"cell_type":"code","source":["import tempfile\n_tmp = tempfile.mkdtemp()\ncat = step2_catalog(_tmp)\nexp = cat.get_experiment(EXPERIMENT_ID)\nprint(exp)"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 4 — DuckDB Query"],"metadata":{}},
  {"cell_type":"code","source":["query = step4_query()"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 5 — Streaming Egress via HF IterableDataset"],"metadata":{}},
  {"cell_type":"code","source":["from lakehouse.stream import as_iterable_dataset\nds = as_iterable_dataset(query, 'SELECT * FROM coco LIMIT 5')\nfor row in ds:\n    print(row['image_id'], row['caption'][:60])"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 6 — Epoch Reshuffling"],"metadata":{}},
  {"cell_type":"code","source":["step6_shuffle(query)"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Step 8 — Data Quality Summary"],"metadata":{}},
  {"cell_type":"code","source":["result = query.execute('SELECT length(caption) AS cap_len FROM coco')\ndf = result.to_pandas()\ndf['cap_len'].describe()"],"outputs":[],"metadata":{},"execution_count":null},
  {"cell_type":"markdown","source":["## Cleanup"],"metadata":{}},
  {"cell_type":"code","source":["cat.complete_run('run-coco-001')\ncat.close()\nquery.close()\nprint('Done')"],"outputs":[],"metadata":{},"execution_count":null}
 ],
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.12.0"}
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
EOF

Alternatively, create the notebook manually in JupyterLab and run cells in order. The key requirement is that all cells execute without error.

Step 2: Verify notebook runs clean

uv run jupyter nbconvert --to notebook --execute notebooks/coco_demo.ipynb --output notebooks/coco_demo.ipynb

Expected: no CellExecutionError.

Step 3: Commit

git add notebooks/coco_demo.ipynb
git commit -m "feat: add COCO-Caption demo notebook (Experiment #0)"

Task 12: Final Integration Test

Step 1: Run the full test suite

uv run pytest tests/ -v --tb=short

Expected: all tests pass.

Step 2: Run the demo script end-to-end

docker-compose up -d
uv run python experiments/coco_demo.py

Expected: 9 steps complete, no unhandled exceptions.

Step 3: Final commit

git add .
git commit -m "feat: complete COCO-Caption lakehouse demo (Experiment #0)
 
- lakehouse/ package: catalog, sync, query, stream, tools, visualize
- HF IterableDataset streaming egress via DuckDB Arrow reader
- Pydantic-AI agent with quality-check, query, sample tools
- W&B loose coupling via Reference Artifacts
- Rerun visualisation routing by schema inspection
- experiments/coco_demo.py + notebooks/coco_demo.ipynb"

What Is NOT in This Plan (Deferred to Phase 2+)

Feature	Phase
MCP server (`mcp_server.py`)	2
Zenoh ingest subscriber	3
K > 1 parallel simulators	4
NATS/JetStream control events	4
Ray Data distributed egress	5
`load_dataset("lakehouse", ...)` custom DatasetBuilder	5

Data Plane — COCO-Caption Demo Implementation Plan (Experiment #0)

On this page