Data-Plane Schema Reference

The Auraison data plane stores all metadata in a single PostgreSQL database attached via the DuckLake extension. Tables fall into two groups:

DuckLake system tables (ducklake_*) — 17 tables created automatically by the extension; they implement MVCC versioning, the table/column catalog, Parquet file tracking, partitioning, and column statistics.
Auraison custom tables — defined by LakehouseCatalog._init_schema() (experiments, simulation_runs) and by user notebooks or scripts (e.g. coco_caption).

Architecture Overview

The diagram below groups all tables by subsystem and shows how they connect to MinIO object storage.

DuckLake System Tables

Entity Relationships — Core Spine

The diagram covers the nine tables that form the read/write critical path: snapshot versioning, the table/column catalog, and Parquet file tracking. The remaining eight tables (views, tags, partitioning, bulk statistics) are documented in the reference tables below.

Reference — All DuckLake Tables

Versioning group

Table	Purpose
`ducklake_metadata`	Key/value store for extension-level settings (catalog format version, etc.)
`ducklake_snapshot`	One row per write transaction; the `snapshot_id` is the time-travel handle
`ducklake_snapshot_changes`	Human-readable description of what each snapshot changed

Catalog group

Table	Purpose
`ducklake_schema`	Named schemas (namespaces); each row is live between `begin_snapshot` and `end_snapshot`
`ducklake_table`	Registered tables; versioned the same way as schemas
`ducklake_view`	Stored SQL views with dialect and optional column aliases
`ducklake_column`	Per-column metadata including type, ordering, nullability, and default values
`ducklake_tag`	Arbitrary key/value tags on any catalog object (table, schema, etc.)
`ducklake_column_tag`	Key/value tags scoped to a specific column within a table

Storage group

Table	Purpose
`ducklake_data_file`	One row per Parquet file; tracks path, format, record count, byte size, and which snapshot added it
`ducklake_delete_file`	Positional or equality delete files recording which rows have been logically deleted
`ducklake_files_scheduled_for_deletion`	Files that have been superseded and will be physically removed by the next `VACUUM`
`ducklake_inlined_data_tables`	Tiny tables stored directly in PostgreSQL (no Parquet file) up to a configurable row threshold

Partitioning group

Table	Purpose
`ducklake_partition_info`	Declares that a table is partitioned; versioned like schema/table
`ducklake_partition_column`	Maps partition key indices to column IDs and transform expressions (identity, year, bucket, etc.)
`ducklake_file_partition_value`	Per-file partition value enabling partition pruning without scanning file footers

Statistics group

Table	Purpose
`ducklake_file_column_statistics`	Per-file, per-column min/max/null/nan counts — fed into DuckDB's predicate pushdown
`ducklake_table_stats`	Table-level aggregate: total record count, next row ID, total byte size
`ducklake_table_column_stats`	Table-level per-column null/nan flags and global min/max (used for query planning)

Auraison Custom Tables

Four tables are created automatically by LakehouseCatalog._init_schema() on every catalog open. Two are for experiment tracking (experiments, simulation_runs); two form the generic data layer (datasets, multimodal_samples) that accommodates any modality — image-text, audio-text, video-text, or sensor data — without schema changes.

Design rationale

A dataset-specific table (e.g. a coco_caption table with columns cocoid, sentences, sentids) would require a new schema for every new dataset. Instead, multimodal_samples normalises the universal fields (sample_id, split, primary_uri, captions) and stores dataset-specific fields (image filename, bounding boxes, audio duration, ROS bag metadata, …) in a metadata JSON column.

Entity Relationships

The dashed link between simulation_runs and multimodal_samples is logical, not a FK: simulation_runs.config->>'dataset_id' equals multimodal_samples.dataset_id, enabling the cross-plane join in use case 3 without any ETL:

SELECT d.name AS dataset, r.config->>'split' AS split, COUNT(s.sample_id) AS images
FROM simulation_runs r
JOIN datasets d ON d.dataset_id = 'yerevann/coco-karpathy'
JOIN multimodal_samples s
  ON s.dataset_id = 'yerevann/coco-karpathy'
 AND s.split = r.config->>'split'
GROUP BY 1, 2;

Column Reference

`datasets`

Column	Type	Notes
`dataset_id`	`VARCHAR` PK	Stable identifier, e.g. `'yerevann/coco-karpathy'`, `'librispeech/clean'`
`name`	`VARCHAR`	Human-readable name
`modalities`	`VARCHAR[]`	e.g. `['image', 'text']`, `['audio', 'text']`, `['video', 'text']`
`description`	`VARCHAR`	Free-text
`source_url`	`VARCHAR`	HF Hub URL, paper URL, or internal link
`registered_at`	`TIMESTAMP`	UTC

`multimodal_samples`

Column	Type	Notes
`sample_id`	`VARCHAR` PK	Dataset-native ID cast to string (e.g. COCO `cocoid`, LibriSpeech utterance ID)
`dataset_id`	`VARCHAR` FK	References `datasets.dataset_id`
`split`	`VARCHAR`	`train` \| `val` \| `test` \| `restval` \| …
`primary_uri`	`VARCHAR`	S3 or HTTPS path to the primary asset (image, audio file, video clip)
`annotations`	`JSON`	Task-specific payload — see examples below
`metadata`	`JSON`	Asset-level fields (filename, resolution, fps, sensor config, …)
`ingested_at`	`TIMESTAMP`	UTC timestamp of ingest
`caption_length`	`INTEGER`	`len(annotations->>'captions'->0)` — added via `ALTER TABLE` (schema evolution demo)

annotations carries whatever the task requires, with no schema change needed per dataset:

Task type	`annotations` shape	Example dataset
Image captioning	`\{"captions": ["...", "..."], "num_captions": 5\}`	COCO-Caption, CC3M
Image classification	`\{"labels": ["cat"], "label_ids": [3]\}`	ImageNet, CIFAR
Object detection	`\{"boxes": [[x,y,w,h], …], "labels": ["car", …]\}`	COCO-Detection, LVIS
VQA	`\{"question": "…", "answers": ["…", "…"]\}`	VQAv2, GQA
ASR / audio-text	`\{"transcript": "…", "language": "en"\}`	LibriSpeech, CommonVoice
ROS2 / sensor (planned)	`\{"topics": ["/camera/image", "/imu/data"], "hz": 30\}`	rosbag episodes

COCO-Caption → multimodal_samples mapping

COCO parquet column	maps to	Notes
`cocoid`	`sample_id`	Cast to VARCHAR
`url`	`primary_uri`	Original COCO image URL
`sentences`	`annotations.captions`	All 5 reference captions
`split`	`split`	Unchanged
`filename`, `imgid`, `filepath`	`metadata` JSON	Asset-level; carried without schema lock-in
`sentids`	(dropped)	Internal HF annotation IDs; not needed downstream

`experiments`

Column	Type	Notes
`experiment_id`	`VARCHAR` PK	UUID string assigned at registration
`project`	`VARCHAR`	Short slug, e.g. `coco-caption-nb`
`description`	`VARCHAR`	Free-text description of the experiment
`created_at`	`TIMESTAMP`	UTC timestamp at registration

`simulation_runs`

Column	Type	Notes
`run_id`	`VARCHAR` PK	UUID string
`experiment_id`	`VARCHAR` FK	References `experiments.experiment_id`
`sim_index`	`INTEGER`	Zero-based index within the experiment
`config`	`JSON`	Arbitrary run config (`split`, `max_samples`, `dataset_id`, `tier`, …)
`status`	`VARCHAR`	`started` \| `completed` \| `failed`
`s3_prefix`	`VARCHAR`	MinIO path prefix of the run's raw input data
`started_at`	`TIMESTAMP`	UTC
`completed_at`	`TIMESTAMP`	UTC; `NULL` while in progress

How the Two Layers Interact

Every write to an Auraison custom table (INSERT, ALTER, UPDATE) creates a new ducklake_snapshot row and is reflected in ducklake_table, ducklake_column, and ducklake_data_file. This is what enables time-travel queries against all four tables using AT (VERSION => N).

Write path:
  Python (LakehouseCatalog / DuckDB)
    → ducklake_snapshot          (new snapshot_id)
    → ducklake_table / column    (catalog update, if DDL)
    → ducklake_data_file         (new Parquet file written to s3://warehouse)
    → ducklake_file_column_stats (min/max/null stats for predicate pushdown)

Read path (time travel):
  SELECT * FROM multimodal_samples AT (VERSION => 1)
    → DuckLake resolves snapshot_id = 1
    → reads ducklake_data_file WHERE begin_snapshot <= 1 AND end_snapshot IS NULL
    → fetches those Parquet files from MinIO

HF Video Datasets and `video_segments`

What works out of the box

HF video datasets come in two forms:

Form	How stored in Parquet	Ingestion behaviour
URL-based (large datasets — Kinetics, MSR-VTT)	`struct<bytes: null, path: "https://...">`	`sync_from_hf` converts the struct to a plain URL string; that URL becomes `primary_uri` ✓
Bytes-embedded (small datasets)	`struct<bytes: binary, path: "video.mp4">`	`sync_from_hf` extracts each video to `s3://landing/{prefix}/assets/video/{name}.mp4`; the S3 URI becomes `primary_uri` ✓

The same logic applies to Image and Audio feature columns — sync_from_hf detects any struct<bytes, path> column and handles it identically regardless of modality.

After ingestion, the Parquet file stored in MinIO contains plain URI strings instead of binary blobs, so DuckDB and the streaming layer can query metadata without downloading assets.

Video-level tasks (video captioning, action recognition, VQA)

One video → one multimodal_samples row. annotations carries task-specific payloads; metadata carries video properties.

// MSR-VTT (video captioning)
{ "annotations": {"captions": ["A man plays guitar", "..."], "num_captions": 20},
  "metadata":    {"duration_s": 15.2, "fps": 30, "width": 320, "height": 240} }

// Kinetics-700 (action recognition)
{ "annotations": {"label": "playing guitar", "label_id": 289},
  "metadata":    {"duration_s": 10.0, "fps": 25, "start_s": 30.0, "end_s": 40.0} }

// MSRVTT-QA (video question answering)
{ "annotations": {"question": "What is the man doing?", "answer": "playing guitar"},
  "metadata":    {"duration_s": 15.2, "fps": 30} }

Temporal / dense annotations — `video_segments`

Datasets like ActivityNet Dense Captions, YouCook2, or AVA associate multiple timestamped annotations with a single video. These cannot be efficiently stored or queried as a flat JSON array in one row. video_segments is created automatically by LakehouseCatalog._init_schema():

CREATE TABLE IF NOT EXISTS video_segments (
    segment_id   VARCHAR NOT NULL,   -- UUID
    sample_id    VARCHAR NOT NULL,   -- FK → multimodal_samples.sample_id
    dataset_id   VARCHAR NOT NULL,   -- FK → datasets.dataset_id
    start_time_s DOUBLE,             -- segment start in seconds
    end_time_s   DOUBLE,             -- segment end in seconds
    segment_uri  VARCHAR,            -- S3 path to extracted clip (optional)
    annotations  JSON,               -- per-segment label, caption, bbox, etc.
    metadata     JSON                -- resolution, codec, frame count, etc.
);

ActivityNet Dense Captions example — one row in multimodal_samples, many in video_segments:

-- multimodal_samples row (video level)
INSERT INTO multimodal_samples VALUES (
    'v_abc123', 'activitynet/v1-3', 'train',
    's3://landing/activitynet/v_abc123.mp4',
    json_object('num_segments', 7),
    json_object('duration_s', 180.0, 'fps', 29.97),
    now()
);

-- video_segments rows (one per temporal annotation)
INSERT INTO video_segments VALUES
    (gen_random_uuid(), 'v_abc123', 'activitynet/v1-3',  0.0,   5.2, NULL,
     json_object('caption', 'A person walks into the room'), NULL),
    (gen_random_uuid(), 'v_abc123', 'activitynet/v1-3',  5.2,  12.4, NULL,
     json_object('caption', 'They pick up a ball and throw it'), NULL),
    ...;

Cross-table query — dense captions for one video:

SELECT s.start_time_s, s.end_time_s, s.annotations->>'caption' AS caption
FROM video_segments s
JOIN multimodal_samples m ON m.sample_id = s.sample_id
WHERE m.dataset_id = 'activitynet/v1-3'
  AND m.sample_id  = 'v_abc123'
ORDER BY s.start_time_s;

Future: ROS 2 Bags and Sensor Data

ROS 2 bags map to multimodal_samples at the episode level and reuse video_segments for per-topic frame-level access — the same table covers both:

// multimodal_samples for a ROS 2 bag episode
{ "primary_uri": "s3://landing/turtlebot/episode_042.db3",
  "annotations": {"topics": ["/camera/image_raw", "/imu/data", "/lidar/points"],
                  "msg_types": ["sensor_msgs/Image", "sensor_msgs/Imu", "sensor_msgs/PointCloud2"]},
  "metadata":    {"robot": "turtlebot4", "duration_s": 120.4,
                  "hz": {"camera": 30, "imu": 200, "lidar": 10}} }

// video_segments row for one extracted camera frame
{ "start_time_s": 1678000000.123, "end_time_s": 1678000000.156,
  "segment_uri":  "s3://landing/turtlebot/episode_042/camera/frame_00042.jpg",
  "annotations":  {"detections": [{"label": "wall", "bbox": [0, 0, 640, 120]}]},
  "metadata":     {"topic": "/camera/image_raw", "seq": 42, "encoding": "rgb8"} }

References

Hoyt Emerson, "Understanding DuckLake's Metadata Tables", The Full Data Stack, 27 Dec 2025. https://thefulldatastack.substack.com/p/understanding-ducklakes-metadata

Architecture Overview​

DuckLake System Tables​

Entity Relationships — Core Spine​

Reference — All DuckLake Tables​

Versioning group​

Catalog group​

Storage group​

Partitioning group​

Statistics group​

Auraison Custom Tables​

Design rationale​

Entity Relationships​

Column Reference​

datasets​

multimodal_samples​

experiments​

simulation_runs​

How the Two Layers Interact​

HF Video Datasets and video_segments​

What works out of the box​

Video-level tasks (video captioning, action recognition, VQA)​

Temporal / dense annotations — video_segments​

Future: ROS 2 Bags and Sensor Data​

References​