Skip to main content

Data-Plane Schema Reference

The Auraison data plane stores all metadata in a single PostgreSQL database attached via the DuckLake extension. Tables fall into two groups:

  • DuckLake system tables (ducklake_*) — 17 tables created automatically by the extension; they implement MVCC versioning, the table/column catalog, Parquet file tracking, partitioning, and column statistics.
  • Auraison custom tables — defined by LakehouseCatalog._init_schema() (experiments, simulation_runs) and by user notebooks or scripts (e.g. coco_caption).

Architecture Overview

The diagram below groups all tables by subsystem and shows how they connect to MinIO object storage.


DuckLake System Tables

Entity Relationships — Core Spine

The diagram covers the nine tables that form the read/write critical path: snapshot versioning, the table/column catalog, and Parquet file tracking. The remaining eight tables (views, tags, partitioning, bulk statistics) are documented in the reference tables below.

Reference — All DuckLake Tables

Versioning group

TablePurpose
ducklake_metadataKey/value store for extension-level settings (catalog format version, etc.)
ducklake_snapshotOne row per write transaction; the snapshot_id is the time-travel handle
ducklake_snapshot_changesHuman-readable description of what each snapshot changed

Catalog group

TablePurpose
ducklake_schemaNamed schemas (namespaces); each row is live between begin_snapshot and end_snapshot
ducklake_tableRegistered tables; versioned the same way as schemas
ducklake_viewStored SQL views with dialect and optional column aliases
ducklake_columnPer-column metadata including type, ordering, nullability, and default values
ducklake_tagArbitrary key/value tags on any catalog object (table, schema, etc.)
ducklake_column_tagKey/value tags scoped to a specific column within a table

Storage group

TablePurpose
ducklake_data_fileOne row per Parquet file; tracks path, format, record count, byte size, and which snapshot added it
ducklake_delete_filePositional or equality delete files recording which rows have been logically deleted
ducklake_files_scheduled_for_deletionFiles that have been superseded and will be physically removed by the next VACUUM
ducklake_inlined_data_tablesTiny tables stored directly in PostgreSQL (no Parquet file) up to a configurable row threshold

Partitioning group

TablePurpose
ducklake_partition_infoDeclares that a table is partitioned; versioned like schema/table
ducklake_partition_columnMaps partition key indices to column IDs and transform expressions (identity, year, bucket, etc.)
ducklake_file_partition_valuePer-file partition value enabling partition pruning without scanning file footers

Statistics group

TablePurpose
ducklake_file_column_statisticsPer-file, per-column min/max/null/nan counts — fed into DuckDB's predicate pushdown
ducklake_table_statsTable-level aggregate: total record count, next row ID, total byte size
ducklake_table_column_statsTable-level per-column null/nan flags and global min/max (used for query planning)

Auraison Custom Tables

Four tables are created automatically by LakehouseCatalog._init_schema() on every catalog open. Two are for experiment tracking (experiments, simulation_runs); two form the generic data layer (datasets, multimodal_samples) that accommodates any modality — image-text, audio-text, video-text, or sensor data — without schema changes.

Design rationale

A dataset-specific table (e.g. a coco_caption table with columns cocoid, sentences, sentids) would require a new schema for every new dataset. Instead, multimodal_samples normalises the universal fields (sample_id, split, primary_uri, captions) and stores dataset-specific fields (image filename, bounding boxes, audio duration, ROS bag metadata, …) in a metadata JSON column.

Entity Relationships

The dashed link between simulation_runs and multimodal_samples is logical, not a FK: simulation_runs.config->>'dataset_id' equals multimodal_samples.dataset_id, enabling the cross-plane join in use case 3 without any ETL:

SELECT d.name AS dataset, r.config->>'split' AS split, COUNT(s.sample_id) AS images
FROM simulation_runs r
JOIN datasets d ON d.dataset_id = 'yerevann/coco-karpathy'
JOIN multimodal_samples s
ON s.dataset_id = 'yerevann/coco-karpathy'
AND s.split = r.config->>'split'
GROUP BY 1, 2;

Column Reference

datasets

ColumnTypeNotes
dataset_idVARCHAR PKStable identifier, e.g. 'yerevann/coco-karpathy', 'librispeech/clean'
nameVARCHARHuman-readable name
modalitiesVARCHAR[]e.g. ['image', 'text'], ['audio', 'text'], ['video', 'text']
descriptionVARCHARFree-text
source_urlVARCHARHF Hub URL, paper URL, or internal link
registered_atTIMESTAMPUTC

multimodal_samples

ColumnTypeNotes
sample_idVARCHAR PKDataset-native ID cast to string (e.g. COCO cocoid, LibriSpeech utterance ID)
dataset_idVARCHAR FKReferences datasets.dataset_id
splitVARCHARtrain | val | test | restval | …
primary_uriVARCHARS3 or HTTPS path to the primary asset (image, audio file, video clip)
annotationsJSONTask-specific payload — see examples below
metadataJSONAsset-level fields (filename, resolution, fps, sensor config, …)
ingested_atTIMESTAMPUTC timestamp of ingest
caption_lengthINTEGERlen(annotations->>'captions'->0) — added via ALTER TABLE (schema evolution demo)

annotations carries whatever the task requires, with no schema change needed per dataset:

Task typeannotations shapeExample dataset
Image captioning\{"captions": ["...", "..."], "num_captions": 5\}COCO-Caption, CC3M
Image classification\{"labels": ["cat"], "label_ids": [3]\}ImageNet, CIFAR
Object detection\{"boxes": [[x,y,w,h], …], "labels": ["car", …]\}COCO-Detection, LVIS
VQA\{"question": "…", "answers": ["…", "…"]\}VQAv2, GQA
ASR / audio-text\{"transcript": "…", "language": "en"\}LibriSpeech, CommonVoice
ROS2 / sensor (planned)\{"topics": ["/camera/image", "/imu/data"], "hz": 30\}rosbag episodes

COCO-Caption → multimodal_samples mapping

COCO parquet columnmaps toNotes
cocoidsample_idCast to VARCHAR
urlprimary_uriOriginal COCO image URL
sentencesannotations.captionsAll 5 reference captions
splitsplitUnchanged
filename, imgid, filepathmetadata JSONAsset-level; carried without schema lock-in
sentids(dropped)Internal HF annotation IDs; not needed downstream

experiments

ColumnTypeNotes
experiment_idVARCHAR PKUUID string assigned at registration
projectVARCHARShort slug, e.g. coco-caption-nb
descriptionVARCHARFree-text description of the experiment
created_atTIMESTAMPUTC timestamp at registration

simulation_runs

ColumnTypeNotes
run_idVARCHAR PKUUID string
experiment_idVARCHAR FKReferences experiments.experiment_id
sim_indexINTEGERZero-based index within the experiment
configJSONArbitrary run config (split, max_samples, dataset_id, tier, …)
statusVARCHARstarted | completed | failed
s3_prefixVARCHARMinIO path prefix of the run's raw input data
started_atTIMESTAMPUTC
completed_atTIMESTAMPUTC; NULL while in progress

How the Two Layers Interact

Every write to an Auraison custom table (INSERT, ALTER, UPDATE) creates a new ducklake_snapshot row and is reflected in ducklake_table, ducklake_column, and ducklake_data_file. This is what enables time-travel queries against all four tables using AT (VERSION => N).

Write path:
Python (LakehouseCatalog / DuckDB)
→ ducklake_snapshot (new snapshot_id)
→ ducklake_table / column (catalog update, if DDL)
→ ducklake_data_file (new Parquet file written to s3://warehouse)
→ ducklake_file_column_stats (min/max/null stats for predicate pushdown)

Read path (time travel):
SELECT * FROM multimodal_samples AT (VERSION => 1)
→ DuckLake resolves snapshot_id = 1
→ reads ducklake_data_file WHERE begin_snapshot <= 1 AND end_snapshot IS NULL
→ fetches those Parquet files from MinIO

HF Video Datasets and video_segments

What works out of the box

HF video datasets come in two forms:

FormHow stored in ParquetIngestion behaviour
URL-based (large datasets — Kinetics, MSR-VTT)struct<bytes: null, path: "https://...">sync_from_hf converts the struct to a plain URL string; that URL becomes primary_uri
Bytes-embedded (small datasets)struct<bytes: binary, path: "video.mp4">sync_from_hf extracts each video to s3://landing/{prefix}/assets/video/{name}.mp4; the S3 URI becomes primary_uri

The same logic applies to Image and Audio feature columns — sync_from_hf detects any struct<bytes, path> column and handles it identically regardless of modality.

After ingestion, the Parquet file stored in MinIO contains plain URI strings instead of binary blobs, so DuckDB and the streaming layer can query metadata without downloading assets.

Video-level tasks (video captioning, action recognition, VQA)

One video → one multimodal_samples row. annotations carries task-specific payloads; metadata carries video properties.

// MSR-VTT (video captioning)
{ "annotations": {"captions": ["A man plays guitar", "..."], "num_captions": 20},
"metadata": {"duration_s": 15.2, "fps": 30, "width": 320, "height": 240} }

// Kinetics-700 (action recognition)
{ "annotations": {"label": "playing guitar", "label_id": 289},
"metadata": {"duration_s": 10.0, "fps": 25, "start_s": 30.0, "end_s": 40.0} }

// MSRVTT-QA (video question answering)
{ "annotations": {"question": "What is the man doing?", "answer": "playing guitar"},
"metadata": {"duration_s": 15.2, "fps": 30} }

Temporal / dense annotations — video_segments

Datasets like ActivityNet Dense Captions, YouCook2, or AVA associate multiple timestamped annotations with a single video. These cannot be efficiently stored or queried as a flat JSON array in one row. video_segments is created automatically by LakehouseCatalog._init_schema():

CREATE TABLE IF NOT EXISTS video_segments (
segment_id VARCHAR NOT NULL, -- UUID
sample_id VARCHAR NOT NULL, -- FK → multimodal_samples.sample_id
dataset_id VARCHAR NOT NULL, -- FK → datasets.dataset_id
start_time_s DOUBLE, -- segment start in seconds
end_time_s DOUBLE, -- segment end in seconds
segment_uri VARCHAR, -- S3 path to extracted clip (optional)
annotations JSON, -- per-segment label, caption, bbox, etc.
metadata JSON -- resolution, codec, frame count, etc.
);

ActivityNet Dense Captions example — one row in multimodal_samples, many in video_segments:

-- multimodal_samples row (video level)
INSERT INTO multimodal_samples VALUES (
'v_abc123', 'activitynet/v1-3', 'train',
's3://landing/activitynet/v_abc123.mp4',
json_object('num_segments', 7),
json_object('duration_s', 180.0, 'fps', 29.97),
now()
);

-- video_segments rows (one per temporal annotation)
INSERT INTO video_segments VALUES
(gen_random_uuid(), 'v_abc123', 'activitynet/v1-3', 0.0, 5.2, NULL,
json_object('caption', 'A person walks into the room'), NULL),
(gen_random_uuid(), 'v_abc123', 'activitynet/v1-3', 5.2, 12.4, NULL,
json_object('caption', 'They pick up a ball and throw it'), NULL),
...;

Cross-table query — dense captions for one video:

SELECT s.start_time_s, s.end_time_s, s.annotations->>'caption' AS caption
FROM video_segments s
JOIN multimodal_samples m ON m.sample_id = s.sample_id
WHERE m.dataset_id = 'activitynet/v1-3'
AND m.sample_id = 'v_abc123'
ORDER BY s.start_time_s;

Future: ROS 2 Bags and Sensor Data

ROS 2 bags map to multimodal_samples at the episode level and reuse video_segments for per-topic frame-level access — the same table covers both:

// multimodal_samples for a ROS 2 bag episode
{ "primary_uri": "s3://landing/turtlebot/episode_042.db3",
"annotations": {"topics": ["/camera/image_raw", "/imu/data", "/lidar/points"],
"msg_types": ["sensor_msgs/Image", "sensor_msgs/Imu", "sensor_msgs/PointCloud2"]},
"metadata": {"robot": "turtlebot4", "duration_s": 120.4,
"hz": {"camera": 30, "imu": 200, "lidar": 10}} }

// video_segments row for one extracted camera frame
{ "start_time_s": 1678000000.123, "end_time_s": 1678000000.156,
"segment_uri": "s3://landing/turtlebot/episode_042/camera/frame_00042.jpg",
"annotations": {"detections": [{"label": "wall", "bbox": [0, 0, 640, 120]}]},
"metadata": {"topic": "/camera/image_raw", "seq": 42, "encoding": "rgb8"} }

References