DuckLake Hit 1.0 on Sunday. The Benchmarks Are Absurd.

Two days ago, DuckLake shipped its 1.0 release — the first stable version of the lakehouse format that stores all metadata in a SQL database instead of scattered JSON and Avro files. I've been watching this project since the 0.1 days, mostly with the "interesting experiment, wake me when it scales" attitude. The 1.0 announcement changed my mind about a few things, though not all of them.

The Small File Problem, Actually Solved

Every data engineer has lived this: a streaming job dumps single-row inserts into an Iceberg table, and within an hour you've got hundreds of tiny Parquet files plus three times as many metadata files. You either run compaction jobs to clean up the mess, or you accept that your queries will spend more time opening files than reading data.

DuckLake's answer is data inlining. Inserts below a configurable threshold (default: 10 rows) get stored directly in the catalog database — no Parquet file created at all. After 100 single-row inserts in their benchmarks, DuckLake produced zero Parquet files. Iceberg produced over 100 data files and 300+ metadata files.

The numbers get more interesting from there:

Operation	DuckLake vs Iceberg	DuckLake vs DuckLake (no inlining)
Single-row inserts	105× faster	5.2× faster
Aggregation queries	923× faster	925.9× faster
Checkpoint/compaction	189× faster	14.5× faster

Those aren't typos. The 923× aggregation speedup happens because the catalog can serve COUNT(*) and basic aggregates straight from metadata — no S3 round-trips, no Parquet file scanning. On S3-backed tables, they measured 8× to 258× speedups on count queries alone.

When you actually need the data materialized as Parquet, call ducklake_flush_inlined_data() or run CHECKPOINT. The flush consolidates scattered inline rows into optimized, sorted Parquet files. Compaction without the compaction headache.

What Else Shipped

Data inlining gets the headlines, but v1.0 packs more.

Sorted tables let you define a sort order by column or SQL expression. DuckLake physically sorts data during compaction and flush, enabling row group pruning without a separate index. Range queries on a timestamp column? Difference between scanning 3 files and scanning 300.

Bucket partitioning via murmur3 hashing, compatible with Iceberg's partition spec. Good for high-cardinality columns where traditional partitioning would create millions of directories.

Deletion vectors store deletions as Puffin files with roaring bitmaps, preserving time-travel. This borrows from the Iceberg v3 spec — pragmatic interop rather than reinventing the wheel.

The 108 merged PRs since late 2025 include 68 focused on reliability and correctness. The team spent more time fixing edge cases than adding features. That's what a 1.0 should look like.

The Architecture Bet

Here's the philosophical split that matters. Iceberg, Delta Lake, and Hudi use file-based metadata — JSON manifests, Avro files, log segments. Every commit creates new files. Reading a table means traversing a tree: snapshot → manifest list → manifests → data files.

DuckLake says: just put it in a database. Postgres, SQLite, or DuckDB itself serves as the catalog backend. No metadata tree traversal — a single SQL query returns the file list, column statistics, and partition info. The team reports metadata lookups in single-digit milliseconds where Iceberg needs hundreds of milliseconds on cold S3 reads.

The three backends carry very different tradeoffs:

SQLite — zero dependencies, single file, perfect for local dev and CI. No concurrent writes though.
PostgreSQL — battle-tested concurrency and replication. Your DBA already knows how to operate it. But your lakehouse now depends on Postgres uptime.
DuckDB — the recursive option: DuckDB manages its own metadata. Elegant, but pushes you deeper into the ecosystem.

Picking a catalog backend isn't a trivial config choice. It determines your concurrency model, your failure modes, and who gets paged at 2 AM.

Where This Falls Apart

I wouldn't be writing about this before coffee if I didn't have reservations.

Scalability is the open question. A petabyte-scale lake with 50–100 million files means 100 million rows in the data_file table and roughly a billion rows in file_column_statistics. Postgres handles tables that large — in theory. In practice, writing to billion-row tables during every commit introduces latency that file-based formats avoid entirely. The Iceberg approach of appending a new manifest file is O(1); updating rows in a massive Postgres table is decidedly not. Nobody has published benchmarks at this scale yet, and "trust us, Postgres is fast" doesn't constitute an engineering argument.

Engine support is thin. Client implementations exist for DataFusion, Spark, Trino, and Pandas, but outside of DuckDB and MotherDuck, nobody seems to be running DuckLake in production with real conviction. Iceberg has native support in AWS, Google Cloud, Snowflake, Databricks, and essentially every query engine that matters. The DuckLake contributor list is still predominantly DuckDB Labs affiliates. For a format that positions itself as open and portable, the ecosystem concentration is a yellow flag worth monitoring.

You're trading one operational burden for another. File-based metadata has known headaches: orphaned files, manifest bloat, compaction storms. But a SQL catalog introduces database operations — backups, replication lag, connection pooling, schema migrations between DuckLake versions. The v1.0 release notes mention that automatic catalog migration is now disabled by default, requiring explicit migration steps. Different failure modes, not fewer.

Who Should Actually Use This

If you're a team of 3–15 engineers, your largest table is under 10TB, you're already using DuckDB for analytics, and you're tired of babysitting compaction on your Iceberg tables — DuckLake 1.0 deserves a serious evaluation. The data inlining alone would have saved me a dozen late-night pages at my last job.

If you're running warehouse-scale operations with 50+ engineers, multi-engine requirements, and Iceberg deeply embedded in your infrastructure — keep watching. The ecosystem isn't there, and the scalability story hasn't been proven at the sizes that matter to you.

The v2.0 roadmap mentions git-like branching, role-based permissions, and incremental materialized views. Ambitious. For now, though, the 1.0 is a focused release that solves a real problem for the right audience. That's more than most v1.0s manage.

#The Small File Problem, Actually Solved

#What Else Shipped

#The Architecture Bet

#Where This Falls Apart

#Who Should Actually Use This

The Small File Problem, Actually Solved

What Else Shipped

The Architecture Bet

Where This Falls Apart

Who Should Actually Use This