Building a Serverless Lakehouse with DuckLake
2025/12/26TL;DR: This hands-on workshop walks through building a serverless lakehouse using DuckLake—an open-source table format that eliminates Java dependencies and exposes all metadata through SQL for easy inspection, time travel, and maintenance operations.
What is DuckLake?
DuckLake is an open table format built for DuckDB that combines the catalog and metadata store into a single component. Key benefits:
- No Java required: Pure SQL interface for all operations
- Flexible metadata storage: Supports PostgreSQL, MySQL, DuckDB, or SQLite as the catalog backend
- Full SQL access to metadata: Query snapshots, file information, and table history directly
- Built-in time travel: Every operation creates a snapshot for historical access
Getting Started: Creating a DuckLake
Copy code
-- Attach a DuckLake with DuckDB as the metadata store
ATTACH 'ducklake:lake.db' AS lake;
-- Set the data path for parquet files
SET ducklake_data_path = 'data/lake';
Adding Data: Two Approaches
1. Metadata-Only Registration (add_files)
Copy code
-- Register existing parquet files without copying
CALL add_data_files('lake.orders_raw', 'tpch/orders/*.parquet');
This is instant—it just records the file locations in metadata.
2. Insert Operations
Copy code
-- Create a partitioned table
CREATE TABLE lake.orders (...);
ALTER TABLE lake.orders SET PARTITION BY (year);
-- Insert with partition columns
INSERT INTO lake.orders
SELECT *, YEAR(order_date) as year FROM lake.orders_raw;
Inserts physically write new parquet files into the lake's data path.
Time Travel and Snapshots
Every operation creates a new snapshot:
Copy code
-- View all snapshots
SELECT * FROM lake.ducklake_snapshots();
-- Query historical data
SELECT COUNT(*) FROM lake.orders AT VERSION 4;
-- Compare versions
SELECT * FROM lake.ducklake_table_changes('orders', 5, 6);
Pro tip: Wrap multiple operations in a transaction to create a single snapshot:
Copy code
BEGIN TRANSACTION;
-- multiple inserts/updates
COMMIT;
File Compaction and Maintenance
Small files accumulate from incremental inserts. DuckLake provides maintenance operations:
Copy code
-- Merge small files together
CALL merge_adjacent_files('lake.line_item');
-- Mark old snapshots for deletion (keep only recent)
CALL expire_snapshots('lake', INTERVAL '1 day');
-- Delete unreferenced files
CALL cleanup_old_files('lake');
-- Remove files not tracked in metadata
CALL vacuum_orphan_files('lake');
-- Or run everything at once
CALL checkpoint('lake');
Inspecting Metadata
All metadata is queryable via SQL:
Copy code
-- File information
SELECT path, record_count, file_size_bytes
FROM lake.ducklake_table_files('orders');
-- Table partitions
SELECT * FROM lake.ducklake_table_partitions('line_item');
DuckLake with MotherDuck
Copy code
-- Create a MotherDuck-managed DuckLake (uses managed S3)
CREATE DATABASE my_lake TYPE ducklake;
-- Or use your own S3 bucket
CREATE DATABASE my_lake TYPE ducklake
DATA_PATH 's3://my-bucket/lake';
When to Choose DuckLake vs Iceberg
| Consideration | DuckLake | Iceberg |
|---|---|---|
| Setup complexity | Simple (SQL only) | More complex (Java ecosystem) |
| Scale | Terabytes | Petabytes |
| Metadata inspection | Native SQL queries | Requires separate tools |
| Ecosystem maturity | Newer (approaching 1.0) | Battle-tested |
What's Coming
- DuckLake 1.0: February 2025
- Spark connector: Write to DuckLake from Spark
- Optimize operations: Intelligent file rewriting based on sort order and zone maps
- Inline storage: Small writes stay in the relational catalog until flushed
Related Videos
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem

