WebinarTutorial

Building a Serverless Lakehouse with DuckLake

2025/12/26

TL;DR: This hands-on workshop walks through building a serverless lakehouse using DuckLake—an open-source table format that eliminates Java dependencies and exposes all metadata through SQL for easy inspection, time travel, and maintenance operations.

What is DuckLake?

DuckLake is an open table format built for DuckDB that combines the catalog and metadata store into a single component. Key benefits:

  • No Java required: Pure SQL interface for all operations
  • Flexible metadata storage: Supports PostgreSQL, MySQL, DuckDB, or SQLite as the catalog backend
  • Full SQL access to metadata: Query snapshots, file information, and table history directly
  • Built-in time travel: Every operation creates a snapshot for historical access

Getting Started: Creating a DuckLake

Copy code

-- Attach a DuckLake with DuckDB as the metadata store ATTACH 'ducklake:lake.db' AS lake; -- Set the data path for parquet files SET ducklake_data_path = 'data/lake';

Adding Data: Two Approaches

1. Metadata-Only Registration (add_files)

Copy code

-- Register existing parquet files without copying CALL add_data_files('lake.orders_raw', 'tpch/orders/*.parquet');

This is instant—it just records the file locations in metadata.

2. Insert Operations

Copy code

-- Create a partitioned table CREATE TABLE lake.orders (...); ALTER TABLE lake.orders SET PARTITION BY (year); -- Insert with partition columns INSERT INTO lake.orders SELECT *, YEAR(order_date) as year FROM lake.orders_raw;

Inserts physically write new parquet files into the lake's data path.

Time Travel and Snapshots

Every operation creates a new snapshot:

Copy code

-- View all snapshots SELECT * FROM lake.ducklake_snapshots(); -- Query historical data SELECT COUNT(*) FROM lake.orders AT VERSION 4; -- Compare versions SELECT * FROM lake.ducklake_table_changes('orders', 5, 6);

Pro tip: Wrap multiple operations in a transaction to create a single snapshot:

Copy code

BEGIN TRANSACTION; -- multiple inserts/updates COMMIT;

File Compaction and Maintenance

Small files accumulate from incremental inserts. DuckLake provides maintenance operations:

Copy code

-- Merge small files together CALL merge_adjacent_files('lake.line_item'); -- Mark old snapshots for deletion (keep only recent) CALL expire_snapshots('lake', INTERVAL '1 day'); -- Delete unreferenced files CALL cleanup_old_files('lake'); -- Remove files not tracked in metadata CALL vacuum_orphan_files('lake'); -- Or run everything at once CALL checkpoint('lake');

Inspecting Metadata

All metadata is queryable via SQL:

Copy code

-- File information SELECT path, record_count, file_size_bytes FROM lake.ducklake_table_files('orders'); -- Table partitions SELECT * FROM lake.ducklake_table_partitions('line_item');

DuckLake with MotherDuck

Copy code

-- Create a MotherDuck-managed DuckLake (uses managed S3) CREATE DATABASE my_lake TYPE ducklake; -- Or use your own S3 bucket CREATE DATABASE my_lake TYPE ducklake DATA_PATH 's3://my-bucket/lake';

When to Choose DuckLake vs Iceberg

ConsiderationDuckLakeIceberg
Setup complexitySimple (SQL only)More complex (Java ecosystem)
ScaleTerabytesPetabytes
Metadata inspectionNative SQL queriesRequires separate tools
Ecosystem maturityNewer (approaching 1.0)Battle-tested

What's Coming

  • DuckLake 1.0: February 2025
  • Spark connector: Write to DuckLake from Spark
  • Optimize operations: Intelligent file rewriting based on sort order and zone maps
  • Inline storage: Small writes stay in the relational catalog until flushed

Related Videos

"The MCP Sessions - Vol 2: Supply Chain Analytics" video thumbnail

2026-01-21

The MCP Sessions - Vol 2: Supply Chain Analytics

Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!

Stream

AI, ML and LLMs

MotherDuck Features

SQL

BI & Visualization

Tutorial

" The MCP Sessions Vol. 1: Sports Analytics" video thumbnail

2026-01-13

The MCP Sessions Vol. 1: Sports Analytics

Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.

AI, ML and LLMs

SQL

MotherDuck Features

Tutorial

BI & Visualization

Ecosystem

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem