WebinarTutorial

Building a Serverless Lakehouse with DuckLake

2025/12/26

TL;DR: This hands-on workshop walks through building a serverless lakehouse using DuckLake—an open-source table format that eliminates Java dependencies and exposes all metadata through SQL for easy inspection, time travel, and maintenance operations.

What is DuckLake?

DuckLake is an open table format built for DuckDB that combines the catalog and metadata store into a single component. Key benefits:

No Java required: Pure SQL interface for all operations
Flexible metadata storage: Supports PostgreSQL, MySQL, DuckDB, or SQLite as the catalog backend
Full SQL access to metadata: Query snapshots, file information, and table history directly
Built-in time travel: Every operation creates a snapshot for historical access

Getting Started: Creating a DuckLake

Copy code
-- Attach a DuckLake with DuckDB as the metadata store
ATTACH 'ducklake:lake.db' AS lake;

-- Set the data path for parquet files
SET ducklake_data_path = 'data/lake';

Adding Data: Two Approaches

1. Metadata-Only Registration (add_files)

Copy code
-- Register existing parquet files without copying
CALL add_data_files('lake.orders_raw', 'tpch/orders/*.parquet');

This is instant—it just records the file locations in metadata.

2. Insert Operations

Copy code
-- Create a partitioned table
CREATE TABLE lake.orders (...);
ALTER TABLE lake.orders SET PARTITION BY (year);

-- Insert with partition columns
INSERT INTO lake.orders 
SELECT *, YEAR(order_date) as year FROM lake.orders_raw;

Inserts physically write new parquet files into the lake's data path.

Time Travel and Snapshots

Every operation creates a new snapshot:

Copy code
-- View all snapshots
SELECT * FROM lake.ducklake_snapshots();

-- Query historical data
SELECT COUNT(*) FROM lake.orders AT VERSION 4;

-- Compare versions
SELECT * FROM lake.ducklake_table_changes('orders', 5, 6);

Pro tip: Wrap multiple operations in a transaction to create a single snapshot:

Copy code
BEGIN TRANSACTION;
-- multiple inserts/updates
COMMIT;

File Compaction and Maintenance

Small files accumulate from incremental inserts. DuckLake provides maintenance operations:

Copy code
-- Merge small files together
CALL merge_adjacent_files('lake.line_item');

-- Mark old snapshots for deletion (keep only recent)
CALL expire_snapshots('lake', INTERVAL '1 day');

-- Delete unreferenced files
CALL cleanup_old_files('lake');

-- Remove files not tracked in metadata
CALL vacuum_orphan_files('lake');

-- Or run everything at once
CALL checkpoint('lake');

Inspecting Metadata

All metadata is queryable via SQL:

Copy code
-- File information
SELECT path, record_count, file_size_bytes 
FROM lake.ducklake_table_files('orders');

-- Table partitions
SELECT * FROM lake.ducklake_table_partitions('line_item');

DuckLake with MotherDuck

Copy code
-- Create a MotherDuck-managed DuckLake (uses managed S3)
CREATE DATABASE my_lake TYPE ducklake;

-- Or use your own S3 bucket
CREATE DATABASE my_lake TYPE ducklake 
  DATA_PATH 's3://my-bucket/lake';

When to Choose DuckLake vs Iceberg

Consideration	DuckLake	Iceberg
Setup complexity	Simple (SQL only)	More complex (Java ecosystem)
Scale	Terabytes	Petabytes
Metadata inspection	Native SQL queries	Requires separate tools
Ecosystem maturity	Newer (approaching 1.0)	Battle-tested

What's Coming

DuckLake 1.0: February 2025
Spark connector: Write to DuckLake from Spark
Optimize operations: Intelligent file rewriting based on sort order and zone maps
Inline storage: Small writes stay in the relational catalog until flushed

TABLE OF CONTENTS

What is DuckLake?

Getting Started: Creating a DuckLake

Adding Data: Two Approaches

Time Travel and Snapshots

File Compaction and Maintenance

Inspecting Metadata

DuckLake with MotherDuck

When to Choose DuckLake vs Iceberg

What's Coming

Related Videos

2026-01-21

The MCP Sessions - Vol 2: Supply Chain Analytics

Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!

Stream

AI, ML and LLMs

MotherDuck Features

SQL

BI & Visualization

Tutorial

2026-01-13

The MCP Sessions Vol. 1: Sports Analytics

Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.

AI, ML and LLMs

SQL

MotherDuck Features

Tutorial

BI & Visualization

Ecosystem

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

View all