Q: What is the difference between DDL and DML in SQL?

DDL (Data Definition Language) defines database structure using statements like CREATE TABLE , ALTER TABLE , and DROP TABLE . DML (Data Manipulation Language) works with the actual data using SELECT , INSERT , UPDATE , and DELETE . Think of DDL as building the container (tables, views, schemas) and DML as filling and querying that container. In DuckDB, both work identically whether running in-memory or with persistent storage.

Q: How do JOINs work in DuckDB?

DuckDB supports all standard SQL JOIN types: INNER JOIN returns only matching rows, LEFT/RIGHT OUTER JOIN includes all rows from one side with NULLs for non-matches, and FULL OUTER JOIN includes all rows from both tables. DuckDB also offers ASOF JOIN for time-series data, which matches rows based on the nearest timestamp rather than exact equality—perfect for joining readings with the most recent valid price. Use JOIN ... USING (column) when both tables share a column name, or JOIN ... ON for different column names.

Q: What SQL extensions does DuckDB add beyond standard SQL?

DuckDB adds several productivity features: (1) SELECT * EXCLUDE (col) to select all columns except specific ones; (2) SELECT * REPLACE (expr AS col) to transform columns inline; (3) COLUMNS('regex') for dynamic column selection; (4) GROUP BY ALL and ORDER BY ALL to auto-infer columns; (5) Alias reuse in WHERE and GROUP BY clauses; (6) USING SAMPLE n% for built-in data sampling; (7) INSERT ... BY NAME for column-name matching. These extensions reduce boilerplate and make ad-hoc analysis faster.

Q: How do I handle duplicate data when inserting into DuckDB?

DuckDB provides ON CONFLICT clauses for handling duplicates: (1) ON CONFLICT DO NOTHING silently skips duplicates, making inserts idempotent; (2) ON CONFLICT DO UPDATE SET col = value performs an upsert, merging new data with existing rows; (3) ON CONFLICT (column) DO UPDATE SET col = excluded.col lets you reference the incoming row's values using the excluded alias. You can also use complex expressions like CASE statements in the update logic.

Q: What is an ASOF JOIN and when should I use it?

An ASOF JOIN matches rows based on inequality conditions (<=, <, >, >=) rather than exact equality—typically used for temporal data. For example, joining energy readings with prices where each reading should use the price that was valid 'as of' that timestamp. Instead of requiring exact timestamp matches, ASOF JOIN finds the nearest preceding (or following) match. This is essential for time-series analysis, financial data, and any scenario where reference data has validity periods.

Q: How does GROUP BY ALL work in DuckDB?

GROUP BY ALL is a DuckDB extension that automatically groups by all non-aggregated columns in your SELECT clause. Instead of writing SELECT category, region, SUM(sales) FROM data GROUP BY category, region , you can write SELECT category, region, SUM(sales) FROM data GROUP BY ALL . DuckDB infers which columns need grouping. Similarly, ORDER BY ALL sorts by all columns from left to right. These features dramatically speed up ad-hoc analysis and reduce errors from mismatched column lists.

Q: How can I sample data in DuckDB for faster exploration?

DuckDB's USING SAMPLE clause lets you query a subset of data: SELECT * FROM large_table USING SAMPLE 10% returns roughly 10% of rows. You can specify methods: bernoulli (row-by-row sampling, more even distribution) or system (vector-based, faster but less precise for small datasets). For exact row counts, use reservoir sampling: USING SAMPLE 1000 ROWS . Add REPEATABLE (seed) for reproducible results. This is invaluable for exploring massive datasets without loading everything into memory.

Question 1

What is the difference between DDL and DML in SQL?

Accepted Answer

DDL (Data Definition Language) defines database structure using statements like CREATE TABLE, ALTER TABLE, and DROP TABLE. DML (Data Manipulation Language) works with the actual data using SELECT, INSERT, UPDATE, and DELETE. Think of DDL as building the container (tables, views, schemas) and DML as filling and querying that container. In DuckDB, both work identically whether running in-memory or with persistent storage.

Question 2

How do JOINs work in DuckDB?

Accepted Answer

DuckDB supports all standard SQL JOIN types: INNER JOIN returns only matching rows, LEFT/RIGHT OUTER JOIN includes all rows from one side with NULLs for non-matches, and FULL OUTER JOIN includes all rows from both tables. DuckDB also offers ASOF JOIN for time-series data, which matches rows based on the nearest timestamp rather than exact equality—perfect for joining readings with the most recent valid price. Use JOIN ... USING (column) when both tables share a column name, or JOIN ... ON for different column names.

Question 3

What is a Common Table Expression (CTE) in DuckDB?

Accepted Answer

A Common Table Expression (CTE) is a temporary named result set defined with the WITH clause that exists only for the duration of a query. CTEs make complex queries more readable by breaking them into logical steps. Unlike subqueries, CTEs can reference each other and can be recursive (WITH RECURSIVE) for traversing hierarchical data like trees. Example: WITH monthly_totals AS (SELECT month, SUM(power) FROM readings GROUP BY month) SELECT * FROM monthly_totals WHERE ...

Question 4

What SQL extensions does DuckDB add beyond standard SQL?

Accepted Answer

DuckDB adds several productivity features: (1) SELECT * EXCLUDE (col) to select all columns except specific ones; (2) SELECT * REPLACE (expr AS col) to transform columns inline; (3) COLUMNS('regex') for dynamic column selection; (4) GROUP BY ALL and ORDER BY ALL to auto-infer columns; (5) Alias reuse in WHERE and GROUP BY clauses; (6) USING SAMPLE n% for built-in data sampling; (7) INSERT ... BY NAME for column-name matching. These extensions reduce boilerplate and make ad-hoc analysis faster.

Question 5

How do I handle duplicate data when inserting into DuckDB?

Accepted Answer

DuckDB provides ON CONFLICT clauses for handling duplicates: (1) ON CONFLICT DO NOTHING silently skips duplicates, making inserts idempotent; (2) ON CONFLICT DO UPDATE SET col = value performs an upsert, merging new data with existing rows; (3) ON CONFLICT (column) DO UPDATE SET col = excluded.col lets you reference the incoming row's values using the excluded alias. You can also use complex expressions like CASE statements in the update logic.

Question 6

What is an ASOF JOIN and when should I use it?

Accepted Answer

An ASOF JOIN matches rows based on inequality conditions (<=, <, >, >=) rather than exact equality—typically used for temporal data. For example, joining energy readings with prices where each reading should use the price that was valid 'as of' that timestamp. Instead of requiring exact timestamp matches, ASOF JOIN finds the nearest preceding (or following) match. This is essential for time-series analysis, financial data, and any scenario where reference data has validity periods.

Question 7

How does GROUP BY ALL work in DuckDB?

Accepted Answer

GROUP BY ALL is a DuckDB extension that automatically groups by all non-aggregated columns in your SELECT clause. Instead of writing SELECT category, region, SUM(sales) FROM data GROUP BY category, region, you can write SELECT category, region, SUM(sales) FROM data GROUP BY ALL. DuckDB infers which columns need grouping. Similarly, ORDER BY ALL sorts by all columns from left to right. These features dramatically speed up ad-hoc analysis and reduce errors from mismatched column lists.

Question 8

How can I sample data in DuckDB for faster exploration?

Accepted Answer

DuckDB's USING SAMPLE clause lets you query a subset of data: SELECT * FROM large_table USING SAMPLE 10% returns roughly 10% of rows. You can specify methods: bernoulli (row-by-row sampling, more even distribution) or system (vector-based, faster but less precise for small datasets). For exact row counts, use reservoir sampling: USING SAMPLE 1000 ROWS. Add REPEATABLE (seed) for reproducible results. This is invaluable for exploring massive datasets without loading everything into memory.

Category	Purpose	Key Statements	When to Use
DDL (Data Definition)	Define database structure	CREATE, ALTER, DROP	Schema design, table creation
DML (Data Manipulation)	Work with data	SELECT, INSERT, UPDATE, DELETE	Querying, data changes
DCL (Data Control)	Manage permissions	GRANT, REVOKE	Access control
TCL (Transaction Control)	Manage transactions	COMMIT, ROLLBACK	Data integrity

Constraint	Purpose	Example	Performance Impact
`PRIMARY KEY`	Unique row identifier	`id INTEGER PRIMARY KEY`	Creates index
`FOREIGN KEY`	Referential integrity	`REFERENCES other_table(id)`	Creates index
`NOT NULL`	Prevent missing values	`name VARCHAR NOT NULL`	Minimal
`CHECK`	Custom validation	`CHECK(power >= 0)`	Evaluated on insert
`UNIQUE`	Prevent duplicates	`UNIQUE (valid_from)`	Creates index
`DEFAULT`	Auto-fill values	`DEFAULT 0`	None

Strategy	Syntax	Use Case
Ignore duplicates	`ON CONFLICT DO NOTHING`	Idempotent inserts, skip existing
Update on conflict	`ON CONFLICT DO UPDATE SET col = value`	Upsert/merge new data
Replace entirely	`INSERT OR REPLACE`	Shorthand for full replacement
Conditional update	`ON CONFLICT DO UPDATE SET col = CASE...`	Complex merge logic

JOIN Type	Returns	NULL Handling	Best For
INNER JOIN	Only matching rows from both tables	Excludes non-matches	Required relationships
LEFT OUTER JOIN	All left rows + matching right	NULL for unmatched right	Optional enrichment
RIGHT OUTER JOIN	All right rows + matching left	NULL for unmatched left	Reverse of LEFT
FULL OUTER JOIN	All rows from both tables	NULL for either side	Complete dataset merge
CROSS JOIN	Cartesian product (all combinations)	N/A	Generating combinations
ASOF JOIN	Nearest match by inequality	Temporal/range matching	Time-series data

Function	Purpose	Example
`COUNT(*)`	Count all rows	`SELECT COUNT(*) FROM readings`
`SUM(col)`	Sum values	`SELECT SUM(power) FROM readings`
`AVG(col)`	Average value	`SELECT AVG(power) FROM readings`
`MIN/MAX(col)`	Extreme values	`SELECT MAX(power) FROM readings`
`arg_max(expr, col)`	Value at maximum	`SELECT arg_max(read_on, power)`
`arg_min(expr, col)`	Value at minimum	`SELECT arg_min(read_on, power)`
`list(col)`	Aggregate into array	`SELECT list(name) FROM systems`
`first(col)`	First value in group	`SELECT first(power) FROM readings`
`any_value(col)`	Any value from group	`SELECT any_value(name)`

Feature	Standard SQL	DuckDB Extension	Benefit
Exclude columns	Must list all wanted columns	`SELECT * EXCLUDE (col)`	Simpler queries
Replace columns	Requires full column list	`SELECT * REPLACE (expr AS col)`	In-place transforms
Dynamic columns	Not available	`SELECT COLUMNS('pattern')`	Regex column selection
Alias in WHERE	Not allowed	Fully supported	Less repetition
Auto GROUP BY	Must list all columns	`GROUP BY ALL`	Faster ad-hoc queries
Auto ORDER BY	Must list all columns	`ORDER BY ALL`	Convenient sorting
Data sampling	Vendor-specific	`USING SAMPLE n%`	Built-in sampling

Executing SQL Queries

3.1 A Quick SQL Recap

3.2 Analyzing Energy Production

3.2.1 Downloading the dataset

3.2.2 The target schema

3.3 Data Definition Language (DDL) Queries

DDL vs DML: SQL Statement Categories

3.3.1 The CREATE TABLE statement

DuckDB Table Constraints Reference

3.3.2 The ALTER TABLE statement

3.3.3 The CREATE VIEW statement

3.3.4 The DESCRIBE statement

3.4 Data Manipulation Language (DML) Queries

3.4.1 The INSERT statement

3.4.2 Merging data

Handling Duplicates: ON CONFLICT Options

3.4.3 The DELETE statement

3.4.4 The SELECT statement

The GROUP BY Clause

Understanding INNER JOIN

Understanding OUTER JOINs

DuckDB JOIN Types Comparison

Essential DuckDB Aggregate Functions

3.5 DuckDB-Specific SQL Extensions

DuckDB SQL Extensions vs Standard SQL

3.5.1 Dealing with SELECT

3.5.2 Inserting by name

3.5.3 Accessing aliases everywhere

3.5.4 Grouping and ordering by all relevant columns

3.5.5 Sampling data

3.5.6 Functions with optional parameters

Summary

FAQS