0:08Hi, I'm Jelte, and I work at MotherDuck, and I'll talk about queries in Postgres faster and easier to write.
0:20and how it makes ducking awesome analytics in Postgres possible. with what is pg_duckdb.
0:32And well, pg_duckdb is a Postgres extension
0:36that embeds DuckDB inside Postgres.
0:40what that means.
0:47So first, we'll take us a bit of a step back.
0:51is a pretty amazing database. It's open source, there's many contributors. and there's many other good things about it.
1:01One of those also being that every release gets extra functionality.
1:08then it's very extensible. that add that functionality.
1:16which means it can very easily
1:21find a specific user in the database, or show your current shopping basket.
1:30all that from the database,
1:34and could do that very fast, very efficiently. or sometimes called OLAP.
1:43What that means is, to answer a query, instead of finding the right data, or finding it in a shopping basket.
1:56all orders to find out
2:0010 most sold products from the last month,
2:04or from the last 10 months, and display those,
2:09nice graph in some dashboard.
2:14That Postgres is usually not great at that. yeah, like this elephant under the books.
2:22it doesn't like it.
2:27So why is it not great at OLAP? because there's no columnar storage in Postgres.
2:34Postgres stores rows of data on the actual disk.
2:39So every row is one specific item. all the data for that row, like if you want to show a user profile.
2:51that means that maybe you don't,
2:57columns, because you want to see all the rows.
3:04you maybe only care about like 3 rows, but for all the rows, because that's how many columns the table has.
3:18than it actually needs to be, that you don't care about, right when you read it.
3:28parallel processing.
3:32these days computers have lots of cores.
3:37is very important for making it go fast.
3:42is not really designed with that in mind. but it's sort of bolted on a bit.
3:49with parallel processing in mind.
3:54So not all the, like many of this Postgres
3:58internals, they don't work in like a parallel way.
4:03They're very sequential. Only specific things are made parallel in Postgres.
4:12can come to the rescue.
4:17that is underneath.
4:22in-process SQL Analytics engine.
4:28It does have disk columnar storage. It's built with multi-threading in mind. It's like there's one of its core design principles, like lots of things need to be parallel.
4:40And it tries to make stuff go fast in parallel. in-process stuff mean?
4:48like we have Postgres and SQLite.
4:53a transactional database, like I explained before.
4:58which means that a client
5:03usually over their network.
5:07And that makes it very nice for certain applications, all connected to the same database.
5:19But SQLite is a different database. but instead, it is something that you load inside
5:28a process. So, for instance, some things that the web browser needs to store.
5:35And then you can use that without going over the network. that is well-suited for transactional workloads,
5:46without having a dedicated server.
5:51So it's sort of two different use cases. And DuckDB is another type of database.
5:59It's similar to SQLite in that it's in-process, so you can load it in Python.
6:07or it runs inside your browser in Wasm.
6:12binary and execute it inside your browser.
6:20like I said, it's an analytical database.
6:26And it's very fast at that. So, Postgres and DuckDB,
6:34good at different things. DuckDB being good at analytical workloads.
6:40that's sort of the idea we're trying to make them one.
6:47maybe put it inside the Postgres process,
6:53so we can put DuckDB in there. And then and everything will be faster. that is both transactional and analytical.
7:09And that's sort of what pg_duckdb tries to be.
7:16So what does that look like? how do you put DuckDB inside Postgres?
7:23And what does that mean? Well, I kind of want it to be like this.
7:28next to a sunset, drinking a nice cold beer,
7:35and just being general friends and happy together.
7:41that's not entirely what it always looks like. this sort of monster picture
7:50that is like this Frankenstein-y mess of things.
7:57And now, after a lot of work, it's a lot better. this is sort of how it started for sure.
8:06And the reason is that Postgres and DuckDB, they're very different. It's not just like ducks and elephants, basically.
8:13It's not just that
8:17and the other is good at analytical workloads. in how they're actually designed
8:27to accommodate those different workloads. because Postgres is much older than DuckDB. Postgres is from the previous millennium.
8:35And DuckDB is from a few years ago.
8:39in how it's designed based on
8:46when it got started. in Postgres stemming from those early days.
8:55And one of those things is is written in C and DuckDB is written in C++.
9:03And so there's a lot of mismatch in how things are done to make development a bit easier.
9:16well with the Postgres code.
9:21You have to make some translations here and there,
9:26but it's doable...
9:31to work relatively well with plain C.
9:39It's at least sort of similar. main problems is the way exceptions work.
9:48just exist in C++. It's a built-in thing. You can throw an exception. where an unexpected error happens.
9:58For instance, a divide by zero. saying, oh, there's an error.
10:05And Postgres provide any way of doing that by default.
10:13So Postgres its own exceptions with its own try and catch.
10:22but Postgres is PG_TRY, PG_CATCH. basically the same, but also very different.
10:30is that the different try-catch blocks
10:36don't understand each other's exceptions. PG_TRY, PG_CATCH block around that,
10:44those handlers will not get executed. And that cleanup that might happen in the
10:53PG_CATCH handler is not executed. So you might leak some memory. And the same is true the other way around.
10:59If Postgres throws an exception, with its normal try-catch handler.
11:09cleanup in C++ relies on even locking cleanup.
11:16when an exception is thrown,
11:23and release a lock, stuff like that.
11:29but we manage.
11:34It is a very hard problem to
11:38make work well. and DuckDB uses threads.
11:45DuckDB uses a ton of threads, and Postgres doesn't use any of them, basically. and the code is also not thread-safe,
11:55oh, there's only one thread in this process. That's the one that's doing everything. So I can change global variables.
12:04I can do anything that I want.
12:09And so, yeah, that's
12:13problematic because there's all these threads but that Postgres function
12:23cannot be called in a thread-safe way. around how to handle that.
12:31And similarly, the other way around, for instance, from multiple processes.
12:39It needs to open it from only one process. makes that problematic,
12:472 DuckDB instances and 2 processes that maybe want to open the same file. as best as we could.
12:59And that paid off. in the sunset that you saw in the picture before.
13:10So how does it work? That's sort of an important question. how does it actually work?
13:21how Postgres works. a client or the network,
13:31the first thing it will do is give it to the parser, or this string of characters,
13:37it's called abstract syntax tree,
13:42of what this query actually means. and it selects these columns from this table.
13:50of objects and structures
13:54than with a plain string.
13:59And then, once it has that, and the planner will then of executing this. the planner will give that plan that it's
14:11and that will actually do the work. it will do the computations, it will do the GROUP BY, stuff like that.
14:22And that's how it works in Postgres. So, how does it work in pg_duckdb? is it basically steals the query.
14:29And it doesn't do this always. There's sort of is this a query that pg_duckdb should care about?
14:36And if it does, then it will just take over.
14:40sort of meta query structure.
14:45It will not let the Postgres planner do anything. It will just and give it to the DuckDB Planner, and then the DuckDB Executor will execute that.
15:00And then, is that it can still read Postgres data. So, pg_duckdb, it can still read the data that's in Postgres.
15:13So, that's sort of a small but important detail.
15:19What can this thing do? of using pg_duckdb.
15:29and go into a little bit more detail after. engine on Postgres tables.
15:37And that's sort of it. You don't have to change where your data lives. You can just use the pg_duckdb on the data that you already have.
15:48and write data in Blob Storage.
15:52for Azure Blob Storage and S3 and
15:57formats like JSON, Parquet, Iceberg.
16:02Lots of different ones. to Blob Storage.
16:08you can offload analytics to MotherDuck.
16:14So, let's go start with the first one. engine running on Postgres tables. So, how does that work?
16:24Well, it's extremely simple. duckdb.force_execution, set it to true.
16:30and the DuckDB plan.
16:35if it was a bit faster. It depends on your query.
16:43It depends on your data. It depends on a lot of things.
16:49But, sometimes, yes, it's much faster. For example, there's this ClickBench benchmark. That's a public benchmark created by ClickHouse.
16:57analytics queries on a single table. mainly aggregation. It's very aggregation-heavy.
17:05And there, for some of the queries,
17:10if you just set this new setting to true.
17:17And so, that's pretty impressive. Some queries also get slower. So, it's not like a silver bullet.
17:24But it is sort of worth trying out. while trying some benchmarks. which is also a query command benchmark.
17:34and no indexes. But here, I wanted to see
17:41how it would work without. And for analytics, that's fairly normal because
17:48for your analytics queries are usually for your transactional workloads.
17:58So, the first query in TPC-DS,
18:03and I just gave up.
18:08And then, I set this setting to true, and I ran it again. And now, it's done in 450 milliseconds.
18:15and pretty much it's finished. query optimizations I ever did. everything is much better. As I said, this is an extreme example.
18:28This is not something you can normally expect. because maybe you get 2,
18:38maybe you get a 10% performance increase from
18:43changing a single setting. And then we get to the second use case, which is using Blob Storage.
18:52we can just read from Blob Storage.
18:56and then read_parquet, and it will just return the results,
19:06like return the first 5 rows in this case.
19:10because you can just do like...
19:15and then you get a table full of the parquet data.
19:23ORDER BY on this parquet data.
19:29parquet is actually stored in columnar format.
19:35And so DuckDB knows that, in this parquet file to your Postgres server.
19:44It will just fetch the columns that you need. we just need title, days in top 10, and type.
19:49for those 3 columns and leave all the, maybe like a hundred different columns. and not touch it.
19:57the bandwidth used will be less.
20:02One thing to note is that like r['stuff'] that you see.
20:08That's because sort of arbitrary columns, what columns it's going to be based on. based on what's actually in the file.
20:19having this sort of dynamic column stuff.
20:25So instead, what read_parquet does, just like you could index like a JSON column.
20:33on the columns of the row, basically.
20:40before we go to the last thing, we have to talk a bit about resources.
20:46normally looks like.
20:51with some transactional queries, that take up a few resources. like pg_duckdb,
21:04well, the two examples I gave before of using it, a lot more resources because
21:12it's going to use more CPU but it will use a lot more resources of the machine.
21:18concurrently as you could previously.
21:23So that's where MotherDuck comes in. this is the company I work for,
21:29you can offload this computation to the MotherDuck cloud get the results on the Postgres server.
21:42queries on Postgres are much smaller. Postgres instance, or you can still combine it
21:50that you might have. is happening in the MotherDuck cloud so that makes it much easier to scale
22:01and stuff like that. that looks like, you copy data into MotherDuck.
22:10CREATE TABLE and then USING duckdb.
22:16And that's sort of the key part. some data from a Postgres table,
22:22and then you have the data in MotherDuck. with the same data as in the hacker_news table
22:29that was originally on Postgres. And then you can just query that like normal. you can do whatever you want
22:38sort of work transparently without you
22:46except that it's a lot faster, suddenly.
22:51with some data inside Postgres. So you can do... to store in Postgres because you update that.
23:02But the archive from before, that's still in MotherDuck. and sort of get a shared result coming from Postgres coming from MotherDuck.
23:18but is that actually fast? And, well, yes. For analytics, it is really fast. in the ClickBench benchmark
23:35and pg_duckdb compared to just regular Postgres,
23:40even with Postgres with indexes in this case.
23:45So yeah, that's it. It's MIT-licensed. It's open source.
23:50or install a Docker image.
23:55Feedback is very welcome. to hearing from any of you to try it out.