Introduction to DuckDB with a focus on Python

I cover how DuckDB pairs with Python for efficient data query & analysis; setup, basic operations, and advanced tips.

Author

Affiliation

Sid Metcalfe

Cartesian Mathematics Foundation

Published

November 13, 2023

What is DuckDB and Why Use It With Python

I’ve been using DuckDB for the past year and I’m pretty stoked about it. If you’re into Python and data analysis it is an extremely versatile tool.

First off, DuckDB is an in-process SQL database (think SQLite). It’s designed as an OLAP (Online Analytical Processing) system, which means it’s made for complex queries, aggregations. It’s extremely fast for analytical queries, even on larger datasets.

It combines with Python very well - with the benefit of Python having tools for anything you want to do—data science, web development, scripting, and so on. Pairing Python with DuckDB you can manipulate your data with all the useful Python packages like Pandas, and then seamless shift to DuckDB for some heavy lifting data analysis without leaving your Jupyter notebooks or scripts.

Another reason I like DuckDB is its low-fuss, zero-config nature. There’s no need to set up a server or go through a lengthy installation process. You just pip install duckdb and import it in Python, similar to how you might use Polars for fast data analysis with minimal setup. It stores everything neatly in a single file. So, if you’re like me and break into hives thinking about database config files, DuckDB is your go-to tool.

Similarly, compatibility is great: DuckDB works well alongside Pandas DataFrames. You can easily go from a DataFrame to a DuckDB table (and vice versa). This means that you can leverage DuckDB’s processing speed while still using Pandas for handling data in a familiar way.

It’s also robust, capable of handling multi-threading out of the box. So, if your laptop or workstation has multiple cores than DuckDB scales nicely to use all that CPU power.

There may be cases where DuckDB isn’t your go-to—like if you need specific features from traditional DBMS for transactional operations. But for analytical stuff it is a gem.

If you’re interested, there’s a bunch of research and development backing DuckDB at CWI Amsterdam and MIT. Plus, the code’s all open-source on GitHub.

Setting Up DuckDB in a Python Environment

A screenshot of a python ide with a duckdb installation code snippet

Getting started, I didn’t need to venture into the wilds of system configurations or the depths of the so-called “dependency hell” – thanks to the Python ecosystem and its wonderful package manager pip. Just a simple command pip install duckdb and I had DuckDB installed and ready to roll.

After the installation, it was time to set up a database. DuckDB runs in-process, which means no fiddling around with servers or authentication – it’s all about simplicity and speed. So, I imported duckdb in my Python script, and I was pretty much done. No lengthy configurations, just:

import duckdb

con = duckdb.connect(database=':memory:', read_only=False)

And you have an in-memory database ready to execute queries. Of course, if I wanted persistence, I could just replace ':memory:' with a file path.

Now, here’s where it gets interesting: you can use DuckDB in conjunction with pandas. You can run a query directly on a Pandas DataFrame with just one line of code. And DuckDB is smart enough to avoid copying data around needlessly.

import pandas as pd

df = pd.DataFrame({
  'id': [1, 2, 3],
  'name': ['Daffy', 'Donald', 'Scrooge']
})

result = con.execute("SELECT * FROM df").fetchdf()

It is great to have the power of SQL in Python. Regarding the actual database session - you just have to create a connection to a file-based database instead:

con_disk = duckdb.connect(database='my_duckdb_file.duckdb', read_only=False)

With just this one line you are all set with a database file that remembers your data even after you shut down you script, making it perfect for both quick analyses and more durable data storage needs.

For those of you who need to dive deep into the documentation or if you’re the type who likes to know exactly how the sausage is made, the DuckDB GitHub repo (https://github.com/duckdb/duckdb) is a goldmine. It’s well-documented, and there’s a helpful community ready to give you a nudge if you happen to get stuck (which is unlikely due to DuckDB’s intuitive syntax and operations).

Basic Operations with DuckDB in Python

Code snippet of python script executing basic sql queries in duckdb

Sometimes working with database systems feels like you’re trying to knit a sweater while wearing boxing gloves. But with DuckDB this is much less painful as you can perform CRUD operations (Create, Read, Update, Delete) straight from your Python environment.

First, you create connections and cursors, just like you would with SQLite:

import duckdb
con = duckdb.connect(database=':memory:', read_only=False)

You can opt for an in-memory database which is lightning fast for smaller datasets or store bigger datasets on disk.

Creation is as easy as running a simple SQL statement to structure your tables:

con.execute("CREATE TABLE birds (name VARCHAR, species VARCHAR, age INTEGER)")

Once your table is set up, you can insert data as follows:

con.execute("INSERT INTO birds VALUES ('Quackers', 'Mallard', 4)")

Now, if you’re like me and you tend to bulk-load data, DuckDB has got you covered. You can insert whole dataframes from Pandas with zero hassle.

Reading data back is where DuckDB truly shines, especially if you’re familiar with Pandas. You use a simple query method and you’ve got a fully-fledged dataframe ready:

df = con.execute("SELECT * FROM birds").fetchdf()

Unlike Pandas, no more converting data types back and forth. Updating data is just as straightforward. Say one of your birds had a birthday, and you need to update its age:

con.execute("UPDATE birds SET age = 5 WHERE name = 'Quackers'")

I love how clear and intuitive this is – it’s SQL at its best, smooth and unobtrusive. And if you need to remove a record, deleting is just one line:

con.execute("DELETE FROM birds WHERE name = 'Quackers'")

What’s also really nice about DuckDB is how it seamlessly integrates with the Python ecosystem. If you’re about big data analytics it will likely integrate well with the tools you already use.

In terms of links and resources, the DuckDB GitHub repo is a fantastic resource. It’s stuffed with comprehensive info, examples, and it’s the spot to dig through if you’re up for contributing.

To sum it up, these basic operations with DuckDB are the building blocks of data manipulation in Python. They’re simple, intuitive, and downright powerful. There’s just something so satisfying about a tool that gets out of the way and lets you focus on what truly matters – the data and the insights it holds.

Advanced Usage and Performance Tips

Illustrative graphs showing performance metrics of duckdb with python

First, you gotta appreciate DuckDB’s vectorized query execution. It’s built to process data in batches, zipping fastly through operations. The key here is to structure your queries to take full advantage of this. You’ll want to avoid anything that can’t be vectorized, like user-defined functions that aren’t written with vectorization in mind.

Parallel processing is another one of DuckDB’s strengths. It can automatically utilize multiple CPU cores for hefty queries. At the same time, not all tasks are parallelizable: stuff like ORDER BY and aggregation is usually limited to single-threaded operations due to their nature.

Regarding data types: choosing the right type can make a big difference. For instance, if you don’t need the extra range of BIGINT, just stick with INT. It’s the little choices like this that can add up to significant speed gains.

DuckDB also support indices, but they’re not always the go-to solution like in other databases. Still, creating an index here and there, especially on columns used in JOIN operations or WHERE clauses, can give your query speed a decent bump. Just remember, indexes have their own overhead, so use them wisely.

Are you dealing with a large amount of data? I’ve been there, and partitioning is your friend. Breaking your data into chunks, especially if you’re often filtering by a specific column, can lead to massive performance gains. DuckDB can swiftly scan the relevant partitions, ignoring the rest, which is a godsend for large datasets.

Another handy feature is the ability to save intermediate results with CREATE VIEW. If you’re repeatedly using a complicated subquery, why do the heavy lifting every single time? Create a view and access those results directly, much like how in data analysis with Python, using the right libraries can streamline the process.

Similarly, the COPY command is my go-to for bulk data operations. Whether you’re ingesting or offloading data, COPY does it with minimal fuss and maximal speed. And it’s easy to use with CSV files – my preferred file types for DuckDB.

Lastly I’d advice to inspect the DuckDB GitHub repo and their documentation. The DuckDB community is pretty helpful, and they’re always coming up with new ways to get the most out of databases.

In conclusion, from my personal experience, setting up and working with DuckDB in a Python environment is a breeze. It’s a tool that helps you work smarter, not harder, leaving you more time to enjoy the data you’re working with.