Technology · Data Engineering
DuckDB in Practice: Analytical Queries Without a Data Warehouse
DuckDB runs OLAP queries directly in your process, on files on your laptop, without a server. Here's how to use it and when it beats spinning up BigQuery or Redshift.
Anurag Verma
7 min read
Sponsored
You have a CSV with two million rows. You want to group by a column, filter on a date range, and compute a percentile. You could load it into pandas and write the aggregations. You could spin up a PostgreSQL instance and import it. Or you could open a DuckDB connection, write one SQL statement, and get the result in under a second.
That third option is what DuckDB is for. It’s an embedded OLAP database that runs inside your Python process, your Node.js server, your CLI tool. No separate server. No cloud account. No permission setup. Just SQL on data.
What DuckDB Actually Is
PostgreSQL is built for OLTP: row-by-row transactional reads and writes. That’s the right model for a web application where you’re inserting individual records and querying by primary key.
OLAP workloads look different. You’re scanning millions of rows and aggregating them. Computing averages across all records from last quarter. Joining a 10M-row events table against a 100K-row user table. For these patterns, column-oriented storage wins. Rather than reading all columns for a row, the database reads only the columns you touched.
DuckDB is column-oriented and OLAP-native. It’s also embedded, meaning it runs in the same process as your application rather than as a network service. The two properties together give you analytical query performance without infrastructure.
Version 1.0 shipped in January 2024. By early 2026, it’s on 1.1.x and the API is stable. The Python package alone gets ~7 million downloads per month.
Installation
pip install duckdb
That’s the full install. There’s also a CLI:
# macOS
brew install duckdb
# Linux (check releases page for latest)
curl -LO https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
Querying Files Without Importing Them
The feature that most people don’t expect: DuckDB can query CSV and Parquet files directly. No import step.
import duckdb
# Query a CSV directly
result = duckdb.sql("""
SELECT
category,
COUNT(*) as total,
AVG(amount) as avg_amount,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) as p95_amount
FROM 'sales_data.csv'
WHERE order_date >= '2025-01-01'
GROUP BY category
ORDER BY total DESC
""").df()
print(result)
The .df() call returns a pandas DataFrame. .fetchall() gives you a list of tuples. .arrow() returns a PyArrow table.
DuckDB also handles compressed files and glob patterns:
# Query all Parquet files in a directory
duckdb.sql("SELECT COUNT(*) FROM 'data/logs/*.parquet'").fetchone()
# Query a gzipped CSV
duckdb.sql("SELECT * FROM 'archive.csv.gz' LIMIT 10").fetchall()
This is useful when you have data exports from S3, Snowflake exports, or log archives that you want to query without a full pipeline.
Working With In-Memory Databases
For anything that needs to persist across queries in a script, use a connection:
import duckdb
con = duckdb.connect() # in-memory, lost when connection closes
# Create a table from a DataFrame
import pandas as pd
df = pd.read_csv("events.csv")
con.execute("CREATE TABLE events AS SELECT * FROM df")
# Now query it repeatedly without re-reading the CSV
con.execute("""
CREATE VIEW daily_totals AS
SELECT
DATE_TRUNC('day', timestamp) as day,
user_id,
COUNT(*) as event_count
FROM events
GROUP BY 1, 2
""")
result = con.execute("""
SELECT day, SUM(event_count) as total
FROM daily_totals
WHERE day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY day
ORDER BY day
""").df()
For persistent storage across process restarts:
# Pass a file path to persist to disk
con = duckdb.connect("analytics.duckdb")
The .duckdb file is self-contained and portable. Copy it, open it from a different machine, query it the same way.
DuckDB vs Pandas for Aggregations
Pandas is excellent for transforming data row by row. It’s slower for aggregations on large datasets because it’s not columnar and doesn’t parallelize by default.
DuckDB uses all available cores automatically. A few comparisons on common operations:
| Operation | pandas (8-core M2) | DuckDB (8-core M2) |
|---|---|---|
| GROUP BY on 10M rows | ~4.2s | ~0.4s |
| Window function p95 on 5M rows | ~6.1s | ~0.6s |
| JOIN two 1M-row tables | ~2.8s | ~0.3s |
These are rough estimates, not a controlled benchmark. The actual difference depends on the operation and your data. For simple column arithmetic and row-by-row transforms, pandas is often more natural. For aggregations, window functions, and joins at scale, DuckDB is reliably faster.
Practical Uses in an Agency or Dev Team Context
1. Analyzing application logs
When a production incident happens and you have gigabytes of JSON logs, DuckDB can read them directly:
con.execute("""
CREATE TABLE logs AS
SELECT *
FROM read_json_auto('logs/*.json')
""")
# Find slow endpoints
con.execute("""
SELECT
endpoint,
COUNT(*) as requests,
AVG(response_time_ms) as avg_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) as p99_ms
FROM logs
WHERE timestamp >= '2026-05-28'
GROUP BY endpoint
HAVING avg_ms > 500
ORDER BY p99_ms DESC
""").df()
2. Querying data exports
A client sends you a CSV of their transaction history. You need to find anomalies, produce a summary, or join it with another export.
# Join two exports without a database server
duckdb.sql("""
SELECT
t.transaction_id,
t.amount,
c.company_name,
c.country
FROM 'transactions.csv' t
JOIN 'customers.csv' c ON t.customer_id = c.id
WHERE t.status = 'failed'
AND t.amount > 1000
ORDER BY t.amount DESC
LIMIT 100
""").df()
3. Building a lightweight analytics layer
For internal tools where you don’t need a full data warehouse, DuckDB on a small server with a scheduled job that writes Parquet files is a real alternative to Redshift or BigQuery:
# Append new data daily
con = duckdb.connect("analytics.duckdb")
con.execute(f"""
INSERT INTO events
SELECT * FROM 'data/events_{today}.parquet'
""")
4. Data pipeline testing
DuckDB is excellent for testing transformation logic before running it in production:
import duckdb
import pytest
def transform_events(con):
con.execute("""
CREATE TABLE output AS
SELECT
user_id,
DATE_TRUNC('day', created_at) as day,
SUM(value) as daily_total
FROM events
GROUP BY user_id, day
""")
def test_no_nulls_in_output():
con = duckdb.connect()
con.execute("CREATE TABLE events AS SELECT * FROM 'fixtures/sample_events.csv'")
transform_events(con)
null_count = con.execute(
"SELECT COUNT(*) FROM output WHERE user_id IS NULL OR daily_total IS NULL"
).fetchone()[0]
assert null_count == 0, f"Found {null_count} null rows in output"
When Not to Use DuckDB
DuckDB is an analytical database. It’s not designed for:
- High-concurrency transactional writes (use PostgreSQL, MySQL)
- Multi-user web application backends (multiple concurrent writers will conflict)
- Real-time streaming ingestion (DuckDB reads, it doesn’t subscribe)
- Terabyte-scale data where you need distributed compute
If your data fits on one machine and you’re doing analytics, DuckDB is almost always the right tool. If you’re building a web app with many concurrent users writing data, it’s the wrong tool.
DuckDB also supports a local server mode (via the ODBC driver or via MotherDuck, the managed cloud product), but the embedded single-process use case is where it shines.
The MotherDuck Option
MotherDuck is a managed cloud extension of DuckDB. You connect from your local DuckDB instance to MotherDuck and it handles storage and larger-than-RAM datasets:
con = duckdb.connect("md:?token=your_token")
con.execute("USE my_database")
If your team wants shared analytics without the overhead of Snowflake or Redshift, MotherDuck is worth a look. It uses DuckDB’s SQL syntax, so everything you learn locally transfers directly.
The Short Version
DuckDB is what you reach for when you need SQL on data and don’t want to set up a server. It’s fast, it requires no infrastructure, and it handles CSV, Parquet, and JSON natively. The Python API is clean enough to replace a lot of pandas aggregation code with a single SQL statement. If you haven’t tried it, install it this afternoon.
Sponsored
More from this category
More from Technology
The Modern Developer's Local Tooling in 2026: Terminal, Git, and the Productivity Stack
Product Analytics for Developer Teams: PostHog vs Mixpanel vs Amplitude in 2026
React Native's New Architecture in 2026: What the Fabric and JSI Migration Actually Took
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored