Skip to content

Technology · Data Engineering

DuckDB in Practice: Analytical Queries Without a Data Warehouse

DuckDB runs OLAP queries directly in your process, on files on your laptop, without a server. Here's how to use it and when it beats spinning up BigQuery or Redshift.

Anurag Verma

Anurag Verma

7 min read

DuckDB in Practice: Analytical Queries Without a Data Warehouse

Sponsored

Share

You have a CSV with two million rows. You want to group by a column, filter on a date range, and compute a percentile. You could load it into pandas and write the aggregations. You could spin up a PostgreSQL instance and import it. Or you could open a DuckDB connection, write one SQL statement, and get the result in under a second.

That third option is what DuckDB is for. It’s an embedded OLAP database that runs inside your Python process, your Node.js server, your CLI tool. No separate server. No cloud account. No permission setup. Just SQL on data.

What DuckDB Actually Is

PostgreSQL is built for OLTP: row-by-row transactional reads and writes. That’s the right model for a web application where you’re inserting individual records and querying by primary key.

OLAP workloads look different. You’re scanning millions of rows and aggregating them. Computing averages across all records from last quarter. Joining a 10M-row events table against a 100K-row user table. For these patterns, column-oriented storage wins. Rather than reading all columns for a row, the database reads only the columns you touched.

DuckDB is column-oriented and OLAP-native. It’s also embedded, meaning it runs in the same process as your application rather than as a network service. The two properties together give you analytical query performance without infrastructure.

Version 1.0 shipped in January 2024. By early 2026, it’s on 1.1.x and the API is stable. The Python package alone gets ~7 million downloads per month.

Installation

pip install duckdb

That’s the full install. There’s also a CLI:

# macOS
brew install duckdb

# Linux (check releases page for latest)
curl -LO https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip

Querying Files Without Importing Them

The feature that most people don’t expect: DuckDB can query CSV and Parquet files directly. No import step.

import duckdb

# Query a CSV directly
result = duckdb.sql("""
  SELECT
    category,
    COUNT(*) as total,
    AVG(amount) as avg_amount,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) as p95_amount
  FROM 'sales_data.csv'
  WHERE order_date >= '2025-01-01'
  GROUP BY category
  ORDER BY total DESC
""").df()

print(result)

The .df() call returns a pandas DataFrame. .fetchall() gives you a list of tuples. .arrow() returns a PyArrow table.

DuckDB also handles compressed files and glob patterns:

# Query all Parquet files in a directory
duckdb.sql("SELECT COUNT(*) FROM 'data/logs/*.parquet'").fetchone()

# Query a gzipped CSV
duckdb.sql("SELECT * FROM 'archive.csv.gz' LIMIT 10").fetchall()

This is useful when you have data exports from S3, Snowflake exports, or log archives that you want to query without a full pipeline.

Working With In-Memory Databases

For anything that needs to persist across queries in a script, use a connection:

import duckdb

con = duckdb.connect()  # in-memory, lost when connection closes

# Create a table from a DataFrame
import pandas as pd
df = pd.read_csv("events.csv")
con.execute("CREATE TABLE events AS SELECT * FROM df")

# Now query it repeatedly without re-reading the CSV
con.execute("""
  CREATE VIEW daily_totals AS
  SELECT
    DATE_TRUNC('day', timestamp) as day,
    user_id,
    COUNT(*) as event_count
  FROM events
  GROUP BY 1, 2
""")

result = con.execute("""
  SELECT day, SUM(event_count) as total
  FROM daily_totals
  WHERE day >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY day
  ORDER BY day
""").df()

For persistent storage across process restarts:

# Pass a file path to persist to disk
con = duckdb.connect("analytics.duckdb")

The .duckdb file is self-contained and portable. Copy it, open it from a different machine, query it the same way.

DuckDB vs Pandas for Aggregations

Pandas is excellent for transforming data row by row. It’s slower for aggregations on large datasets because it’s not columnar and doesn’t parallelize by default.

DuckDB uses all available cores automatically. A few comparisons on common operations:

Operationpandas (8-core M2)DuckDB (8-core M2)
GROUP BY on 10M rows~4.2s~0.4s
Window function p95 on 5M rows~6.1s~0.6s
JOIN two 1M-row tables~2.8s~0.3s

These are rough estimates, not a controlled benchmark. The actual difference depends on the operation and your data. For simple column arithmetic and row-by-row transforms, pandas is often more natural. For aggregations, window functions, and joins at scale, DuckDB is reliably faster.

Practical Uses in an Agency or Dev Team Context

1. Analyzing application logs

When a production incident happens and you have gigabytes of JSON logs, DuckDB can read them directly:

con.execute("""
  CREATE TABLE logs AS
  SELECT *
  FROM read_json_auto('logs/*.json')
""")

# Find slow endpoints
con.execute("""
  SELECT
    endpoint,
    COUNT(*) as requests,
    AVG(response_time_ms) as avg_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) as p99_ms
  FROM logs
  WHERE timestamp >= '2026-05-28'
  GROUP BY endpoint
  HAVING avg_ms > 500
  ORDER BY p99_ms DESC
""").df()

2. Querying data exports

A client sends you a CSV of their transaction history. You need to find anomalies, produce a summary, or join it with another export.

# Join two exports without a database server
duckdb.sql("""
  SELECT
    t.transaction_id,
    t.amount,
    c.company_name,
    c.country
  FROM 'transactions.csv' t
  JOIN 'customers.csv' c ON t.customer_id = c.id
  WHERE t.status = 'failed'
    AND t.amount > 1000
  ORDER BY t.amount DESC
  LIMIT 100
""").df()

3. Building a lightweight analytics layer

For internal tools where you don’t need a full data warehouse, DuckDB on a small server with a scheduled job that writes Parquet files is a real alternative to Redshift or BigQuery:

# Append new data daily
con = duckdb.connect("analytics.duckdb")
con.execute(f"""
  INSERT INTO events
  SELECT * FROM 'data/events_{today}.parquet'
""")

4. Data pipeline testing

DuckDB is excellent for testing transformation logic before running it in production:

import duckdb
import pytest

def transform_events(con):
    con.execute("""
        CREATE TABLE output AS
        SELECT
            user_id,
            DATE_TRUNC('day', created_at) as day,
            SUM(value) as daily_total
        FROM events
        GROUP BY user_id, day
    """)

def test_no_nulls_in_output():
    con = duckdb.connect()
    con.execute("CREATE TABLE events AS SELECT * FROM 'fixtures/sample_events.csv'")
    transform_events(con)
    
    null_count = con.execute(
        "SELECT COUNT(*) FROM output WHERE user_id IS NULL OR daily_total IS NULL"
    ).fetchone()[0]
    
    assert null_count == 0, f"Found {null_count} null rows in output"

When Not to Use DuckDB

DuckDB is an analytical database. It’s not designed for:

  • High-concurrency transactional writes (use PostgreSQL, MySQL)
  • Multi-user web application backends (multiple concurrent writers will conflict)
  • Real-time streaming ingestion (DuckDB reads, it doesn’t subscribe)
  • Terabyte-scale data where you need distributed compute

If your data fits on one machine and you’re doing analytics, DuckDB is almost always the right tool. If you’re building a web app with many concurrent users writing data, it’s the wrong tool.

DuckDB also supports a local server mode (via the ODBC driver or via MotherDuck, the managed cloud product), but the embedded single-process use case is where it shines.

The MotherDuck Option

MotherDuck is a managed cloud extension of DuckDB. You connect from your local DuckDB instance to MotherDuck and it handles storage and larger-than-RAM datasets:

con = duckdb.connect("md:?token=your_token")
con.execute("USE my_database")

If your team wants shared analytics without the overhead of Snowflake or Redshift, MotherDuck is worth a look. It uses DuckDB’s SQL syntax, so everything you learn locally transfers directly.

The Short Version

DuckDB is what you reach for when you need SQL on data and don’t want to set up a server. It’s fast, it requires no infrastructure, and it handles CSV, Parquet, and JSON natively. The Python API is clean enough to replace a lot of pandas aggregation code with a single SQL statement. If you haven’t tried it, install it this afternoon.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored