Cloud & Infrastructure · Databases
ClickHouse in 2026: Analytical Queries on Billions of Rows Without the Pain
ClickHouse is a columnar database designed for analytical workloads. It answers queries over billions of rows in seconds that would take minutes in Postgres. Here's what application developers need to know.
Anurag Verma
7 min read
Sponsored
Most web applications run on Postgres or MySQL, and those databases handle transactional workloads well. User registration, order creation, session tracking: these are row-oriented operations, and row-oriented databases handle them efficiently.
Analytical queries are a different problem. “How many users from Germany made a purchase in the last 90 days, broken down by referral source?” requires scanning millions of rows to aggregate a handful of columns. Postgres will answer that query, but as your data grows past a few hundred million rows, it starts taking minutes. ClickHouse is built specifically for this class of query, and it answers them in seconds or less. Not by being clever about indexes, but by storing and processing data in a fundamentally different way.
How Columnar Storage Changes the Performance Equation
Row-oriented databases store each row’s data contiguously on disk. When you query one column from a million rows, the database reads all the row data and discards what it doesn’t need.
Columnar databases store each column’s data contiguously. A query that reads three columns out of a hundred scans only three columns’ worth of data. For analytical queries that aggregate a small number of columns across a large number of rows, this is the difference between reading 2% of your data and reading 100% of it.
ClickHouse adds compression on top of columnar storage. Because a single column contains data of the same type (all integers, all timestamps, all strings with similar patterns), it compresses extremely well. A table that takes 500 GB in Postgres might compress to 50 GB in ClickHouse, which means the same query reads far less data from disk.
The result: ClickHouse benchmarks consistently show it handling queries over billions of rows in under a second. The ClickHouse team publishes a public benchmark at benchmark.clickhouse.com that compares it against other analytical databases with real hardware and reproducible methodology.
When to Consider ClickHouse
ClickHouse is worth evaluating when any of these are true:
- Your analytics queries take more than a few seconds and the data volume keeps growing
- You’re running nightly batch jobs to summarize data because live queries are too slow
- You’ve added Postgres read replicas or materialized views specifically to handle analytical traffic
- You want to offer real-time dashboards to clients or users and Postgres is the bottleneck
- You’re storing events, logs, or time-series data and querying it gets slower every month
ClickHouse is not a general-purpose database. It’s not a replacement for Postgres. You still use Postgres for your application’s transactional data. ClickHouse is where you send data that needs to be queried analytically at scale.
Setting Up ClickHouse
The fastest way to get a ClickHouse instance is ClickHouse Cloud, which offers a free trial. For local development, Docker works:
docker run -d \
--name clickhouse \
-p 8123:8123 \
-p 9000:9000 \
clickhouse/clickhouse-server:latest
Port 8123 is the HTTP interface. Port 9000 is the native binary protocol. You can query via HTTP immediately:
curl http://localhost:8123 \
--data-binary "SELECT version()"
# 24.x.x.x
Or using the CLI client:
docker exec -it clickhouse clickhouse-client
The Table Engine Matters
ClickHouse has multiple table engines. The one you’ll use for almost everything is MergeTree and its variants.
CREATE TABLE events
(
event_id UUID,
user_id UInt64,
event_type LowCardinality(String),
properties String,
country LowCardinality(String),
created_at DateTime64(3) -- millisecond precision
)
ENGINE = MergeTree()
ORDER BY (created_at, user_id)
PARTITION BY toYYYYMM(created_at);
A few things worth noting here:
ORDER BY defines the physical sort order on disk. ClickHouse uses this to skip data during queries. If you frequently filter by event_type, adding it to the ORDER BY (or creating a secondary index) speeds up those queries significantly.
LowCardinality(String) is for columns with limited unique values (under ~10,000). It uses dictionary encoding internally, which improves compression and query speed.
PARTITION BY splits the table into monthly partitions. Queries that filter by month only touch the relevant partition, not the entire table.
Practical Query Examples
Aggregate queries are where ClickHouse shines:
-- Daily active users over the last 30 days
SELECT
toDate(created_at) AS day,
uniq(user_id) AS dau
FROM events
WHERE created_at >= now() - INTERVAL 30 DAY
AND event_type = 'session_start'
GROUP BY day
ORDER BY day;
-- Funnel: users who completed each step in order within 24 hours
SELECT
countIf(has(steps, 'signup')) AS step_1,
countIf(has(steps, 'onboarding_done')) AS step_2,
countIf(has(steps, 'first_purchase')) AS step_3
FROM (
SELECT
user_id,
groupArray(event_type) AS steps
FROM events
WHERE created_at >= today() - 30
GROUP BY user_id
);
-- Cohort retention: week-0 vs week-1 retention
SELECT
cohort_week,
countIf(activity_week = cohort_week) AS week_0,
countIf(activity_week = cohort_week + 1) AS week_1,
round(week_1 / week_0 * 100, 1) AS retention_pct
FROM (
SELECT
user_id,
toMonday(min(created_at)) AS cohort_week
FROM events
WHERE event_type = 'signup'
GROUP BY user_id
) cohorts
JOIN (
SELECT user_id, toMonday(created_at) AS activity_week
FROM events
GROUP BY user_id, activity_week
) activity USING (user_id)
GROUP BY cohort_week
ORDER BY cohort_week;
These queries run on billions of rows in ClickHouse in seconds. The syntax is close enough to standard SQL that most developers don’t need to learn much before writing useful queries.
Inserting Data
ClickHouse is optimized for bulk inserts, not single-row inserts. The recommended pattern is to batch inserts.
From your application, buffer events and flush in batches:
from clickhouse_driver import Client
from datetime import datetime
import uuid
client = Client(host='localhost')
# Buffer and flush
event_buffer = []
def record_event(user_id, event_type, properties, country):
event_buffer.append({
'event_id': str(uuid.uuid4()),
'user_id': user_id,
'event_type': event_type,
'properties': json.dumps(properties),
'country': country,
'created_at': datetime.utcnow(),
})
if len(event_buffer) >= 1000:
flush_events()
def flush_events():
if not event_buffer:
return
client.execute(
'INSERT INTO events VALUES',
event_buffer,
types_check=True
)
event_buffer.clear()
For higher throughput, use a queue (Kafka, SQS) and write a consumer that inserts to ClickHouse in batches. This decouples your application from the database and makes the insertion pipeline more resilient.
Directly from object storage, ClickHouse can read files from S3 or GCS:
INSERT INTO events
SELECT *
FROM s3(
's3://your-bucket/events/2026-06-07/*.parquet',
'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY',
'Parquet'
);
This is useful for backfilling historical data or loading from a data lake.
Replicating from Postgres
The most common architecture: Postgres handles your OLTP workload, ClickHouse handles analytics. You keep them in sync via change data capture (CDC).
PeerDB is an open-source CDC tool built specifically for Postgres-to-ClickHouse replication. It sets up in about 15 minutes:
docker compose -f peerdb-compose.yaml up -d
# Then in the PeerDB UI or API:
# 1. Create a Postgres source peer
# 2. Create a ClickHouse destination peer
# 3. Create a mirror for the tables you want replicated
Debezium is another option with broader database support if you need to replicate from other sources too.
With CDC in place, writes to Postgres appear in ClickHouse within seconds. Your application keeps writing to Postgres normally. Your analytics dashboards query ClickHouse.
ClickHouse Cloud vs Self-Hosted
| ClickHouse Cloud | Self-hosted | |
|---|---|---|
| Setup time | Minutes | Hours to days |
| Scaling | Automatic | Manual |
| Cost at low scale | $50-200/month for small clusters | Cheaper if you’re already running servers |
| Cost at high scale | Scales with usage | Fixed unless you over-provision |
| Operational burden | Low | High |
| Data residency | Choose region | Full control |
For most product teams evaluating ClickHouse, start with ClickHouse Cloud. If the queries and architecture work, you have real data for a self-hosting decision later.
What ClickHouse Is Not Good For
- High-frequency single-row reads. Looking up a user by ID is slower in ClickHouse than Postgres. It’s not a key-value store.
- Frequent updates and deletes. ClickHouse supports mutations, but they’re expensive. It’s designed for append-heavy workloads.
- ACID transactions across tables. ClickHouse is not a transactional database. Use Postgres for that.
- Small datasets. If your analytics tables have millions of rows rather than billions, Postgres with a good query plan is fast enough and simpler to operate.
The decision point is roughly: if your slowest analytics queries are taking 5-10 seconds or more and the data is still growing, ClickHouse is worth the setup cost. Below that threshold, Postgres with better indexes or a materialized view is usually the simpler answer.
Sponsored
More from this category
More from Cloud & Infrastructure
Cloudflare R2 vs AWS S3 in 2026: The Storage Decision for Developer Teams
Error Tracking in 2026: What Sentry Catches That Your Logs Don't
Vendor Lock-in in 2026: What It Actually Costs and When to Stop Worrying About It
Sponsored
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored