Cloud & Infrastructure · Data Engineering
dbt in 2026: SQL-First Data Transformation That Actually Scales
dbt turns SQL SELECT statements into a tested, documented, version-controlled data pipeline. Here's how it works and when you should add it to your data stack.
Anurag Verma
7 min read
Sponsored
Most data transformation work starts the same way: someone writes a SQL query, pastes the result into a spreadsheet, and shares it in Slack. The query gets modified. The spreadsheet is saved under a new name. Eventually there are four versions of the query and nobody agrees which one is right.
dbt (data build tool) is the answer to that problem. It takes the SQL you’re already writing and organizes it into a transformation pipeline with dependency management, automated testing, version control, and auto-generated documentation. The same person who writes the SQL can set up the pipeline without learning a new language or framework.
What dbt Actually Does
dbt sits between your raw data sources and the tables that your dashboards and application read from. It does not extract data from sources or load data into your warehouse. That part (Extract and Load) is handled by tools like Fivetran, Airbyte, or custom scripts. dbt’s job is the Transform part of ELT.
The architecture looks like this:
Sources (raw data from your app database, third-party SaaS tools, event streams) → Staging models (clean up and standardize raw source tables) → Intermediate models (business logic, joins) → Mart models (final tables for dashboards and reporting)
Each layer is just a SQL SELECT statement in a .sql file. dbt handles running them in the right order, materializing them as tables or views, and tracking which ones depend on which.
Setting Up dbt Core
pip install dbt-core dbt-postgres # or dbt-bigquery, dbt-snowflake, dbt-duckdb, etc.
dbt init my_project
# Prompts for database type and connection details
cd my_project
The project structure dbt creates:
my_project/
├── dbt_project.yml # project config
├── profiles.yml # database connection (in ~/.dbt/ by default)
├── models/ # your SQL transformation files
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── seeds/ # CSV files that become tables
└── tests/ # custom test SQL
Configure your connection in ~/.dbt/profiles.yml:
my_project:
target: dev
outputs:
dev:
type: postgres
host: localhost
user: postgres
password: "{{ env_var('DBT_POSTGRES_PASSWORD') }}"
port: 5432
dbname: analytics
schema: dbt_dev
threads: 4
prod:
type: postgres
host: your-prod-host
user: analytics_writer
password: "{{ env_var('DBT_POSTGRES_PASSWORD') }}"
port: 5432
dbname: analytics
schema: analytics
threads: 8
Verify the connection:
dbt debug
Writing Your First Models
A model is a .sql file with a SELECT statement. dbt materializes it as a view or table in your database.
Staging model: cleans up raw source data:
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('app_db', 'orders') }}
),
cleaned AS (
SELECT
id AS order_id,
user_id,
status,
LOWER(currency) AS currency,
amount_cents / 100.0 AS amount,
created_at AS ordered_at,
updated_at
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM cleaned
Intermediate model: joins staging models together:
-- models/intermediate/int_orders_with_users.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
users AS (
SELECT * FROM {{ ref('stg_users') }}
)
SELECT
o.order_id,
o.amount,
o.currency,
o.ordered_at,
o.status,
u.user_id,
u.email,
u.country,
u.created_at AS user_registered_at
FROM orders o
JOIN users u ON o.user_id = u.user_id
Mart model: the final table for dashboards:
-- models/marts/fct_daily_revenue.sql
{{
config(
materialized='incremental',
unique_key='report_date',
on_schema_change='fail'
)
}}
WITH orders AS (
SELECT * FROM {{ ref('int_orders_with_users') }}
WHERE status = 'completed'
{% if is_incremental() %}
AND ordered_at >= (SELECT MAX(report_date) FROM {{ this }})
{% endif %}
)
SELECT
ordered_at::date AS report_date,
currency,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
COUNT(DISTINCT user_id) AS unique_buyers
FROM orders
GROUP BY 1, 2
The {{ ref('stg_orders') }} syntax is the key feature. dbt resolves these references into the actual table or view names and builds the DAG (directed acyclic graph) of dependencies automatically. Run dbt run and it executes models in the correct order.
The source() Function
Sources tell dbt about the raw tables in your database that dbt doesn’t create. You define them in a YAML file:
# models/staging/sources.yml
version: 2
sources:
- name: app_db
schema: public
description: "Raw data from the application Postgres database"
tables:
- name: orders
description: "All orders placed through the application"
columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null
- name: users
description: "Registered users"
This gives dbt visibility into the raw tables and lets you define tests and documentation on source data too. The source() function in your SQL refers to these definitions.
Testing
dbt has a built-in test framework. Tests run against your models after they’re materialized.
Generic tests: applied via YAML config:
# models/marts/schema.yml
version: 2
models:
- name: fct_daily_revenue
description: "Daily revenue aggregated by currency"
columns:
- name: report_date
tests:
- unique
- not_null
- name: revenue
tests:
- not_null
- name: currency
tests:
- accepted_values:
values: ['USD', 'EUR', 'GBP', 'INR']
Singular tests: custom SQL that returns rows when the test fails:
-- tests/assert_revenue_non_negative.sql
SELECT
report_date,
currency,
revenue
FROM {{ ref('fct_daily_revenue') }}
WHERE revenue < 0
If this query returns any rows, the test fails. dbt treats a non-empty result as a failure.
Run tests after your models:
dbt run && dbt test
Or run both together:
dbt build # runs models, seeds, and tests in dependency order
Incremental Models
Full table rebuilds are fine for models with millions of rows. For billions of rows, you need incremental materialization.
The is_incremental() macro lets you write logic that only processes new records when running incrementally:
{{
config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge'
)
}}
SELECT
event_id,
user_id,
event_type,
created_at
FROM {{ source('event_stream', 'events') }}
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
On the first run, dbt creates the table with all data. On subsequent runs, it only processes records newer than the max timestamp already in the table and merges them in. The unique_key ensures records with the same event_id are updated rather than duplicated.
Documentation
dbt generates a documentation site from your YAML descriptions and model definitions. Run:
dbt docs generate
dbt docs serve
This opens a browser with a searchable catalog of all your models, their columns, descriptions, and the lineage graph showing which models depend on which. For teams where multiple people touch the data stack, this pays off quickly.
dbt Core vs dbt Cloud
dbt Core is the open-source CLI tool. dbt Cloud is the managed platform from dbt Labs, which adds:
- A web-based IDE for writing and testing models
- Scheduled job runs
- Continuous integration integration (GitHub, GitLab)
- Environment management (dev, staging, prod)
- Centralized logging
dbt Core is free. dbt Cloud starts at $50/month per developer seat and goes up from there.
For most teams starting out: use dbt Core, run it from CI (GitHub Actions works well), and schedule runs with a simple cron job or your existing orchestration tool. Move to dbt Cloud if you need the web IDE or find the scheduling setup too much to maintain.
# .github/workflows/dbt.yml — minimal dbt CI
name: dbt
on:
push:
paths:
- 'models/**'
jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install dbt-postgres
- run: dbt run --target prod
env:
DBT_POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
- run: dbt test --target prod
env:
DBT_POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
The Practical Case for Adding dbt
The clearest signal that dbt is worth adding: you have SQL queries that more than one person needs to trust, and right now you’re not sure which version is correct or when they last ran.
dbt doesn’t make SQL easier to write. It makes SQL easier to maintain, test, and understand across a team. If you’re the only person touching the data pipeline and you’re comfortable tracking it manually, dbt adds overhead. If you’re two or three people working on the same models, or if your pipeline is something clients or stakeholders depend on, the structure dbt provides pays for itself quickly.
Start with one staging model and one mart model. Get dbt build to pass. Add tests for your key metrics. The pattern becomes clear after the first week.
Sponsored
More from this category
More from Cloud & Infrastructure
Cloudflare R2 vs AWS S3 in 2026: The Storage Decision for Developer Teams
Error Tracking in 2026: What Sentry Catches That Your Logs Don't
Vendor Lock-in in 2026: What It Actually Costs and When to Stop Worrying About It
Sponsored
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored