Skip to content

Cloud & Infrastructure · Data Engineering

dbt in 2026: SQL-First Data Transformation That Actually Scales

dbt turns SQL SELECT statements into a tested, documented, version-controlled data pipeline. Here's how it works and when you should add it to your data stack.

Anurag Verma

Anurag Verma

7 min read

dbt in 2026: SQL-First Data Transformation That Actually Scales

Sponsored

Share

Most data transformation work starts the same way: someone writes a SQL query, pastes the result into a spreadsheet, and shares it in Slack. The query gets modified. The spreadsheet is saved under a new name. Eventually there are four versions of the query and nobody agrees which one is right.

dbt (data build tool) is the answer to that problem. It takes the SQL you’re already writing and organizes it into a transformation pipeline with dependency management, automated testing, version control, and auto-generated documentation. The same person who writes the SQL can set up the pipeline without learning a new language or framework.

What dbt Actually Does

dbt sits between your raw data sources and the tables that your dashboards and application read from. It does not extract data from sources or load data into your warehouse. That part (Extract and Load) is handled by tools like Fivetran, Airbyte, or custom scripts. dbt’s job is the Transform part of ELT.

The architecture looks like this:

Sources (raw data from your app database, third-party SaaS tools, event streams) → Staging models (clean up and standardize raw source tables) → Intermediate models (business logic, joins) → Mart models (final tables for dashboards and reporting)

Each layer is just a SQL SELECT statement in a .sql file. dbt handles running them in the right order, materializing them as tables or views, and tracking which ones depend on which.

Setting Up dbt Core

pip install dbt-core dbt-postgres  # or dbt-bigquery, dbt-snowflake, dbt-duckdb, etc.

dbt init my_project
# Prompts for database type and connection details

cd my_project

The project structure dbt creates:

my_project/
├── dbt_project.yml      # project config
├── profiles.yml         # database connection (in ~/.dbt/ by default)
├── models/              # your SQL transformation files
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── seeds/               # CSV files that become tables
└── tests/               # custom test SQL

Configure your connection in ~/.dbt/profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: postgres
      password: "{{ env_var('DBT_POSTGRES_PASSWORD') }}"
      port: 5432
      dbname: analytics
      schema: dbt_dev
      threads: 4
    prod:
      type: postgres
      host: your-prod-host
      user: analytics_writer
      password: "{{ env_var('DBT_POSTGRES_PASSWORD') }}"
      port: 5432
      dbname: analytics
      schema: analytics
      threads: 8

Verify the connection:

dbt debug

Writing Your First Models

A model is a .sql file with a SELECT statement. dbt materializes it as a view or table in your database.

Staging model: cleans up raw source data:

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('app_db', 'orders') }}
),

cleaned AS (
    SELECT
        id                                  AS order_id,
        user_id,
        status,
        LOWER(currency)                     AS currency,
        amount_cents / 100.0                AS amount,
        created_at                          AS ordered_at,
        updated_at
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM cleaned

Intermediate model: joins staging models together:

-- models/intermediate/int_orders_with_users.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

users AS (
    SELECT * FROM {{ ref('stg_users') }}
)

SELECT
    o.order_id,
    o.amount,
    o.currency,
    o.ordered_at,
    o.status,
    u.user_id,
    u.email,
    u.country,
    u.created_at AS user_registered_at
FROM orders o
JOIN users u ON o.user_id = u.user_id

Mart model: the final table for dashboards:

-- models/marts/fct_daily_revenue.sql
{{
  config(
    materialized='incremental',
    unique_key='report_date',
    on_schema_change='fail'
  )
}}

WITH orders AS (
    SELECT * FROM {{ ref('int_orders_with_users') }}
    WHERE status = 'completed'
    {% if is_incremental() %}
      AND ordered_at >= (SELECT MAX(report_date) FROM {{ this }})
    {% endif %}
)

SELECT
    ordered_at::date        AS report_date,
    currency,
    COUNT(*)                AS order_count,
    SUM(amount)             AS revenue,
    COUNT(DISTINCT user_id) AS unique_buyers
FROM orders
GROUP BY 1, 2

The {{ ref('stg_orders') }} syntax is the key feature. dbt resolves these references into the actual table or view names and builds the DAG (directed acyclic graph) of dependencies automatically. Run dbt run and it executes models in the correct order.

The source() Function

Sources tell dbt about the raw tables in your database that dbt doesn’t create. You define them in a YAML file:

# models/staging/sources.yml
version: 2

sources:
  - name: app_db
    schema: public
    description: "Raw data from the application Postgres database"
    tables:
      - name: orders
        description: "All orders placed through the application"
        columns:
          - name: id
            description: "Primary key"
            tests:
              - unique
              - not_null
      - name: users
        description: "Registered users"

This gives dbt visibility into the raw tables and lets you define tests and documentation on source data too. The source() function in your SQL refers to these definitions.

Testing

dbt has a built-in test framework. Tests run against your models after they’re materialized.

Generic tests: applied via YAML config:

# models/marts/schema.yml
version: 2

models:
  - name: fct_daily_revenue
    description: "Daily revenue aggregated by currency"
    columns:
      - name: report_date
        tests:
          - unique
          - not_null
      - name: revenue
        tests:
          - not_null
      - name: currency
        tests:
          - accepted_values:
              values: ['USD', 'EUR', 'GBP', 'INR']

Singular tests: custom SQL that returns rows when the test fails:

-- tests/assert_revenue_non_negative.sql
SELECT
    report_date,
    currency,
    revenue
FROM {{ ref('fct_daily_revenue') }}
WHERE revenue < 0

If this query returns any rows, the test fails. dbt treats a non-empty result as a failure.

Run tests after your models:

dbt run && dbt test

Or run both together:

dbt build  # runs models, seeds, and tests in dependency order

Incremental Models

Full table rebuilds are fine for models with millions of rows. For billions of rows, you need incremental materialization.

The is_incremental() macro lets you write logic that only processes new records when running incrementally:

{{
  config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
  )
}}

SELECT
    event_id,
    user_id,
    event_type,
    created_at
FROM {{ source('event_stream', 'events') }}

{% if is_incremental() %}
  WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

On the first run, dbt creates the table with all data. On subsequent runs, it only processes records newer than the max timestamp already in the table and merges them in. The unique_key ensures records with the same event_id are updated rather than duplicated.

Documentation

dbt generates a documentation site from your YAML descriptions and model definitions. Run:

dbt docs generate
dbt docs serve

This opens a browser with a searchable catalog of all your models, their columns, descriptions, and the lineage graph showing which models depend on which. For teams where multiple people touch the data stack, this pays off quickly.

dbt Core vs dbt Cloud

dbt Core is the open-source CLI tool. dbt Cloud is the managed platform from dbt Labs, which adds:

  • A web-based IDE for writing and testing models
  • Scheduled job runs
  • Continuous integration integration (GitHub, GitLab)
  • Environment management (dev, staging, prod)
  • Centralized logging

dbt Core is free. dbt Cloud starts at $50/month per developer seat and goes up from there.

For most teams starting out: use dbt Core, run it from CI (GitHub Actions works well), and schedule runs with a simple cron job or your existing orchestration tool. Move to dbt Cloud if you need the web IDE or find the scheduling setup too much to maintain.

# .github/workflows/dbt.yml — minimal dbt CI
name: dbt

on:
  push:
    paths:
      - 'models/**'

jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install dbt-postgres
      - run: dbt run --target prod
        env:
          DBT_POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
      - run: dbt test --target prod
        env:
          DBT_POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}

The Practical Case for Adding dbt

The clearest signal that dbt is worth adding: you have SQL queries that more than one person needs to trust, and right now you’re not sure which version is correct or when they last ran.

dbt doesn’t make SQL easier to write. It makes SQL easier to maintain, test, and understand across a team. If you’re the only person touching the data pipeline and you’re comfortable tracking it manually, dbt adds overhead. If you’re two or three people working on the same models, or if your pipeline is something clients or stakeholders depend on, the structure dbt provides pays for itself quickly.

Start with one staging model and one mart model. Get dbt build to pass. Add tests for your key metrics. The pattern becomes clear after the first week.

Sponsored

Sponsored

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored