dbt Primer

dbt Primer #

A bite sized chunk of knowledge for using dbt (data build tool).

Installation #

First up, install dbt and dbt tools from the command line.

pip install dbt

pip install dbt-core

We’ll need to also specific adapters for the type of data platform (BigQuery, Snowflake, Spark, etc.):

Check the documentation for the types of supported data platforms.

Let’s say we’re using BigQuery:

pip install dbt-bigquery

Check version with:

dbt --version

Profile #

The profile defines how to connect a data platform.

This exists in the home directory’s .dbt/ directory in file profiles.yml. The top line should match the project name (from above).

(As always, check out the documentation for more details).

For a BigQuery implementation, the profile might look something like:

<project name>:
    target: test

    outputs:
        test:
            type: bigquery
            method: service-account
            keyfile: <path>.json
            project: <project id>
            dataset: dbt_<name>
            threads: 1
            timeout_seconds: 300
            location: US
            priority: interactive # or interactive

        dev:
            ...

        prod:
            ...

Where ^ project name would relate to the GCP project name you’ve set up, dataset is the BigQuery dataset name (within the database).

To check that the connection works, run dbt debug. If everything works, the tests should pass.

Projects #

To create a dbt project:

dbt init <project name>

You’ll be asked to select from a list of choices the data platform you want to use (based on the adapters that you previously installed.)

Once it gets created, there will be a skeleton full of files and directories.

The skeleton usually starts off with the following folders:

  • analysis

  • data

  • macros - user-defined functions that be be re-used

  • models - core foundation of dbt for data models

  • snapshots

  • tests

As well as a dbt_project.yml file.

dbt_project.yml - every project needs one. This file dictates how the project works.

This ^ will include a chunk that handles models (see below). The + is recursive downstream.

models:
    <project>:
        <models sub-directory>:
            +materialized: view # or table
            <models sub-sub-directory, assuming there's no + in the parent dir specification>:
                materialized: table

Models #

Basically a bunch of SQL files. Each file is a model (table).

models folder in the project contains all the SQL files. This is truly the heart of dbt.

Common formats int he SQL files are:

with <something> as (
    select ...
)

select *
from <something>

Models can reference other models. For instance, if there’s something called model_one.sql, another sql file could include something like:

select * 
from {{ ref(model_one) }}
where id = <some unique identifier>

Staging #

Run #

dbt run commmand line run executes models in the models directory.

For a full refresh:

dbt run --full-refresh

For select updates of particular models:

dbt run --select  <model name(s)>

To run everything that builds up to a particular target model:

dbt run --seclect +<model_name>

Materialization #

Specification of how something gets created, such as as a table.

Jinja #

Macros galore.

Testing #

Validate that the models are working correctly.

schema.ymlversion: 2 at the top

Beneath that,

models:

- name: <model/table name>
    columns:
        - name: <name>
            tests:
                - unique
                - not_null

Run tests with:

dbt test

Documentation #

To build dbt documentation:

dbt docs generate

Launch it locally with:

dbt docs serve

There’s a handy graph to show the relationship between the models.

Debug #

Look for errors from within the dbt project directory:

dbt debug

This will raise errors if any are found.