dbt Primer #
A bite sized chunk of knowledge for using dbt (data build tool).
Installation #
First up, install dbt and dbt tools from the command line.
pip install dbt
pip install dbt-core
We’ll need to also specific adapters for the type of data platform (BigQuery, Snowflake, Spark, etc.):
Check the documentation for the types of supported data platforms.
Let’s say we’re using BigQuery:
pip install dbt-bigquery
Check version with:
dbt --version
Profile #
The profile defines how to connect a data platform.
This exists in the home directory’s .dbt/
directory in file profiles.yml
. The top line should match the project name (from above).
(As always, check out the documentation for more details).
For a BigQuery implementation, the profile might look something like:
<project name>:
target: test
outputs:
test:
type: bigquery
method: service-account
keyfile: <path>.json
project: <project id>
dataset: dbt_<name>
threads: 1
timeout_seconds: 300
location: US
priority: interactive # or interactive
dev:
...
prod:
...
Where ^ project name would relate to the GCP project name you’ve set up, dataset is the BigQuery dataset name (within the database).
To check that the connection works, run dbt debug
. If everything works, the tests should pass.
Projects #
To create a dbt project:
dbt init <project name>
You’ll be asked to select from a list of choices the data platform you want to use (based on the adapters that you previously installed.)
Once it gets created, there will be a skeleton full of files and directories.
The skeleton usually starts off with the following folders:
-
analysis
-
data
-
macros
- user-defined functions that be be re-used -
models
- core foundation of dbt for data models -
snapshots
-
tests
As well as a dbt_project.yml
file.
dbt_project.yml
- every project needs one. This file dictates how the project works.
This ^ will include a chunk that handles models (see below). The +
is recursive downstream.
models:
<project>:
<models sub-directory>:
+materialized: view # or table
<models sub-sub-directory, assuming there's no + in the parent dir specification>:
materialized: table
Models #
Basically a bunch of SQL files. Each file is a model (table).
models
folder in the project contains all the SQL files. This is truly the heart of dbt.
Common formats int he SQL files are:
with <something> as (
select ...
)
select *
from <something>
Models can reference other models. For instance, if there’s something called model_one.sql
, another sql file could include something like:
select *
from {{ ref(model_one) }}
where id = <some unique identifier>
Staging #
Run #
dbt run
commmand line run executes models in the models
directory.
For a full refresh:
dbt run --full-refresh
For select updates of particular models:
dbt run --select <model name(s)>
To run everything that builds up to a particular target model:
dbt run --seclect +<model_name>
Materialization #
Specification of how something gets created, such as as a table.
Jinja #
Macros galore.
Testing #
Validate that the models are working correctly.
schema.yml
– version: 2
at the top
Beneath that,
models:
- name: <model/table name>
columns:
- name: <name>
tests:
- unique
- not_null
Run tests with:
dbt test
Documentation #
To build dbt documentation:
dbt docs generate
Launch it locally with:
dbt docs serve
There’s a handy graph to show the relationship between the models.
Debug #
Look for errors from within the dbt project directory:
dbt debug
This will raise errors if any are found.