The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations
The Complete Guide to DBT (Data Build Tool)
File Structure and YAML Configurations
Introduction
Data Build Tool (DBT) has become an
essential part of modern data transformation workflows, enabling
analysts and engineers to efficiently model, validate, and document data within
cloud data warehouses. A well-structured DBT project consists of various files,
each serving a unique purpose in configuration, execution, and documentation.
One of the key components of DBT is YAML,
a human-readable data serialization format used extensively for configurations,
metadata, testing, and dependencies. In this blog, we will explore all
the essential file types used in a DBT project, focusing on how YAML
structures play a pivotal role.
1. Configuration & Metadata Files (.yml)
DBT uses YAML (.yml) files to
store project settings, database connections, model documentation, tests,
and dependencies.
🔹 dbt_project.yml
(Project Configuration)
- Purpose: Defines core
settings for a DBT project.
- Location: Root directory
of the DBT project.
- Example:
YAML
name: my_project
config-version: 2
version: 1.0.0
profile: my_profile
model-paths:
["models"]
seed-paths:
["seeds"]
test-paths:
["tests"]
- Key
Points:
- Defines
the project name, version, and directory paths.
- Connects
to the profile (profiles.yml).
- Organizes
models, seeds, and tests efficiently.
🔹 profiles.yml
(Database Connections)
- Purpose: Stores database
credentials for different environments.
- Location: Typically found
in ~/.dbt/.
- Example:
YAML
default:
outputs:
dev:
type: postgres
host: my-database-host
user: my-user
password: my-password
schema: analytics
threads: 4
target: dev
- Key
Points:
- Defines
database credentials (Postgres, Snowflake, BigQuery, etc.).
- Manages
multiple environments (e.g., dev, prod).
- Ensures
parallel execution via thread count.
🔹 schema.yml (Model
Documentation & Testing)
- Purpose: Contains metadata
and tests for DBT models.
- Location: Inside the models/
directory.
- Example:
YAML
version: 2
models:
- name: dim_customers
description: "Customer dimension
table"
columns:
- name: customer_id
description: "Unique identifier
for customers"
tests:
- not_null
- unique
- Key
Points:
- Defines
column-level metadata and validation tests.
- Ensures
data integrity with tests like not_null, unique.
- Generates
automated documentation for DBT projects.
🔹 sources.yml
(External Data Sources)
- Purpose: Defines external
datasets referenced in models.
- Location: Inside the models/
directory.
- Example:
YAML
version: 2
sources:
- name: raw_data
schema: staging
tables:
- name: customers
description: "Raw customer data
from CRM"
- Key
Points:
- Helps
document and reference raw tables.
- Improves
data lineage tracking.
- Ensures
consistency across models.
2. SQL Models & Transformations (.sql)
DBT executes SQL queries as part of data
modeling and transformation.
🔹 models/*.sql (SQL
Model Files)
- Purpose: Defines the logic
for data transformation in SQL.
- Location: Inside the models/
directory.
- Example:
SQL
SELECT customer_id,
first_name, last_name
FROM
raw_data.customers
WHERE active = TRUE;
- Key
Points:
- Stores
SQL transformation logic for DBT models.
- Allows
incremental transformations.
🔹 macros/*.sql
(Reusable SQL Functions)
- Purpose: Stores custom
macros that enhance SQL logic.
- Location: Inside the macros/
directory.
- Example:
SQL
{% macro
calculate_discount(price, discount_percent) %}
price * (1 - discount_percent/100)
{% endmacro %}
- Key
Points:
- Allows
SQL automation & reusable functions.
- Simplifies
complex calculations across models.
3. Dependency & Snapshot Management (.yml)
DBT relies on additional YAML files to manage
dependencies and track data history.
🔹 packages.yml
(Managing DBT Dependencies)
- Purpose: Installs DBT packages
and plugins.
- Location: Root directory.
- Example:
YAML
packages:
- package: dbt-utils
version: 0.8.6
- package: dbt-expectations
version: 0.5.1
- Key
Points:
- Installs
DBT packages for additional functionality.
- Extends
DBT with macros and utilities.
🔹 snapshots.yml
(Historical Data Tracking)
- Purpose: Tracks historical
data changes in the warehouse.
- Location: Inside snapshots/.
- Example:
YAML
version: 2
snapshots:
- name: customer_snapshot
description: "Tracks changes in
customer data"
unique_key: customer_id
strategy: timestamp
updated_at: updated_at
- Key
Points:
- Enables
Slowly Changing Dimensions (SCD).
- Audits
data modifications over time.
4. Validation & Static Data (.yml & .csv)
DBT also supports validation tests and static
dataset management.
🔹 tests.yml (Custom
Data Validation)
- Purpose: Defines custom
tests to ensure data quality.
- Location: Inside tests/.
- Example:
YAML
version: 2
tests:
- name: check_positive_values
description: "Ensures all sales
amounts are positive"
- Key
Points:
- Ensures
data integrity and validation.
- Extends
built-in DBT tests.
🔹 seeds/*.csv (Static
Reference Data)
- Purpose: Loads predefined
datasets into DBT projects.
- Location: Inside the seeds/
directory.
- Example: customer_data.csv
- Key
Points:
- Allows
loading CSV files into the warehouse.
- Useful
for reference tables (e.g., country codes, currency mappings).
5. Execution Logs & Debugging (.log)
DBT generates logs for debugging data
transformations and workflow execution.
🔹 logs/dbt.log
(Execution Logs)
- Purpose: Stores execution
logs for debugging DBT runs.
- Location: Inside logs/.
- Key
Points:
- Helps
troubleshoot errors in DBT runs.
- Provides
detailed execution history.
Conclusion
A DBT project consists of multiple
file types, each serving a specific purpose in data transformation,
validation, and automation. YAML plays a critical role in defining
configurations, metadata, dependencies, and tests, ensuring efficient
project setup and execution.
By understanding all these file types,
developers can optimize their DBT workflow, maintain clean data
models, and enforce robust validation processes.
Comments
Post a Comment