The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

Introduction

Data Build Tool (DBT) has become an essential part of modern data transformation workflows, enabling analysts and engineers to efficiently model, validate, and document data within cloud data warehouses. A well-structured DBT project consists of various files, each serving a unique purpose in configuration, execution, and documentation.

One of the key components of DBT is YAML, a human-readable data serialization format used extensively for configurations, metadata, testing, and dependencies. In this blog, we will explore all the essential file types used in a DBT project, focusing on how YAML structures play a pivotal role.

1. Configuration & Metadata Files (.yml)

DBT uses YAML (.yml) files to store project settings, database connections, model documentation, tests, and dependencies.

🔹 dbt_project.yml (Project Configuration)

Purpose: Defines core settings for a DBT project.
Location: Root directory of the DBT project.
Example:

YAML

name: my_project

config-version: 2

version: 1.0.0

profile: my_profile

model-paths: ["models"]

seed-paths: ["seeds"]

test-paths: ["tests"]

Key Points:

Defines the project name, version, and directory paths.
Connects to the profile (profiles.yml).
Organizes models, seeds, and tests efficiently.

🔹 profiles.yml (Database Connections)

Purpose: Stores database credentials for different environments.
Location: Typically found in ~/.dbt/.
Example:

YAML

default:

outputs:

dev:

type: postgres

host: my-database-host

user: my-user

password: my-password

schema: analytics

threads: 4

target: dev

Key Points:

Defines database credentials (Postgres, Snowflake, BigQuery, etc.).
Manages multiple environments (e.g., dev, prod).
Ensures parallel execution via thread count.

🔹 schema.yml (Model Documentation & Testing)

Purpose: Contains metadata and tests for DBT models.
Location: Inside the models/ directory.
Example:

YAML

version: 2

models:

- name: dim_customers

description: "Customer dimension table"

columns:

- name: customer_id

description: "Unique identifier for customers"

tests:

- not_null

- unique

Key Points:

Defines column-level metadata and validation tests.
Ensures data integrity with tests like not_null, unique.
Generates automated documentation for DBT projects.

🔹 sources.yml (External Data Sources)

Purpose: Defines external datasets referenced in models.
Location: Inside the models/ directory.
Example:

YAML

version: 2

sources:

- name: raw_data

schema: staging

tables:

- name: customers

description: "Raw customer data from CRM"

Key Points:

Helps document and reference raw tables.
Improves data lineage tracking.
Ensures consistency across models.

2. SQL Models & Transformations (.sql)

DBT executes SQL queries as part of data modeling and transformation.

🔹 models/*.sql (SQL Model Files)

Purpose: Defines the logic for data transformation in SQL.
Location: Inside the models/ directory.
Example:

SQL

SELECT customer_id, first_name, last_name

FROM raw_data.customers

WHERE active = TRUE;

Key Points:

Stores SQL transformation logic for DBT models.
Allows incremental transformations.

🔹 macros/*.sql (Reusable SQL Functions)

Purpose: Stores custom macros that enhance SQL logic.
Location: Inside the macros/ directory.
Example:

SQL

{% macro calculate_discount(price, discount_percent) %}

price * (1 - discount_percent/100)

{% endmacro %}

Key Points:

Allows SQL automation & reusable functions.
Simplifies complex calculations across models.

3. Dependency & Snapshot Management (.yml)

DBT relies on additional YAML files to manage dependencies and track data history.

🔹 packages.yml (Managing DBT Dependencies)

Purpose: Installs DBT packages and plugins.
Location: Root directory.
Example:

YAML

packages:

- package: dbt-utils

version: 0.8.6

- package: dbt-expectations

version: 0.5.1

Key Points:

Installs DBT packages for additional functionality.
Extends DBT with macros and utilities.

🔹 snapshots.yml (Historical Data Tracking)

Purpose: Tracks historical data changes in the warehouse.
Location: Inside snapshots/.
Example:

YAML

version: 2

snapshots:

- name: customer_snapshot

description: "Tracks changes in customer data"

unique_key: customer_id

strategy: timestamp

updated_at: updated_at

Key Points:

Enables Slowly Changing Dimensions (SCD).
Audits data modifications over time.

4. Validation & Static Data (.yml & .csv)

DBT also supports validation tests and static dataset management.

🔹 tests.yml (Custom Data Validation)

Purpose: Defines custom tests to ensure data quality.
Location: Inside tests/.
Example:

YAML

version: 2

tests:

- name: check_positive_values

description: "Ensures all sales amounts are positive"

Key Points:

Ensures data integrity and validation.
Extends built-in DBT tests.

🔹 seeds/*.csv (Static Reference Data)

Purpose: Loads predefined datasets into DBT projects.
Location: Inside the seeds/ directory.
Example: customer_data.csv
Key Points:

Allows loading CSV files into the warehouse.
Useful for reference tables (e.g., country codes, currency mappings).

5. Execution Logs & Debugging (.log)

DBT generates logs for debugging data transformations and workflow execution.

🔹 logs/dbt.log (Execution Logs)

Purpose: Stores execution logs for debugging DBT runs.
Location: Inside logs/.
Key Points:

Helps troubleshoot errors in DBT runs.
Provides detailed execution history.

Conclusion

A DBT project consists of multiple file types, each serving a specific purpose in data transformation, validation, and automation. YAML plays a critical role in defining configurations, metadata, dependencies, and tests, ensuring efficient project setup and execution.

By understanding all these file types, developers can optimize their DBT workflow, maintain clean data models, and enforce robust validation processes.

Search This Blog

Dataverse_Chronicles

The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

Comments

Post a Comment

Popular posts from this blog

Getting Started with DBT Core

Connecting DBT to Snowflake