The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

Introduction

Data Build Tool (DBT) has become an essential part of modern data transformation workflows, enabling analysts and engineers to efficiently model, validate, and document data within cloud data warehouses. A well-structured DBT project consists of various files, each serving a unique purpose in configuration, execution, and documentation.

One of the key components of DBT is YAML, a human-readable data serialization format used extensively for configurations, metadata, testing, and dependencies. In this blog, we will explore all the essential file types used in a DBT project, focusing on how YAML structures play a pivotal role.

1. Configuration & Metadata Files (.yml)

DBT uses YAML (.yml) files to store project settings, database connections, model documentation, tests, and dependencies.

🔹 dbt_project.yml (Project Configuration)

  • Purpose: Defines core settings for a DBT project.
  • Location: Root directory of the DBT project.
  • Example:

YAML

name: my_project

config-version: 2

version: 1.0.0

profile: my_profile

model-paths: ["models"]

seed-paths: ["seeds"]

test-paths: ["tests"]

  • Key Points:
    • Defines the project name, version, and directory paths.
    • Connects to the profile (profiles.yml).
    • Organizes models, seeds, and tests efficiently.

🔹 profiles.yml (Database Connections)

  • Purpose: Stores database credentials for different environments.
  • Location: Typically found in ~/.dbt/.
  • Example:

YAML

default:

  outputs:

    dev:

      type: postgres

      host: my-database-host

      user: my-user

      password: my-password

      schema: analytics

      threads: 4

  target: dev

  • Key Points:
    • Defines database credentials (Postgres, Snowflake, BigQuery, etc.).
    • Manages multiple environments (e.g., dev, prod).
    • Ensures parallel execution via thread count.

🔹 schema.yml (Model Documentation & Testing)

  • Purpose: Contains metadata and tests for DBT models.
  • Location: Inside the models/ directory.
  • Example:

YAML

version: 2

models:

  - name: dim_customers

    description: "Customer dimension table"

    columns:

      - name: customer_id

        description: "Unique identifier for customers"

        tests:

          - not_null

          - unique

  • Key Points:
    • Defines column-level metadata and validation tests.
    • Ensures data integrity with tests like not_null, unique.
    • Generates automated documentation for DBT projects.

🔹 sources.yml (External Data Sources)

  • Purpose: Defines external datasets referenced in models.
  • Location: Inside the models/ directory.
  • Example:

YAML

version: 2

sources:

  - name: raw_data

    schema: staging

    tables:

      - name: customers

        description: "Raw customer data from CRM"

  • Key Points:
    • Helps document and reference raw tables.
    • Improves data lineage tracking.
    • Ensures consistency across models.

2. SQL Models & Transformations (.sql)

DBT executes SQL queries as part of data modeling and transformation.

🔹 models/*.sql (SQL Model Files)

  • Purpose: Defines the logic for data transformation in SQL.
  • Location: Inside the models/ directory.
  • Example:

SQL

SELECT customer_id, first_name, last_name

FROM raw_data.customers

WHERE active = TRUE;

  • Key Points:
    • Stores SQL transformation logic for DBT models.
    • Allows incremental transformations.

🔹 macros/*.sql (Reusable SQL Functions)

  • Purpose: Stores custom macros that enhance SQL logic.
  • Location: Inside the macros/ directory.
  • Example:

SQL

{% macro calculate_discount(price, discount_percent) %}

  price * (1 - discount_percent/100)

{% endmacro %}

  • Key Points:
    • Allows SQL automation & reusable functions.
    • Simplifies complex calculations across models.

3. Dependency & Snapshot Management (.yml)

DBT relies on additional YAML files to manage dependencies and track data history.

🔹 packages.yml (Managing DBT Dependencies)

  • Purpose: Installs DBT packages and plugins.
  • Location: Root directory.
  • Example:

YAML

packages:

  - package: dbt-utils

    version: 0.8.6

  - package: dbt-expectations

    version: 0.5.1

  • Key Points:
    • Installs DBT packages for additional functionality.
    • Extends DBT with macros and utilities.

🔹 snapshots.yml (Historical Data Tracking)

  • Purpose: Tracks historical data changes in the warehouse.
  • Location: Inside snapshots/.
  • Example:

YAML

version: 2

snapshots:

  - name: customer_snapshot

    description: "Tracks changes in customer data"

    unique_key: customer_id

    strategy: timestamp

    updated_at: updated_at

  • Key Points:
    • Enables Slowly Changing Dimensions (SCD).
    • Audits data modifications over time.

4. Validation & Static Data (.yml & .csv)

DBT also supports validation tests and static dataset management.

🔹 tests.yml (Custom Data Validation)

  • Purpose: Defines custom tests to ensure data quality.
  • Location: Inside tests/.
  • Example:

YAML

version: 2

tests:

  - name: check_positive_values

    description: "Ensures all sales amounts are positive"

  • Key Points:
    • Ensures data integrity and validation.
    • Extends built-in DBT tests.

🔹 seeds/*.csv (Static Reference Data)

  • Purpose: Loads predefined datasets into DBT projects.
  • Location: Inside the seeds/ directory.
  • Example: customer_data.csv
  • Key Points:
    • Allows loading CSV files into the warehouse.
    • Useful for reference tables (e.g., country codes, currency mappings).

5. Execution Logs & Debugging (.log)

DBT generates logs for debugging data transformations and workflow execution.

🔹 logs/dbt.log (Execution Logs)

  • Purpose: Stores execution logs for debugging DBT runs.
  • Location: Inside logs/.
  • Key Points:
    • Helps troubleshoot errors in DBT runs.
    • Provides detailed execution history.

Conclusion

A DBT project consists of multiple file types, each serving a specific purpose in data transformation, validation, and automation. YAML plays a critical role in defining configurations, metadata, dependencies, and tests, ensuring efficient project setup and execution.

By understanding all these file types, developers can optimize their DBT workflow, maintain clean data models, and enforce robust validation processes.

 

Comments

Popular posts from this blog

Connecting DBT to Snowflake

Edge Computing and Edge Databases - Powering the Future of Decentralized Data