Understanding Its File and Folder Structure - DBT
Inside a DBT Project: Understanding Its File and Folder Structure
Introduction
In the world of modern data engineering, DBT (Data Build
Tool) has emerged as a transformative solution for managing data
transformations in a scalable, modular, and version-controlled way. Unlike
traditional ETL tools that often rely on opaque workflows and proprietary
interfaces, DBT embraces the principles of software engineering—treating data
transformations as code.
One of the key reasons DBT is so effective is its well-defined
project structure. Every DBT project is organized into a set of folders and
configuration files that serve specific purposes. This structure not only
promotes clarity and collaboration but also enables automation, testing,
documentation, and deployment.
In this blog, we’ll take a deep dive into the anatomy of a
DBT project, exploring the purpose and importance of each folder—such as models,
seeds, snapshots, and macros—and how they work together to create a robust data
transformation pipeline.
Why Project Structure Matters in DBT
Before diving into the folders themselves, it’s important to
understand why DBT’s structure is so valuable:
- Modularity:
Each transformation is isolated in its own file, making it easier to
manage, test, and reuse.
- Transparency:
The structure makes it easy for new team members to understand the flow of
data and dependencies.
- Version
Control: With Git integration, every change is tracked, reviewed, and
documented.
- Automation:
The structure supports CI/CD workflows, enabling automated testing and
deployment.
- Documentation:
DBT can auto-generate documentation based on the structure and metadata
defined in the project.
Now, let’s explore the key folders that make up a DBT
project.
1. The models/ Folder: The Heart of Your Transformations
The models folder is where the core of your data
transformation logic lives. Each file in this folder represents a DBT model—a
SQL query that transforms raw data into a clean, usable format.
Models are typically organized into subfolders that reflect
the layers of transformation. Common subfolders include:
- Staging:
Contains models that clean and standardize raw source data.
- Intermediate:
Holds models that join, filter, or enrich staging data.
- Marts:
Includes final models used for reporting, dashboards, or business
analysis.
This layered approach helps teams maintain clarity and
control over complex transformations. It also supports dependency tracking,
allowing DBT to build models in the correct order based on their relationships.
Each model can be documented, tested, and materialized as a
table, view, or ephemeral structure—depending on performance and use case
requirements.
2. The seeds/ Folder: Static Reference Data
The seeds folder is used to store static datasets in CSV
format. These datasets are typically small, stable, and used for reference
purposes—such as country codes, currency mappings, or business rules.
When DBT runs the seed command, it loads these CSV files
into the data warehouse as tables. This allows them to be used in
transformations just like any other model.
Seeds are especially useful for:
- Lookup
tables that enrich transactional data.
- Configuration
tables that drive dynamic logic.
- Testing
datasets used to validate transformations.
Because seeds are version-controlled alongside the rest of
the project, they offer a reliable and auditable way to manage reference data.
3. The snapshots/ Folder: Tracking Historical Changes
The snapshots folder is where DBT stores logic for capturing
historical versions of data. This is essential for implementing Slowly Changing
Dimensions (SCDs), where you need to track how records evolve over time.
Snapshots work by comparing the current state of a record to
its previous state and storing changes in a dedicated snapshot table. This
allows analysts to answer questions like:
- What
was a customer’s status last month?
- How
did product pricing change over time?
- When
did a user upgrade their subscription?
Snapshots are configured with metadata that defines the
unique key, update strategy, and timestamp fields. They are particularly
valuable in environments where source systems overwrite data and historical
context is lost.
4. The macros/ Folder: Reusable SQL Logic
The macros folder is where you define reusable SQL functions
using Jinja templating. Macros help you write cleaner, more maintainable code
by abstracting repetitive logic into callable functions.
For example, if you frequently calculate a specific metric
across multiple models, you can define that logic once in a macro and reuse it
wherever needed.
Macros are powerful because they:
- Reduce
duplication across models.
- Encapsulate
complex logic into readable functions.
- Support
dynamic SQL generation based on parameters.
They can be used in models, tests, snapshots, and even
documentation—making them one of the most versatile tools in the DBT ecosystem.
5. The tests/ Folder: Custom Data Validations
While DBT supports built-in schema tests (like not null,
unique, and relationships), the tests folder allows you to define custom
validations using SQL.
These tests are designed to catch business-specific
anomalies that generic tests might miss. For example:
- Ensuring
that revenue values are always positive.
- Verifying
that event timestamps occur after user signup dates.
- Detecting
duplicate records based on composite keys.
Custom tests return rows that violate expectations. If any
rows are returned, the test fails—alerting the team to investigate.
By storing these tests in a dedicated folder, DBT promotes a
culture of proactive data quality assurance.
6. The analysis/ Folder: Exploratory Queries
The analysis folder is used for ad hoc queries and
exploratory analysis that aren’t part of the transformation pipeline. These
queries can be compiled and run manually but are not materialized or included
in DAGs.
This folder is useful for:
- Prototyping
new models.
- Investigating
data anomalies.
- Performing
one-off analyses for stakeholders.
While not essential to every project, the analysis folder
provides a sandbox for experimentation without cluttering the production
workflow.
7. The docs/ Folder: Enhancing Documentation
DBT automatically generates documentation based on model
metadata, tests, and sources. The docs folder allows you to enhance this
documentation with custom content—such as markdown files, images, or diagrams.
This is especially helpful for:
- Explaining
complex business logic.
- Providing
onboarding guides for new team members.
- Sharing
context about data sources and usage.
By combining auto-generated and custom documentation, DBT
helps teams build a comprehensive knowledge base around their data.
8. The data/ or sources/ Folder: External Source
Definitions
Some teams create a dedicated folder for defining external
sources in YAML files. These definitions include metadata about raw tables—such
as descriptions, column names, and freshness expectations.
Source definitions improve:
- Data
lineage tracking, showing how raw data flows into models.
- Documentation,
making it easier to understand upstream systems.
- Testing,
by enabling freshness checks and schema validations.
While not required, organizing source definitions in a
separate folder can improve clarity and maintainability.
Conclusion
DBT’s file and folder structure is more than just a way to
organize code—it’s a framework for building reliable, scalable, and
collaborative data transformation workflows. Each folder serves a specific
purpose, from core modeling to testing, documentation, and historical tracking.
By embracing this structure, teams can:
- Write
modular and reusable SQL.
- Validate
data quality automatically.
- Track
changes and collaborate effectively.
- Document
their work for transparency and governance.
- Scale
their pipelines with confidence.
Whether you're just starting with DBT or looking to optimize
your existing project, understanding and leveraging its folder structure is a
foundational step toward building a modern, trustworthy data stack
Comments
Post a Comment