Defining Your Sources in sources.yml
Sources in dbt represent the raw data tables in your data warehouse. Defining sources in a sources.yml
file allows you to centralize table references, enabling cleaner code, testing, and documentation.
What is sources.yml?
The sources.yml
file is where you define the external data tables that your dbt models will transform. This file centralizes information about your raw data sources, making it easier to:
Reference raw tables consistently throughout your project
Test source data for quality and freshness
Document your data pipeline from beginning to end
Create clear lineage visualization in dbt docs
Basic Structure
A basic sources.yml
file follows this structure:
This defines a source called jaffle_shop
with two tables: orders
and customers
.
Required and Optional Fields
name
Required
Logical name for the source (used in the source()
function)
schema
Required
Database schema where the tables exist
database
Optional
Database where the schema exists (if different from target database)
tables
Required
List of tables in this source
description
Optional
Description of the source for documentation
loader
Optional
Information about the tool loading this data (e.g., Fivetran, Stitch)
freshness
Optional
Configuration for freshness checks
quoting
Optional
Settings for quoting identifiers
Detailed Source Configuration
Here's a more complete example with additional configurations:
Referencing Sources in Models
Once defined, you can reference sources in your models using the source()
function:
The source()
function takes two arguments:
The source name (defined as
name
insources.yml
)The table name (defined under
tables
insources.yml
)
Testing Sources
You can apply tests to your source tables just like you would with models:
This allows you to verify data quality at the entry point to your dbt pipeline.
Source Freshness
One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date.
Configuring Freshness Checks
To set up freshness checks, you need:
A column in your source table that indicates when a record was last updated (
loaded_at_field
)Freshness threshold configurations
Valid Time Periods for Freshness Checks
When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:
minute
- For data that updates very frequently (e.g., real-time systems)hour
- For data that updates throughout the day (e.g., transaction systems)day
- For data that updates daily or less frequently (e.g., batch uploads)
For example: warn_after: {count: 6, period: hour}
would warn if data is more than 6 hours old.
Running Freshness Checks
Execute freshness checks with:
The output indicates which sources are fresh, stale (warning), or too old (error).
Advanced Source Configurations
Dynamic Schema Resolution
You can use Jinja in your source definitions to handle environments:
This creates different schemas depending on your target environment.
Quoting Configuration
Control identifier quoting:
External Tables
For data sources like S3 or GCS:
Best Practices for Source Configuration
Group logically related sources
Organize sources by system or domain (e.g., CRM, ERP, Marketing)
Document sources thoroughly
Add descriptions for sources, tables, and columns
Test source data
Apply tests to key columns to catch issues early
Set appropriate freshness thresholds
Define freshness based on business needs and data load frequency
Use consistent naming patterns
Establish a convention for source names and stick to it
Consider file organization
Place source definitions close to the models that use them
Organizing source.yml Files
There are several approaches to organizing your source definitions:
Single File Approach
Good for smaller projects:
Source-by-Source Approach
Better for larger projects:
Alongside Related Models
Place source definitions with the models that use them:
Generating Source Definitions
Paradime offers a CLI tool to automatically generate source definitions:
This command:
Scans your data warehouse
Creates or updates source definitions
Saves them to appropriately named YAML files
Automatically generated source definitions can save you time and ensure accuracy, especially when dealing with complex schemas or frequent changes.
By effectively defining and managing your sources, you create a solid foundation for your dbt project that improves maintainability, testability, and documentation.
Last updated
Was this helpful?