Defining Your Sources in sources.yml
Sources in dbt represent the raw data tables in your data warehouse. Defining sources in a sources.yml
file allows you to centralize table references, enabling cleaner code, testing, and documentation.
What is sources.yml?
The sources.yml
file is where you define the external data tables that your dbt models will transform. This file centralizes information about your raw data sources, making it easier to:
Reference raw tables consistently throughout your project
Test source data for quality and freshness
Document your data pipeline from beginning to end
Create clear lineage visualization in dbt docs
Basic Structure
A basic sources.yml
file follows this structure:
version: 2
sources:
- name: jaffle_shop # Source system name
database: raw # Optional: database where source is stored
schema: jaffle_shop # Schema containing the source tables
tables:
- name: orders # Table name as it exists in the database
- name: customers
This defines a source called jaffle_shop
with two tables: orders
and customers
.
Required and Optional Fields
name
Required
Logical name for the source (used in the source()
function)
schema
Required
Database schema where the tables exist
database
Optional
Database where the schema exists (if different from target database)
tables
Required
List of tables in this source
description
Optional
Description of the source for documentation
loader
Optional
Information about the tool loading this data (e.g., Fivetran, Stitch)
freshness
Optional
Configuration for freshness checks
quoting
Optional
Settings for quoting identifiers
Detailed Source Configuration
Here's a more complete example with additional configurations:
version: 2
sources:
- name: stripe
description: "Payment data from Stripe, loaded by Fivetran connector"
database: raw_data
schema: stripe
loader: fivetran
loaded_at_field: _fivetran_synced
# Source-level freshness (can be overridden at table level)
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: charges
description: "Credit card charges, including status and amount"
columns:
- name: id
description: "Primary key - the Stripe charge ID"
tests:
- unique
- not_null
- name: amount
description: "Amount in cents"
tests:
- not_null
- name: customers
description: "Customer information from Stripe"
# Override source-level freshness for this table
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
Referencing Sources in Models
Once defined, you can reference sources in your models using the source()
function:
-- models/staging/stg_stripe_charges.sql
SELECT
id as charge_id,
customer_id,
amount / 100.0 as amount_usd, -- Convert cents to dollars
status,
created as created_at
FROM {{ source('stripe', 'charges') }}
The source()
function takes two arguments:
The source name (defined as
name
insources.yml
)The table name (defined under
tables
insources.yml
)
Testing Sources
You can apply tests to your source tables just like you would with models:
sources:
- name: crm
tables:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: email
tests:
- unique
- not_null
This allows you to verify data quality at the entry point to your dbt pipeline.
Source Freshness
One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date.
Configuring Freshness Checks
To set up freshness checks, you need:
A column in your source table that indicates when a record was last updated (
loaded_at_field
)Freshness threshold configurations
sources:
- name: stripe
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: charges
- name: customers
# Override for specific table
freshness:
warn_after: {count: 36, period: hour}
Running Freshness Checks
Execute freshness checks with:
dbt source freshness
The output indicates which sources are fresh, stale (warning), or too old (error).
Advanced Source Configurations
Dynamic Schema Resolution
You can use Jinja in your source definitions to handle environments:
sources:
- name: marketing
schema: "{% if target.name == 'prod' %}marketing_prod{% else %}marketing_{{ target.name }}{% endif %}"
tables:
- name: campaigns
This creates different schemas depending on your target environment.
Quoting Configuration
Control identifier quoting:
sources:
- name: salesforce
quoting:
database: true
schema: true
identifier: false
tables:
- name: Contacts
quoting:
identifier: true # Override source-level setting
External Tables
For data sources like S3 or GCS:
sources:
- name: clickstream
tables:
- name: raw_events
external:
location: "s3://my-bucket/clickstream/events/"
file_format: "parquet"
Best Practices for Source Configuration
Group logically related sources
Organize sources by system or domain (e.g., CRM, ERP, Marketing)
Document sources thoroughly
Add descriptions for sources, tables, and columns
Test source data
Apply tests to key columns to catch issues early
Set appropriate freshness thresholds
Define freshness based on business needs and data load frequency
Use consistent naming patterns
Establish a convention for source names and stick to it
Consider file organization
Place source definitions close to the models that use them
Organizing source.yml Files
There are several approaches to organizing your source definitions:
Single File Approach
Good for smaller projects:
models/
└── sources.yml # All sources defined in one file
Source-by-Source Approach
Better for larger projects:
models/
└── sources/
├── salesforce_sources.yml
├── stripe_sources.yml
└── google_analytics_sources.yml
Alongside Related Models
Place source definitions with the models that use them:
models/
├── staging/
│ ├── stripe/
│ │ ├── stripe_sources.yml
│ │ ├── stg_stripe_charges.sql
│ │ └── stg_stripe_customers.sql
│ └── salesforce/
│ ├── salesforce_sources.yml
│ └── stg_salesforce_contacts.sql
Generating Source Definitions
Paradime offers a CLI tool to automatically generate source definitions:
paradime sources generate
This command:
Scans your data warehouse
Creates or updates source definitions
Saves them to appropriately named YAML files
By effectively defining and managing your sources, you create a solid foundation for your dbt project that improves maintainability, testability, and documentation.
Last updated
Was this helpful?