Defining Your Sources in sources.yml

Sources in dbt represent the raw data tables in your data warehouse. Defining sources in a sources.yml file allows you to centralize table references, enabling cleaner code, testing, and documentation.


What is sources.yml?

The sources.yml file is where you define the external data tables that your dbt models will transform. This file centralizes information about your raw data sources, making it easier to:

  • Reference raw tables consistently throughout your project

  • Test source data for quality and freshness

  • Document your data pipeline from beginning to end

  • Create clear lineage visualization in dbt docs


Basic Structure

A basic sources.yml file follows this structure:

version: 2

sources:
  - name: jaffle_shop  # Source system name
    database: raw      # Optional: database where source is stored
    schema: jaffle_shop  # Schema containing the source tables
    tables:
      - name: orders   # Table name as it exists in the database
      - name: customers

This defines a source called jaffle_shop with two tables: orders and customers.


Required and Optional Fields

Field
Required?
Description

name

Required

Logical name for the source (used in the source() function)

schema

Required

Database schema where the tables exist

database

Optional

Database where the schema exists (if different from target database)

tables

Required

List of tables in this source

description

Optional

Description of the source for documentation

loader

Optional

Information about the tool loading this data (e.g., Fivetran, Stitch)

freshness

Optional

Configuration for freshness checks

quoting

Optional

Settings for quoting identifiers


Detailed Source Configuration

Here's a more complete example with additional configurations:

version: 2

sources:
  - name: stripe
    description: "Payment data from Stripe, loaded by Fivetran connector"
    database: raw_data
    schema: stripe
    loader: fivetran
    loaded_at_field: _fivetran_synced
    
    # Source-level freshness (can be overridden at table level)
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    
    tables:
      - name: charges
        description: "Credit card charges, including status and amount"
        columns:
          - name: id
            description: "Primary key - the Stripe charge ID"
            tests:
              - unique
              - not_null
          - name: amount
            description: "Amount in cents"
            tests:
              - not_null
      
      - name: customers
        description: "Customer information from Stripe"
        # Override source-level freshness for this table
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}

Referencing Sources in Models

Once defined, you can reference sources in your models using the source() function:

-- models/staging/stg_stripe_charges.sql
SELECT
  id as charge_id,
  customer_id,
  amount / 100.0 as amount_usd,  -- Convert cents to dollars
  status,
  created as created_at
FROM {{ source('stripe', 'charges') }}

The source() function takes two arguments:

  1. The source name (defined as name in sources.yml)

  2. The table name (defined under tables in sources.yml)


Testing Sources

You can apply tests to your source tables just like you would with models:

sources:
  - name: crm
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: email
            tests:
              - unique
              - not_null

This allows you to verify data quality at the entry point to your dbt pipeline.


Source Freshness

One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date.

Configuring Freshness Checks

To set up freshness checks, you need:

  1. A column in your source table that indicates when a record was last updated (loaded_at_field)

  2. Freshness threshold configurations

sources:
  - name: stripe
    loaded_at_field: _etl_loaded_at
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: charges
      - name: customers
        # Override for specific table
        freshness:
          warn_after: {count: 36, period: hour}

Valid Time Periods for Freshness Checks

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

  • minute - For data that updates very frequently (e.g., real-time systems)

  • hour - For data that updates throughout the day (e.g., transaction systems)

  • day - For data that updates daily or less frequently (e.g., batch uploads)

For example: warn_after: {count: 6, period: hour} would warn if data is more than 6 hours old.

Running Freshness Checks

Execute freshness checks with:

dbt source freshness

The output indicates which sources are fresh, stale (warning), or too old (error).


Advanced Source Configurations

Dynamic Schema Resolution

You can use Jinja in your source definitions to handle environments:

sources:
  - name: marketing
    schema: "{% if target.name == 'prod' %}marketing_prod{% else %}marketing_{{ target.name }}{% endif %}"
    tables:
      - name: campaigns

This creates different schemas depending on your target environment.

Quoting Configuration

Control identifier quoting:

sources:
  - name: salesforce
    quoting:
      database: true
      schema: true
      identifier: false
    tables:
      - name: Contacts
        quoting:
          identifier: true  # Override source-level setting

External Tables

For data sources like S3 or GCS:

sources:
  - name: clickstream
    tables:
      - name: raw_events
        external:
          location: "s3://my-bucket/clickstream/events/"
          file_format: "parquet"

Best Practices for Source Configuration

Practice
Description

Group logically related sources

Organize sources by system or domain (e.g., CRM, ERP, Marketing)

Document sources thoroughly

Add descriptions for sources, tables, and columns

Test source data

Apply tests to key columns to catch issues early

Set appropriate freshness thresholds

Define freshness based on business needs and data load frequency

Use consistent naming patterns

Establish a convention for source names and stick to it

Consider file organization

Place source definitions close to the models that use them


Organizing source.yml Files

There are several approaches to organizing your source definitions:

Single File Approach

Good for smaller projects:

models/
  └── sources.yml    # All sources defined in one file

Source-by-Source Approach

Better for larger projects:

models/
  └── sources/
      ├── salesforce_sources.yml
      ├── stripe_sources.yml
      └── google_analytics_sources.yml

Place source definitions with the models that use them:

models/
  ├── staging/
  │   ├── stripe/
  │   │   ├── stripe_sources.yml
  │   │   ├── stg_stripe_charges.sql
  │   │   └── stg_stripe_customers.sql
  │   └── salesforce/
  │       ├── salesforce_sources.yml
  │       └── stg_salesforce_contacts.sql

Generating Source Definitions

Paradime offers a CLI tool to automatically generate source definitions:

paradime sources generate

This command:

  • Scans your data warehouse

  • Creates or updates source definitions

  • Saves them to appropriately named YAML files

Automatically generated source definitions can save you time and ensure accuracy, especially when dealing with complex schemas or frequent changes.

By effectively defining and managing your sources, you create a solid foundation for your dbt project that improves maintainability, testability, and documentation.

Last updated

Was this helpful?