Defining Your Sources in sources.yml

Sources in dbt represent the raw data tables in your data warehouse. Defining sources in a sources.yml file allows you to centralize table references, enabling cleaner code, testing, and documentation.


What is sources.yml?

The sources.yml file is where you define the external data tables that your dbt models will transform. This file centralizes information about your raw data sources, making it easier to:

  • Reference raw tables consistently throughout your project

  • Test source data for quality and freshness

  • Document your data pipeline from beginning to end

  • Create clear lineage visualization in dbt docs


Basic Structure

A basic sources.yml file follows this structure:

version: 2

sources:
  - name: jaffle_shop  # Source system name
    database: raw      # Optional: database where source is stored
    schema: jaffle_shop  # Schema containing the source tables
    tables:
      - name: orders   # Table name as it exists in the database
      - name: customers

This defines a source called jaffle_shop with two tables: orders and customers.


Required and Optional Fields

Field
Required?
Description

name

Required

Logical name for the source (used in the source() function)

schema

Required

Database schema where the tables exist

database

Optional

Database where the schema exists (if different from target database)

tables

Required

List of tables in this source

description

Optional

Description of the source for documentation

loader

Optional

Information about the tool loading this data (e.g., Fivetran, Stitch)

freshness

Optional

Configuration for freshness checks

quoting

Optional

Settings for quoting identifiers


Detailed Source Configuration

Here's a more complete example with additional configurations:


Referencing Sources in Models

Once defined, you can reference sources in your models using the source() function:

The source() function takes two arguments:

  1. The source name (defined as name in sources.yml)

  2. The table name (defined under tables in sources.yml)


Testing Sources

You can apply tests to your source tables just like you would with models:

This allows you to verify data quality at the entry point to your dbt pipeline.


Source Freshness

One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date.

Configuring Freshness Checks

To set up freshness checks, you need:

  1. A column in your source table that indicates when a record was last updated (loaded_at_field)

  2. Freshness threshold configurations

Valid Time Periods for Freshness Checks

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

  • minute - For data that updates very frequently (e.g., real-time systems)

  • hour - For data that updates throughout the day (e.g., transaction systems)

  • day - For data that updates daily or less frequently (e.g., batch uploads)

For example: warn_after: {count: 6, period: hour} would warn if data is more than 6 hours old.

Running Freshness Checks

Execute freshness checks with:

The output indicates which sources are fresh, stale (warning), or too old (error).


Advanced Source Configurations

Dynamic Schema Resolution

You can use Jinja in your source definitions to handle environments:

This creates different schemas depending on your target environment.

Quoting Configuration

Control identifier quoting:

External Tables

For data sources like S3 or GCS:


Best Practices for Source Configuration

Practice
Description

Group logically related sources

Organize sources by system or domain (e.g., CRM, ERP, Marketing)

Document sources thoroughly

Add descriptions for sources, tables, and columns

Test source data

Apply tests to key columns to catch issues early

Set appropriate freshness thresholds

Define freshness based on business needs and data load frequency

Use consistent naming patterns

Establish a convention for source names and stick to it

Consider file organization

Place source definitions close to the models that use them


Organizing source.yml Files

There are several approaches to organizing your source definitions:

Single File Approach

Good for smaller projects:

Source-by-Source Approach

Better for larger projects:

Place source definitions with the models that use them:


Automating Source Generation with DinoAI

With DinoAI (Paradime's AI Agent), you can automatically generate source definitions with appropriate freshness configurations. See step by step instructions for more details.

Last updated

Was this helpful?