Define your sources.yml

The sources.yml file is a crucial component in dbt projects, used to define and document your raw data sources. It centralizes source information and enables features like freshness checking.

Purpose

The sources.yml file serves several important functions:

  • Defines the raw data sources used in your dbt project

  • Documents metadata about your source tables

  • Enables source freshness checking

  • Allows use of the source() function in your models

Key Components

Declaring Sources

Sources are defined in .yml files nested under a sources: key. Here's an example of how to structure your sources.yml file:

version: 2

sources:
  - name: jaffle_shop
    database: raw  
    schema: jaffle_shop  
    tables:
      - name: orders
      - name: customers

  - name: stripe
    tables:
      - name: payments

This example defines two sources: jaffle_shop and stripe. For jaffle_shop, we specify the database and schema. Each source lists its available tables.

Note: By default, schema will be the same as name. Add schema only if you want to use a source name that differs from the existing schema.

Using the source() Function

The source() function is used in your models to reference the sources you've defined. It takes two arguments: the source name and the table name. Here's an example of how to use it in a model:

select
  c.id as customer_id,
  c.name,
  o.id as order_id,
  o.order_date,
  o.status

from {{ source('jaffle_shop', 'customers') }} c      -- References the customers table from the jaffle_shop source
left join {{ source('jaffle_shop', 'orders') }} o    -- References the orders table from the jaffle_shop source
    on c.id = o.customer_id                          -- Joins the two tables on customer ID

This SQL query selects data from the customers and orders tables in the jaffle_shop source, joining them on the customer ID.

The source() function:

  • Returns a reference to the specified source table

  • Establishes dependencies between the source and the current model

  • Compiles to the full table name in your data warehouse

Best Practices

  1. Use meaningful and consistent naming for your sources and tables

  2. Document your sources thoroughly, including descriptions and owners

  3. Set up appropriate freshness checks for critical data sources

  4. Use column-level constraints and tests to ensure data quality

  5. Keep your sources.yml file up-to-date as your data sources evolve

Remember, the sources.yml file should be version controlled and committed to your repository along with your other dbt project files.

To generate and/or update your sources.yml on the fly, use the Paradime generator CLI command:

paradime sources generate

Last updated