Test your sources for freshness

dbt™ allows you to monitor the freshness of your source data, which is crucial for maintaining data pipeline health and meeting service level agreements (SLAs) for your data warehouse.

Purpose

Testing source freshness serves several important functions:

  • Verifies that your source data is being updated regularly

  • Alerts you to potential issues in your data pipelines

  • Helps maintain data quality and reliability in your project

  • Supports the definition and monitoring of SLAs for your data

Key Components

Configuring Freshness Checks

To enable freshness checking, you need to add configuration to your sources.yml file. Here's an example:

version: 2

sources:
  - name: jaffle_shop
    database: raw
    freshness: # Default freshness settings for all tables in this source
      warn_after: {count: 12, period: hour}  # Warn if data is over 12 hours old
      error_after: {count: 24, period: hour}  # Error if data is over 24 hours old
    loaded_at_field: _etl_loaded_at  # Column used to determine when data was last loaded
    tables:
      - name: raw_orders
        freshness: # Stricter settings specifically for the orders table
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}

      - name: raw_customers  # This table inherits the default freshness settings

      - name: raw_product_skus
        freshness: null  # Freshness checks are disabled for this table

Key points:

  • The freshness block defines when to warn or error based on data age

  • If a source has a freshness: block, dbt will attempt to calculate freshness for that source:

    • If a loaded_at_field is provided, dbt will calculate freshness via a select query (behavior prior to v1.7).

    • If a loaded_at_field is not provided, dbt will calculate freshness via warehouse metadata tables when possible (supported in Snowflake, Redshift and BigQuery adapters in v1.7 or greater).

    • loaded_at_field specifies which column contains the last-updated timestamp

  • Settings can be applied at the source level and overridden for specific tables

  • Use freshness: null to skip freshness checks for a table

Optimizing Freshness Checks for Large Tables

For large tables, you might want to add a filter to limit the rows checked:

tables:
  - name: large_table
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at
    filter: "_etl_loaded_at >= date_sub(current_date, interval 30 day)"  # Only check data from the last 30 days

This filter helps optimize the freshness check query for better performance.

Running Freshness Tests

To check the freshness of your sources, use the following dbt command:

dbt source freshness      # This command checks freshness for all configured sources

This command will:

  1. Construct a query to find the most recent loaded_at_field value, or look for warehouse metadata tables for each source table

  2. Compare this timestamp to the current time

  3. Report whether the data is fresh, stale (warning), or old (error)

Best Practices

  1. Set appropriate freshness thresholds based on your data update frequency and business needs

  2. Monitor and act on freshness test results to maintain data reliability

  3. Document your freshness expectations for each source

  4. Use freshness tests in combination with other data quality tests for comprehensive data validation

Remember, keeping your data fresh is crucial for accurate analytics and decision-making. Regular freshness testing helps ensure the reliability of your data pipeline.

To generate and/or update your sources.yml on the fly, use the Paradime generator CLI command:

paradime sources generate

Last updated