Test your sources for freshness
dbt™ allows you to monitor the freshness of your source data, which is crucial for maintaining data pipeline health and meeting service level agreements (SLAs) for your data warehouse.
Purpose
Testing source freshness serves several important functions:
Verifies that your source data is being updated regularly
Alerts you to potential issues in your data pipelines
Helps maintain data quality and reliability in your project
Supports the definition and monitoring of SLAs for your data
Key Components
Configuring Freshness Checks
To enable freshness checking, you need to add configuration to your sources.yml
file. Here's an example:
Key points:
The
freshness
block defines when to warn or error based on data ageIf a source has a
freshness:
block, dbt will attempt to calculate freshness for that source:If a
loaded_at_field
is provided, dbt will calculate freshness via a select query (behavior prior to v1.7).If a
loaded_at_field
is not provided, dbt will calculate freshness via warehouse metadata tables when possible (supported in Snowflake, Redshift and BigQuery adapters in v1.7 or greater).loaded_at_field
specifies which column contains the last-updated timestamp
Settings can be applied at the source level and overridden for specific tables
Use
freshness: null
to skip freshness checks for a table
Optimizing Freshness Checks for Large Tables
For large tables, you might want to add a filter to limit the rows checked:
This filter helps optimize the freshness check query for better performance.
Running Freshness Tests
To check the freshness of your sources, use the following dbt command:
This command will:
Construct a query to find the most recent
loaded_at_field
value, or look for warehouse metadata tables for each source tableCompare this timestamp to the current time
Report whether the data is fresh, stale (warning), or old (error)
Best Practices
Set appropriate freshness thresholds based on your data update frequency and business needs
Monitor and act on freshness test results to maintain data reliability
Document your freshness expectations for each source
Use freshness tests in combination with other data quality tests for comprehensive data validation
Remember, keeping your data fresh is crucial for accurate analytics and decision-making. Regular freshness testing helps ensure the reliability of your data pipeline.
To generate and/or update your sources.yml on the fly, use the Paradime generator CLI command:
Last updated