Testing Source Freshness
Ensuring your source data is up-to-date is critical for reliable analytics. dbt's source freshness checks allow you to monitor and verify when data was last loaded, helping your team maintain data quality and meet service level agreements (SLAs).
Why Test Source Freshness?
Source freshness checks serve several important purposes:
Detect stale data
Identify when source data hasn't been updated within expected timeframes
Enforce SLAs
Ensure that data meets freshness requirements for business operations
Prevent incorrect analytics
Avoid building reports on outdated information
Monitor pipeline health
Get early warnings when data ingestion processes fail
How Source Freshness Works
dbt's source freshness checking works by:
Examining a timestamp column in your source tables
Comparing that timestamp to the current time
Evaluating the difference against your defined thresholds
Example Freshness Check Failures
When data exceeds your freshness thresholds, dbt will report failures like these:
When data exceeds your freshness thresholds, dbt will report failures like these:
The warnings and errors include the source name, how much data is missing, the timestamp of the most recent record, and your configured threshold.
Configuring Freshness Checks
To enable freshness checks, you need to add two key elements to your sources.yml
file:
A
loaded_at_field
that identifies the timestamp columnA
freshness
configuration that defines your thresholds
Here's a basic example:
Understanding the Configuration
loaded_at_field
: The column containing the timestamp when data was loadedwarn_after
: When to issue a warning about stale dataerror_after
: When to report an error about stale data (typically more lenient than warnings)
Valid Time Periods for Freshness Checks
When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:
minute
- For data that updates very frequently (e.g., real-time systems)hour
- For data that updates throughout the day (e.g., transaction systems)day
- For data that updates daily or less frequently (e.g., batch uploads)
For example: warn_after: {count: 6, period: hour}
would warn if data is more than 6 hours old.
Table-Specific Freshness Settings
You can override source-level freshness settings for specific tables:
Running Freshness Checks
To check the freshness of your source data, run:
This command:
Queries each source table for the most recent
loaded_at_field
valueCompares that timestamp against the current time
Evaluates it against your thresholds
Reports success, warning, or error for each source
Where Freshness Check Results Appear
When you run dbt source freshness
, the results appear in multiple places:
Terminal Output: Warnings and errors are displayed in your command line interface immediately when running the command
Log Files: Results are written to the log files in your project's
logs/
directoryArtifacts: Detailed results are stored in the
target/sources.json
filedbt Cloud (if using): Results appear in the run history and can trigger notifications
Paradime Interface (if using): Results are displayed in the Paradime UI with history tracking
These results can be consumed by monitoring tools, notification systems, or dashboards to provide visibility into your data pipeline health.
Example Output
Integrating Freshness Checks into Workflows
Freshness checks can be integrated into your data pipeline in several ways:
Scheduled Checks
Run freshness checks on a regular schedule to proactively monitor your data:
Pre-run Validation
Execute freshness checks before running models to prevent building on stale data:
CI/CD Pipeline Integration
Include freshness checks in your CI/CD workflows:
Optimizing Freshness Checks for Large Tables
For very large datasets, running freshness checks can be resource-intensive. You can optimize them using filters:
The filter
clause limits the rows that dbt examines when checking freshness, improving performance for large tables.
Handling Different Data Loading Patterns
Different sources may have various data loading patterns that affect how you configure freshness:
Real-time streaming
Short windows (minutes)
warn_after: {count: 15, period: minute}
Hourly batch updates
Medium windows (hours)
warn_after: {count: 2, period: hour}
Daily ETL jobs
Longer windows (days)
warn_after: {count: 1, period: day}
Weekly data delivery
Extended windows
warn_after: {count: 8, period: day}
Troubleshooting Freshness Issues
If your freshness checks are failing, consider these common issues:
Incorrect timestamp column: Verify that
loaded_at_field
is the right columnTimezone differences: Check if there are timezone discrepancies between source timestamps and dbt
Data loading failures: Investigate upstream ETL/ELT processes
Unrealistic expectations: Adjust thresholds to match actual data loading patterns
Best Practices
Set realistic thresholds
Align freshness requirements with business needs and actual data load frequencies
Use appropriate column
Choose a column that truly represents when data was last updated (ETL timestamp preferred over source timestamps)
Monitor trends
Track freshness over time to identify deteriorating pipeline performance
Disable when appropriate
Use freshness: null
for static reference tables that rarely change
Document expectations
Include freshness SLAs in your data documentation
Automating Freshness Monitoring in Paradime
Paradime enhances dbt's freshness capabilities with additional monitoring features:
This command not only generates source definitions but also sets appropriate freshness configurations based on observed data patterns.
By implementing comprehensive source freshness testing, you can ensure your data transformations are always built on timely, reliable source data.
Last updated
Was this helpful?