Testing Source Freshness

Ensuring your source data is up-to-date is critical for reliable analytics. dbt's source freshness checks allow you to monitor and verify when data was last loaded, helping your team maintain data quality and meet service level agreements (SLAs).

Why Test Source Freshness?

Source freshness checks serve several important purposes:

Purpose
Description

Detect stale data

Identify when source data hasn't been updated within expected timeframes

Enforce SLAs

Ensure that data meets freshness requirements for business operations

Prevent incorrect analytics

Avoid building reports on outdated information

Monitor pipeline health

Get early warnings when data ingestion processes fail

How Source Freshness Works

dbt's source freshness checking works by:

  1. Examining a timestamp column in your source tables

  2. Comparing that timestamp to the current time

  3. Evaluating the difference against your defined thresholds

Example Freshness Check Failures

When data exceeds your freshness thresholds, dbt will report failures like these:


03:15:22 | 1 of 3 WARN freshness of jaffle_shop.orders ........................ [WARN in 1.42s]
03:15:22 | WARN: Source jaffle_shop.orders (model: sources.yml) - Missing 15.3 hours of data. Latest record loaded at 2023-09-15 12:01:32, expected within 12 hours

03:15:24 | 2 of 3 ERROR freshness of stripe.transactions ...................... [ERROR in 2.01s]
03:15:24 | ERROR: Source stripe.transactions (model: stripe_sources.yml) - Missing 28.5 hours of data. Latest record loaded at 2023-09-14 22:45:11, expected within 24 hours

When data exceeds your freshness thresholds, dbt will report failures like these:

The warnings and errors include the source name, how much data is missing, the timestamp of the most recent record, and your configured threshold.


Configuring Freshness Checks

To enable freshness checks, you need to add two key elements to your sources.yml file:

  1. A loaded_at_field that identifies the timestamp column

  2. A freshness configuration that defines your thresholds

Here's a basic example:

version: 2

sources:
  - name: jaffle_shop
    database: raw  
    freshness:
      warn_after: {count: 12, period: hour}  # Warn if data is over 12 hours old
      error_after: {count: 24, period: hour}  # Error if data is over 24 hours old
    loaded_at_field: _etl_loaded_at  # Column storing the last update timestamp
    tables:
      - name: raw_orders
      - name: raw_customers  # Inherits default freshness settings

Understanding the Configuration

  • loaded_at_field: The column containing the timestamp when data was loaded

  • warn_after: When to issue a warning about stale data

  • error_after: When to report an error about stale data (typically more lenient than warnings)

Valid Time Periods for Freshness Checks

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

  • minute - For data that updates very frequently (e.g., real-time systems)

  • hour - For data that updates throughout the day (e.g., transaction systems)

  • day - For data that updates daily or less frequently (e.g., batch uploads)

For example: warn_after: {count: 6, period: hour} would warn if data is more than 6 hours old.


Table-Specific Freshness Settings

You can override source-level freshness settings for specific tables:

sources:
  - name: jaffle_shop
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at
    tables:
      - name: raw_orders
        freshness:  
          warn_after: {count: 6, period: hour}  # More strict for orders table
          error_after: {count: 12, period: hour}

      - name: raw_customers  # Inherits source's default freshness

      - name: raw_product_skus
        freshness: null  # Disable freshness checks for this table

Running Freshness Checks

To check the freshness of your source data, run:

dbt source freshness

This command:

  1. Queries each source table for the most recent loaded_at_field value

  2. Compares that timestamp against the current time

  3. Evaluates it against your thresholds

  4. Reports success, warning, or error for each source

Where Freshness Check Results Appear

When you run dbt source freshness, the results appear in multiple places:

  1. Terminal Output: Warnings and errors are displayed in your command line interface immediately when running the command

  2. Log Files: Results are written to the log files in your project's logs/ directory

  3. Artifacts: Detailed results are stored in the target/sources.json file

  4. dbt Cloud (if using): Results appear in the run history and can trigger notifications

  5. Paradime Interface (if using): Results are displayed in the Paradime UI with history tracking

These results can be consumed by monitoring tools, notification systems, or dashboards to provide visibility into your data pipeline health.

Example Output

03:33:31 | Concurrency: 1 threads (target='dev')
03:33:31 | 
03:33:31 | 1 of 2 START freshness of jaffle_shop.raw_orders ................... [RUN]
03:33:32 | 1 of 2 WARN freshness of jaffle_shop.raw_orders .................... [WARN in 0.98s]
03:33:32 | 2 of 2 START freshness of jaffle_shop.raw_customers ................ [RUN]
03:33:33 | 2 of 2 PASS freshness of jaffle_shop.raw_customers ................. [PASS in 0.82s]
03:33:33 | 
03:33:33 | Finished running 2 source freshness checks in 1.99s.

Integrating Freshness Checks into Workflows

Freshness checks can be integrated into your data pipeline in several ways:

Scheduled Checks

Run freshness checks on a regular schedule to proactively monitor your data:

# Example cron job to check freshness every hour
0 * * * * cd /path/to/project && dbt source freshness

Pre-run Validation

Execute freshness checks before running models to prevent building on stale data:

#!/bin/bash
# Exit with error if any source freshness check fails
dbt source freshness || exit 1
# Only run models if freshness checks pass
dbt run

CI/CD Pipeline Integration

Include freshness checks in your CI/CD workflows:

# Example GitHub Actions workflow step
- name: Check source freshness
  run: dbt source freshness --target prod

Optimizing Freshness Checks for Large Tables

For very large datasets, running freshness checks can be resource-intensive. You can optimize them using filters:

sources:
  - name: analytics
    tables:
      - name: events
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 1, period: hour}
          error_after: {count: 6, period: hour}
          filter: "date_trunc('day', created_at) >= dateadd('day', -3, current_date)"

The filter clause limits the rows that dbt examines when checking freshness, improving performance for large tables.


Handling Different Data Loading Patterns

Different sources may have various data loading patterns that affect how you configure freshness:

Loading Pattern
Freshness Strategy
Example

Real-time streaming

Short windows (minutes)

warn_after: {count: 15, period: minute}

Hourly batch updates

Medium windows (hours)

warn_after: {count: 2, period: hour}

Daily ETL jobs

Longer windows (days)

warn_after: {count: 1, period: day}

Weekly data delivery

Extended windows

warn_after: {count: 8, period: day}

Troubleshooting Freshness Issues

If your freshness checks are failing, consider these common issues:

  1. Incorrect timestamp column: Verify that loaded_at_field is the right column

  2. Timezone differences: Check if there are timezone discrepancies between source timestamps and dbt

  3. Data loading failures: Investigate upstream ETL/ELT processes

  4. Unrealistic expectations: Adjust thresholds to match actual data loading patterns


Best Practices

Practice
Description

Set realistic thresholds

Align freshness requirements with business needs and actual data load frequencies

Use appropriate column

Choose a column that truly represents when data was last updated (ETL timestamp preferred over source timestamps)

Monitor trends

Track freshness over time to identify deteriorating pipeline performance

Disable when appropriate

Use freshness: null for static reference tables that rarely change

Document expectations

Include freshness SLAs in your data documentation


Automating Freshness Monitoring in Paradime

Paradime enhances dbt's freshness capabilities with additional monitoring features:

paradime sources generate

This command not only generates source definitions but also sets appropriate freshness configurations based on observed data patterns.

By implementing comprehensive source freshness testing, you can ensure your data transformations are always built on timely, reliable source data.

Last updated

Was this helpful?