# Testing Source Freshness

Ensuring your source data is up-to-date is critical for reliable analytics. dbt's source freshness checks allow you to monitor and verify when data was last loaded, helping your team maintain data quality and meet service level agreements (SLAs).

### Why Test Source Freshness?

Source freshness checks serve several important purposes:

| Purpose                     | Description                                                              |
| --------------------------- | ------------------------------------------------------------------------ |
| Detect stale data           | Identify when source data hasn't been updated within expected timeframes |
| Enforce SLAs                | Ensure that data meets freshness requirements for business operations    |
| Prevent incorrect analytics | Avoid building reports on outdated information                           |
| Monitor pipeline health     | Get early warnings when data ingestion processes fail                    |

### How Source Freshness Works

dbt's source freshness checking works by:

1. Examining a timestamp column in your source tables
2. Comparing that timestamp to the current time
3. Evaluating the difference against your defined thresholds

{% hint style="info" %}
**Example Freshness Check Failures**

When data exceeds your freshness thresholds, dbt will report failures like these:

***

```bash
03:15:22 | 1 of 3 WARN freshness of jaffle_shop.orders ........................ [WARN in 1.42s]
03:15:22 | WARN: Source jaffle_shop.orders (model: sources.yml) - Missing 15.3 hours of data. Latest record loaded at 2023-09-15 12:01:32, expected within 12 hours

03:15:24 | 2 of 3 ERROR freshness of stripe.transactions ...................... [ERROR in 2.01s]
03:15:24 | ERROR: Source stripe.transactions (model: stripe_sources.yml) - Missing 28.5 hours of data. Latest record loaded at 2023-09-14 22:45:11, expected within 24 hours
```

***

When data exceeds your freshness thresholds, dbt will report failures like these:

The warnings and errors include the source name, how much data is missing, the timestamp of the most recent record, and your configured threshold.
{% endhint %}

***

### Configuring Freshness Checks

To enable freshness checks, you need to add two key elements to your `sources.yml` file:

1. A `loaded_at_field` that identifies the timestamp column
2. A `freshness` configuration that defines your thresholds

Here's a basic example:

```yaml
version: 2

sources:
  - name: jaffle_shop
    database: raw  
    freshness:
      warn_after: {count: 12, period: hour}  # Warn if data is over 12 hours old
      error_after: {count: 24, period: hour}  # Error if data is over 24 hours old
    loaded_at_field: _etl_loaded_at  # Column storing the last update timestamp
    tables:
      - name: raw_orders
      - name: raw_customers  # Inherits default freshness settings
```

#### Understanding the Configuration

* `loaded_at_field`: The column containing the timestamp when data was loaded
* `warn_after`: When to issue a warning about stale data
* `error_after`: When to report an error about stale data (typically more lenient than warnings)

{% hint style="info" %}
**Valid Time Periods for Freshness Checks**

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

* `minute` - For data that updates very frequently (e.g., real-time systems)
* `hour` - For data that updates throughout the day (e.g., transaction systems)
* `day` - For data that updates daily or less frequently (e.g., batch uploads)

For example: `warn_after: {count: 6, period: hour}` would warn if data is more than 6 hours old.&#x20;
{% endhint %}

***

### Table-Specific Freshness Settings

You can override source-level freshness settings for specific tables:

```yaml
sources:
  - name: jaffle_shop
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at
    tables:
      - name: raw_orders
        freshness:  
          warn_after: {count: 6, period: hour}  # More strict for orders table
          error_after: {count: 12, period: hour}

      - name: raw_customers  # Inherits source's default freshness

      - name: raw_product_skus
        freshness: null  # Disable freshness checks for this table
```

***

### Running Freshness Checks

To check the freshness of your source data, run:

```bash
dbt source freshness
```

This command:

1. Queries each source table for the most recent `loaded_at_field` value
2. Compares that timestamp against the current time
3. Evaluates it against your thresholds
4. Reports success, warning, or error for each source

{% hint style="info" %}
**Where Freshness Check Results Appear**

When you run `dbt source freshness`, the results appear in multiple places:

1. **Terminal Output**: Warnings and errors are displayed in your command line interface immediately when running the command
2. **Log Files**: Results are written to the log files in your project's `logs/` directory
3. **Artifacts**: Detailed results are stored in the `target/sources.json` file
4. **dbt Cloud** (if using): Results appear in the run history and can trigger notifications
5. **Paradime Interface** (if using): Results are displayed in the Paradime UI with history tracking

These results can be consumed by monitoring tools, notification systems, or dashboards to provide visibility into your data pipeline health.
{% endhint %}

#### Example Output

```
03:33:31 | Concurrency: 1 threads (target='dev')
03:33:31 | 
03:33:31 | 1 of 2 START freshness of jaffle_shop.raw_orders ................... [RUN]
03:33:32 | 1 of 2 WARN freshness of jaffle_shop.raw_orders .................... [WARN in 0.98s]
03:33:32 | 2 of 2 START freshness of jaffle_shop.raw_customers ................ [RUN]
03:33:33 | 2 of 2 PASS freshness of jaffle_shop.raw_customers ................. [PASS in 0.82s]
03:33:33 | 
03:33:33 | Finished running 2 source freshness checks in 1.99s.
```

***

### Integrating Freshness Checks into Workflows

Freshness checks can be integrated into your data pipeline in several ways:

#### Scheduled Checks

Run freshness checks on a regular schedule to proactively monitor your data:

```bash
# Example cron job to check freshness every hour
0 * * * * cd /path/to/project && dbt source freshness
```

#### Pre-run Validation

Execute freshness checks before running models to prevent building on stale data:

```bash
#!/bin/bash
# Exit with error if any source freshness check fails
dbt source freshness || exit 1
# Only run models if freshness checks pass
dbt run
```

#### CI/CD Pipeline Integration

Include freshness checks in your CI/CD workflows:

```yaml
# Example GitHub Actions workflow step
- name: Check source freshness
  run: dbt source freshness --target prod
```

***

### Optimizing Freshness Checks for Large Tables

For very large datasets, running freshness checks can be resource-intensive. You can optimize them using filters:

```yaml
sources:
  - name: analytics
    tables:
      - name: events
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 1, period: hour}
          error_after: {count: 6, period: hour}
          filter: "date_trunc('day', created_at) >= dateadd('day', -3, current_date)"
```

The `filter` clause limits the rows that dbt examines when checking freshness, improving performance for large tables.

***

### Handling Different Data Loading Patterns

Different sources may have various data loading patterns that affect how you configure freshness:

| Loading Pattern      | Freshness Strategy      | Example                                   |
| -------------------- | ----------------------- | ----------------------------------------- |
| Real-time streaming  | Short windows (minutes) | `warn_after: {count: 15, period: minute}` |
| Hourly batch updates | Medium windows (hours)  | `warn_after: {count: 2, period: hour}`    |
| Daily ETL jobs       | Longer windows (days)   | `warn_after: {count: 1, period: day}`     |
| Weekly data delivery | Extended windows        | `warn_after: {count: 8, period: day}`     |

### Troubleshooting Freshness Issues

If your freshness checks are failing, consider these common issues:

1. **Incorrect timestamp column**: Verify that `loaded_at_field` is the right column
2. **Timezone differences**: Check if there are timezone discrepancies between source timestamps and dbt
3. **Data loading failures**: Investigate upstream ETL/ELT processes
4. **Unrealistic expectations**: Adjust thresholds to match actual data loading patterns

***

### Best Practices

| Practice                 | Description                                                                                                       |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------- |
| Set realistic thresholds | Align freshness requirements with business needs and actual data load frequencies                                 |
| Use appropriate column   | Choose a column that truly represents when data was last updated (ETL timestamp preferred over source timestamps) |
| Monitor trends           | Track freshness over time to identify deteriorating pipeline performance                                          |
| Disable when appropriate | Use `freshness: null` for static reference tables that rarely change                                              |
| Document expectations    | Include freshness SLAs in your data documentation                                                                 |

***

{% hint style="info" %}

### Automating Source Generation with DinoAI

With [DinoAI](https://docs.paradime.io/app-help/documentation/dino-ai) (Paradime's AI Agent), you can automatically generate source definitions with appropriate freshness configurations. See [step by step instructions](https://docs.paradime.io/app-help/documentation/dino-ai/agent-mode/use-cases/creating-sources-from-your-warehouse#step-by-step-instructions) for more details.&#x20;
{% endhint %}
