Paradime Help Docs
Get Started
  • 🚀Introduction
  • 📃Guides
    • Paradime 101
      • Getting Started with your Paradime Workspace
        • Creating a Workspace
        • Setting Up Data Warehouse Connections
        • Managing workspace configurations
        • Managing Users in the Workspace
      • Getting Started with the Paradime IDE
        • Setting Up a dbt™ Project
        • Creating a dbt™ Model
        • Data Exploration in the Code IDE
        • DinoAI: Accelerating Your Analytics Engineering Workflow
          • DinoAI Agent
            • Creating dbt Sources from Data Warehouse
            • Generating Base Models
            • Building Intermediate/Marts Models
            • Documentation Generation
            • Data Pipeline Configuration
            • Using .dinorules to Tailor Your AI Experience
          • Accelerating GitOps
          • Accelerating Data Governance
          • Accelerating dbt™ Development
        • Utilizing Advanced Developer Features
          • Visualize Data Lineage
          • Auto-generated Data Documentation
          • Enforce SQL and YAML Best Practices
          • Working with CSV Files
      • Managing dbt™ Schedules with Bolt
        • Creating Bolt Schedules
        • Understanding schedule types and triggers
        • Viewing Run History and Analytics
        • Setting Up Notifications
        • Debugging Failed Runs
    • Migrating from dbt™ cloud to Paradime
  • 🔍Concepts
    • Working with Git
      • Git Lite
      • Git Advanced
      • Read Only Branches
      • Delete Branches
      • Merge Conflicts
      • Configuring Signed Commits on Paradime with SSH Keys
    • dbt™ fundamentals
      • Getting started with dbt™
        • Introduction
        • Project Strucuture
        • Working with Sources
        • Testing Data Quality
        • Models and Transformations
      • Configuring your dbt™ Project
        • Setting up your dbt_project.yml
        • Defining Your Sources in sources.yml
        • Testing Source Freshness
        • Unit Testing
        • Working with Tags
        • Managing Seeds
        • Environment Management
        • Variables and Parameters
        • Macros
        • Custom Tests
        • Hooks & Operational Tasks
        • Packages
      • Model Materializations
        • Table Materialization
        • View​ Materialization
        • Incremental Materialization
          • Using Merge for Incremental Models
          • Using Delete+Insert for Incremental Models
          • Using Append for Incremental Models
          • Using Microbatch for Incremental Models
        • Ephemeral Materialization
        • Snapshots
      • Running dbt™
        • Mastering the dbt™ CLI
          • Commands
          • Methods
          • Selector Methods
          • Graph Operators
    • Paradime fundamentals
      • Global Search
        • Paradime Apps Navigation
        • Invite users to your workspace
        • Search and preview Bolt schedules status
      • Using --defer in Paradime
      • Workspaces and data mesh
    • Data Warehouse essentials
      • BigQuery Multi-Project Service Account
  • 📖Documentation
    • DinoAI
      • Agent Mode
        • Use Cases
          • Creating Sources from your Warehouse
          • Generating dbt™ models
          • Fixing Errors with Jira
          • Researching with Perplexity
          • Providing Additional Context Using PDFs
      • Context
        • File Context
        • Directory Context
      • Tools and Features
        • Warehouse Tool
        • File System Tool
        • PDF Tool
        • Jira Tool
        • Perplexity Tool
        • Terminal Tool
        • Coming Soon Tools...
      • .dinorules
      • Ask Mode
      • Version Control
      • Production Pipelines
      • Data Documentation
    • Code IDE
      • User interface
        • Autocompletion
        • Context Menu
        • Flexible layout
        • "Peek" and "Go To" Definition
        • IDE preferences
        • Shortcuts
      • Left Panel
        • DinoAI Coplot
        • Search, Find, and Replace
        • Git Lite
        • Bookmarks
      • Command Panel
        • Data Explorer
        • Lineage
        • Catalog
        • Lint
      • Terminal
        • Running dbt™
        • Paradime CLI
      • Additional Features
        • Scratchpad
    • Bolt
      • Creating Schedules
        • 1. Schedule Settings
        • 2. Command Settings
          • dbt™ Commands
          • Python Scripts
          • Elementary Commands
          • Lightdash Commands
          • Tableau Workbook Refresh
          • Power BI Dataset Refresh
          • Paradime Bolt Schedule Toggle Commands
          • Monte Carlo Commands
        • 3. Trigger Types
        • 4. Notification Settings
        • Templates
          • Run and Test all your dbt™ Models
          • Snapshot Source Data Freshness
          • Build and Test Models with New Source Data
          • Test Code Changes On Pull Requests
          • Re-executes the last dbt™ command from the point of failure
          • Deploy Code Changes On Merge
          • Create Jira Tickets
          • Trigger Census Syncs
          • Trigger Hex Projects
          • Create Linear Issues
          • Create New Relic Incidents
          • Create Azure DevOps Items
        • Schedules as Code
      • Managing Schedules
        • Schedule Configurations
        • Viewing Run Log History
        • Analyzing Individual Run Details
          • Configuring Source Freshness
      • Bolt API
      • Special Environment Variables
        • Audit environment variables
        • Runtime environment variables
      • Integrations
        • Reverse ETL
          • Hightouch
        • Orchestration
          • Airflow
          • Azure Data Factory (ADF)
      • CI/CD
        • Turbo CI
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
          • Paradime Turbo CI Schema Cleanup
        • Continuous Deployment with Bolt
          • GitHub Native Continuous Deployment
          • Using Azure Pipelines
          • Using BitBucket Pipelines
          • Using GitLab Pipelines
        • Column-Level Lineage Diff
          • dbt™ mesh
          • Looker
          • Tableau
          • Thoughtspot
    • Radar
      • Get Started
      • Cost Management
        • Snowflake Cost Optimization
        • Snowflake Cost Monitoring
        • BigQuery Cost Monitoring
      • dbt™ Monitoring
        • Schedules Dashboard
        • Models Dashboard
        • Sources Dashboard
        • Tests Dashboard
      • Team Efficiency Tracking
      • Real-time Alerting
      • Looker Monitoring
    • Data Catalog
      • Data Assets
        • Looker assets
        • Tableau assets
        • Power BI assets
        • Sigma assets
        • ThoughtSpot assets
        • Fivetran assets
        • dbt™️ assets
      • Lineage
        • Search and Discovery
        • Filters and Nodes interaction
        • Nodes navigation
        • Canvas interactions
        • Compare Lineage version
    • Integrations
      • Dashboards
        • Sigma
        • ThoughtSpot (Beta)
        • Lightdash
        • Tableau
        • Looker
        • Power BI
        • Streamlit
      • Code IDE
        • Cube CLI
        • dbt™️ generator
        • Prettier
        • Harlequin
        • SQLFluff
        • Rainbow CSV
        • Mermaid
          • Architecture Diagrams
          • Block Diagrams Documentation
          • Class Diagrams
          • Entity Relationship Diagrams
          • Gantt Diagrams
          • GitGraph Diagrams
          • Mindmaps
          • Pie Chart Diagrams
          • Quadrant Charts
          • Requirement Diagrams
          • Sankey Diagrams
          • Sequence Diagrams
          • State Diagrams
          • Timeline Diagrams
          • User Journey Diagrams
          • XY Chart
          • ZenUML
        • pre-commit
          • Paradime Setup and Configuration
          • dbt™️-checkpoint hooks
            • dbt™️ Model checks
            • dbt™️ Script checks
            • dbt™️ Source checks
            • dbt™️ Macro checks
            • dbt™️ Modifiers
            • dbt™️ commands
            • dbt™️ checks
          • SQLFluff hooks
          • Prettier hooks
      • Observability
        • Elementary Data
          • Anomaly Detection Tests
            • Anomaly tests parameters
            • Volume anomalies
            • Freshness anomalies
            • Event freshness anomalies
            • Dimension anomalies
            • All columns anomalies
            • Column anomalies
          • Schema Tests
            • Schema changes
            • Schema changes from baseline
          • Sending alerts
            • Slack alerts
            • Microsoft Teams alerts
            • Alerts Configuration and Customization
          • Generate observability report
          • CLI commands and usage
        • Monte Carlo
      • Storage
        • Amazon S3
        • Snowflake Storage
      • Reverse ETL
        • Hightouch
      • CI/CD
        • GitHub
        • Spectacles
      • Notifications
        • Microsoft Teams
        • Slack
      • ETL
        • Fivetran
    • Security
      • Single Sign On (SSO)
        • Okta SSO
        • Azure AD SSO
        • Google SAML SSO
        • Google Workspace SSO
        • JumpCloud SSO
      • Audit Logs
      • Security model
      • Privacy model
      • FAQs
      • Trust Center
      • Security
    • Settings
      • Workspaces
      • Git Repositories
        • Importing a repository
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
        • Update connected git repository
      • Connections
        • Code IDE environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Scheduler environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Manage connections
          • Set alternative default connection
          • Delete connections
        • Cost connection
          • BigQuery cost connection
          • Snowflake cost connection
        • Connection Security
          • AWS PrivateLink
            • Snowflake PrivateLink
            • Redshift PrivateLink
          • BigQuery OAuth
          • Snowflake OAuth
        • Optional connection attributes
      • Notifications
      • dbt™
        • Upgrade dbt Core™ version
      • Users
        • Invite users
        • Manage Users
        • Enable Auto-join
        • Users and licences
        • Default Roles and Permissions
        • Role-based access control
      • Environment Variables
        • Bolt Schedules Environment Variables
        • Code IDE Environment Variables
  • 💻Developers
    • GraphQL API
      • Authentication
      • Examples
        • Audit Logs API
        • Bolt API
        • User Management API
        • Workspace Management API
    • Python SDK
      • Getting Started
      • Modules
        • Audit Log
        • Bolt
        • Lineage Diff
        • Custom Integration
        • User Management
        • Workspace Management
    • Paradime CLI
      • Getting Started
      • Bolt CLI
    • Webhooks
      • Getting Started
      • Custom Webhook Guides
        • Create an Azure DevOps Work item when a Bolt run complete with errors
        • Create a Linear Issue when a Bolt run complete with errors
        • Create a Jira Issue when a Bolt run complete with errors
        • Trigger a Slack notification when a Bolt run is overrunning
    • Virtual Environments
      • Using Poetry
      • Troubleshooting
    • API Keys
    • IP Restrictions in Paradime
    • Company & Workspace token
  • 🙌Best Practices
    • Data Mesh Setup
      • Configure Project dependencies
      • Model access
      • Model groups
  • ‼️Troubleshooting
    • Errors
    • Error List
    • Restart Code IDE
  • 🔗Other Links
    • Terms of Service
    • Privacy Policy
    • Paradime Blog
Powered by GitBook
On this page
  • Why Test Source Freshness?
  • How Source Freshness Works
  • Configuring Freshness Checks
  • Table-Specific Freshness Settings
  • Running Freshness Checks
  • Integrating Freshness Checks into Workflows
  • Optimizing Freshness Checks for Large Tables
  • Handling Different Data Loading Patterns
  • Troubleshooting Freshness Issues
  • Best Practices
  • Automating Freshness Monitoring in Paradime

Was this helpful?

  1. Concepts
  2. dbt™ fundamentals
  3. Configuring your dbt™ Project

Testing Source Freshness

Ensuring your source data is up-to-date is critical for reliable analytics. dbt's source freshness checks allow you to monitor and verify when data was last loaded, helping your team maintain data quality and meet service level agreements (SLAs).

Why Test Source Freshness?

Source freshness checks serve several important purposes:

Purpose
Description

Detect stale data

Identify when source data hasn't been updated within expected timeframes

Enforce SLAs

Ensure that data meets freshness requirements for business operations

Prevent incorrect analytics

Avoid building reports on outdated information

Monitor pipeline health

Get early warnings when data ingestion processes fail

How Source Freshness Works

dbt's source freshness checking works by:

  1. Examining a timestamp column in your source tables

  2. Comparing that timestamp to the current time

  3. Evaluating the difference against your defined thresholds

Example Freshness Check Failures

When data exceeds your freshness thresholds, dbt will report failures like these:


03:15:22 | 1 of 3 WARN freshness of jaffle_shop.orders ........................ [WARN in 1.42s]
03:15:22 | WARN: Source jaffle_shop.orders (model: sources.yml) - Missing 15.3 hours of data. Latest record loaded at 2023-09-15 12:01:32, expected within 12 hours

03:15:24 | 2 of 3 ERROR freshness of stripe.transactions ...................... [ERROR in 2.01s]
03:15:24 | ERROR: Source stripe.transactions (model: stripe_sources.yml) - Missing 28.5 hours of data. Latest record loaded at 2023-09-14 22:45:11, expected within 24 hours

When data exceeds your freshness thresholds, dbt will report failures like these:

The warnings and errors include the source name, how much data is missing, the timestamp of the most recent record, and your configured threshold.


Configuring Freshness Checks

To enable freshness checks, you need to add two key elements to your sources.yml file:

  1. A loaded_at_field that identifies the timestamp column

  2. A freshness configuration that defines your thresholds

Here's a basic example:

version: 2

sources:
  - name: jaffle_shop
    database: raw  
    freshness:
      warn_after: {count: 12, period: hour}  # Warn if data is over 12 hours old
      error_after: {count: 24, period: hour}  # Error if data is over 24 hours old
    loaded_at_field: _etl_loaded_at  # Column storing the last update timestamp
    tables:
      - name: raw_orders
      - name: raw_customers  # Inherits default freshness settings

Understanding the Configuration

  • loaded_at_field: The column containing the timestamp when data was loaded

  • warn_after: When to issue a warning about stale data

  • error_after: When to report an error about stale data (typically more lenient than warnings)

Valid Time Periods for Freshness Checks

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

  • minute - For data that updates very frequently (e.g., real-time systems)

  • hour - For data that updates throughout the day (e.g., transaction systems)

  • day - For data that updates daily or less frequently (e.g., batch uploads)

For example: warn_after: {count: 6, period: hour} would warn if data is more than 6 hours old.


Table-Specific Freshness Settings

You can override source-level freshness settings for specific tables:

sources:
  - name: jaffle_shop
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at
    tables:
      - name: raw_orders
        freshness:  
          warn_after: {count: 6, period: hour}  # More strict for orders table
          error_after: {count: 12, period: hour}

      - name: raw_customers  # Inherits source's default freshness

      - name: raw_product_skus
        freshness: null  # Disable freshness checks for this table

Running Freshness Checks

To check the freshness of your source data, run:

dbt source freshness

This command:

  1. Queries each source table for the most recent loaded_at_field value

  2. Compares that timestamp against the current time

  3. Evaluates it against your thresholds

  4. Reports success, warning, or error for each source

Where Freshness Check Results Appear

When you run dbt source freshness, the results appear in multiple places:

  1. Terminal Output: Warnings and errors are displayed in your command line interface immediately when running the command

  2. Log Files: Results are written to the log files in your project's logs/ directory

  3. Artifacts: Detailed results are stored in the target/sources.json file

  4. dbt Cloud (if using): Results appear in the run history and can trigger notifications

  5. Paradime Interface (if using): Results are displayed in the Paradime UI with history tracking

These results can be consumed by monitoring tools, notification systems, or dashboards to provide visibility into your data pipeline health.

Example Output

03:33:31 | Concurrency: 1 threads (target='dev')
03:33:31 | 
03:33:31 | 1 of 2 START freshness of jaffle_shop.raw_orders ................... [RUN]
03:33:32 | 1 of 2 WARN freshness of jaffle_shop.raw_orders .................... [WARN in 0.98s]
03:33:32 | 2 of 2 START freshness of jaffle_shop.raw_customers ................ [RUN]
03:33:33 | 2 of 2 PASS freshness of jaffle_shop.raw_customers ................. [PASS in 0.82s]
03:33:33 | 
03:33:33 | Finished running 2 source freshness checks in 1.99s.

Integrating Freshness Checks into Workflows

Freshness checks can be integrated into your data pipeline in several ways:

Scheduled Checks

Run freshness checks on a regular schedule to proactively monitor your data:

# Example cron job to check freshness every hour
0 * * * * cd /path/to/project && dbt source freshness

Pre-run Validation

Execute freshness checks before running models to prevent building on stale data:

#!/bin/bash
# Exit with error if any source freshness check fails
dbt source freshness || exit 1
# Only run models if freshness checks pass
dbt run

CI/CD Pipeline Integration

Include freshness checks in your CI/CD workflows:

# Example GitHub Actions workflow step
- name: Check source freshness
  run: dbt source freshness --target prod

Optimizing Freshness Checks for Large Tables

For very large datasets, running freshness checks can be resource-intensive. You can optimize them using filters:

sources:
  - name: analytics
    tables:
      - name: events
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 1, period: hour}
          error_after: {count: 6, period: hour}
          filter: "date_trunc('day', created_at) >= dateadd('day', -3, current_date)"

The filter clause limits the rows that dbt examines when checking freshness, improving performance for large tables.


Handling Different Data Loading Patterns

Different sources may have various data loading patterns that affect how you configure freshness:

Loading Pattern
Freshness Strategy
Example

Real-time streaming

Short windows (minutes)

warn_after: {count: 15, period: minute}

Hourly batch updates

Medium windows (hours)

warn_after: {count: 2, period: hour}

Daily ETL jobs

Longer windows (days)

warn_after: {count: 1, period: day}

Weekly data delivery

Extended windows

warn_after: {count: 8, period: day}

Troubleshooting Freshness Issues

If your freshness checks are failing, consider these common issues:

  1. Incorrect timestamp column: Verify that loaded_at_field is the right column

  2. Timezone differences: Check if there are timezone discrepancies between source timestamps and dbt

  3. Data loading failures: Investigate upstream ETL/ELT processes

  4. Unrealistic expectations: Adjust thresholds to match actual data loading patterns


Best Practices

Practice
Description

Set realistic thresholds

Align freshness requirements with business needs and actual data load frequencies

Use appropriate column

Choose a column that truly represents when data was last updated (ETL timestamp preferred over source timestamps)

Monitor trends

Track freshness over time to identify deteriorating pipeline performance

Disable when appropriate

Use freshness: null for static reference tables that rarely change

Document expectations

Include freshness SLAs in your data documentation


Automating Freshness Monitoring in Paradime

Paradime enhances dbt's freshness capabilities with additional monitoring features:

paradime sources generate

This command not only generates source definitions but also sets appropriate freshness configurations based on observed data patterns.

By implementing comprehensive source freshness testing, you can ensure your data transformations are always built on timely, reliable source data.

PreviousDefining Your Sources in sources.ymlNextUnit Testing

Last updated 2 months ago

Was this helpful?

🔍