Snapshots

Snapshots track historical changes in your data warehouse by implementing Type 2 Slowly Changing Dimensions (SCD). Instead of overwriting data, snapshots create new records while preserving the history of changes.

Understanding Slowly Changing Dimensions

In data warehousing, different strategies exist for handling data that changes over time:

  • Type 0 (Retain Original): The dimension never changes once created

  • Type 1 (Overwrite): Updates replace the original values with no history kept

  • Type 2 (Add New Row): Preserves history by creating new records when changes occur

  • Type 3 (Add New Attribute): Maintains limited history by adding columns for previous values

dbt snapshots implement Type 2 SCD, which is ideal for tracking changes to data that updates infrequently and requires historical tracking, such as:

  • Customer status changes

  • Product pricing and categorization

  • Order processing states

  • Employee roles and departments

Here's an example of how snapshots preserve history when a customer's status changes:

-- Initial record
customer_id  status    updated_at     dbt_valid_from    dbt_valid_to
101         active     2023-06-15     2023-06-15        null

-- After status change with snapshots
customer_id  status    updated_at     dbt_valid_from    dbt_valid_to
101         active     2023-06-15     2023-06-15        2023-08-20    -- Historical record
101         inactive   2023-08-20     2023-08-20        null          -- Current record

Using this snapshot, you can query the customer's status on any specific date.


Basic Configuration

Prerequisites

Before setting up snapshots:

  • Create a target schema for snapshots in your data warehouse

  • Ensure source data has either reliable timestamps or columns to track for changes

  • Set up the snapshots/ directory in your dbt project

Snapshots are defined in .sql files within your project's snapshots directory:

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('orders', 'orders') }}

{% endsnapshot %}

To execute the snapshot, run:

dbt snapshot

This command should be run whenever you want to capture changes, both in development and production environments.


Snapshot Strategies

dbt offers two primary strategies to detect changes in your data:

Uses a timestamp column to detect changes. This is the best choice when your source data reliably updates a timestamp column when records change.

When to use:

  • Source data has a reliable updated_at column

  • Need to track when changes occurred

  • Want to capture all changes based on timing

Configuration

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

-- Your SQL query

{% endsnapshot %}

The updated_at parameter specifies which column contains the last-updated timestamp.


How Snapshots Work Behind the Scenes

When you run dbt snapshot, dbt:

  1. Creates the snapshot table if it doesn't exist

  2. For existing tables, determines which records have changed based on your strategy

  3. Marks previously current records that changed as no longer current (dbt_valid_to gets timestamp)

  4. Inserts new versions of changed records as current (with dbt_valid_to as null)

  5. Inserts completely new records

This process maintains a complete history of all changes while ensuring current data is easily identifiable.


Metadata Fields

As of dbt version ≥ 1.9, snapshots add four tracking columns:

  • dbt_valid_from: When this version became valid

  • dbt_valid_to: When this version became invalid (null for current version)

  • dbt_updated_at: Timestamp when the snapshot was taken

  • dbt_scd_id: Unique identifier for each version

These columns allow you to:

  • Identify current records (dbt_valid_to IS NULL)

  • Find records valid at a specific point in time

  • Determine when changes occurred

  • Track the duration a particular version was active


Step-by-Step Example

Let's implement a simple snapshot to track customer status changes:

1. Create a staging model for the source data

-- models/staging/stg_customers.sql
SELECT
    customer_id,
    name,
    status,
    email,
    updated_at
FROM {{ source('crm', 'customers') }}

2. Create a snapshot file

-- snapshots/customer_snapshots.sql
{% snapshot customers_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ ref('stg_customers') }}

{% endsnapshot %}

3. Run the snapshot for the first time

dbt snapshot

This creates the initial snapshot table with all customers.

4. Simulate a data change in the source

When customer data changes in your source system, run the snapshot again to capture those changes:

dbt snapshot

5. Query historical and current data

-- Get current customer data
SELECT * FROM snapshots.customers_snapshot
WHERE dbt_valid_to IS NULL;

-- Get customer data as it existed on a specific date
SELECT * FROM snapshots.customers_snapshot
WHERE customer_id = 123
  AND dbt_valid_from <= '2023-06-01'
  AND (dbt_valid_to > '2023-06-01' OR dbt_valid_to IS NULL);

Advanced Configuration

Invalidating Hard Deletes

By default, snapshots don't track when records are deleted from the source. To track deletions:

{% snapshot orders_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='order_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=true
    )
}}

-- Your SQL query

{% endsnapshot %}

With invalidate_hard_deletes=true, dbt will:

  • Identify records in the snapshot that no longer exist in the source

  • Set their dbt_valid_to timestamps to mark them as no longer current

Custom Snapshot Schemas

You can dynamically set the target schema:

{% snapshot orders_snapshot %}

{{
    config(
        target_schema=var('snapshot_schema', 'snapshots'),
        unique_key='order_id',
        strategy='timestamp',
        updated_at='updated_at'
    )
}}

-- Your SQL query

{% endsnapshot %}

This allows you to use variables to change the schema at runtime.


Best Practices for Snapshots

1. Snapshot Frequency

  • Schedule snapshots based on how frequently your source data changes and how important it is to capture every change

  • For critical data, run snapshots before dependent models to ensure they use the latest history

  • Consider storage costs versus historical data needs

# Example schedule.yml for a daily snapshot
jobs:
  - name: daily_snapshot
    schedule: "0 1 * * *"  # 1:00 AM daily
    steps:
      - dbt snapshot
      - dbt run --models dependent_on_snapshots

2. Performance Optimization

  • Use a dedicated schema for snapshots to make maintenance easier

  • Apply appropriate indexes to snapshot tables for faster querying

  • Consider partitioning large snapshot tables by date

  • For very large tables, use incremental snapshots with time-based filters

3. Querying Historical Data

To access historical data, filter to see the state of a specific record as it existed on a given date:

-- Get data as it looked on a specific date
SELECT *
FROM {{ ref('customers_snapshot') }}
WHERE customer_id = 123
  AND dbt_valid_from <= '2023-06-01'
  AND (dbt_valid_to > '2023-06-01' OR dbt_valid_to IS NULL)

4. Common Pitfalls to Avoid

  • Infrequent snapshots: Running snapshots too infrequently might miss intermediate state changes

  • Missing source filter: Always filter your source query to include only necessary data

  • Unreliable timestamps: Ensure your updated_at field actually updates when records change

  • Using snapshots for high-frequency changes: Consider incremental models for data that changes very frequently


Using Snapshots in Your Data Architecture

Snapshots typically fit into your data architecture as follows:

dbt project
├── models/
│   ├── staging/         # Simple transformations of source data
│   ├── intermediate/    # Business logic transformations
│   └── marts/           # Business-level output models
├── snapshots/           # Historical tracking of changing data
│   ├── customer_snapshots.sql
│   └── product_snapshots.sql
└── analyses/            # SQL for historical analysis using snapshots
    └── customer_status_history.sql

By effectively implementing snapshots, you create a robust history tracking system that supports historical reporting, audit requirements, and trend analysis—all while maintaining the simplicity and reproducibility that makes dbt powerful.

Last updated

Was this helpful?