Snapshots

Snapshots track historical changes in your data warehouse by implementing Type 2 Slowly Changing Dimensions (SCD). Instead of overwriting data, snapshots create new records while preserving history of changes.

In data warehousing, Type 2 SCD is one of several methods for handling changes:

  • Type 1: Overwrites old data (no history)

  • Type 2: Preserves history by creating new records (what snapshots implement)

  • Type 3: Adds new columns for changes

Snapshots are ideal for tracking changes in fields that update infrequently and require historical tracking, such as customer status changes, product pricing, and order processing states.

To understand how snapshots preserve history, let's look at a customer account status change:

-- Let's say we have a customer account table where status can change
-- Here's what it looks like initially
id    status          updated_at
101   active          2023-06-15

-- When the customer downgrades their account, the data changes to:
id    status          updated_at
101   basic           2023-08-20

-- Without snapshots, we've lost the history of when they were 'active'.
-- However, with snapshots, we preserve the complete history:

id    status    updated_at    dbt_valid_from    dbt_valid_to
101   active    2023-06-15    2023-06-15        2023-08-20    -- Historical record
101   basic     2023-08-20    2023-08-20        null          -- Current record

Using this snapshot, we can query historical states, like a customer's status on any given date.


Basic Configuration

Prerequisites

  • Create a target schema for snapshots in your data warehouse

  • Ensure source data has either reliable timestamps or columns to track for changes

Snapshots are defined in .sql files within your project's snapshots directory:

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('orders', 'orders') }}

{% endsnapshot %}

To execute the snapshot, run:

dbt snapshot

Note: Run dbt snapshot command whenever you want to capture changes, in both development and production environments.


Snapshot Strategies

Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt:

Uses a timestamp column to detect changes. Best choice when your source data reliably updates a timestamp column when records change.

When to use:

  • Source data has a reliable updated_at column

  • Need to track when changes occurred

  • Want to capture all changes based on timing

Configuration

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

// your SQL code

{% endsnapshot %}

Configuration Example

If you are brand new to using snapshots, use this step-by-step tutorial to create and test a snapshot with different strategies.

Configuration Example

Initial Setup

  1. Create source table:

-- models/staging/stg_test_users.sql
WITH source_data AS (
    SELECT 1 as user_id,
           'active' as status,
           'user1@test.com' as email,
           CURRENT_TIMESTAMP() as updated_at
    UNION ALL
    SELECT 2, 'pending', 'user2@test.com', CURRENT_TIMESTAMP()
)
SELECT * FROM source_data
  1. Create snapshot:

-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
{{
    config(
      target_schema='snapshots',
      unique_key='user_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ ref('stg_test_users') }}
{% endsnapshot %}
  1. Create schema in data warehouse (if not already created):

CREATE SCHEMA IF NOT EXISTS snapshots;
  1. Run commands:

dbt run --select stg_test_users
dbt snapshot
  1. Verify in Snowflake:

SELECT * FROM your_database.snapshots.users_snapshot ORDER BY user_id;

Expected Result:

user_id | status  | email           | updated_at | dbt_valid_from | dbt_valid_to
1       | active  | user1@test.com  | [time1]    | [time1]        | null
2       | pending | user2@test.com  | [time1]    | [time1]        | null

This shows our initial state: two users with their current statuses. The null in dbt_valid_to indicates these are the current active records.

Test Status Change

  1. Update source data:

-- models/staging/stg_test_users.sql
WITH source_data AS (
    SELECT 1 as user_id,
           'inactive' as status,
           'user1@test.com' as email,
           CURRENT_TIMESTAMP() as updated_at
)
SELECT * FROM source_data
  1. Run commands:

dbt run --select stg_test_users
dbt snapshot
  1. Verify in Snowflake:

SELECT * FROM your_database.snapshots.users_snapshot 
WHERE user_id = 1 
ORDER BY dbt_valid_from;

Expected Result:

user_id | status   | email          | updated_at | dbt_valid_from | dbt_valid_to
1       | active   | user1@test.com | [time1]    | [time1]        | [time2]
1       | inactive | user1@test.com | [time2]    | [time2]        | null

Now we see the history: the user's original 'active' status is preserved with an end date ([time2]), and a new record shows their current 'inactive' status.

Test Check Strategy

  1. Update snapshot config:

-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
{{
    config(
      target_schema='snapshots',
      unique_key='user_id',
      strategy='check',
      check_cols=['status', 'email']
    )
}}

select * from {{ ref('stg_test_users') }}
{% endsnapshot %}
  1. Update source data:

-- models/staging/stg_test_users.sql
WITH source_data AS (
    SELECT 1 as user_id,
           'inactive' as status,
           'newemail@test.com' as email,
           CURRENT_TIMESTAMP() as updated_at
)
SELECT * FROM source_data
  1. Run commands:

dbt run --select stg_test_users
dbt snapshot
  1. Verify in Snowflake:

SELECT * FROM your_database.snapshots.users_snapshot 
ORDER BY user_id, dbt_valid_from;

Expected Result:

user_id | status   | email              | dbt_valid_from | dbt_valid_to
1       | active   | user1@test.com     | [time1]        | [time2]
1       | inactive | user1@test.com     | [time2]        | [time3]
1       | inactive | newemail@test.com  | [time3]        | null

This shows the complete history: the original status, the status change, and finally the email change. Each record shows when it was valid, creating a complete historical trail of changes.


Meta Fields

As of dbt version ≥ 1.9, snapshots add four tracking columns:

  • dbt_valid_from: When this version became valid

  • dbt_valid_to: When this version became invalid (null for current version)

  • dbt_updated_at: Timestamp when the snapshot was taken

  • dbt_scd_id: Unique identifier for each version


Best Practices

1. Snapshot Frequency

  • Execute dbt snapshot before running dependent models

  • Schedule regular snapshots in production based on data change frequency

  • Consider storage costs versus historical data needs

3. Query Historical Data

To access historical data from the snapshot table, filter to see the state of a specific record (order_id 123) as it existed on a given date (2024-01-01).

-- Get data as it looked on a specific date
SELECT *
FROM {{ ref('orders_snapshot') }}
WHERE order_id = 123
  AND dbt_valid_from <= '2024-01-01'
  AND (dbt_valid_to > '2024-01-01' OR dbt_valid_to IS NULL)

Last updated