Snapshots
Snapshots track historical changes in your data warehouse by implementing Type 2 Slowly Changing Dimensions (SCD). Instead of overwriting data, snapshots create new records while preserving the history of changes.
Understanding Slowly Changing Dimensions
In data warehousing, different strategies exist for handling data that changes over time:
Type 0 (Retain Original): The dimension never changes once created
Type 1 (Overwrite): Updates replace the original values with no history kept
Type 2 (Add New Row): Preserves history by creating new records when changes occur
Type 3 (Add New Attribute): Maintains limited history by adding columns for previous values
dbt snapshots implement Type 2 SCD, which is ideal for tracking changes to data that updates infrequently and requires historical tracking, such as:
Customer status changes
Product pricing and categorization
Order processing states
Employee roles and departments
Here's an example of how snapshots preserve history when a customer's status changes:
Using this snapshot, you can query the customer's status on any specific date.
Basic Configuration
Prerequisites
Before setting up snapshots:
Create a target schema for snapshots in your data warehouse
Ensure source data has either reliable timestamps or columns to track for changes
Set up the
snapshots/
directory in your dbt project
Snapshots are defined in .sql
files within your project's snapshots
directory:
To execute the snapshot, run:
This command should be run whenever you want to capture changes, both in development and production environments.
Snapshot Strategies
dbt offers two primary strategies to detect changes in your data:
Uses a timestamp column to detect changes. This is the best choice when your source data reliably updates a timestamp column when records change.
When to use:
Source data has a reliable updated_at column
Need to track when changes occurred
Want to capture all changes based on timing
Configuration
The updated_at parameter specifies which column contains the last-updated timestamp.
How Snapshots Work Behind the Scenes
When you run dbt snapshot
, dbt:
Creates the snapshot table if it doesn't exist
For existing tables, determines which records have changed based on your strategy
Marks previously current records that changed as no longer current (
dbt_valid_to
gets timestamp)Inserts new versions of changed records as current (with
dbt_valid_to
as null)Inserts completely new records
This process maintains a complete history of all changes while ensuring current data is easily identifiable.
Metadata Fields
As of dbt version β₯ 1.9, snapshots add four tracking columns:
dbt_valid_from
: When this version became validdbt_valid_to
: When this version became invalid (null for current version)dbt_updated_at
: Timestamp when the snapshot was takendbt_scd_id
: Unique identifier for each version
These columns allow you to:
Identify current records (
dbt_valid_to IS NULL
)Find records valid at a specific point in time
Determine when changes occurred
Track the duration a particular version was active
Step-by-Step Example
Let's implement a simple snapshot to track customer status changes:
1. Create a staging model for the source data
2. Create a snapshot file
3. Run the snapshot for the first time
This creates the initial snapshot table with all customers.
4. Simulate a data change in the source
When customer data changes in your source system, run the snapshot again to capture those changes:
5. Query historical and current data
Advanced Configuration
Invalidating Hard Deletes
By default, snapshots don't track when records are deleted from the source. To track deletions:
With invalidate_hard_deletes=true
, dbt will:
Identify records in the snapshot that no longer exist in the source
Set their
dbt_valid_to
timestamps to mark them as no longer current
Custom Snapshot Schemas
You can dynamically set the target schema:
This allows you to use variables to change the schema at runtime.
Best Practices for Snapshots
1. Snapshot Frequency
Schedule snapshots based on how frequently your source data changes and how important it is to capture every change
For critical data, run snapshots before dependent models to ensure they use the latest history
Consider storage costs versus historical data needs
2. Performance Optimization
Use a dedicated schema for snapshots to make maintenance easier
Apply appropriate indexes to snapshot tables for faster querying
Consider partitioning large snapshot tables by date
For very large tables, use incremental snapshots with time-based filters
3. Querying Historical Data
To access historical data, filter to see the state of a specific record as it existed on a given date:
4. Common Pitfalls to Avoid
Infrequent snapshots: Running snapshots too infrequently might miss intermediate state changes
Missing source filter: Always filter your source query to include only necessary data
Unreliable timestamps: Ensure your
updated_at
field actually updates when records changeUsing snapshots for high-frequency changes: Consider incremental models for data that changes very frequently
Using Snapshots in Your Data Architecture
Snapshots typically fit into your data architecture as follows:
By effectively implementing snapshots, you create a robust history tracking system that supports historical reporting, audit requirements, and trend analysisβall while maintaining the simplicity and reproducibility that makes dbt powerful.
Last updated
Was this helpful?