Paradime Help Docs
Get Started
  • πŸš€Introduction
  • πŸ“ƒGuides
    • Paradime 101
      • Getting Started with your Paradime Workspace
        • Creating a Workspace
        • Setting Up Data Warehouse Connections
        • Managing workspace configurations
        • Managing Users in the Workspace
      • Getting Started with the Paradime IDE
        • Setting Up a dbtβ„’ Project
        • Creating a dbtβ„’ Model
        • Data Exploration in the Code IDE
        • DinoAI: Accelerating Your Analytics Engineering Workflow
          • DinoAI Agent
            • Creating dbt Sources from Data Warehouse
            • Generating Base Models
            • Building Intermediate/Marts Models
            • Documentation Generation
            • Data Pipeline Configuration
            • Using .dinorules to Tailor Your AI Experience
          • Accelerating GitOps
          • Accelerating Data Governance
          • Accelerating dbtβ„’ Development
        • Utilizing Advanced Developer Features
          • Visualize Data Lineage
          • Auto-generated Data Documentation
          • Enforce SQL and YAML Best Practices
          • Working with CSV Files
      • Managing dbtβ„’ Schedules with Bolt
        • Creating Bolt Schedules
        • Understanding schedule types and triggers
        • Viewing Run History and Analytics
        • Setting Up Notifications
        • Debugging Failed Runs
    • Migrating from dbtβ„’ cloud to Paradime
  • πŸ”Concepts
    • Working with Git
      • Git Lite
      • Git Advanced
      • Read Only Branches
      • Delete Branches
      • Merge Conflicts
      • Configuring Signed Commits on Paradime with SSH Keys
    • dbtβ„’ fundamentals
      • Getting started with dbtβ„’
        • Introduction
        • Project Strucuture
        • Working with Sources
        • Testing Data Quality
        • Models and Transformations
      • Configuring your dbtβ„’ Project
        • Setting up your dbt_project.yml
        • Defining Your Sources in sources.yml
        • Testing Source Freshness
        • Unit Testing
        • Working with Tags
        • Managing Seeds
        • Environment Management
        • Variables and Parameters
        • Macros
        • Custom Tests
        • Hooks & Operational Tasks
        • Packages
      • Model Materializations
        • Table Materialization
        • View​ Materialization
        • Incremental Materialization
          • Using Merge for Incremental Models
          • Using Delete+Insert for Incremental Models
          • Using Append for Incremental Models
          • Using Microbatch for Incremental Models
        • Ephemeral Materialization
        • Snapshots
      • Running dbtβ„’
        • Mastering the dbtβ„’ CLI
          • Commands
          • Methods
          • Selector Methods
          • Graph Operators
    • Paradime fundamentals
      • Global Search
        • Paradime Apps Navigation
        • Invite users to your workspace
        • Search and preview Bolt schedules status
      • Using --defer in Paradime
      • Workspaces and data mesh
    • Data Warehouse essentials
      • BigQuery Multi-Project Service Account
  • πŸ“–Documentation
    • DinoAI
      • Agent Mode
        • Use Cases
          • Creating Sources from your Warehouse
          • Generating dbtβ„’ models
          • Fixing Errors with Jira
          • Researching with Perplexity
          • Providing Additional Context Using PDFs
      • Context
        • File Context
        • Directory Context
      • Tools and Features
        • Warehouse Tool
        • File System Tool
        • PDF Tool
        • Jira Tool
        • Perplexity Tool
        • Terminal Tool
        • Coming Soon Tools...
      • .dinorules
      • Ask Mode
      • Version Control
      • Production Pipelines
      • Data Documentation
    • Code IDE
      • User interface
        • Autocompletion
        • Context Menu
        • Flexible layout
        • "Peek" and "Go To" Definition
        • IDE preferences
        • Shortcuts
      • Left Panel
        • DinoAI Coplot
        • Search, Find, and Replace
        • Git Lite
        • Bookmarks
      • Command Panel
        • Data Explorer
        • Lineage
        • Catalog
        • Lint
      • Terminal
        • Running dbtβ„’
        • Paradime CLI
      • Additional Features
        • Scratchpad
    • Bolt
      • Creating Schedules
        • 1. Schedule Settings
        • 2. Command Settings
          • dbtβ„’ Commands
          • Python Scripts
          • Elementary Commands
          • Lightdash Commands
          • Tableau Workbook Refresh
          • Power BI Dataset Refresh
          • Paradime Bolt Schedule Toggle Commands
          • Monte Carlo Commands
        • 3. Trigger Types
        • 4. Notification Settings
        • Templates
          • Run and Test all your dbtβ„’ Models
          • Snapshot Source Data Freshness
          • Build and Test Models with New Source Data
          • Test Code Changes On Pull Requests
          • Re-executes the last dbtβ„’ command from the point of failure
          • Deploy Code Changes On Merge
          • Create Jira Tickets
          • Trigger Census Syncs
          • Trigger Hex Projects
          • Create Linear Issues
          • Create New Relic Incidents
          • Create Azure DevOps Items
        • Schedules as Code
      • Managing Schedules
        • Schedule Configurations
        • Viewing Run Log History
        • Analyzing Individual Run Details
          • Configuring Source Freshness
      • Bolt API
      • Special Environment Variables
        • Audit environment variables
        • Runtime environment variables
      • Integrations
        • Reverse ETL
          • Hightouch
        • Orchestration
          • Airflow
          • Azure Data Factory (ADF)
      • CI/CD
        • Turbo CI
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
          • Paradime Turbo CI Schema Cleanup
        • Continuous Deployment with Bolt
          • GitHub Native Continuous Deployment
          • Using Azure Pipelines
          • Using BitBucket Pipelines
          • Using GitLab Pipelines
        • Column-Level Lineage Diff
          • dbtβ„’ mesh
          • Looker
          • Tableau
          • Thoughtspot
    • Radar
      • Get Started
      • Cost Management
        • Snowflake Cost Optimization
        • Snowflake Cost Monitoring
        • BigQuery Cost Monitoring
      • dbtβ„’ Monitoring
        • Schedules Dashboard
        • Models Dashboard
        • Sources Dashboard
        • Tests Dashboard
      • Team Efficiency Tracking
      • Real-time Alerting
      • Looker Monitoring
    • Data Catalog
      • Data Assets
        • Looker assets
        • Tableau assets
        • Power BI assets
        • Sigma assets
        • ThoughtSpot assets
        • Fivetran assets
        • dbtℒ️ assets
      • Lineage
        • Search and Discovery
        • Filters and Nodes interaction
        • Nodes navigation
        • Canvas interactions
        • Compare Lineage version
    • Integrations
      • Dashboards
        • Sigma
        • ThoughtSpot (Beta)
        • Lightdash
        • Tableau
        • Looker
        • Power BI
        • Streamlit
      • Code IDE
        • Cube CLI
        • dbtℒ️ generator
        • Prettier
        • Harlequin
        • SQLFluff
        • Rainbow CSV
        • Mermaid
          • Architecture Diagrams
          • Block Diagrams Documentation
          • Class Diagrams
          • Entity Relationship Diagrams
          • Gantt Diagrams
          • GitGraph Diagrams
          • Mindmaps
          • Pie Chart Diagrams
          • Quadrant Charts
          • Requirement Diagrams
          • Sankey Diagrams
          • Sequence Diagrams
          • State Diagrams
          • Timeline Diagrams
          • User Journey Diagrams
          • XY Chart
          • ZenUML
        • pre-commit
          • Paradime Setup and Configuration
          • dbtℒ️-checkpoint hooks
            • dbtℒ️ Model checks
            • dbtℒ️ Script checks
            • dbtℒ️ Source checks
            • dbtℒ️ Macro checks
            • dbtℒ️ Modifiers
            • dbtℒ️ commands
            • dbtℒ️ checks
          • SQLFluff hooks
          • Prettier hooks
      • Observability
        • Elementary Data
          • Anomaly Detection Tests
            • Anomaly tests parameters
            • Volume anomalies
            • Freshness anomalies
            • Event freshness anomalies
            • Dimension anomalies
            • All columns anomalies
            • Column anomalies
          • Schema Tests
            • Schema changes
            • Schema changes from baseline
          • Sending alerts
            • Slack alerts
            • Microsoft Teams alerts
            • Alerts Configuration and Customization
          • Generate observability report
          • CLI commands and usage
        • Monte Carlo
      • Storage
        • Amazon S3
        • Snowflake Storage
      • Reverse ETL
        • Hightouch
      • CI/CD
        • GitHub
        • Spectacles
      • Notifications
        • Microsoft Teams
        • Slack
      • ETL
        • Fivetran
    • Security
      • Single Sign On (SSO)
        • Okta SSO
        • Azure AD SSO
        • Google SAML SSO
        • Google Workspace SSO
        • JumpCloud SSO
      • Audit Logs
      • Security model
      • Privacy model
      • FAQs
      • Trust Center
      • Security
    • Settings
      • Workspaces
      • Git Repositories
        • Importing a repository
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
        • Update connected git repository
      • Connections
        • Code IDE environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Scheduler environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Manage connections
          • Set alternative default connection
          • Delete connections
        • Cost connection
          • BigQuery cost connection
          • Snowflake cost connection
        • Connection Security
          • AWS PrivateLink
            • Snowflake PrivateLink
            • Redshift PrivateLink
          • BigQuery OAuth
          • Snowflake OAuth
        • Optional connection attributes
      • Notifications
      • dbtβ„’
        • Upgrade dbt Coreβ„’ version
      • Users
        • Invite users
        • Manage Users
        • Enable Auto-join
        • Users and licences
        • Default Roles and Permissions
        • Role-based access control
      • Environment Variables
        • Bolt Schedules Environment Variables
        • Code IDE Environment Variables
  • πŸ’»Developers
    • GraphQL API
      • Authentication
      • Examples
        • Audit Logs API
        • Bolt API
        • User Management API
        • Workspace Management API
    • Python SDK
      • Getting Started
      • Modules
        • Audit Log
        • Bolt
        • Lineage Diff
        • Custom Integration
        • User Management
        • Workspace Management
    • Paradime CLI
      • Getting Started
      • Bolt CLI
    • Webhooks
      • Getting Started
      • Custom Webhook Guides
        • Create an Azure DevOps Work item when a Bolt run complete with errors
        • Create a Linear Issue when a Bolt run complete with errors
        • Create a Jira Issue when a Bolt run complete with errors
        • Trigger a Slack notification when a Bolt run is overrunning
    • Virtual Environments
      • Using Poetry
      • Troubleshooting
    • API Keys
    • IP Restrictions in Paradime
    • Company & Workspace token
  • πŸ™ŒBest Practices
    • Data Mesh Setup
      • Configure Project dependencies
      • Model access
      • Model groups
  • ‼️Troubleshooting
    • Errors
    • Error List
    • Restart Code IDE
  • πŸ”—Other Links
    • Terms of Service
    • Privacy Policy
    • Paradime Blog
Powered by GitBook
On this page
  • Basic Configuration
  • Snapshot Strategies
  • How Snapshots Work Behind the Scenes
  • Metadata Fields
  • Step-by-Step Example
  • Advanced Configuration
  • Best Practices for Snapshots
  • Using Snapshots in Your Data Architecture

Was this helpful?

  1. Concepts
  2. dbtβ„’ fundamentals
  3. Model Materializations

Snapshots

Snapshots track historical changes in your data warehouse by implementing Type 2 Slowly Changing Dimensions (SCD). Instead of overwriting data, snapshots create new records while preserving the history of changes.

Understanding Slowly Changing Dimensions

In data warehousing, different strategies exist for handling data that changes over time:

  • Type 0 (Retain Original): The dimension never changes once created

  • Type 1 (Overwrite): Updates replace the original values with no history kept

  • Type 2 (Add New Row): Preserves history by creating new records when changes occur

  • Type 3 (Add New Attribute): Maintains limited history by adding columns for previous values

dbt snapshots implement Type 2 SCD, which is ideal for tracking changes to data that updates infrequently and requires historical tracking, such as:

  • Customer status changes

  • Product pricing and categorization

  • Order processing states

  • Employee roles and departments

Here's an example of how snapshots preserve history when a customer's status changes:

-- Initial record
customer_id  status    updated_at     dbt_valid_from    dbt_valid_to
101         active     2023-06-15     2023-06-15        null

-- After status change with snapshots
customer_id  status    updated_at     dbt_valid_from    dbt_valid_to
101         active     2023-06-15     2023-06-15        2023-08-20    -- Historical record
101         inactive   2023-08-20     2023-08-20        null          -- Current record

Using this snapshot, you can query the customer's status on any specific date.


Basic Configuration

Prerequisites

Before setting up snapshots:

  • Create a target schema for snapshots in your data warehouse

  • Ensure source data has either reliable timestamps or columns to track for changes

  • Set up the snapshots/ directory in your dbt project

Snapshots are defined in .sql files within your project's snapshots directory:

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('orders', 'orders') }}

{% endsnapshot %}

To execute the snapshot, run:

dbt snapshot

This command should be run whenever you want to capture changes, both in development and production environments.


Snapshot Strategies

dbt offers two primary strategies to detect changes in your data:

Uses a timestamp column to detect changes. This is the best choice when your source data reliably updates a timestamp column when records change.

When to use:

  • Source data has a reliable updated_at column

  • Need to track when changes occurred

  • Want to capture all changes based on timing

Configuration

{% snapshot orders_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

-- Your SQL query

{% endsnapshot %}

The updated_at parameter specifies which column contains the last-updated timestamp.

Detects changes by comparing specific column values. Use when you don't have reliable timestamps or want to track only certain columns.

When to use:

  • No reliable updated_at column

  • Only want to track specific column changes

  • Need control over what constitutes a change

Configuration

{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='check',
      check_cols=['status', 'amount']  -- or 'all'
    )
}}

The check_cols parameter can be:

  • A list of specific columns to check for changes

  • The string 'all' to check all columns {% endtab %} {% endtabs %}


How Snapshots Work Behind the Scenes

When you run dbt snapshot, dbt:

  1. Creates the snapshot table if it doesn't exist

  2. For existing tables, determines which records have changed based on your strategy

  3. Marks previously current records that changed as no longer current (dbt_valid_to gets timestamp)

  4. Inserts new versions of changed records as current (with dbt_valid_to as null)

  5. Inserts completely new records

This process maintains a complete history of all changes while ensuring current data is easily identifiable.


Metadata Fields

As of dbt version β‰₯ 1.9, snapshots add four tracking columns:

  • dbt_valid_from: When this version became valid

  • dbt_valid_to: When this version became invalid (null for current version)

  • dbt_updated_at: Timestamp when the snapshot was taken

  • dbt_scd_id: Unique identifier for each version

These columns allow you to:

  • Identify current records (dbt_valid_to IS NULL)

  • Find records valid at a specific point in time

  • Determine when changes occurred

  • Track the duration a particular version was active


Step-by-Step Example

Let's implement a simple snapshot to track customer status changes:

1. Create a staging model for the source data

-- models/staging/stg_customers.sql
SELECT
    customer_id,
    name,
    status,
    email,
    updated_at
FROM {{ source('crm', 'customers') }}

2. Create a snapshot file

-- snapshots/customer_snapshots.sql
{% snapshot customers_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ ref('stg_customers') }}

{% endsnapshot %}

3. Run the snapshot for the first time

dbt snapshot

This creates the initial snapshot table with all customers.

4. Simulate a data change in the source

When customer data changes in your source system, run the snapshot again to capture those changes:

dbt snapshot

5. Query historical and current data

-- Get current customer data
SELECT * FROM snapshots.customers_snapshot
WHERE dbt_valid_to IS NULL;

-- Get customer data as it existed on a specific date
SELECT * FROM snapshots.customers_snapshot
WHERE customer_id = 123
  AND dbt_valid_from <= '2023-06-01'
  AND (dbt_valid_to > '2023-06-01' OR dbt_valid_to IS NULL);

Advanced Configuration

Invalidating Hard Deletes

By default, snapshots don't track when records are deleted from the source. To track deletions:

{% snapshot orders_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='order_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=true
    )
}}

-- Your SQL query

{% endsnapshot %}

With invalidate_hard_deletes=true, dbt will:

  • Identify records in the snapshot that no longer exist in the source

  • Set their dbt_valid_to timestamps to mark them as no longer current

Custom Snapshot Schemas

You can dynamically set the target schema:

{% snapshot orders_snapshot %}

{{
    config(
        target_schema=var('snapshot_schema', 'snapshots'),
        unique_key='order_id',
        strategy='timestamp',
        updated_at='updated_at'
    )
}}

-- Your SQL query

{% endsnapshot %}

This allows you to use variables to change the schema at runtime.


Best Practices for Snapshots

1. Snapshot Frequency

  • Schedule snapshots based on how frequently your source data changes and how important it is to capture every change

  • For critical data, run snapshots before dependent models to ensure they use the latest history

  • Consider storage costs versus historical data needs

# Example schedule.yml for a daily snapshot
jobs:
  - name: daily_snapshot
    schedule: "0 1 * * *"  # 1:00 AM daily
    steps:
      - dbt snapshot
      - dbt run --models dependent_on_snapshots

2. Performance Optimization

  • Use a dedicated schema for snapshots to make maintenance easier

  • Apply appropriate indexes to snapshot tables for faster querying

  • Consider partitioning large snapshot tables by date

  • For very large tables, use incremental snapshots with time-based filters

3. Querying Historical Data

To access historical data, filter to see the state of a specific record as it existed on a given date:

-- Get data as it looked on a specific date
SELECT *
FROM {{ ref('customers_snapshot') }}
WHERE customer_id = 123
  AND dbt_valid_from <= '2023-06-01'
  AND (dbt_valid_to > '2023-06-01' OR dbt_valid_to IS NULL)

4. Common Pitfalls to Avoid

  • Infrequent snapshots: Running snapshots too infrequently might miss intermediate state changes

  • Missing source filter: Always filter your source query to include only necessary data

  • Unreliable timestamps: Ensure your updated_at field actually updates when records change

  • Using snapshots for high-frequency changes: Consider incremental models for data that changes very frequently


Using Snapshots in Your Data Architecture

Snapshots typically fit into your data architecture as follows:

dbt project
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ staging/         # Simple transformations of source data
β”‚   β”œβ”€β”€ intermediate/    # Business logic transformations
β”‚   └── marts/           # Business-level output models
β”œβ”€β”€ snapshots/           # Historical tracking of changing data
β”‚   β”œβ”€β”€ customer_snapshots.sql
β”‚   └── product_snapshots.sql
└── analyses/            # SQL for historical analysis using snapshots
    └── customer_status_history.sql

By effectively implementing snapshots, you create a robust history tracking system that supports historical reporting, audit requirements, and trend analysisβ€”all while maintaining the simplicity and reproducibility that makes dbt powerful.

PreviousEphemeral MaterializationNextRunning dbtβ„’

Last updated 2 months ago

Was this helpful?

πŸ”