Paradime Help Docs
Get Started
  • 🚀Introduction
  • 📃Guides
    • Paradime 101
      • Getting Started with your Paradime Workspace
        • Creating a Workspace
        • Setting Up Data Warehouse Connections
        • Managing workspace configurations
        • Managing Users in the Workspace
      • Getting Started with the Paradime IDE
        • Setting Up a dbt™ Project
        • Creating a dbt™ Model
        • Data Exploration in the Code IDE
        • DinoAI: Accelerating Your Analytics Engineering Workflow
          • DinoAI Agent
            • Creating dbt Sources from Data Warehouse
            • Generating Base Models
            • Building Intermediate/Marts Models
            • Documentation Generation
            • Data Pipeline Configuration
            • Using .dinorules to Tailor Your AI Experience
          • Accelerating GitOps
          • Accelerating Data Governance
          • Accelerating dbt™ Development
        • Utilizing Advanced Developer Features
          • Visualize Data Lineage
          • Auto-generated Data Documentation
          • Enforce SQL and YAML Best Practices
          • Working with CSV Files
      • Managing dbt™ Schedules with Bolt
        • Creating Bolt Schedules
        • Understanding schedule types and triggers
        • Viewing Run History and Analytics
        • Setting Up Notifications
        • Debugging Failed Runs
    • Migrating from dbt™ cloud to Paradime
  • 🔍Concepts
    • Working with Git
      • Git Lite
      • Git Advanced
      • Read Only Branches
      • Delete Branches
      • Merge Conflicts
      • Configuring Signed Commits on Paradime with SSH Keys
      • GitHub Branch Protection Guide: Preventing Direct Commits to Main
    • dbt™ fundamentals
      • Getting started with dbt™
        • Introduction
        • Project Strucuture
        • Working with Sources
        • Testing Data Quality
        • Models and Transformations
      • Configuring your dbt™ Project
        • Setting up your dbt_project.yml
        • Defining Your Sources in sources.yml
        • Testing Source Freshness
        • Unit Testing
        • Working with Tags
        • Managing Seeds
        • Environment Management
        • Variables and Parameters
        • Macros
        • Custom Tests
        • Hooks & Operational Tasks
        • Packages
      • Model Materializations
        • Table Materialization
        • View​ Materialization
        • Incremental Materialization
          • Using Merge for Incremental Models
          • Using Delete+Insert for Incremental Models
          • Using Append for Incremental Models
          • Using Microbatch for Incremental Models
        • Ephemeral Materialization
        • Snapshots
      • Running dbt™
        • Mastering the dbt™ CLI
          • Commands
          • Methods
          • Selector Methods
          • Graph Operators
    • Paradime fundamentals
      • Global Search
        • Paradime Apps Navigation
        • Invite users to your workspace
        • Search and preview Bolt schedules status
      • Using --defer in Paradime
      • Workspaces and data mesh
    • Data Warehouse essentials
      • BigQuery Multi-Project Service Account
  • 📖Documentation
    • DinoAI
      • Agent Mode
        • Use Cases
          • Creating Sources from your Warehouse
          • Generating dbt™ models
          • Fixing Errors with Jira
          • Researching with Perplexity
          • Providing Additional Context Using PDFs
      • Context
        • File Context
        • Directory Context
      • Tools and Features
        • Warehouse Tool
        • File System Tool
        • PDF Tool
        • Jira Tool
        • Perplexity Tool
        • Terminal Tool
        • Coming Soon Tools...
      • .dinorules
      • Ask Mode
      • Version Control
      • Production Pipelines
      • Data Documentation
    • Code IDE
      • User interface
        • Autocompletion
        • Context Menu
        • Flexible layout
        • "Peek" and "Go To" Definition
        • IDE preferences
        • Shortcuts
      • Left Panel
        • DinoAI Coplot
        • Search, Find, and Replace
        • Git Lite
        • Bookmarks
      • Command Panel
        • Data Explorer
        • Lineage
        • Catalog
        • Lint
      • Terminal
        • Running dbt™
        • Paradime CLI
      • Additional Features
        • Scratchpad
    • Bolt
      • Creating Schedules
        • 1. Schedule Settings
        • 2. Command Settings
          • dbt™ Commands
          • Python Scripts
          • Elementary Commands
          • Lightdash Commands
          • Tableau Workbook Refresh
          • Power BI Dataset Refresh
          • Paradime Bolt Schedule Toggle Commands
          • Monte Carlo Commands
        • 3. Trigger Types
        • 4. Notification Settings
        • Templates
          • Run and Test all your dbt™ Models
          • Snapshot Source Data Freshness
          • Build and Test Models with New Source Data
          • Test Code Changes On Pull Requests
          • Re-executes the last dbt™ command from the point of failure
          • Deploy Code Changes On Merge
          • Create Jira Tickets
          • Trigger Census Syncs
          • Trigger Hex Projects
          • Create Linear Issues
          • Create New Relic Incidents
          • Create Azure DevOps Items
        • Schedules as Code
      • Managing Schedules
        • Schedule Configurations
        • Viewing Run Log History
        • Analyzing Individual Run Details
          • Configuring Source Freshness
      • Bolt API
      • Special Environment Variables
        • Audit environment variables
        • Runtime environment variables
      • Integrations
        • Reverse ETL
          • Hightouch
        • Orchestration
          • Airflow
          • Azure Data Factory (ADF)
      • CI/CD
        • Turbo CI
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
          • Paradime Turbo CI Schema Cleanup
        • Continuous Deployment with Bolt
          • GitHub Native Continuous Deployment
          • Using Azure Pipelines
          • Using BitBucket Pipelines
          • Using GitLab Pipelines
        • Column-Level Lineage Diff
          • dbt™ mesh
          • Looker
          • Tableau
          • Thoughtspot
    • Radar
      • Get Started
      • Cost Management
        • Snowflake Cost Optimization
        • Snowflake Cost Monitoring
        • BigQuery Cost Monitoring
      • dbt™ Monitoring
        • Schedules Dashboard
        • Models Dashboard
        • Sources Dashboard
        • Tests Dashboard
      • Team Efficiency Tracking
      • Real-time Alerting
      • Looker Monitoring
    • Data Catalog
      • Data Assets
        • Looker assets
        • Tableau assets
        • Power BI assets
        • Sigma assets
        • ThoughtSpot assets
        • Fivetran assets
        • dbt™️ assets
      • Lineage
        • Search and Discovery
        • Filters and Nodes interaction
        • Nodes navigation
        • Canvas interactions
        • Compare Lineage version
    • Integrations
      • Dashboards
        • Sigma
        • ThoughtSpot (Beta)
        • Lightdash
        • Tableau
        • Looker
        • Power BI
        • Streamlit
      • Code IDE
        • Cube CLI
        • dbt™️ generator
        • Prettier
        • Harlequin
        • SQLFluff
        • Rainbow CSV
        • Mermaid
          • Architecture Diagrams
          • Block Diagrams Documentation
          • Class Diagrams
          • Entity Relationship Diagrams
          • Gantt Diagrams
          • GitGraph Diagrams
          • Mindmaps
          • Pie Chart Diagrams
          • Quadrant Charts
          • Requirement Diagrams
          • Sankey Diagrams
          • Sequence Diagrams
          • State Diagrams
          • Timeline Diagrams
          • User Journey Diagrams
          • XY Chart
          • ZenUML
        • pre-commit
          • Paradime Setup and Configuration
          • dbt™️-checkpoint hooks
            • dbt™️ Model checks
            • dbt™️ Script checks
            • dbt™️ Source checks
            • dbt™️ Macro checks
            • dbt™️ Modifiers
            • dbt™️ commands
            • dbt™️ checks
          • SQLFluff hooks
          • Prettier hooks
      • Observability
        • Elementary Data
          • Anomaly Detection Tests
            • Anomaly tests parameters
            • Volume anomalies
            • Freshness anomalies
            • Event freshness anomalies
            • Dimension anomalies
            • All columns anomalies
            • Column anomalies
          • Schema Tests
            • Schema changes
            • Schema changes from baseline
          • Sending alerts
            • Slack alerts
            • Microsoft Teams alerts
            • Alerts Configuration and Customization
          • Generate observability report
          • CLI commands and usage
        • Monte Carlo
      • Storage
        • Amazon S3
        • Snowflake Storage
      • Reverse ETL
        • Hightouch
      • CI/CD
        • GitHub
        • Spectacles
      • Notifications
        • Microsoft Teams
        • Slack
      • ETL
        • Fivetran
    • Security
      • Single Sign On (SSO)
        • Okta SSO
        • Azure AD SSO
        • Google SAML SSO
        • Google Workspace SSO
        • JumpCloud SSO
      • Audit Logs
      • Security model
      • Privacy model
      • FAQs
      • Trust Center
      • Security
    • Settings
      • Workspaces
      • Git Repositories
        • Importing a repository
          • Azure DevOps
          • BitBucket
          • GitHub
          • GitLab
        • Update connected git repository
      • Connections
        • Code IDE environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Scheduler environment
          • Amazon Athena
          • BigQuery
          • Clickhouse
          • Databricks
          • Dremio
          • DuckDB
          • Firebolt
          • Microsoft Fabric
          • Microsoft SQL Server
          • MotherDuck
          • PostgreSQL
          • Redshift
          • Snowflake
          • Starburst/Trino
        • Manage connections
          • Set alternative default connection
          • Delete connections
        • Cost connection
          • BigQuery cost connection
          • Snowflake cost connection
        • Connection Security
          • AWS PrivateLink
            • Snowflake PrivateLink
            • Redshift PrivateLink
          • BigQuery OAuth
          • Snowflake OAuth
        • Optional connection attributes
      • Notifications
      • dbt™
        • Upgrade dbt Core™ version
      • Users
        • Invite users
        • Manage Users
        • Enable Auto-join
        • Users and licences
        • Default Roles and Permissions
        • Role-based access control
      • Environment Variables
        • Bolt Schedules Environment Variables
        • Code IDE Environment Variables
  • 💻Developers
    • GraphQL API
      • Authentication
      • Examples
        • Audit Logs API
        • Bolt API
        • User Management API
        • Workspace Management API
    • Python SDK
      • Getting Started
      • Modules
        • Audit Log
        • Bolt
        • Lineage Diff
        • Custom Integration
        • User Management
        • Workspace Management
    • Paradime CLI
      • Getting Started
      • Bolt CLI
    • Webhooks
      • Getting Started
      • Custom Webhook Guides
        • Create an Azure DevOps Work item when a Bolt run complete with errors
        • Create a Linear Issue when a Bolt run complete with errors
        • Create a Jira Issue when a Bolt run complete with errors
        • Trigger a Slack notification when a Bolt run is overrunning
    • Virtual Environments
      • Using Poetry
      • Troubleshooting
    • API Keys
    • IP Restrictions in Paradime
    • Company & Workspace token
  • 🙌Best Practices
    • Data Mesh Setup
      • Configure Project dependencies
      • Model access
      • Model groups
  • ‼️Troubleshooting
    • Errors
    • Error List
    • Restart Code IDE
  • 🔗Other Links
    • Terms of Service
    • Privacy Policy
    • Paradime Blog
Powered by GitBook
On this page
  • Introduction
  • Parameters overview
  • Common Parameters for All Anomaly Detection Tests
  • Anomaly Detection Tests With Timestamp Column:
  • Volume Anomaly Tests
  • All columns anomalies test:
  • Dimension anomalies test:
  • Event freshness anomalies:
  • Example configurations
  • Parameters details
  • timestamp_column
  • where_expression
  • anomaly_sensitivity
  • anomaly_direction
  • training_period
  • detection_period
  • time_bucket
  • seasonality
  • column_anomalies
  • exclude_prefix
  • exclude_regexp
  • dimensions
  • event_timestamp_column
  • update_timestamp_column
  • ignore_small_changes
  • fail_on_zero
  • detection_delay
  • anomaly_exclude_metrics

Was this helpful?

  1. Documentation
  2. Integrations
  3. Observability
  4. Elementary Data
  5. Anomaly Detection Tests

Anomaly tests parameters

Introduction

Elementary data anomaly detection tests monitor specific metrics and compare recent values to historical data to detect significant changes that may indicate data reliability issues. This page outlines the parameters available for configuring these tests.

If your dataset doesn't hasve a timestamp column representing the creation time of a field, it's highly recommended to configure one (ex. date_added_dttm). This allows Elementary to create time buckets and filter the table effectively.

Parameters overview

Below it is a list of all parameters available for each type of test provided by Elementary.

Common Parameters for All Anomaly Detection Tests

Parameters
Parameters Config

Anomaly Detection Tests With Timestamp Column:

Parameters
Parameters Config

Volume Anomaly Tests

Parameters
Parameters Config

All columns anomalies test:

Parameters
Parameters Config

Dimension anomalies test:

Parameters
Parameters Config

Event freshness anomalies:

Parameters
Parameters Config

Example configurations

version: 2

models:
  - name: <model_name>
    config:
      elementary:
        timestamp_column: < model timestamp column >
    tests: < here you will add elementary monitors as tests >

  - name: <your model with no timestamp>
    ## if no timestamp is configured, elementary will monitor without time filtering
    tests: <here you will add elementary monitors as tests>
version: 2

models:
  - name: login_events
    config:
      elementary:
        timestamp_column: updated_at
    tests:
      - elementary.freshness_anomalies:
          tags: ["elementary"]
      - elementary.all_columns_anomalies:
          tags: ["elementary"]

  - name: users
    ## if no timestamp is configured, elementary will monitor without time filtering
    tests:
      - elementary.volume_anomalies:
          tags: ["elementary"]
sources:
  - name: < some name >
    database: < database >
    schema: < schema >
    tables:
      - name: < table_name >
        ## sources don't have config, so elementary config is placed under 'meta'
        meta:
          elementary:
            timestamp_column: < source timestamp column >
        tests: <here you will add elementary monitors as tests>
sources:
  - name: "my_non_dbt_table"
    database: "raw_events"
    schema: "product"
    tables:
      - name: "raw_product_login_events"
        ## sources don't have config, so elementary config is placed under 'meta'
        meta:
          elementary:
            timestamp_column: "loaded_at"
        tests:
          - elementary.volume_anomalies
          - elementary.all_columns_anomalies:
              column_anomalies:
                - null_count
                - missing_count
                - zero_count
        columns:
          - name: user_id
            tests:
              - elementary.column_anomalies

Parameters details

Below is a list of each parameter and configuration details.

timestamp_column

timestamp_column: [column name]

If your data set has a timestamp column that represents the creation time of a field, it is highly recommended configuring it as a timestamp_column.

Elementary anomaly detection tests will use this column to create time buckets and filter the table. It is highly recommended to configure a timestamp column (if there is one). The best column for this would be an updated_at/created_at/loaded_at timestamp for each row (date type also works).

  • When you specify a timestamp_column, when the test runs it splits the data to buckets according to the timestamp in this column, calculates the metric for each bucket and checks for anomalies between these buckets. This also means that if the table has enough historical data, the test can start working right away.

  • When you do not specify a timestamp_column, each time the test runs it calculates the metric for all of the data in the table, and checks for anomalies between the metric from previous runs. This also means that it will take the test training_period days to start working, as it needs to the time to collect the necessary metrics.

If undefined, default is null (no time buckets).

Default: none

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          timestamp_column: created_at
models:
  - name: this_is_a_model
    config:
      elementary:
        timestamp_column: updated_at
sources:
  - name: my_non_dbt_tables
    schema: raw
    tables:
      - name: source_table
        meta:
          elementary:
            timestamp_column: loaded_at
vars:
  timestamp_column: loaded_at

where_expression

where_expression: [sql expression]

Filter the tested data using a valid sql expression.

Default: None

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          where_expression: "user_name != 'test'"
models:
  - name: this_is_a_model
    config:
      elementary:
        where_expression: "loaded_at is not null"
vars:
  where_expression: "loaded_at > '2022-01-01'"

anomaly_sensitivity

anomaly_sensitivity: [int]

Configuration to define how the expected range is calculated. A sensitivity of 3 means that the expected range is within 3 standard deviations from the average of the training set. Smaller sensitivity means this range will be reduced and more values will be potentially flagged as anomalies. Larger values will have the opposite effect and will reduce the number of anomalies as the expected range will be larger.

Default: 3

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_sensitivity: 2.5

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_sensitivity: 4
models:
  - name: this_is_a_model
    config:
      elementary:
        anomaly_sensitivity: 3.5
vars:
  anomaly_sensitivity: 3

anomaly_direction

anomaly_direction: both | spike | drop

By default, data points are compared to the expected range and check if these are below or above it. For some data monitors, you might only want to flag anomalies if they are above the range and not under it, and vice versa. For example - when monitoring for freshness, we only want to detect data delays and not data that is “early”. The anomaly_direction configuration is used to configure the direction of the expected range, and can be set to both, spike or drop.

Default: both

Supported values: both, spike, drop

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_direction: drop

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_direction: spike
models:
  - name: this_is_a_model
    config:
      elementary:
        anomaly_direction: drop
vars:
  anomaly_direction: both

training_period

training_period:
  period: < time period > # supported periods: day, week, month
  count: < number of periods >

The maximal timeframe for which the test will collect data. This timeframe includes the training period and detection period. If a detection delay is defined, the whole training period is being delayed.

Default: 14 days

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          training_period:
            period: day
            count: 30
models:
  - name: this_is_a_model
    config:
      elementary:
        detection_delay:
          period: week
          count: 1
vars:
  detection_delay:
    period: month
    count: 1

detection_period

detection_period:
  period: < time period > # supported periods: day, week, month
  count: < number of periods >

Configuration to define the detection period. If the detection_period are set to 2 days, only data points in the last 2 days will be included in the detection period and could be flagged anomalous. If detection_period is set to 7 days, the detection period will be 7 days long.

For incremental models, this is also the period for re-calculating metrics. If metrics for buckets in the detection period were already calculated, Elementary will overwrite them. The reason behind it is to monitor recent backfills of data, if there were any. This configuration should be changed according to your data delays.

Default: 2 days

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          detection_period:
            period: day
            count: 30
models:
  - name: this_is_a_model
    config:
      elementary:
        detection_period:
          period: month
          count: 1
vars:
  detection_period:
    period: week
    count: 2

time_bucket

time_bucket:
  period: < time period > # supported periods: hour, day, week, month
  count: < number of periods >

This configuration controls the duration of the time buckets.

To calculate how data changes over time and detect issues, we split the data into consistent time buckets. For example, if we use daily (period=day, count=1) time bucket and monitor for row count anomalies, we will count new rows per day.

Depending on the nature of your data, it may make sense to modify this parameter. For example, if you want to detect volume anomalies in an hourly resolution, you should set the time bucket to period=hour and count=1.

Default: daily buckets. time_bucket: {period: day, count: 1}

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          time_bucket:
            period: day
            count: 2
models:
  - name: this_is_a_model
    config:
      elementary:
        time_bucket:
          period: hour
          count: 4
vars:
  time_bucket:
    period: hour
    count: 12

seasonality

seasonality: day_of_week | hour_of_day | hour_of_week

Some data sets have patterns that repeat over a time period, and are expected. This is the normal behavior of these data sets. This means that when we try to detect outliers from the normal and expected range, ignoring this patterns might cause false positives or make us miss anomalies. The seasonality configuration is used to overcome this challenge and account for expected patterns.

Supported seasonality configurations:

  • day_of_week - Uses the same day of week as a training set for each daily bucket (Compares Sunday to Sundays, Monday to Mondays, etc.).

  • hour_of_day - Uses the same hour as a training set for each hourly bucket (For example will compare 10:00-11:00AM to 10:00-11:00AM on previous days, instead of any previous hour).

  • hour_of_week - Uses the same hour and day of week as a training set for each hourly bucket (For example will compare 10:00-11:00AM on Sunday to 10:00-11:00AM on previous sundays).

Use case:

Many data sets have lower volume over the weekend, and higher volume over the week days. This means that the expected range for different days of the week is different. The day_of_week seasonality uses the same day of week as a training set for each daily time bucket data point. The expected range for Monday will be based on a training set of previous Mondays, and so on.

Default: none

Supported values: day_of_week, hour_of_day, hour_of_week

Relevant tests: Anomaly detection tests with timestamp_column and 1 day time_bucket

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          seasonality: day_of_week
models:
  - name: this_is_a_model
    config:
      elementary:
        seasonality: day_of_week
vars:
  seasonality: day_of_week

column_anomalies

column_anomalies: [column monitors list]

Select which monitors to activate as part of the test.

Default: default monitors

Relevant tests: all_column_anomalies, column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - average

Default monitors by type:

Data quality metric
Column Type

null_count

any

null_percent

any

min_length

string

max_length

string

average_length

string

missing_count

string

missing_percent

string

min

numeric

max

numeric

average

numeric

zero_count

numeric

zero_percent

numeric

standard_deviation

numeric

variance

numeric

Opt-in monitors by type:

Data quality metric
Column Type

sum

numeric


exclude_prefix

exclude_prefix: [string]

Param for the all_columns_anomalies test only, which enables to exclude a column from the tests based on prefix match.

Default: None

Relevant tests: all_column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          exclude_prefix: "id_"

exclude_regexp

exclude_regexp: [regex]

Param for the all_columns_anomalies test only, which enables to exclude a column from the tests based on regular expression match.\

Default: None

Relevant tests: all_column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          exclude_regexp: ".*SDC$"

dimensions

dimensions: [list of SQL expressions]

Configuration for the test dimension_anomalies. The test counts rows grouped by given column / columns / valid select sql expression. Under dimensions you can configure the group by expression.

This test monitors the frequency of values in the configured dimension over time, and alerts on unexpected changes in the distribution. It is best to configure it on low-cardinality fields.

Default: None

Relevant tests: dimension_anomalies

Configuration level: test

Example configuration:

models:
  - name: model_name
    config:
      elementary:
        timestamp_column: updated_at
    tests:
      - elementary.dimension_anomalies:
          dimensions:
            - device_os
            - device_browser

event_timestamp_column

event_timestamp_column: [column name]

Configuration for the test event_freshness_anomalies. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.

The test can work in a couple of modes:

  • If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.

  • If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.

Default: None

Relevant tests: event_freshness_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.event_timestamp_column:
          event_timestamp_column: "event_timestamp"
          update_timestamp_column: "created_at"

update_timestamp_column

update_timestamp_column: [column name]

Configuration for the test event_freshness_anomalies. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.

The test can work in a couple of modes:

  • If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.

  • If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.

Default: None

Relevant tests: event_freshness_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.event_timestamp_column:
          event_timestamp_column: "event_timestamp"
          update_timestamp_column: "created_at"

ignore_small_changes

ignore_small_changes:
  spike_failure_percent_threshold: [int]
  drop_failure_percent_threshold: [int]

If defined, an anomaly test will fail only if all the following conditions hold:

  • The z-score of the metric within the detection period is anomoulous

  • One of the following holds:

    • The metric within the detection period is higher than spike_failure_percent_threshold percentages of the mean value in the training period, if defined.

    • The metric within the detection period is lower than drop_failure_percent_threshold percentages of the mean value in the training period, if defined

Those settings can help to deal with situations where your metrics are stable and small changes causes to high z-scores, and therefore to anomaly.

If undefined, default is null for both spike and drop.

Default: none

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          ignore_small_changes:
            spike_failure_percent_threshold: 2
            drop_failure_percent_threshold: 50
models:
  - name: this_is_a_model
    config:
      elementary:
        ignore_small_changes:
          spike_failure_percent_threshold: 2
sources:
  - name: my_non_dbt_tables
    schema: raw
    tables:
      - name: source_table
        meta:
          elementary:
            ignore_small_changes:
              drop_failure_percent_threshold: 50
vars:
  ignore_small_changes:
    spike_failure_percent_threshold: 10

fail_on_zero

fail_on_zero: true/false

Elementary anomaly detection tests will fail if there is a zero metric value within the detection period. If undefined, default is false.

Default: false

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          fail_on_zero: true
models:
  - name: this_is_a_model
    config:
      elementary:
        fail_on_zero: true
vars:
  fail_on_zero: true

detection_delay

detection_delay:
  period: < time period > # supported periods: hour, day, week, month
  count: < number of periods >

The duration for retracting the detection period. That’s useful in cases which the latest data should be excluded from the test. For example, this can happen because of scheduling issues- if the test is running before the table is populated for some reason. The detection delay is the period of time to ignore, after the detection period.

Default: 0

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          detection_delay:
            period: day
            count: 1
models:
  - name: this_is_a_model
    config:
      elementary:
        detection_delay:
          period: day
          count: 1
vars:
  detection_delay:
    period: day
    count: 1

anomaly_exclude_metrics

anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]

By default, data points are compared to the all data points in the training set. Using this param, you can exclude metrics from the training set, to improve the test accuracy.

The filter can be configured using an SQL where expression syntax, and the following fields:

  1. metric_date - The date of the relevant bucket (even if the bucket is not daily).

  2. metric_time_bucket - The exact time bucket.

  3. metric_value - The value of the metric.

Supported values: valid SQL where expression on the columns metric_date / metric_time_bucket / metric_value

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_exclude_metrics: metric_value < 10

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_exclude_metrics: metric_time_bucket >= '2023-10-01 06:00:00' and metric_time_bucket <= '2023-10-01 07:00:00'
models:
  - name: this_is_a_model
    config:
      elementary:
        anomaly_exclude_metrics: metric_date = '2023-10-01'
vars:
  anomaly_exclude_metrics: metric_date = '2023-10-01'
PreviousAnomaly Detection TestsNextVolume anomalies

Last updated 8 months ago

Was this helpful?

How it works?

The training_period param only works for tests that have timestamp_column configuration.

It works differently according to the table materialization:

  • Regular tables and views - The values of the full training_period period is calculated on each run.

  • Incremental models and sources - The values of the full training_period period is calculated on the first test run, and on full refresh. The following test runs will only calculate the values of the detection_period period.

Changes from default:

  • Full time buckets - Elementary will increase the training_period automatically to insure full time buckets. For example if the time_bucket of the test is period: week, and 14 days training_period result in Tuesday, the test will collect 2 more days back to complete a week (starting on Sunday).

  • Seasonality training set - If seasonality is configured, Elementary will increase the training_period automatically to ensure there are enough training set values to calculate an anomaly. For example if the seasonality of the test is day_of_week, training_period will be increased to ensure enough Sundays, Mondays, Tuesdays, etc. to calculate an anomaly for each.

If you increase training_period your test training set will be larger. This means a larger sample size for calculating the expected range, which should make the test less sensitive to outliers. This means less chance of false positive anomalies, but also less sensitivity so anomalies have a higher threshold.

If you decrease training_period your test training set will be smaller. This means a smaller sample size for calculating the expected range, which might make the test more sensitive to outliers. This means more chance of false positive anomalies, but also more sensitivity as anomalies have a lower threshold.

How it works?

The detection_period param only works for tests that have timestamp_column configuration.

It works differently according to the table materialization:

  • Regular tables and views - detection_period defines the detection period.

  • Incremental models and sources - detection_period defines the detection period, and the period for which metrics will be re-calculated

How it works?
  • The training_period and detection_period of the test might be extended to ensure full time buckets (for example, full week Sunday-Saturday).

  • Weekly buckets start at the day that is configured as week start on the data warehouse.

How it works?
  • The test will compare the value of a bucket to previous bucket with the same seasonality attribute, and not to the adjacent previous data points.

  • The training_period of the test will be changed by default to assure a minimal training set. When seasonality: day_of_week is configured, training_period is by default multiplied by 7.

How it works?

The detection_delay param only works for tests that have timestamp_column configuration. It does not affect the other duration parameters, like detection_period or training_period.

The impact of changing training_period

📖
timestamp_column: column name
where_expression: sql expression
anomaly_sensitivity: [int]
anomaly_direction: [both | spike | drop]
ignore_small_changes: 
  spike_failure_percent_threshold: int 
  drop_failure_percent_threshold: int
anomaly_exclude_metrics: [SQL expression]
training_period: int
  period: [hour | day | week | month]
  count: int
detection_period: int
  period: [hour | day | week | month]
  count: int
time_bucket:
  period: [hour | day | week | month]
  count: int
seasonality: day_of_week
detection_delay:
  period: [hour | day | week | month]
  count: int
fail_on_zero: [true | false]
column_anomalies: column monitors list
exclude_prefix: string
exclude_regexp: regex
dimensions: sql expression
event_timestamp_column: column name
update_timestamp_column: column name
💡
💡
💡
💡
​
💡
​
timestamp_column
where_expression
anomaly_sensitivity
anomaly_direction
ignore_small_changes
anomaly_exclude_metrics
training_period
detection_period
time_bucket
seasonality
detection_delay
fail_on_zero
column_anomalies
exclude_prefix
exclude_regexp
dimensions
event_timestamp_column
update_timestamp_column