Anomaly tests parameters

Introduction

Elementary data anomaly detection tests monitor specific metrics and compare recent values to historical data to detect significant changes that may indicate data reliability issues. This page outlines the parameters available for configuring these tests.

If your dataset doesn't hasve a timestamp column representing the creation time of a field, it's highly recommended to configure one (ex. date_added_dttm). This allows Elementary to create time buckets and filter the table effectively.

Parameters overview

Below it is a list of all parameters available for each type of test provided by Elementary.

Common Parameters for All Anomaly Detection Tests

ParametersParameters Config

Anomaly Detection Tests With Timestamp Column:

ParametersParameters Config

Volume Anomaly Tests

ParametersParameters Config

All columns anomalies test:

ParametersParameters Config

Dimension anomalies test:

ParametersParameters Config

Event freshness anomalies:

ParametersParameters Config

Example configurations

version: 2

models:
  - name: <model_name>
    config:
      elementary:
        timestamp_column: < model timestamp column >
    tests: < here you will add elementary monitors as tests >

  - name: <your model with no timestamp>
    ## if no timestamp is configured, elementary will monitor without time filtering
    tests: <here you will add elementary monitors as tests>

Parameters details

Below is a list of each parameter and configuration details.

timestamp_column

timestamp_column: [column name]

If your data set has a timestamp column that represents the creation time of a field, it is highly recommended configuring it as a timestamp_column.

Elementary anomaly detection tests will use this column to create time buckets and filter the table. It is highly recommended to configure a timestamp column (if there is one). The best column for this would be an updated_at/created_at/loaded_at timestamp for each row (date type also works).

  • When you specify a timestamp_column, when the test runs it splits the data to buckets according to the timestamp in this column, calculates the metric for each bucket and checks for anomalies between these buckets. This also means that if the table has enough historical data, the test can start working right away.

  • When you do not specify a timestamp_column, each time the test runs it calculates the metric for all of the data in the table, and checks for anomalies between the metric from previous runs. This also means that it will take the test training_period days to start working, as it needs to the time to collect the necessary metrics.

If undefined, default is null (no time buckets).

Default: none

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          timestamp_column: created_at

where_expression

where_expression: [sql expression]

Filter the tested data using a valid sql expression.

Default: None

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          where_expression: "user_name != 'test'"

anomaly_sensitivity

anomaly_sensitivity: [int]

Configuration to define how the expected range is calculated. A sensitivity of 3 means that the expected range is within 3 standard deviations from the average of the training set. Smaller sensitivity means this range will be reduced and more values will be potentially flagged as anomalies. Larger values will have the opposite effect and will reduce the number of anomalies as the expected range will be larger.

Default: 3

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_sensitivity: 2.5

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_sensitivity: 4

anomaly_direction

anomaly_direction: both | spike | drop

By default, data points are compared to the expected range and check if these are below or above it. For some data monitors, you might only want to flag anomalies if they are above the range and not under it, and vice versa. For example - when monitoring for freshness, we only want to detect data delays and not data that is “early”. The anomaly_direction configuration is used to configure the direction of the expected range, and can be set to both, spike or drop.

Default: both

Supported values: both, spike, drop

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_direction: drop

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_direction: spike

training_period

training_period:
  period: < time period > # supported periods: day, week, month
  count: < number of periods >

The maximal timeframe for which the test will collect data. This timeframe includes the training period and detection period. If a detection delay is defined, the whole training period is being delayed.

Default: 14 days

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          training_period:
            period: day
            count: 30
💡How it works?

The training_period param only works for tests that have timestamp_column configuration.

It works differently according to the table materialization:

  • Regular tables and views - The values of the full training_period period is calculated on each run.

  • Incremental models and sources - The values of the full training_period period is calculated on the first test run, and on full refresh. The following test runs will only calculate the values of the detection_period period.

Changes from default:

  • Full time buckets - Elementary will increase the training_period automatically to insure full time buckets. For example if the time_bucket of the test is period: week, and 14 days training_period result in Tuesday, the test will collect 2 more days back to complete a week (starting on Sunday).

  • Seasonality training set - If seasonality is configured, Elementary will increase the training_period automatically to ensure there are enough training set values to calculate an anomaly. For example if the seasonality of the test is day_of_week, training_period will be increased to ensure enough Sundays, Mondays, Tuesdays, etc. to calculate an anomaly for each.

The impact of changing training_period

If you increase training_period your test training set will be larger. This means a larger sample size for calculating the expected range, which should make the test less sensitive to outliers. This means less chance of false positive anomalies, but also less sensitivity so anomalies have a higher threshold.

If you decrease training_period your test training set will be smaller. This means a smaller sample size for calculating the expected range, which might make the test more sensitive to outliers. This means more chance of false positive anomalies, but also more sensitivity as anomalies have a lower threshold.


detection_period

detection_period:
  period: < time period > # supported periods: day, week, month
  count: < number of periods >

Configuration to define the detection period. If the detection_period are set to 2 days, only data points in the last 2 days will be included in the detection period and could be flagged anomalous. If detection_period is set to 7 days, the detection period will be 7 days long.

For incremental models, this is also the period for re-calculating metrics. If metrics for buckets in the detection period were already calculated, Elementary will overwrite them. The reason behind it is to monitor recent backfills of data, if there were any. This configuration should be changed according to your data delays.

Default: 2 days

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          detection_period:
            period: day
            count: 30
💡How it works?

The detection_period param only works for tests that have timestamp_column configuration.

It works differently according to the table materialization:

  • Regular tables and views - detection_period defines the detection period.

  • Incremental models and sources - detection_period defines the detection period, and the period for which metrics will be re-calculated


time_bucket

time_bucket:
  period: < time period > # supported periods: hour, day, week, month
  count: < number of periods >

This configuration controls the duration of the time buckets.

To calculate how data changes over time and detect issues, we split the data into consistent time buckets. For example, if we use daily (period=day, count=1) time bucket and monitor for row count anomalies, we will count new rows per day.

Depending on the nature of your data, it may make sense to modify this parameter. For example, if you want to detect volume anomalies in an hourly resolution, you should set the time bucket to period=hour and count=1.

Default: daily buckets. time_bucket: {period: day, count: 1}

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          time_bucket:
            period: day
            count: 2
💡How it works?
  • The training_period and detection_period of the test might be extended to ensure full time buckets (for example, full week Sunday-Saturday).

  • Weekly buckets start at the day that is configured as week start on the data warehouse.


seasonality

seasonality: day_of_week | hour_of_day | hour_of_week

Some data sets have patterns that repeat over a time period, and are expected. This is the normal behavior of these data sets. This means that when we try to detect outliers from the normal and expected range, ignoring this patterns might cause false positives or make us miss anomalies. The seasonality configuration is used to overcome this challenge and account for expected patterns.

Supported seasonality configurations:

  • day_of_week - Uses the same day of week as a training set for each daily bucket (Compares Sunday to Sundays, Monday to Mondays, etc.).

  • hour_of_day - Uses the same hour as a training set for each hourly bucket (For example will compare 10:00-11:00AM to 10:00-11:00AM on previous days, instead of any previous hour).

  • hour_of_week - Uses the same hour and day of week as a training set for each hourly bucket (For example will compare 10:00-11:00AM on Sunday to 10:00-11:00AM on previous sundays).

Use case:

Many data sets have lower volume over the weekend, and higher volume over the week days. This means that the expected range for different days of the week is different. The day_of_week seasonality uses the same day of week as a training set for each daily time bucket data point. The expected range for Monday will be based on a training set of previous Mondays, and so on.

Default: none

Supported values: day_of_week, hour_of_day, hour_of_week

Relevant tests: Anomaly detection tests with timestamp_column and 1 day time_bucket

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          seasonality: day_of_week
💡How it works?
  • The test will compare the value of a bucket to previous bucket with the same seasonality attribute, and not to the adjacent previous data points.

  • The training_period of the test will be changed by default to assure a minimal training set. When seasonality: day_of_week is configured, training_period is by default multiplied by 7.


column_anomalies

column_anomalies: [column monitors list]

Select which monitors to activate as part of the test.

Default: default monitors

Relevant tests: all_column_anomalies, column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - average

Default monitors by type:

Data quality metricColumn Type

null_count

any

null_percent

any

min_length

string

max_length

string

average_length

string

missing_count

string

missing_percent

string

min

numeric

max

numeric

average

numeric

zero_count

numeric

zero_percent

numeric

standard_deviation

numeric

variance

numeric

Opt-in monitors by type:

Data quality metricColumn Type

sum

numeric


exclude_prefix

exclude_prefix: [string]

Param for the all_columns_anomalies test only, which enables to exclude a column from the tests based on prefix match.

Default: None

Relevant tests: all_column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          exclude_prefix: "id_"

exclude_regexp

exclude_regexp: [regex]

Param for the all_columns_anomalies test only, which enables to exclude a column from the tests based on regular expression match.\

Default: None

Relevant tests: all_column_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.column_anomalies:
          exclude_regexp: ".*SDC$"

dimensions

dimensions: [list of SQL expressions]

Configuration for the test dimension_anomalies. The test counts rows grouped by given column / columns / valid select sql expression. Under dimensions you can configure the group by expression.

This test monitors the frequency of values in the configured dimension over time, and alerts on unexpected changes in the distribution. It is best to configure it on low-cardinality fields.

Default: None

Relevant tests: dimension_anomalies

Configuration level: test

Example configuration:

models:
  - name: model_name
    config:
      elementary:
        timestamp_column: updated_at
    tests:
      - elementary.dimension_anomalies:
          dimensions:
            - device_os
            - device_browser

event_timestamp_column

event_timestamp_column: [column name]

Configuration for the test event_freshness_anomalies. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.

The test can work in a couple of modes:

  • If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.

  • If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.

Default: None

Relevant tests: event_freshness_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.event_timestamp_column:
          event_timestamp_column: "event_timestamp"
          update_timestamp_column: "created_at"

update_timestamp_column

update_timestamp_column: [column name]

Configuration for the test event_freshness_anomalies. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.

The test can work in a couple of modes:

  • If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.

  • If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.

Default: None

Relevant tests: event_freshness_anomalies

Configuration level: test

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.event_timestamp_column:
          event_timestamp_column: "event_timestamp"
          update_timestamp_column: "created_at"

ignore_small_changes

ignore_small_changes:
  spike_failure_percent_threshold: [int]
  drop_failure_percent_threshold: [int]

If defined, an anomaly test will fail only if all the following conditions hold:

  • The z-score of the metric within the detection period is anomoulous

  • One of the following holds:

    • The metric within the detection period is higher than spike_failure_percent_threshold percentages of the mean value in the training period, if defined.

    • The metric within the detection period is lower than drop_failure_percent_threshold percentages of the mean value in the training period, if defined

Those settings can help to deal with situations where your metrics are stable and small changes causes to high z-scores, and therefore to anomaly.

If undefined, default is null for both spike and drop.

Default: none

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          ignore_small_changes:
            spike_failure_percent_threshold: 2
            drop_failure_percent_threshold: 50

fail_on_zero

fail_on_zero: true/false

Elementary anomaly detection tests will fail if there is a zero metric value within the detection period. If undefined, default is false.

Default: false

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          fail_on_zero: true

detection_delay

detection_delay:
  period: < time period > # supported periods: hour, day, week, month
  count: < number of periods >

The duration for retracting the detection period. That’s useful in cases which the latest data should be excluded from the test. For example, this can happen because of scheduling issues- if the test is running before the table is populated for some reason. The detection delay is the period of time to ignore, after the detection period.

Default: 0

Relevant tests: Anomaly detection tests with timestamp_column

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          detection_delay:
            period: day
            count: 1
💡How it works?

The detection_delay param only works for tests that have timestamp_column configuration. It does not affect the other duration parameters, like detection_period or training_period.


anomaly_exclude_metrics

anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]

By default, data points are compared to the all data points in the training set. Using this param, you can exclude metrics from the training set, to improve the test accuracy.

The filter can be configured using an SQL where expression syntax, and the following fields:

  1. metric_date - The date of the relevant bucket (even if the bucket is not daily).

  2. metric_time_bucket - The exact time bucket.

  3. metric_value - The value of the metric.

Supported values: valid SQL where expression on the columns metric_date / metric_time_bucket / metric_value

Relevant tests: All anomaly detection tests

Example configuration:

models:
  - name: this_is_a_model
    tests:
      - elementary.volume_anomalies:
          anomaly_exclude_metrics: metric_value < 10

      - elementary.all_columns_anomalies:
          column_anomalies:
            - null_count
            - missing_count
            - zero_count
          anomaly_exclude_metrics: metric_time_bucket >= '2023-10-01 06:00:00' and metric_time_bucket <= '2023-10-01 07:00:00'

Last updated