Anomaly tests parameters
Introduction
Elementary data anomaly detection tests monitor specific metrics and compare recent values to historical data to detect significant changes that may indicate data reliability issues. This page outlines the parameters available for configuring these tests.
If your dataset doesn't hasve a timestamp column representing the creation time of a field, it's highly recommended to configure one (ex. date_added_dttm
). This allows Elementary to create time buckets and filter the table effectively.
Parameters overview
Below it is a list of all parameters available for each type of test provided by Elementary.
Common Parameters for All Anomaly Detection Tests
Anomaly Detection Tests With Timestamp Column:
Volume Anomaly Tests
All columns anomalies test:
Dimension anomalies test:
Event freshness anomalies:
Example configurations
Parameters details
Below is a list of each parameter and configuration details.
timestamp_column
timestamp_column: [column name]
If your data set has a timestamp column that represents the creation time of a field, it is highly recommended configuring it as a timestamp_column
.
Elementary anomaly detection tests will use this column to create time buckets and filter the table. It is highly recommended to configure a timestamp column (if there is one). The best column for this would be an updated_at
/created_at
/loaded_at
timestamp for each row (date type also works).
When you specify a
timestamp_column
, when the test runs it splits the data to buckets according to the timestamp in this column, calculates the metric for each bucket and checks for anomalies between these buckets. This also means that if the table has enough historical data, the test can start working right away.When you do not specify a
timestamp_column
, each time the test runs it calculates the metric for all of the data in the table, and checks for anomalies between the metric from previous runs. This also means that it will take the testtraining_period
days to start working, as it needs to the time to collect the necessary metrics.
If undefined, default is null (no time buckets).
Default: none
Relevant tests: All anomaly detection tests
Example configuration:
where_expression
where_expression: [sql expression]
Filter the tested data using a valid sql expression.
Default: None
Relevant tests: All anomaly detection tests
Example configuration:
anomaly_sensitivity
anomaly_sensitivity: [int]
Configuration to define how the expected range is calculated. A sensitivity of 3 means that the expected range is within 3 standard deviations from the average of the training set. Smaller sensitivity means this range will be reduced and more values will be potentially flagged as anomalies. Larger values will have the opposite effect and will reduce the number of anomalies as the expected range will be larger.
Default: 3
Relevant tests: All anomaly detection tests
Example configuration:
anomaly_direction
anomaly_direction: both | spike | drop
By default, data points are compared to the expected range and check if these are below or above it. For some data monitors, you might only want to flag anomalies if they are above the range and not under it, and vice versa. For example - when monitoring for freshness, we only want to detect data delays and not data that is “early”. The anomaly_direction configuration is used to configure the direction of the expected range, and can be set to both, spike or drop.
Default: both
Supported values: both
, spike
, drop
Relevant tests: All anomaly detection tests
Example configuration:
training_period
The maximal timeframe for which the test will collect data. This timeframe includes the training period and detection period. If a detection delay is defined, the whole training period is being delayed.
Default: 14 days
Relevant tests: Anomaly detection tests with timestamp_column
Example configuration:
detection_period
Configuration to define the detection period. If the detection_period are set to 2 days, only data points in the last 2 days will be included in the detection period and could be flagged anomalous. If detection_period is set to 7 days, the detection period will be 7 days long.
For incremental models, this is also the period for re-calculating metrics. If metrics for buckets in the detection period were already calculated, Elementary will overwrite them. The reason behind it is to monitor recent backfills of data, if there were any. This configuration should be changed according to your data delays.
Default: 2 days
Relevant tests: Anomaly detection tests with timestamp_column
Example configuration:
time_bucket
This configuration controls the duration of the time buckets.
To calculate how data changes over time and detect issues, we split the data into consistent time buckets. For example, if we use daily (period=day
, count=1
) time bucket and monitor for row count anomalies, we will count new rows per day.
Depending on the nature of your data, it may make sense to modify this parameter. For example, if you want to detect volume anomalies in an hourly resolution, you should set the time bucket to period=hour
and count=1
.
Default: daily buckets. time_bucket: {period: day, count: 1}
Relevant tests: Anomaly detection tests with timestamp_column
Example configuration:
seasonality
seasonality: day_of_week | hour_of_day | hour_of_week
Some data sets have patterns that repeat over a time period, and are expected. This is the normal behavior of these data sets. This means that when we try to detect outliers from the normal and expected range, ignoring this patterns might cause false positives or make us miss anomalies. The seasonality configuration is used to overcome this challenge and account for expected patterns.
Supported seasonality configurations:
day_of_week
- Uses the same day of week as a training set for each daily bucket (Compares Sunday to Sundays, Monday to Mondays, etc.).hour_of_day
- Uses the same hour as a training set for each hourly bucket (For example will compare 10:00-11:00AM to 10:00-11:00AM on previous days, instead of any previous hour).hour_of_week
- Uses the same hour and day of week as a training set for each hourly bucket (For example will compare 10:00-11:00AM on Sunday to 10:00-11:00AM on previous sundays).
Use case:
Many data sets have lower volume over the weekend, and higher volume over the week days. This means that the expected range for different days of the week is different. The day_of_week
seasonality uses the same day of week as a training set for each daily time bucket data point. The expected range for Monday will be based on a training set of previous Mondays, and so on.
Default: none
Supported values: day_of_week
, hour_of_day
, hour_of_week
Relevant tests: Anomaly detection tests with timestamp_column
and 1 day time_bucket
Example configuration:
column_anomalies
column_anomalies: [column monitors list]
Select which monitors to activate as part of the test.
Default: default monitors
Relevant tests: all_column_anomalies
, column_anomalies
Configuration level: test
Example configuration:
Default monitors by type:
null_count
any
null_percent
any
min_length
string
max_length
string
average_length
string
missing_count
string
missing_percent
string
min
numeric
max
numeric
average
numeric
zero_count
numeric
zero_percent
numeric
standard_deviation
numeric
variance
numeric
Opt-in monitors by type:
sum
numeric
exclude_prefix
exclude_prefix: [string]
Param for the all_columns_anomalies
test only, which enables to exclude a column from the tests based on prefix match.
Default: None
Relevant tests: all_column_anomalies
Configuration level: test
Example configuration:
exclude_regexp
exclude_regexp: [regex]
Param for the all_columns_anomalies
test only, which enables to exclude a column from the tests based on regular expression match.\
Default: None
Relevant tests: all_column_anomalies
Configuration level: test
Example configuration:
dimensions
dimensions: [list of SQL expressions]
Configuration for the test dimension_anomalies
. The test counts rows grouped by given column / columns / valid select sql expression. Under dimensions
you can configure the group by expression.
This test monitors the frequency of values in the configured dimension over time, and alerts on unexpected changes in the distribution. It is best to configure it on low-cardinality fields.
Default: None
Relevant tests: dimension_anomalies
Configuration level: test
Example configuration:
event_timestamp_column
event_timestamp_column: [column name]
Configuration for the test event_freshness_anomalies
. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.
The test can work in a couple of modes:
If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.
If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.
Default: None
Relevant tests: event_freshness_anomalies
Configuration level: test
Example configuration:
update_timestamp_column
update_timestamp_column: [column name]
Configuration for the test event_freshness_anomalies
. This test compliments the freshness_anomalies test and is primarily intended for data that is updated in a continuous / streaming fashion.
The test can work in a couple of modes:
If only an event_timestamp_column is supplied, the test measures over time the difference between the current timestamp (“now”) and the most recent event timestamp.
If both an event_timestamp_column and an update_timestamp_column are provided, the test will measure over time the difference between these two columns.
Default: None
Relevant tests: event_freshness_anomalies
Configuration level: test
Example configuration:
ignore_small_changes
If defined, an anomaly test will fail only if all the following conditions hold:
The z-score of the metric within the detection period is anomoulous
One of the following holds:
The metric within the detection period is higher than
spike_failure_percent_threshold
percentages of the mean value in the training period, if defined.The metric within the detection period is lower than
drop_failure_percent_threshold
percentages of the mean value in the training period, if defined
Those settings can help to deal with situations where your metrics are stable and small changes causes to high z-scores, and therefore to anomaly.
If undefined, default is null for both spike and drop.
Default: none
Relevant tests: All anomaly detection tests
Example configuration:
fail_on_zero
fail_on_zero: true/false
Elementary anomaly detection tests will fail if there is a zero metric value within the detection period. If undefined, default is false.
Default: false
Relevant tests: All anomaly detection tests
Example configuration:
detection_delay
The duration for retracting the detection period. That’s useful in cases which the latest data should be excluded from the test. For example, this can happen because of scheduling issues- if the test is running before the table is populated for some reason. The detection delay is the period of time to ignore, after the detection period.
Default: 0
Relevant tests: Anomaly detection tests with timestamp_column
Example configuration:
anomaly_exclude_metrics
anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]
By default, data points are compared to the all data points in the training set. Using this param, you can exclude metrics from the training set, to improve the test accuracy.
The filter can be configured using an SQL where expression syntax, and the following fields:
metric_date
- The date of the relevant bucket (even if the bucket is not daily).metric_time_bucket
- The exact time bucket.metric_value
- The value of the metric.
Supported values: valid SQL where expression on the columns metric_date / metric_time_bucket / metric_value
Relevant tests: All anomaly detection tests
Example configuration:
Last updated