Anomaly Detection Tests
Last updated
Last updated
Elementary data anomaly detection tests monitor specific metrics (like row count, null rate, average value, etc.) and compare recent values to historical data. This process helps detect significant changes and deviations, that are probably data reliability issues.
Data is split into time buckets based on the time_bucket
field.
The data is limited by the training_period
variable.
The test compares a specific metric (e.g., row count) of the buckets within the detection_period
to all previous time buckets within the training_period
.
If anomalies are detected in the detection period, the test fails.
The elementary package executes relevant monitors and searches for anomalies by comparing to historical metrics.
A test failure indicates that an anomaly was detected for the specific metric and dataset. For more details, refer to the anomaly detection method.
Elementary uses the "standard score" (Z-score) for anomaly detection. This score represents the number of standard deviations a value is from the mean of a set of values.
~68% of values have an absolute z-score of 1 or less.
~95% of values have an absolute z-score of 2 or less.
~99.7% of values have an absolute z-score of 3 or less.
Values with a standard score of 3 and above are considered outliers. This is Elementary's default threshold, which can be adjusted using the anomaly_score_threshold
variable in the global configuration.
Within your Elementary Schema in your data warehouse, access the anomaly_sensitivity
model to see how different scores would affect anomaly detection based on your last run's metric values. This can help in deciding whether to adjust the sensitivity.
Anomaly
A value in the detection set that is an outlier compared to the expected range calculated based on the training set.
Monitored data set
The complete dataset used for the data monitor, including both training set and detection set values.
Data monitors
Different metrics (freshness, volume, nullness, uniqueness, distribution, etc.) that we monitor to detect problems.
Training set
The set of values used as a reference point to calculate the expected range.
Detection set
The set of values compared to the expected range. Outliers in this set are flagged as anomalies.
Expected range
The range of values considered normal, calculated based on the training set.
Training period
The time period from which the training set is collected. This is typically a recent period, as data patterns may change over time.
Detection period
The period containing values that are compared to the expected range.
Time bucket
The consistent time intervals into which data is split for analysis. For example, daily buckets for monitoring row count anomalies.