Summary Tests

Drift

Test for differences in the distribution of the reference dataset versus the evaluation dataset. If predictions and labels are provided, measure the performance degradation caused by shifting data as well as drift in predictions and labels themselves.
Data Requirements: Labels and predictions are not required but they improve results.

Abnormal Inputs

Check the evaluation dataset for abnormal values commonly encountered in production. If model predictions are provided, test if the observed abnormal values cause a degradation in your model’s performance.
Data Requirements: Labels and predictions are not required but they improve results.

Subset Performance

Test that your model performs equally well across different subsets of the evaluation dataset.
Data Requirements: Predictions are required. Labels are required for most tests.

Bias and Fairness

Test that your model does not discriminate based on protected features.
Data Requirements: For some tests predictions, labels, and/or model are required.

Transformations

Augment your evaluation dataset with synthetic abnormal values to proactively test your pipeline’s error-handling behavior and measure the performance degradation caused by different types of abnormal values.
Data Requirements: Model is required. Labels are not required but they improve results.

Model Performance

Test that your model performs well on the evaluation dataset and is not degrading in performance.
Data Requirements: Predictions are required. Labels are required for exact performance metrics, otherwise approximate performance metrics will be used.

Data Cleanliness

Test for data reliability by checking that your data is consistent and complete.
Data Requirements: This category only requires input data.

Adversarial

Test the robustness of your model by measuring the worst-case change in model predictions that can be caused by small perturbations to data points.
Data Requirements: Model is required.

Subset Performance Degradation

Test that your model’s performance on different subsets of the data has not degraded.
Data Requirements: Predictions are required. Labels are required for most tests.

Data Poisoning Detection

Test for the presence of potentially corrupted samples in your data.
Note that the default behavior of this test category in Continuous Testing depends on the quality of data used for the initial stress test. If corrupted samples were detected in the stress test, then the continuous test will be more permissive.
Data Requirements: This category requires input data with labels and timestamps.

Evasion Attack Detection

Test for the presence of adversarial evasion attacks against your model.
Data Requirements: This category only requires input data with timestamps.

Model Alignment

Test that your model is aligned: helpful, honest, and harmless.
Data Requirements: Most tests in this category require a generative text model to be provided, though providing only predictions is sufficient for any row-wise test.

Factual Awareness

Test that your model is producing factually correct outputs.
Data Requirements: Most tests in this category require a generative text model to be provided, though providing only predictions is sufficient for any row-wise test.