RIME overview

The Robust Intelligence (RI) Platform secures your machine learning pipeline so you can focus on building better ML models for your business needs.

The RI Platform

The RI Platform operates at three stages of the ML lifecycle.

During model development, AI Stress Testing measures the robustness of your model by running dozens of pre-configured tests, each of which checks the model’s vulnerability to a specific form of potential failure in production. After production deployment and before inference, AI Firewall protects your model from such critical errors in real-time by flagging or blocking aberrant data from entering your ML system. After inference, AI Continuous Testing monitors your model and alerts on issues such as data drift and performance degradation. When things go wrong, it also offers automated root cause analysis of the underlying driver of performance change.

Why use the RI Platform?

In modern engineering organizations, data scientists and machine learning engineers typically spend the majority of their effort on the development stage of the model life cycle, which encompasses data ingestion and cleaning, feature extraction, and model training. During this stage, models are primarily evaluated based on their performance on some clean, held-out test set.

While such metrics might be sufficient for understanding model performance in controlled development environments, deploying models into production introduces a whole new set of challenges and failure modes that are often overlooked. Once a model is deployed, data scientists no longer have complete control over how a model is instantiated, how data is passed into the model, nor do they have any oversight over data pipelines in which the model is integrated. Even when the model is used correctly, the real world can change, and issues like distributional shifts in production data may silently degrade model performance.

Key Features

The RI Platform addresses these risks with four core products:

AI Stress Testing

AI Stress Testing is a set of tests that measure the robustness of your ML deployment by computing an aggregate severity score across all tests. The severity score is a measure of the magnitude of the identified failure mode specific to each test. It is a combination of the impact the failure has on model performance (Performance Change or Prediction Change) and the prevalence of the failure mode in the reference set (Abnormal Inputs or Drift Statistic). By running hundreds of these unit tests and simulations across both your model and associated reference and evaluation datasets, the RI Platform identifies implicit assumptions and failure modes of the ML deployment.

AI Stress Testing allows you to test your data and model before deployment. We recommend providing a model and labels when running AI Stress Testing to leverage the platform’s full potential; however, it is not required. You can run the RI Platform in various modes.

  • Model: Providing access to the model allows for testing the model behavior under different circumstances. In these tests, we perturb model inputs, provide them to the model, and examine the model behavior to uncover its vulnerabilities.

  • Predictions: Providing predictions for the data can speed up the RI Platform and allows us to test your model even if you don’t provide a model interface. We use sophisticated statistical algorithms to run most of the same tests as when we have direct model access to uncover vulnerabilities within your model and approximate the impact of each vulnerability. If you provide neither a model nor predictions, the RI Platform will still run data quality and data distribution shift tests.

  • Labels: Providing labels allows for testing model performance under different circumstances. If you do not provide labels, the RI Platform will still run data quality tests, data distribution tests, and prediction distribution tests (if possible).

AI Firewall

AI Firewall protects your model from bad predictions before model inference. Firewall operates at the data point level. It can flag or block aberrant data points in realtime, and the AI Firewall is automatically configured from the results of stress testing. The end result is that the user gets a custom AI Firewall that protects against the unique forms of model failure to which a given model is susceptible. Firewall can be deployed with a single line of code directly in the inference pipeline of your ML model, wherein it logs, analyzes, and/or acts upon (flag, block, impute) aberrant data points in realtime.

AI Continuous Testing

AI Continuous Testing enables the user to monitor their ML model after inference. As suggested by the name, this view uses the same Stress Testing framework applied continually across time. Data drift will inevitably occur once a model is deployed in production. Continuous tests help answer both the what and the why of changing data. It not only detects issues as they happen but also alerts you regarding the issues and provides insight into their root causes - shortening the time to resolution. Continuous Testing can be set up by passively logging and analyzing predictions by uploading prediction logs after model inference. These can be automated to run at regular intervals.

AI Compliance Management

AI Compliance Management allows the user to download auto-generated model cards for internal and external documentation needs. This incorporates results from the AI Stress Testing suite (including a suite of bias and fairness tests) that measure a model’s production readiness. In addition, the model cards includes results from AI Continuous Testing meeting ongoing monitoring requirements put forward by regulators. These reports help companies comply with AI regulatory standards.

Governance Dashboard

A single pane of glass provides visibility into all models in production, providing model health status and the ability to track models to any custom metric. The Governance dashboard is behind a feature flag. Request enabling of this feature directly from Robust Intelligence.

Key Machine Learning Tasks Covered

Tabular

  • Binary Classification

  • Multiclass Classification

  • Regression

  • Learning to Rank

Natural Language Processing (NLP)

  • Text Classification

  • Named Entity Recognition

Computer Vision (CV)

  • Image Classification

  • Object Detection

RI Platform Deployment Patterns

We offer three variations of RI Platform tailored to different deployment patterns. More information on deployment patterns can be found here.

Self-Hosted

Managed Cloud

Cloud

Installation

K8s cluster in Customer VPC

K8s cluster in RI VPC

Control Plane K8s cluster in RI VPC
Data Plane K8s cluster in Customer VPC

Data Location

Customer VPC

RI VPC

Customer VPC

Compute Location

Customer VPC

RI VPC

Customer VPC

Test Result Location

Customer VPC

RI VPC

RI VPC

A lightweight local installation.

Please see the below section for a description of what “Summary Tests” are and what information they provide:

Please see below for a list of summary tests (including when they are run), as well as detailed descriptions of all RIME tests.

RI Platform consolidated test database

Name

Category

Modality

Description

Why it matters

Configuration

Model Types

Average Confidence

Model Performance

tabular, nlp, cv

This test checks the average confidence of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. The “confidence” of a prediction for classification tasks is defined as the distance between the probability of the predicted class (defined as the argmax over the prediction vector) and 1. We average this metric across all predictions.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

By default, this test runs if predictions are specified (no labels required).

[tabular] Binary Classification, [tabular] Multi-class Classification

Average Thresholded Confidence

Model Performance

tabular, nlp, cv

This test checks the average thresholded confidence (ATC) of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. ATC is a method for estimating accuracy of unlabeled examples taken from <a href=”https://arxiv.org/abs/2201.04234”>this paper</a>. The threshold is first computed on the reference set: we pick a confidence threshold such that the percentage of datapoints whose max predicted probability is less than the threshold is around equal to the error rate of the model (here, it is 1-accuracy) on the reference set. Then, we apply this threshold in the evaluation set: the predicted accuracy is then equal to the percentage of datapoints with max predicted probability greater than this threshold.

During production, factors like distribution shift may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

By default, this test runs if predictions/labels are specified in the reference set and predictions are specified in the eval set (no labels required).

[tabular] Binary Classification, [tabular] Multi-class Classification

Calibration Comparison

Model Performance

tabular, nlp, cv

This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.

Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.

By default, this test runs over the predictions and labels.

[tabular] Binary Classification

Precision

Model Performance

tabular, nlp, cv

This test checks the Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Precision metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Prediction Variance (Positive Labels)

Model Performance

tabular

This test checks the Prediction Variance (Positive Labels) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance (Positive Labels) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Prediction Variance (Positive Labels) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Positive Prediction Rate

Model Performance

tabular, nlp, cv

This test checks the Positive Prediction Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Positive Prediction Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Positive Prediction Rate metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

False Negative Rate

Model Performance

tabular

This test checks the False Negative Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of False Negative Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the False Negative Rate metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mean-Absolute Error (MAE)

Model Performance

tabular

This test checks the Mean-Absolute Error (MAE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Absolute Error (MAE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Mean-Absolute Error (MAE) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mean-Squared-Log Error (MSLE)

Model Performance

tabular

This test checks the Mean-Squared-Log Error (MSLE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Squared-Log Error (MSLE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Mean-Squared-Log Error (MSLE) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Multiclass Accuracy

Model Performance

tabular, nlp, cv

This test checks the Multiclass Accuracy metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass Accuracy has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Multiclass Accuracy metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Macro Precision

Model Performance

tabular, nlp, cv

This test checks the Macro Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Macro Precision metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Rank Correlation

Model Performance

tabular

This test checks the Rank Correlation metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Rank Correlation has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Rank Correlation metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

F1

Model Performance

tabular, nlp, cv

This test checks the F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the F1 metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Prediction Variance (Negative Labels)

Model Performance

tabular

This test checks the Prediction Variance (Negative Labels) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance (Negative Labels) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Prediction Variance (Negative Labels) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

False Positive Rate

Model Performance

tabular

This test checks the False Positive Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of False Positive Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the False Positive Rate metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Root-Mean-Squared Error (RMSE)

Model Performance

tabular

This test checks the Root-Mean-Squared Error (RMSE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Root-Mean-Squared Error (RMSE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Root-Mean-Squared Error (RMSE) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Average Prediction

Model Performance

tabular, nlp, cv

This test checks the Average Prediction metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Prediction has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Average Prediction metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mean-Absolute Percentage Error (MAPE)

Model Performance

tabular

This test checks the Mean-Absolute Percentage Error (MAPE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Absolute Percentage Error (MAPE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Mean-Absolute Percentage Error (MAPE) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Average Rank

Model Performance

tabular

This test checks the Average Rank metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Rank has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Average Rank metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Prediction Variance

Model Performance

tabular

This test checks the Prediction Variance metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Prediction Variance metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Macro F1

Model Performance

tabular, nlp, cv

This test checks the Macro F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Macro F1 metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Normalized Discounted Cumulative Gain (NDCG)

Model Performance

tabular

This test checks the Normalized Discounted Cumulative Gain (NDCG) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Normalized Discounted Cumulative Gain (NDCG) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Normalized Discounted Cumulative Gain (NDCG) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

AUC

Model Performance

tabular

This test checks the AUC metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of AUC has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the AUC metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Recall

Model Performance

tabular, nlp, cv

This test checks the Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Recall metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Accuracy

Model Performance

tabular

This test checks the Accuracy metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Accuracy has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Accuracy metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mean-Squared Error (MSE)

Model Performance

tabular

This test checks the Mean-Squared Error (MSE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Squared Error (MSE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Mean-Squared Error (MSE) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Macro Recall

Model Performance

tabular, nlp, cv

This test checks the Macro Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Macro Recall metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mean Reciprocal Rank (MRR)

Model Performance

tabular

This test checks the Mean Reciprocal Rank (MRR) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean Reciprocal Rank (MRR) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Mean Reciprocal Rank (MRR) metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Multiclass AUC

Model Performance

tabular, nlp, cv

This test checks the Multiclass AUC metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass AUC has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Multiclass AUC metric with the below thresholds set for the absolute and degradation tests.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Protected Feature Drift

Bias and Fairness

tabular

This test measures the severity of passing to the model data points that have categorical features which have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail displayed is the PSI test statistic, which is a measure of how statistically significant the difference between the frequencies of categorical values in the reference and evaluation sets is.

Distribution drift in categorical features between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in categorical features towards categorical subsets that your model performs poorly in could indicate a degradation in model performance and signal the need for relabeling and retraining.

By default, this test runs over all categorical columns with sufficiently many samples.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Selection Rate

Bias and Fairness

tabular

This test checks whether the Selection Rate for any subset of a feature performs as well as the best Selection Rate across all subsets of that feature. The Selection Rate is calculated as the Positive Prediction Rate. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Selection Rate of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates.

Assessing differences in Selection Rate is an important measures of fairness. It is meant to be used in a setting where we assert that the base Selection Rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. It can be useful in legal/compliance settings where we want a Selection Rate for any sensitive group to fundamentally be the same as other groups.

By default, the Selection Rate is computed for all protected features.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Selection Rate (Avg Pred)

Bias and Fairness

tabular

This test checks whether the Average Prediction for any subset of a feature performs as well as the best Average Prediction across all subsets of that feature. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Average Prediction of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates.

Assessing differences in Average Prediction is an important measures of fairness. It is meant to be used in a setting where we assert that the base Average Predictions between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. It can be useful in legal/compliance settings where we want a Average Prediction for any sensitive group to fundamentally be the same as other groups.

By default, the Average Prediction is computed for all protected features.

[tabular] Regression

Class Imbalance

Bias and Fairness

tabular

This test checks whether the training sample size for any subset of a feature is significantly smaller than other subsets of that feature. The test first splits the dataset into various subset classes within the feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the class imbalance measure of that subset compared to the largest subset exceeds a set threshold.

Assessing class imbalance is an important measure of fairness. Features with low subset sizes can result in the model overfitting those subsets, and hence cause a larger error when those subsets appear in test data. This test can be useful in legal/compliance settings where sufficient data for all subsets of a protected feature is important.

By default, class imbalance is tested for all protected features.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Feature Independence

Bias and Fairness

tabular

This test checks the independence of each protected feature with the predicted label class. It runs over categorical protected features and uses the chi square test of independence to determine the feature independence. The test compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Wherever the observed data does not fit the model, the likelihood that the variables are dependent becomes stronger.

A test of independence assesses whether observations consisting of measures on two variables, expressed in a contingency table, are independent of each other. This can be useful when assessing how protected features impact the predicted class and helping with the feature selection process.

By default, this test is run over all protected categorical features.

[tabular] Binary Classification, [tabular] Multi-class Classification

Predict Protected Features

Bias and Fairness

tabular

The Predict Protected Features test works by training a multi-class logistic regression model to infer categorical protected features from unprotected categorical and numerical features. The model is fit to the reference data and scored based on its accuracy over the evaluation data. The unprotected categorical features are one-hot encoded.

In a compliance setting, it may be prohibited to include certain protected features in your training data. However, unprotected features might still provide your model with information about the protected features. If a simple logistic regression model can be trained to accurately predict protected features, your model might have a hidden reliance on protected features, resulting in biased decisions.

By default, the selection rate is computed for all protected features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Demographic Parity (Pos Pred)

Bias and Fairness

tabular

This test is commonly known as the demographic parity or statistical parity test in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Positive Prediction Rate of model predictions within a specific subset is significantly different than the model prediction Positive Prediction Rate over the entire population.

Demographic parity is one of the most well-known and strict measures of fairness. It is meant to be used in a setting where we assert that the base label rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a protected attribute. It can be useful in legal/compliance settings where we want a Selection Rate for any protected group to fundamentally be the same as other groups.

By default, the Positive Prediction Rate is computed for all protected features.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Demographic Parity (Avg Pred)

Bias and Fairness

tabular

This test is commonly known as the demographic parity or statistical parity test in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Average Prediction of model predictions within a specific subset is significantly different than the model prediction Average Prediction over the entire population.

Demographic parity is one of the most well-known and strict measures of fairness. It is meant to be used in a setting where we assert that the base label rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a protected attribute. It can be useful in legal/compliance settings where we want a Selection Rate for any protected group to fundamentally be the same as other groups.

By default, the Average Prediction is computed for all protected features.

[tabular] Regression, [tabular] Ranking

Demographic Parity (Avg Rank)

Bias and Fairness

tabular

This test is commonly known as the demographic parity or statistical parity test in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Average Rank of model predictions within a specific subset is significantly different than the model prediction Average Rank over the entire population.

Demographic parity is one of the most well-known and strict measures of fairness. It is meant to be used in a setting where we assert that the base label rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a protected attribute. It can be useful in legal/compliance settings where we want a Selection Rate for any protected group to fundamentally be the same as other groups.

By default, the Average Rank is computed for all protected features.

[tabular] Ranking

Equal Opportunity (Recall)

Bias and Fairness

tabular

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.

Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts a rejection is similar to group A and B.

By default, Recall is computed over all predictions/labels. Note that we round predictions to 0/1 to compute recall.

[tabular] Binary Classification, [tabular] Ranking

Equal Opportunity (Macro Recall)

Bias and Fairness

tabular

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. When transitioning to the multiclass setting we can use macro recall which computes the recall of each individual class and then averages these numbers.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Recall of model predictions within a specific subset is significantly lower than the model prediction Macro Recall over the entire population.

Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts an interview is similar to group A and B.

By default, Macro Recall is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted class probability.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Intersectional Group Fairness (Pos Pred)

Bias and Fairness

tabular

This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the positive prediction rate of model predictions within a specific subset is significantly lower than the model positive prediction rate over the entire population. This will expose hidden biases against groups at the intersection of these protected features

Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.

This test runs over unique pairs of categorical protected features.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Intersectional Group Fairness (Avg Pred)

Bias and Fairness

tabular

This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the average prediction of model predictions within a specific subset is significantly lower than the model average prediction over the entire population. This will expose hidden biases against groups at the intersection of these protected features

Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.

This test runs over unique pairs of categorical protected features.

[tabular] Regression, [tabular] Ranking

Intersectional Group Fairness (Avg Rank)

Bias and Fairness

tabular

This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the average rank of model predictions within a specific subset is significantly lower than the model average rank over the entire population. This will expose hidden biases against groups at the intersection of these protected features

Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.

This test runs over unique pairs of categorical protected features.

[tabular] Ranking

Predictive Equality (FPR)

Bias and Fairness

tabular

The false positive error rate test is also popularly referred to as as predictive equality, or equal mis-opportunity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the false positive rate of model predictions within a specific subset is significantly higher than the model prediction false positive rate over the entire population.

Having different false positive rates (e.g. predictive equality) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. As an intuitive example, consider the case when the label indicates an undesirable attribute: if predicting whether a person will default on their loan, make sure that for people who didn’t default, the rate at which the model incorrectly predicts positive is similar for group A and B.

By default, false positive rate is computed over all predictions/labels. Note that we round predictions to 0/1 to compute false positive rate.

[tabular] Binary Classification, [tabular] Ranking

Discrimination By Proxy

Bias and Fairness

tabular

This test checks whether any feature is a proxy for a protected feature. It runs over categorical features, using mutual information as a measure of similarity with a protected feature. Mutual information measures any dependencies between two variables.

A common strategy to try to ensure a model is not biased is to remove protected features from the training data entirely so the model cannot learn over them. However, if other features are highly dependent on those features, that could lead to the model effectively still training over those features by proxy.

By default, this test is run over all categorical protected columns.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Sensitivity (Pos Pred)

Bias and Fairness

tabular

This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Positive Prediction Rate. The test then substitutes this subset into a sample from the original data and calculates the change in Positive Prediction Rate. This test fails if a model demonstrates significantly lower Positive Prediction Rate on the lowest performing subset.

Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.

By default, the subset sensitivity is computed for all protected features that are strings.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Sensitivity (Avg Pred)

Bias and Fairness

tabular

This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Average Prediction. The test then substitutes this subset into a sample from the original data and calculates the change in Average Prediction. This test fails if a model demonstrates significantly lower Average Prediction on the lowest performing subset.

Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.

By default, the subset sensitivity is computed for all protected features that are strings.

[tabular] Regression, [tabular] Ranking

Subset Sensitivity (Avg Rank)

Bias and Fairness

tabular

This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Average Rank. The test then substitutes this subset into a sample from the original data and calculates the change in Average Rank. This test fails if a model demonstrates significantly lower Average Rank on the lowest performing subset.

Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.

By default, the subset sensitivity is computed for all protected features that are strings.

[tabular] Ranking

Out of Range Substitution

Transformations

tabular

This test measures the impact on the model when we substitute values outside the inferred range of allowed values into clean datapoints.

In production, the model may encounter corrupted or manipulated out of range values. It is important that the model is robust to such extremities.

By default, this test runs over all numeric features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Numeric Outliers Substitution

Transformations

tabular

This test measures the impact on the model when we substitute outliers into clean datapoints. Outliers are values which may not necessarily be outside of an allowed range for a feature, but are extreme values that are unusual and may be indicative of abnormality.

Outliers can be a sign of corrupted or otherwise erroneous data, and can degrade model performance if used in the training data, or lead to unexpected behaviour if input at inference time.

By default this test is run over each numeric feature that is neither unique nor ascending.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Int Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type Integer into features that are inferred to be Integer type from the reference set. In this specific test, we add a decimal value to the integer to convert it to a float.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Integer.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Float Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type Float into features that are inferred to be Float type from the reference set. In this specific test, we cast the float as a string (2.3 becomes ‘2.3’)

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Float.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

String Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type String Categorical into features that are inferred to be String Categorical type from the reference set. In this specific test, we fix a random integer as the input.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type String Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Boolean Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type Boolean Categorical into features that are inferred to be Boolean Categorical type from the reference set. In this specific test, we randomly fix an integer as the input.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Boolean Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

URL Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type URL Categorical into features that are inferred to be URL Categorical type from the reference set. In this specific test, we create a random string and fix that as the input.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type URL Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Domain Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type Domain Categorical into features that are inferred to be Domain Categorical type from the reference set. In this specific test, we create a random string and fix that as the input.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Domain Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Email Feature Type Change

Transformations

tabular

This test measures the impact on the model when we substitute values not of type Email Categorical into features that are inferred to be Email Categorical type from the reference set. In this specific test, we create a random string and fix that as the input.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Email Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Empty String Substitution

Transformations

tabular

This test measures the impact on the model when we substitute empty string values instead of null values into clean datapoints.

In production, the model may encounter corrupted or manipulated string values. Null values and empty strings are often expected to be treated the same, but the model might not treat them that way. It is important that the model is robust to such extremities.

By default, this test runs over all string features with null values.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Required Characters Deletion

Transformations

tabular

This test measures the impact on the model when we delete required characters, inferred from the reference set, from the strings of clean datapoints.

A feature may require specific characters. However, errors in the data pipeline may allow invalid data points that lack these required characters to pass. Failing to catch such errors may lead to noisier training data or noisier predictions during inference, which can degrade model metrics.

By default, this test runs over all string features that are inferred to have required characters.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Categorical Substitution

Transformations

tabular

This test measures the impact on the model when we substitute unseen categorical values into clean datapoints.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Domain Substitution

Transformations

tabular

This test measures the impact on the model when we substitute unseen domain values into clean datapoints.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain domains.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Email Substitution

Transformations

tabular

This test measures the impact on the model when we substitute unseen email values into clean datapoints.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain emails.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen URL Substitution

Transformations

tabular

This test measures the impact on the model when we substitute unseen URL values into clean datapoints.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain URLs.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Null Substitution

Transformations

tabular

This test measures the impact on the model when we substitute nulls in features that should not have nulls into clean datapoints.

The model may make certain assumptions about a column depending on whether or not it had nulls in the training data. If these assumptions break during production, this may damage the model’s performance. For example, if a column was never null during training then a model may not have learned to be robust against noise in that column.

By default, this test runs over all columns that had zero nulls in the reference set.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Capitalization Change

Transformations

tabular

This test measures the impact on the model when we substitute different types of capitalization into clean datapoints.

In production, models can come across the same value with different capitalizations, making it important to explicitly check that your model is invariant to such differences.

By default, this test runs over all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Correlation Drift (Feature-to-Feature)

Drift

tabular, nlp, cv

This test measures the severity of feature-feature correlation drift from the reference to the evaluation set for a given pair of features. The severity is a function of the correlation drift in the data. The key detail is the difference in correlation scores between the reference and evaluation sets, along with an associated p-value. Correlation is a measure of the linear relationship between two numeric columns (feature-feature) so this test checks for significant changes in this relationship between each feature-feature in the reference and evaluation sets. To compute the p-value, we use Fisher’s z-transformation to convert the distribution of sample correlations to a normal distribution, and then we run a standard two-sample test on two normal distributions.

Correlation drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.

By default, this test runs over all pairs of features in the dataset.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Correlation Drift (Feature-to-Label)

Drift

tabular, nlp, cv

This test measures the severity of feature-label correlation drift from the reference to the evaluation set for a given pair of a feature and label. The severity is a function of the correlation drift in the data. The key detail is the difference in correlation scores between the reference and evaluation sets, along with an associated p-value. Correlation is a measure of the linear relationship between two numeric columns (feature-label) so this test checks for significant changes in this relationship between each feature-label in the reference and evaluation sets. To compute the p-value, we use Fisher’s z-transformation to convert the distribution of sample correlations to a normal distribution, and then we run a standard two-sample test on two normal distributions.

Correlation drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.

By default, this test runs over all pairs of features and labels in the dataset.

[tabular] Regression

Mutual Information Drift (Feature-to-Feature)

Drift

tabular, nlp, cv

This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.

Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.

By default, this test runs over all pairs of features in the dataset.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mutual Information Drift (Feature-to-Label)

Drift

tabular, nlp, cv

This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.

Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.

By default, this test runs over all pairs of features in the dataset.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Label Drift (Categorical)

Drift

tabular, nlp, cv

This test checks that the difference in label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.

Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

This test is run by default whenever both the reference and evaluation sets have associated labels.

[tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Predicted Label Drift

Drift

tabular, nlp, cv

This test checks that the difference in predicted label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.

Predicted Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant predicted label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

This test is run by default whenever the model or predictions is provided.

tabular Multi-class Classification

Label Drift (Regression)

Drift

tabular, nlp, cv

This test checks that the difference in label distribution between the reference and evaluation sets is small, using the PSI test. The key detail displayed is the KS statistic which is a measure of how different the labels in the reference and evaluation sets are. Concretely, the KS statistic is the maximum difference of the empirical CDF’s of the two label columns.

Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

This test is run by default whenever both the reference and evaluation sets have associated labels.

[tabular] Regression, [tabular] Ranking

Categorical Feature Drift

Drift

tabular, nlp, cv

This test measures the severity of passing to the model data points that have categorical features which have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail displayed is the PSI test statistic, which is a measure of how statistically significant the difference between the frequencies of categorical values in the reference and evaluation sets is.

Distribution drift in categorical features between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in categorical features towards categorical subsets that your model performs poorly in could indicate a degradation in model performance and signal the need for relabeling and retraining.

By default, this test runs over all categorical columns with sufficiently many samples.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Numeric Feature Drift

Drift

tabular, nlp, cv

This test measures the severity of passing to the model data points that have numeric features that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the Population Stability Index statistic. The Population Stability Index (PSI) is a measure of how different two distributions are. Given two distributions P and Q, it is computed as the sum of the KL Divergence between P and Q and the (reverse) KL Divergence between Q and P. Thus, PSI is symmetric.

Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.

By default, this test runs over all numeric columns with sufficiently many samples and stored quantiles in each of the reference and evaluation sets.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Prediction Drift

Drift

tabular, nlp, cv

This test checks that the difference in the prediction distribution between the reference and evaluation sets is small, using Population Stability Index. The key detail displayed is the PSI which is a measure of how different the prediction distributions in the reference and evaluation sets are.

Prediction distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant prediction distribution drift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

This test is run by default whenever both the reference and evaluation sets have associated predictions. Different thresholds are associated with different severities.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Embedding Drift

Drift

tabular, nlp, cv

This test measures the severity of passing to the model data points associated with embeddings that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the Euclidean Distance statistic. The Euclidean Distance is defined as the square root of the sum of the squared differences between two vectors X and Y. The normalized version of this metric first divides each vector by its L2 norm. This test takes the normalized Euclidean distance between the centroids of the ref and eval data sets.

Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.

By default, this test runs over all specified embeddings with sufficiently many samples in each of the reference and evaluation sets.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Nulls Per Feature Drift

Drift

tabular

This test measures the severity of passing to the model data points that have features with a null proportion that has drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the p-value from a two-sample proportion test that checks if there is a statistically significant difference in the frequencies of null values between the reference and evaluation sets.

Distribution drift in null values between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in null value proportion could indicate a degradation in model performance and signal the need for relabeling and retraining.

By default, this test runs over all columns with sufficiently many samples.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Nulls Per Row Drift

Drift

tabular

This test measures the severity of passing to the model data points that have proportions of null values that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much predictions change when the observed drift is applied to a given row. The key detail displayed is the PSI statistic that is a measure of how statistically significant the difference in the proportion of null values in a row between the reference and evaluation sets is.

Distribution drift in null values between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in null value proportion could indicate a degradation in model performance and signal the need for relabeling and retraining.

By default, this test runs over all rows.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Single-Feature Changes

Attacks

tabular

This test measures the severity of passing to the model data points that have been manipulated across a single feature in an unbounded manner. The severity is a function of the impact of these manipulations on the model.

In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. ‘Attacking’ a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, <i>before</i> putting it into production. Rstricting ourselves to changing a single feature at a time is one proxy for what ‘realistic’ out-of-distribution data can look like.

By default, for a given input we aim to change your model’s prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Bounded Single-Feature Changes

Attacks

tabular

This test measures the severity of passing to the model data points that have been manipulated across a single feature in a bounded manner. The severity is a function of the impact of these manipulations on the model.We bound the manipulations to be less than some fraction of the range of the given feature.

In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. ‘Attacking’ a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, <i>before</i> putting it into production. Restricting ourselves to changing a single feature by a small amount is one proxy for what ‘realistic’ out-of-distribution data can look like.

By default, for a given input we aim to change your model’s prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold. This test runs only over numeric features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Multi-Feature Changes

Attacks

tabular

This test measures the severity of passing to the model data points that have been manipulated across multiple features in an unbounded manner. The severity is a function of the impact of these manipulations on the model.

In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. ‘Attacking’ a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, <i>before</i> putting it into production. Restricting the number of features that can be changed is one proxy for what ‘realistic’ out-of-distribution data can look like.

By default, for a given input we aim to change your model’s prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Bounded Multi-Feature Changes

Attacks

tabular

This test measures the severity of passing to the model data points that have been manipulated across multiple features in an bounded manner. The severity is a function of the impact of these manipulations on the model.We bound the manipulations to be less than some fraction of the range of the given feature.

In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. ‘Attacking’ a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, <i>before</i> putting it into production. Restricting the number of features that can be changed and the magnitude of the change that can be made to each feature is one proxy for what ‘realistic’ out-of-distribution data can look like.

By default, for a given input we aim to change your model’s prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold. This test runs only over numeric features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Label Imbalance

Data Cleanliness

tabular, nlp, cv

This test checks that no labels have exceedingly high frequency.

Label imbalance in the training data can introduce bias into the model and possibly result in poor predictive performance on examples from the minority classes.

This test runs only on classification tasks.

[tabular] Binary Classification, [tabular] Multi-class Classification

Required Features

Data Cleanliness

tabular

This test checks that the features of a dataset are as expected.

Errors in data collection and processing can lead to invalid missing (or extra) features. In the case of missing features, this can cause failures in models. In the case of extra features, this can lead to unnecessary storage and computation.

This test runs only when required features are specified.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Duplicate Row

Data Cleanliness

tabular

This test checks if there are any duplicate rows in your dataset. The key detail displays the number of duplicate rows in your dataset.

Duplicate rows are potentially a sign of a broken data pipeline or an otherwise corrupted input.

By default this test is run over all features, meaning two rows are considered duplicates only if they match across all features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Mutual Information Decrease (Feature to Label)

Data Cleanliness

tabular

This test flags a likely data leakage issue in the model.Data leakage occurs when a model is trained on features containing information about the label that is not normally present during production.This test flags an issue if both of the following occur:<ul><li>the normalized mutual information between the feature and the label is too high in the reference set</li><li>the normalized mutual information for the reference set is much higher than for the evaluation set</li></ul> The first criterion is an indicator that the feature has unreasonably high predictive power for the label during training, and the second criterion checks that the feature is no longer a good predictor in the evaluation set. One requirement for this test to flag data leakage is that the evaluation set labels and features are collected properly. This test should be utilized if one trusts their eval data is collected correctly, else the High MI test should be used.

Errors in data collection and processing can lead to some features containing information about the label in the reference set that do not appear in the evaluation set. This causes the model to under-perform during production.

By default, this test always runs on all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

High Mutual Information (Feature to Label)

Data Cleanliness

tabular

This test flags a likely data leakage issue if the normalized mutual information between the feature and the label is too high in the reference set. Data leakage occurs when a model is trained on features containing information about the label that is not normally present during production. This criterion is an indicator that this feature has unreasonably high predictive power for the label during training. One requirement for this test to flag data leakage is that the reference set labels and features are collected properly. This test should be utilized when one doesn’t trust their eval data is collected correctly, else the MI Decrease test should be used.

Errors in data collection and processing can lead to some features containing information about the label in the reference set. This causes the model to under-perform during production.

By default, this test always runs on all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

High Feature Correlation

Data Cleanliness

tabular

This test checks that the correlation between two features in the reference set is not too high. Correlation is a measure of the linear relationship between two numeric features.

Correlation in training features can be caused by a variety of factors, including interdependencies between the collected features, data collection processes, or change in data labeling. Training on too similar features can lead to underperforming or non-robust models.

By default, this test runs over all pairs of numeric features in the dataset.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Precision

Subset Performance

tabular, nlp, cv

The precision test is also popularly referred to as positive predictive parity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire population.

Having different precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

By default, Precision is computed over all predictions/labels. Note that we round predictions to 0/1 to compute precision.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Prediction Variance (Positive Labels)

Subset Performance

tabular

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire population. In this test, the population refers to all data with positive ground-truth labels.

High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

By default, the variance is computed over all predictions with a positive ground-truth label.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Mean-Absolute Error (MAE)

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MAE of model predictions within a specific subset is significantly higher than the model prediction MAE over the entire population.

Having different mean-absolute error between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, mean-absolute error is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Multiclass Accuracy

Subset Performance

tabular, nlp, cv

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire population.

Having different accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a ‘weaker’ metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Macro Precision

Subset Performance

tabular, nlp, cv

The precision test is also popularly referred to as positive predictive parity in fairness literature. When transitioning to the multiclass setting, we can compute macro precision which computes the precisions of each class individually and then averages them.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Precision of model predictions within a specific subset is significantly lower than the model prediction Macro Precision over the entire population.

Having different macro precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

By default, Macro Precision is computed over all predictions/labels. Note that the predicted label is the label with the greatest predicted probability.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Rank Correlation

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the rank correlation of model predictions within a specific subset is significantly lower than the model prediction rank correlation over the entire population.

Having different rank correlation between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, rank correlation is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset F1

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the F1 of model predictions within a specific subset is significantly lower than the model prediction F1 over the entire population.

Having different F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, F1 is computed over all predictions/labels. Note that we round predictions to 0/1 to compute F1 score.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Prediction Variance (Negative Labels)

Subset Performance

tabular

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire population. In this test, the population refers to all data with negative ground-truth labels.

High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

By default, the variance is computed over all predictions with a negative ground-truth label.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset False Positive Rate

Subset Performance

tabular

The false positive error rate test is also popularly referred to as as predictive equality, or equal mis-opportunity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the false positive rate of model predictions within a specific subset is significantly higher than the model prediction false positive rate over the entire population.

Having different false positive rates (e.g. predictive equality) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. As an intuitive example, consider the case when the label indicates an undesirable attribute: if predicting whether a person will default on their loan, make sure that for people who didn’t default, the rate at which the model incorrectly predicts positive is similar for group A and B.

By default, false positive rate is computed over all predictions/labels. Note that we round predictions to 0/1 to compute false positive rate.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Root-Mean-Squared Error (RMSE)

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the RMSE of model predictions within a specific subset is significantly higher than the model prediction RMSE over the entire population.

Having different RMSE between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, RMSE is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Mean-Absolute Percentage Error (MAPE)

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MAPE of model predictions within a specific subset is significantly higher than the model prediction MAPE over the entire population.

Having different mean-absolute percentage error between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, mean-absolute percentage error is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Macro F1

Subset Performance

tabular, nlp, cv

F1 is a holistic measure of both precision and recall. When transitioning to the multiclass setting we can use macro F1 which computes the F1 of each class and averages them. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the macro F1 of model predictions within a specific subset is significantly lower than the model prediction macro F1 over the entire population.

Having different macro F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, macro F1 is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted probability.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Normalized Discounted Cumulative Gain (NDCG)

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the NDCG of model predictions within a specific subset is significantly lower than the model prediction NDCG over the entire population.

Having different NDCG between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, NDCG is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset AUC

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire population.

Having different AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Recall

Subset Performance

tabular, nlp, cv

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.

Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts a rejection is similar to group A and B.

By default, Recall is computed over all predictions/labels. Note that we round predictions to 0/1 to compute recall.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Accuracy

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire population.

Having different accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a ‘weaker’ metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Macro Recall

Subset Performance

tabular, nlp, cv

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. When transitioning to the multiclass setting we can use macro recall which computes the recall of each individual class and then averages these numbers.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Recall of model predictions within a specific subset is significantly lower than the model prediction Macro Recall over the entire population.

Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts an interview is similar to group A and B.

By default, Macro Recall is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted class probability.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Mean Reciprocal Rank (MRR)

Subset Performance

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MRR of model predictions within a specific subset is significantly lower than the model prediction MRR over the entire population.

Having different MRR between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, MRR is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Multiclass AUC

Subset Performance

tabular, nlp, cv

In the multiclass setting, we compute one vs. one area under the curve (AUC), which computes the AUC between every pairwise combination of classes. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire population.

Having different AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Numeric Outliers

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with outliers and their impact on the model. Outliers are values which may not necessarily be outside of an allowed range for a feature, but are extreme values that are unusual and may be indicative of abnormality. The model impact is the difference in model performance between passing and failing rows with outliers. If labels are not provided, prediction change is used instead of model performance change.

Outliers can be a sign of corrupted or otherwise erroneous data, and can degrade model performance if used in the training data, or lead to unexpected behaviour if input at inference time.

By default this test is run over each numeric feature that is neither unique nor ascending.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Categorical

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with unseen categorical values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen categorical values. If labels are not provided, prediction change is used instead of model performance change.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Domain

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with unseen domain values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen domain values. If labels are not provided, prediction change is used instead of model performance change.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain domains.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen Email

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with unseen email values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen email values. If labels are not provided, prediction change is used instead of model performance change.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain emails.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Unseen URL

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with unseen URL values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen URL values. If labels are not provided, prediction change is used instead of model performance change.

Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

By default, this test runs over all features inferred to contain URLs.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Rare Categories

Abnormal Inputs

tabular, nlp, cv

This test measures the severity of passing to the model data points whose features contain rarely observed categories (relative to the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with rarely observed categorical values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times rarely observed categorical values are observed in the evaluation set.

Rare categories are a common failure point in machine learning systems because less data often means worse performance. In addition, this may expose gaps or errors in data collection.

By default, this test runs over all categorical features. A category is considered rare if it occurs fewer than <span>min_num_occurrences</span> times, or if it occurs less than <span>min_pct_occurrences</span> of the time. If neither of these values are specified, the rate of appearance below which a category is considered rare is <span>min_ratio_rel_uniform</span> divided by the number of classes.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Out of Range

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with values outside the inferred range of allowed values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values outside the inferred range of allowed values. If labels are not provided, prediction change is used instead of model performance change.

In production, the model may encounter corrupted or manipulated out of range values. It is important that the model is robust to such extremities.

By default, this test runs over all numeric features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Required Characters

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with strings without any required characters and their impact on the model. The model impact is the difference in model performance between passing and failing rows with strings without any required characters. If labels are not provided, prediction change is used instead of model performance change.

A feature may require specific characters. However, errors in the data pipeline may allow invalid data points that lack these required characters to pass. Failing to catch such errors may lead to noisier training data or noisier predictions during inference, which can degrade model metrics.

By default, this test runs over all string features that are inferred to have required characters.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Inconsistencies

Abnormal Inputs

tabular, nlp, cv

This test measures the severity of passing to the model data points whose values are inconsistent (as inferred from the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with data containing inconsistent feature values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times data containing inconsistent feature values are observed in the evaluation set.

Inconsistent values might be the result of malicious actors manipulating the data or errors in the data pipeline. Thus, it is important to be aware of inconsistent values to identify sources of manipulations or errors.

By default, this test runs on pairs of categorical features whose correlations exceed some minimum threshold. The default threshold for the frequency ratio below which values are considered to be inconsistent is <span>0.02</span>.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Capitalization

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with different types of capitalization and their impact on the model. The model impact is the difference in model performance between passing and failing rows with different types of capitalization. If labels are not provided, prediction change is used instead of model performance change.

In production, models can come across the same value with different capitalizations, making it important to explicitly check that your model is invariant to such differences.

By default, this test runs over all categorical features.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Empty String

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with empty string values instead of null values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with empty string values instead of null values. If labels are not provided, prediction change is used instead of model performance change.

In production, the model may encounter corrupted or manipulated string values. Null values and empty strings are often expected to be treated the same, but the model might not treat them that way. It is important that the model is robust to such extremities.

By default, this test runs over all string features with null values.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Embedding Anomalies

Abnormal Inputs

tabular, nlp, cv

This test measures the number of failing rows in your data with anomalous embeddings and their impact on the model. The model impact is the difference in model performance between passing and failing rows with anomalous embeddings. If labels are not provided, prediction change is used instead of model performance change.

In production, the presence of anomalous embeddings can indicate breaks in upstream data pipelines, poor model generalization, or other issues.

By default, this test runs over all configured embeddings.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Null Check

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with nulls in features that should not have nulls and their impact on the model. The model impact is the difference in model performance between passing and failing rows with nulls in features that should not have nulls. If labels are not provided, prediction change is used instead of model performance change.

The model may make certain assumptions about a column depending on whether or not it had nulls in the training data. If these assumptions break during production, this may damage the model’s performance. For example, if a column was never null during training then a model may not have learned to be robust against noise in that column.

By default, this test runs over all columns that had zero nulls in the reference set.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be Int

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type Integer and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Integer. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Integer.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be Float

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type Float and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Float. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Float.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be String

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type String Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type String Categorical. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type String Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be Boolean

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type Boolean Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Boolean Categorical. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Boolean Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be URL

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type URL Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type URL Categorical. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type URL Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be Domain

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type Domain Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Domain Categorical. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Domain Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Must be Email

Abnormal Inputs

tabular

This test measures the number of failing rows in your data with values not of type Email Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Email Categorical. If labels are not provided, prediction change is used instead of model performance change.

A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

By default, this test runs over all features that are inferred to be type Email Categorical.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Precision

Subset Performance Degradation

tabular, ‘nlp

The precision test is also popularly referred to as positive predictive parity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire population.

Having different precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

By default, Precision is computed over all predictions/labels. Note that we round predictions to 0/1 to compute precision.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Prediction Variance (Positive Labels)

Subset Performance Degradation

tabular

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire population. In this test, the population refers to all data with positive ground-truth labels.

High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

By default, the variance is computed over all predictions with a positive ground-truth label.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Mean-Absolute Error (MAE)

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MAE of model predictions within a specific subset is significantly higher than the model prediction MAE over the entire population.

Having different mean-absolute error between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, mean-absolute error is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Rank Correlation

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the rank correlation of model predictions within a specific subset is significantly lower than the model prediction rank correlation over the entire population.

Having different rank correlation between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, rank correlation is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift F1

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the F1 of model predictions within a specific subset is significantly lower than the model prediction F1 over the entire population.

Having different F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, F1 is computed over all predictions/labels. Note that we round predictions to 0/1 to compute F1 score.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Prediction Variance (Negative Labels)

Subset Performance Degradation

tabular

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire population. In this test, the population refers to all data with negative ground-truth labels.

High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

By default, the variance is computed over all predictions with a negative ground-truth label.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift False Positive Rate

Subset Performance Degradation

tabular

The false positive error rate test is also popularly referred to as as predictive equality, or equal mis-opportunity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the false positive rate of model predictions within a specific subset is significantly higher than the model prediction false positive rate over the entire population.

Having different false positive rates (e.g. predictive equality) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. As an intuitive example, consider the case when the label indicates an undesirable attribute: if predicting whether a person will default on their loan, make sure that for people who didn’t default, the rate at which the model incorrectly predicts positive is similar for group A and B.

By default, false positive rate is computed over all predictions/labels. Note that we round predictions to 0/1 to compute false positive rate.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Root-Mean-Squared Error (RMSE)

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the RMSE of model predictions within a specific subset is significantly higher than the model prediction RMSE over the entire population.

Having different RMSE between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, RMSE is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Mean-Absolute Percentage Error (MAPE)

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MAPE of model predictions within a specific subset is significantly higher than the model prediction MAPE over the entire population.

Having different mean-absolute percentage error between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, mean-absolute percentage error is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift NDCG

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the NDCG of model predictions within a specific subset is significantly lower than the model prediction NDCG over the entire population.

Having different NDCG between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, NDCG is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift AUC

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire population.

Having different AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Recall

Subset Performance Degradation

tabular, ‘nlp

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.

Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts a rejection is similar to group A and B.

By default, Recall is computed over all predictions/labels. Note that we round predictions to 0/1 to compute recall.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Accuracy

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire population.

Having different accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a ‘weaker’ metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Subset Drift Mean Reciprocal Rank (MRR)

Subset Performance Degradation

tabular

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MRR of model predictions within a specific subset is significantly lower than the model prediction MRR over the entire population.

Having different MRR between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

By default, MRR is computed over all predictions/labels.

[tabular] Regression, [tabular] Binary Classification, [tabular] Ranking, [tabular] Multi-class Classification

Invisible Character Attack

Adversarial

nlp

This test measures the robustness of your model to invisible character attacks. It does this by taking a sample input, inserting zero-width unicode characters, and measuring the performance of the model on the perturbed input. See the paper “Fall of Giants: How Popular Text-Based MLaaS Fall against a Simple Evasion Attack” by Pajola and Conti (https://arxiv.org/abs/2104.05996) for more details.

Malicious actors can perturb natural language input sequences to alter model behavior in unexpected ways. It is important that your NLP models are robust to such attacks.

By default, this test runs in adversarial mode.

[nlp] text_classification

Deletion Control Character Attack

Adversarial

nlp

This test measures the robustness of your model to deletion control character attacks. It does this by taking a sample input, inserting deletion control characters, and measuring the performance of the model on the perturbed input. See the paper “Bad Characters: Imperceptible NLP Attacks” by Boucher, Shumailov, et al. (https://arxiv.org/abs/2106.09898) for more details.

Malicious actors can perturb natural language input sequences to alter model behavior in unexpected ways. It is important that your NLP models are robust to such attacks.

By default, this test runs in adversarial mode.

[nlp] text_classification

Intentional Homoglyph Attack

Adversarial

nlp

This test measures the robustness of your model to intentional homoglyph attacks. It does this by taking a sample input, substituting homoglyphs designed to look like other characters, and measuring the performance of the model on the perturbed input. See the paper “Bad Characters: Imperceptible NLP Attacks” by Boucher, Shumailov, et al. (https://arxiv.org/abs/2106.09898) for more details.

Malicious actors can perturb natural language input sequences to alter model behavior in unexpected ways. It is important that your NLP models are robust to such attacks.

By default, this test runs in adversarial mode.

[nlp] text_classification

Confusable Homoglyph Attack

Adversarial

nlp

This test measures the robustness of your model to confusable homoglyph attacks. It does this by taking a sample input, substituting homoglyphs that are easily confused with other characters, and measuring the performance of the model on the perturbed input. See the paper “Bad Characters: Imperceptible NLP Attacks” by Boucher, Shumailov, et al. (https://arxiv.org/abs/2106.09898) for more details.

Malicious actors can perturb natural language input sequences to alter model behavior in unexpected ways. It is important that your NLP models are robust to such attacks.

By default, this test runs in adversarial mode.

[nlp] text_classification

Character Substitution

Adversarial

nlp

This test measures the robustness of your model to character substitution attacks. It does this by randomly substituting characters in the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Character Deletion

Adversarial

nlp

This test measures the robustness of your model to character deletion attacks. It does this by randomly deleting characters in the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Character Insertion

Adversarial

nlp

This test measures the robustness of your model to character insertion attacks. It does this by randomly adding characters to the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Character Swap

Adversarial

nlp

This test measures the robustness of your model to character swap attacks. It does this by randomly swapping characters in the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Keyboard Augmentation

Adversarial

nlp

This test measures the robustness of your model to keyboard augmentation attacks. It does this by adding common typos based on keyboard distance to the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Common Misspellings

Adversarial

nlp

This test measures the robustness of your model to common misspellings attacks. It does this by adding common misspellings to the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

OCR Error Simulation

Adversarial

nlp

This test measures the robustness of your model to ocr error simulation attacks. It does this by adding common OCR errors to the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Synonym Swap

Adversarial

nlp

This test measures the robustness of your model to synonym swap attacks. It does this by randomly swapping synonyms in the input string and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Contextual Word Swap

Adversarial

nlp

This test measures the robustness of your model to contextual word swap attacks. It does this by replacing words with those close in embedding space and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Contextual Word Insertion

Adversarial

nlp

This test measures the robustness of your model to contextual word insertion attacks. It does this by inserting words generated from a language model and measuring your model’s performance on the attacked string.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Universal Prefix Attack

Adversarial

nlp

This test measures the robustness of your model to ‘universal’ adversarial prefix injections. It does this by sampling a batch of inputs, and searching over the model vocabulary to find a prefix that is nonsensical to a reader but that, when prepended to the batch of inputs, will cause the model to output a different prediction. See the paper “Universal Adversarial Triggers for Attacking and Analyzing NLP” by Wallace, Feng, Kandpal, et al. (https://arxiv.org/abs/1908.07125) for more details.

Malicious actors can perturb natural language input sequences to alter model behavior in unexpected ways. ‘Universal triggers’ pose a particularly large threat since they easily transfer between models and data points to permit an adversary to make large-scale, cost-efficient attacks. It is important that your NLP models are robust to such threat vectors.

By default, this test runs when the ‘Adversarial’ category is specified.

[nlp] text_classification

Unseen Unigram

Abnormal Inputs

nlp

This test measures the number of failing rows in your data with unseen unigrams and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen unigrams. If labels are not provided, prediction change is used instead of model performance change.

Unseen unigrams are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen unigram. In addition, such errors may expose gaps or errors in data collection.

By default, this test is run over every data point.

[nlp] text_classification

Empty Text String

Abnormal Inputs

nlp

This test measures the number of failing rows in your data with empty strings and their impact on the model. The model impact is the difference in model performance between passing and failing rows with empty strings. If labels are not provided, prediction change is used instead of model performance change.

Empty strings are a common failure point in machine learning systems; as some models may yield uninterpretable or undefined behavior when interacting with an empty string. In addition, such errors may expose gaps or errors in data collection.

By default, this test is run over every data point.

[nlp] text_classification

Character Distribution

Drift

nlp

This test measures the <span>character</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>character</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] text_classification

Unigrams Distribution

Drift

nlp

This test measures the <span>unigram</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>unigram</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] text_classification

Bigrams Distribution

Drift

nlp

This test measures the <span>bigram</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>bigram</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] text_classification

Upper-Case Text

Transformations

nlp

This test measures the robustness of your model to Upper-Case Text transformations. It does this by taking a sample input, upper-casing all text, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Lower-Case Text

Transformations

nlp

This test measures the robustness of your model to Lower-Case Text transformations. It does this by taking a sample input, lower-casing all text, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Remove Special Characters

Transformations

nlp

This test measures the robustness of your model to Remove Special Characters transformations. It does this by taking a sample input, removing all periods and apostrophes from the input string, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Replace Masculine with Feminine Pronouns

Transformations

nlp

This test measures the robustness of your model to Replace Masculine with Feminine Pronouns transformations. It does this by taking a sample input, swapping all masculine pronouns from the input string to feminine ones, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Replace Feminine with Masculine Pronouns

Transformations

nlp

This test measures the robustness of your model to Replace Feminine with Masculine Pronouns transformations. It does this by taking a sample input, swapping all feminine pronouns from the input string to masculine ones, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Replace Feminine with Masculine Names

Transformations

nlp

This test measures the invariance of your model to gendered name swap transformations. It does this by taking a sample input, swapping all instances of traditionally feminine names (in the provided list) with a traditionally masculine name, and measuring the behavior of the model on the transformed input.

Production natural language input sequences must properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.

By default, this test runs over a sample of up to strings from the evaluation set that contain one or more words from the source list.

[nlp] text_classification

Replace Masculine with Feminine Names

Transformations

nlp

This test measures the invariance of your model to gendered name swap transformations. It does this by taking a sample input, swapping all instances of traditionally masculine names (in the provided list) with a traditionally feminine name, and measuring the behavior of the model on the transformed input.

Production natural language input sequences must properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.

By default, this test runs over a sample of up to strings from the evaluation set that contain one or more words from the source list.

[nlp] text_classification

Unicode to ASCII

Transformations

nlp

This test measures the robustness of your model to Unicode to ASCII transformations. It does this by taking a sample input, converting all characters in the input string to their nearest ASCII representation, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.

[nlp] text_classification

Entity Type Distribution

Drift

nlp

This test measures the <span>label entity type</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>label entity type</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] named_entity_recognition

Predicted Entity Type Distribution

Drift

nlp

This test measures the <span>predicted entity type</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>predicted entity type</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] named_entity_recognition

Entity Lengths Distribution

Drift

nlp

This test measures the <span>entity length</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>entity length</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[nlp] named_entity_recognition

Label Entity Type Subsets

Subset Performance

nlp

This test measures whether the model performs equally well across subsets of the data when grouped by <span>label entity type</span>. These subsets are defined by grouping input sequences into approximately equal-width bins of the aforementioned metric. The test then measures whether model performance, as defined by the recall, for any given subset is significantly worse than the average performance across all subsets of the data.

Having similar performance across various subsets of the data is an important measure of performance bias.

By default, this test measures whether the recall of each subgroup is within 0.05 of the overall performance.

[nlp] named_entity_recognition

Predicted Entity Type Subsets

Subset Performance

nlp

This test measures whether the model performs equally well across subsets of the data when grouped by <span>predicted entity type</span>. These subsets are defined by grouping input sequences into approximately equal-width bins of the aforementioned metric. The test then measures whether model performance, as defined by the precision, for any given subset is significantly worse than the average performance across all subsets of the data.

Having similar performance across various subsets of the data is an important measure of performance bias.

By default, this test measures whether the precision of each subgroup is within 0.05 of the overall performance.

[nlp] named_entity_recognition

Lower-Case Entity

Transformations

nlp

This test measures the robustness of your model to Lower-Case Entity transformations. It does this by taking a sample input, lower-casing all entities, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Upper-Case Entity

Transformations

nlp

This test measures the robustness of your model to Upper-Case Entity transformations. It does this by taking a sample input, upper-casing all entities, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Ampersand

Transformations

nlp

This test measures the robustness of your model to Ampersand transformations. It does this by taking a sample input, changing <span>&</span> to <span>and</span>, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Abbreviation Expander

Transformations

nlp

This test measures the robustness of your model to Abbreviation Expander transformations. It does this by taking a sample input, expanding abbreviations in entities, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Whitespace Around Special Character

Transformations

nlp

This test measures the robustness of your model to Whitespace Around Special Character transformations. It does this by taking a sample input, adding whitespace around special characters, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Swap Seen Entities

Transformations

nlp

This test measures the robustness of your model to Swap Seen Entities transformations. It does this by taking a sample input, swapping all the entities in a text with random entities of the same type seen in the rest of the data, and measuring the behavior of the model on the transformed input.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Swap Unseen Entities

Transformations

nlp

This test measures the robustness of your model to Swap Unseen Entities transformations. It does this by taking a sample input, swapping all the entities in a text with random entities of the same category, unseen in the data, and measuring the behavior of the model on the transformed input. This test supports swapping entities from commonly-appearing categories in NER tasks: Person, Geopolitical Entity, Location, Nationality, Product, Corporation, and Organization.

Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.

By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.

[nlp] named_entity_recognition

Average Number of Predicted Entities

Model Performance

nlp

This test checks the Average Number of Predicted Entities metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Entities has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Average Number of Predicted Entities metric with the below thresholds set for the absolute and degradation tests.

[nlp] named_entity_recognition

Gaussian Blur

Transformations

cv

This test measures the robustness of your model to Gaussian Blur transformations. It does this by taking a sample input, blurring the image, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Color Jitter

Transformations

cv

This test measures the robustness of your model to Color Jitter transformations. It does this by taking a sample input, jittering the image colors, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Gaussian Noise

Transformations

cv

This test measures the robustness of your model to Gaussian Noise transformations. It does this by taking a sample input, adding gaussian noise to the image, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Vertical Flip

Transformations

cv

This test measures the robustness of your model to Vertical Flip transformations. It does this by taking a sample input, flipping the image vertically, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Horizontal Flip

Transformations

cv

This test measures the robustness of your model to Horizontal Flip transformations. It does this by taking a sample input, flipping the image horizontally, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Randomize Pixels With Mask

Transformations

cv

This test measures the robustness of your model to Randomize Pixels With Mask transformations. It does this by taking a sample input, randomizing pixels with fixed probability, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Contrast Increase

Transformations

cv

This test measures the robustness of your model to Contrast Increase transformations. It does this by taking a sample input, increase image contrast, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Contrast Decrease

Transformations

cv

This test measures the robustness of your model to Contrast Decrease transformations. It does this by taking a sample input, decrease image contrast, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Add Rain

Transformations

cv

This test measures the robustness of your model to Add Rain transformations. It does this by taking a sample input, adding rain texture to the image, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Add Snow

Transformations

cv

This test measures the robustness of your model to Add Snow transformations. It does this by taking a sample input, adding snow texture to the image, and measuring the behavior of the model on the transformed input.

Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

[cv] image_classification

Area of Predicted Boxes Distribution

Drift

cv

This test measures the <span>predicted box area</span> distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.

The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the <span>predicted box area</span> distribution between these sets, it can lead to subpar real-world model performance.

To pass a given test case, the divergence metric must be below the configured threshold.

[cv] object_detection

Average Number of Predicted Boxes

Model Performance

cv

This test checks the Average Number of Predicted Boxes metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Boxes has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

During production, factors like distribution shift or a change in <span>p(y|x)</span> may cause model performance to decrease significantly.

By default, this test runs over the Average Number of Predicted Boxes metric with the below thresholds set for the absolute and degradation tests.

[cv] object_detection

Tabular

NLP

CV

RIME local trial