Image Classification Tests

Subset Performance

Subset Macro F1

F1 is a holistic measure of both precision and recall. When transitioning to the multiclass setting we can use macro F1 which computes the F1 of each class and averages them. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the macro F1 of model predictions within a specific subset is significantly lower than the model prediction macro F1 over the entire population.

Why it matters: Having different macro F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, macro F1 is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the macro F1 across this subset is 0.78. If the overall macro F1 across all subsets is 0.9 then this test raises a warning.

Subset Macro Precision

The precision test is also popularly referred to as positive predictive parity in fairness literature. When transitioning to the multiclass setting, we can compute macro precision which computes the precisions of each class individually and then averages them.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Precision of model predictions within a specific subset is significantly lower than the model prediction Macro Precision over the entire population.

Why it matters: Having different macro precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

Configuration: By default, Macro Precision is computed over all predictions/labels. Note that the predicted label is the label with the greatest predicted probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the Macro Precision across this subset is 0.67. If the overall Macro Precision across all subsets is 0.9 then this test raises a warning.

Subset Macro Recall

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. When transitioning to the multiclass setting we can use macro recall which computes the recall of each individual class and then averages these numbers.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Recall of model predictions within a specific subset is significantly lower than the model prediction Macro Recall over the entire population.

Why it matters: Having different true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts an interview is similar to group A and B.

Configuration: By default, Macro Recall is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted class probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the Macro Recall across this subset is 0.67. If the overall Macro Recall across all subsets is 0.9 then this test raises a warning.

Subset Multiclass Accuracy

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire population.

Why it matters: Having different accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a 'weaker' metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

Configuration: By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the accuracy over the feature subset value 'cat' would be 0.33, compared to the overall metric of 0.5.

Subset Multiclass AUC

In the multiclass setting, we compute one vs. one area under the curve (AUC), which computes the AUC between every pairwise combination of classes. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire population.

Why it matters: Having different AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the AUC (one vs. one) across this subset is 0.75. If the overall AUC (one vs. one) across all subsets is 0.9 then this test raises a warning.

Subset Positive Prediction Rate

Subset Average Confidence

Transformations

Gaussian Blur

This test measures the robustness of your model to Gaussian Blur transformations. It does this by taking a sample input, blurring the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Color Jitter

This test measures the robustness of your model to Color Jitter transformations. It does this by taking a sample input, jittering the image colors, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Gaussian Noise

This test measures the robustness of your model to Gaussian Noise transformations. It does this by taking a sample input, adding gaussian noise to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Vertical Flip

This test measures the robustness of your model to Vertical Flip transformations. It does this by taking a sample input, flipping the image vertically, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Horizontal Flip

This test measures the robustness of your model to Horizontal Flip transformations. It does this by taking a sample input, flipping the image horizontally, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Randomize Pixels With Mask

This test measures the robustness of your model to Randomize Pixels With Mask transformations. It does this by taking a sample input, randomizing pixels with fixed probability, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Contrast Increase

This test measures the robustness of your model to Contrast Increase transformations. It does this by taking a sample input, increase image contrast, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Contrast Decrease

This test measures the robustness of your model to Contrast Decrease transformations. It does this by taking a sample input, decrease image contrast, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Add Rain

This test measures the robustness of your model to Add Rain transformations. It does this by taking a sample input, adding rain texture to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Add Snow

This test measures the robustness of your model to Add Snow transformations. It does this by taking a sample input, adding snow texture to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

Model Performance

Average Confidence

This test checks the average confidence of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. The "confidence" of a prediction for classification tasks is defined as the distance between the probability of the predicted class (defined as the argmax over the prediction vector) and 1. We average this metric across all predictions.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

Configuration: By default, this test runs if predictions are specified (no labels required).

Example: Assume that on the reference set the model obtained 0.85 average confidence but on the evaluation set without labels we predict that the model obtained 0.5 average confidence. Then this test raises a warning.

Average Thresholded Confidence

This test checks the average thresholded confidence (ATC) of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. ATC is a method for estimating accuracy of unlabeled examples taken from this paper. The threshold is first computed on the reference set: we pick a confidence threshold such that the percentage of datapoints whose max predicted probability is less than the threshold is around equal to the error rate of the model (here, it is 1-accuracy) on the reference set. Then, we apply this threshold in the evaluation set: the predicted accuracy is then equal to the percentage of datapoints with max predicted probability greater than this threshold.

Why it matters: During production, factors like distribution shift may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

Configuration: By default, this test runs if predictions/labels are specified in the reference set and predictions are specified in the eval set (no labels required).

Example: Assume that on the reference set the model obtained 0.85 accuracy but on the evaluation set, we find that only 55 percent of datapoints have max predicted probability greater than our threshold. Then our predicted accuracy is 0.55 and this test raises a warning.

Calibration Comparison

This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.

Why it matters: Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.

Configuration: By default, this test runs over the predictions and labels.

Example: Suppose the model’s task is binary classification and predicts whether or not a data point is fraudulent. If we have a reference set in which 1% of the data points are fraudulent, but an evaluation set where 50% are fraudulent, then our model may not be well calibrated, and the MSE difference in the curves will be large, resulting in a failing test.

Average Prediction

This test checks the Average Prediction metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Prediction has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Average Prediction metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Macro F1

This test checks the Macro F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Macro F1 metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Macro Precision

This test checks the Macro Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Macro Precision metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Macro Recall

This test checks the Macro Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Macro Recall metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Multiclass Accuracy

This test checks the Multiclass Accuracy metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass Accuracy has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Multiclass Accuracy metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Multiclass AUC

This test checks the Multiclass AUC metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass AUC has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Multiclass AUC metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Positive Prediction Rate

This test checks the Positive Prediction Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Positive Prediction Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Positive Prediction Rate metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.