# Object Detection Tests ## Subset Performance ### Subset F1 ### Subset Precision

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire population.

Why it matters: Having different Precision between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, Precision is computed over all predictions/labels.

Example: Suppose in our subset the ground truth has two cats and one dog in the image. Suppose your actual detection has two true positives (the cats), one false positive (it predicts a bird) and one false negative (does not predict the dog). This leads to a Precision of 0.66 on this subset of data. We then compare that to the overall Precision on the full dataset.

### Subset Recall

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.

Why it matters: Having different Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, Recall is computed over all predictions/labels.

Example: Suppose in our subset the ground truth has two cats and one dog in the image. Suppose your actual detection has two true positives (the cats), one false positive (it predicts a bird) and one false negative (does not predict the dog). This leads to a Recall of 0.66 on this subset of data. We then compare that to the overall Recall on the full dataset.

### Subset Average Number of Predicted Boxes ## Transformations ### Gaussian Blur

This test measures the robustness of your model to Gaussian Blur transformations. It does this by taking a sample input, blurring the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Color Jitter

This test measures the robustness of your model to Color Jitter transformations. It does this by taking a sample input, jittering the image colors, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Gaussian Noise

This test measures the robustness of your model to Gaussian Noise transformations. It does this by taking a sample input, adding gaussian noise to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Randomize Pixels With Mask

This test measures the robustness of your model to Randomize Pixels With Mask transformations. It does this by taking a sample input, randomizing pixels with fixed probability, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Vertical Flip

This test measures the robustness of your model to Vertical Flip transformations. It does this by taking a sample input, flipping the image vertically, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Horizontal Flip

This test measures the robustness of your model to Horizontal Flip transformations. It does this by taking a sample input, flipping the image horizontally, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Contrast Increase

This test measures the robustness of your model to Contrast Increase transformations. It does this by taking a sample input, increase image contrast, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Contrast Decrease

This test measures the robustness of your model to Contrast Decrease transformations. It does this by taking a sample input, decrease image contrast, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Add Rain

This test measures the robustness of your model to Add Rain transformations. It does this by taking a sample input, adding rain texture to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

### Add Snow

This test measures the robustness of your model to Add Snow transformations. It does this by taking a sample input, adding snow texture to the image, and measuring the behavior of the model on the transformed input.

Why it matters: Production inputs can have unusual variations amongst many different dimensions, ranging from lighting changes to sensor errors to compression artifacts. It is important that your models are robust to the introduction of such variations.

## Model Performance ### Average Confidence

This test checks the average confidence of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. The "confidence" of a prediction for classification tasks is defined as the distance between the probability of the predicted class (defined as the argmax over the prediction vector) and 1. We average this metric across all predictions.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

Configuration: By default, this test runs if predictions are specified (no labels required).

Example: Assume that on the reference set the model obtained 0.85 average confidence but on the evaluation set without labels we predict that the model obtained 0.5 average confidence. Then this test raises a warning.

### Average Thresholded Confidence

This test checks the average thresholded confidence (ATC) of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. ATC is a method for estimating accuracy of unlabeled examples taken from this paper. The threshold is first computed on the reference set: we pick a confidence threshold such that the percentage of datapoints whose max predicted probability is less than the threshold is around equal to the error rate of the model (here, it is 1-accuracy) on the reference set. Then, we apply this threshold in the evaluation set: the predicted accuracy is then equal to the percentage of datapoints with max predicted probability greater than this threshold.

Why it matters: During production, factors like distribution shift may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.

Configuration: By default, this test runs if predictions/labels are specified in the reference set and predictions are specified in the eval set (no labels required).

Example: Assume that on the reference set the model obtained 0.85 accuracy but on the evaluation set, we find that only 55 percent of datapoints have max predicted probability greater than our threshold. Then our predicted accuracy is 0.55 and this test raises a warning.

### Calibration Comparison

This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.

Why it matters: Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.

Configuration: By default, this test runs over the predictions and labels.

Example: Suppose the model’s task is binary classification and predicts whether or not a data point is fraudulent. If we have a reference set in which 1% of the data points are fraudulent, but an evaluation set where 50% are fraudulent, then our model may not be well calibrated, and the MSE difference in the curves will be large, resulting in a failing test.

### F1

This test checks the F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the F1 metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

### Precision

This test checks the Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Precision metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

### Recall

This test checks the Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Recall metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

### Average Number of Predicted Boxes

This test checks the Average Number of Predicted Boxes metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Boxes has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over the Average Number of Predicted Boxes metric with the below thresholds set for the absolute and degradation tests.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.