Named Entity Recognition Tests¶
Abnormal Inputs¶
Numeric Outliers¶
This test measures the number of failing rows in your data with outliers and their impact on the model. Outliers are values which may not necessarily be outside of an allowed range for a feature, but are extreme values that are unusual and may be indicative of abnormality. The model impact is the difference in model performance between passing and failing rows with outliers. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Outliers can be a sign of corrupted or otherwise erroneous data, and can degrade model performance if used in the training data, or lead to unexpected behaviour if input at inference time.
Configuration: By default this test is run over each numeric feature that is neither unique nor ascending.
Example: Suppose there is a feature age for which in the reference set the values 103 and 114 each appear once but every other value (with substantial sample size) is contained within the range [0, 97]. Then we would infer a lower outlier threshold of 0 and an upper outlier threshold of 97. This test raises a warning if we observe any values in the evaluation set outside these thresholds or if model performance decreases on observed datapoints with outliers.
Unseen Categorical¶
This test measures the number of failing rows in your data with unseen categorical values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen categorical values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test runs over all categorical features.
Example: Say that the feature Animal contains the values ['Cat', 'Dog'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'Mouse' that causes a significant change in model performance. If labels/predictions are provided in the run, then a severity would be raised if the Average Prediction changed by 0.03. If labels/predictions were not provided but 'Mouse' appeared in 3% of the evaluation dataset, an severity would be raised due to the significant increase in presence of an unseen feature.
Unseen Domain¶
This test measures the number of failing rows in your data with unseen domain values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen domain values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test runs over all features inferred to contain domains.
Example: Say that the feature WebDomain contains the values ['gmail.com', 'hotmail.com'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'xyzabc.com' that causes a significant change in model performance. If labels/predictions are provided in the run, then a severity issue would be raised if the Average Prediction changed by 0.03. If labels/predictions were not provided but 'xyzabc.com' appeared in 3% of the evaluation dataset, an severity issue would be raised due to the significant increase in presence of an unseen feature.
Unseen Email¶
This test measures the number of failing rows in your data with unseen email values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen email values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test runs over all features inferred to contain emails.
Example: Say that the feature Email contains the values ['[email protected]', '[email protected]'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as '[email protected]' that causes a significant change in model performance. If labels/predictions are provided in the run, then a severity issue would be raised if the Average Prediction changed by 0.03. If labels/predictions were not provided but '[email protected]' appeared in 3% of the evaluation dataset, a severity issue would be raised due to the significant increase in presence of an unseen feature.
Unseen URL¶
This test measures the number of failing rows in your data with unseen URL values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen URL values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test runs over all features inferred to contain URLs.
Example: Say that the feature WebURL contains the values ['http://google.com', 'http://yahoo.com'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'http://xyzabc.com' that causes a significant change in model performance. If labels/predictions are provided in the run, then a severity issue would be raised if the Average Prediction changed by 0.03. If labels/predictions were not provided but 'xyzabc.com' appeared in 3% of the evaluation dataset, an severity issue would be raised due to the significant increase in presence of an unseen feature.
Rare Categories¶
This test measures the severity of passing to the model data points whose features contain rarely observed categories (relative to the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with rarely observed categorical values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times rarely observed categorical values are observed in the evaluation set.
Why it matters: Rare categories are a common failure point in machine learning systems because less data often means worse performance. In addition, this may expose gaps or errors in data collection.
Configuration: By default, this test runs over all categorical features. A category is considered rare if it occurs fewer than min_num_occurrences times, or if it occurs less than min_pct_occurrences of the time. If neither of these values are specified, the rate of appearance below which a category is considered rare is min_ratio_rel_uniform divided by the number of classes.
Example: Say that the feature AgeGroup takes on the value 0-18 twice while taking on the value 35-55 a total of 98 times. If the min_num_occurences is 5 and the min_pct_occurrences is 0.03 then the test will flag the value 0-18 as a rare category.
Out of Range¶
This test measures the number of failing rows in your data with values outside the inferred range of allowed values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values outside the inferred range of allowed values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: In production, the model may encounter corrupted or manipulated out of range values. It is important that the model is robust to such extremities.
Configuration: By default, this test runs over all numeric features.
Example: In the reference set, the Age feature has a range of [0, 121]. This test raises a warning if we observe values outside of this range in the evaluation set (eg. 150, 200) or if model performance decreases on observed datapoints outside of this range.
Required Characters¶
This test measures the number of failing rows in your data with strings without any required characters and their impact on the model. The model impact is the difference in model performance between passing and failing rows with strings without any required characters. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: A feature may require specific characters. However, errors in the data pipeline may allow invalid data points that lack these required characters to pass. Failing to catch such errors may lead to noisier training data or noisier predictions during inference, which can degrade model metrics.
Configuration: By default, this test runs over all string features that are inferred to have required characters.
Example: Say that the feature email requires the character @. This test raises a warning if we observe any values in the evaluation set where the character is missing.
Inconsistencies¶
This test measures the severity of passing to the model data points whose values are inconsistent (as inferred from the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with data containing inconsistent feature values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times data containing inconsistent feature values are observed in the evaluation set.
Why it matters: Inconsistent values might be the result of malicious actors manipulating the data or errors in the data pipeline. Thus, it is important to be aware of inconsistent values to identify sources of manipulations or errors.
Configuration: By default, this test runs on pairs of categorical features whose correlations exceed some minimum threshold. The default threshold for the frequency ratio below which values are considered to be inconsistent is 0.02.
Example: Suppose we have a feature country that takes on value "US" with frequency 0.5, and a feature time_zone that takes on value "Central European Time" with frequency 0.2. Then if these values appear together with frequency less than 0.5 * 0.2 * 0.02 = 0.002 , in the reference set, rows in which these values do appear together are inconsistencies.
Capitalization¶
This test measures the number of failing rows in your data with different types of capitalization and their impact on the model. The model impact is the difference in model performance between passing and failing rows with different types of capitalization. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: In production, models can come across the same value with different capitalizations, making it important to explicitly check that your model is invariant to such differences.
Configuration: By default, this test runs over all categorical features.
Example: Suppose we had a column that corresponded to country code. For a specific row, let's say the observed value in the reference set was USA. This test raises a warning if we observe a similar value in the evaluation set with case changes, e.g. uSa or if model performance decreases on observed datapoints with case changes.
Empty String¶
This test measures the number of failing rows in your data with empty string values instead of null values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with empty string values instead of null values. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: In production, the model may encounter corrupted or manipulated string values. Null values and empty strings are often expected to be treated the same, but the model might not treat them that way. It is important that the model is robust to such extremities.
Configuration: By default, this test runs over all string features with null values.
Example: In the reference set, the Name feature contains nulls. This test raises a warning if we observe any empty string in the Name feature or if these values decrease model performance.
Embedding Anomalies¶
This test measures the number of failing rows in your data with anomalous embeddings and their impact on the model. The model impact is the difference in model performance between passing and failing rows with anomalous embeddings. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: In production, the presence of anomalous embeddings can indicate breaks in upstream data pipelines, poor model generalization, or other issues.
Configuration: By default, this test runs over all configured embeddings.
Example: Say that the 'user_id' embedding is two-dimensional and has a mean at the origin and a covariance matrix of [[1, 0], [0, 1]] in the reference set. This test will flag any embeddings in the test set that are distant from the reference distribution using the Mahalanobis distance.
Unseen Unigram¶
This test measures the number of failing rows in your data with unseen unigrams and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen unigrams. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Unseen unigrams are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen unigram. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test is run over every data point.
Example: Say that there is a text field with value James went to his casa and the unigram casa was not seen in the reference set. This test would raise a warning flagging that datapoint, with the severity depending on how bad the model performed on that datapoint.
Empty Text String¶
This test measures the number of failing rows in your data with empty strings and their impact on the model. The model impact is the difference in model performance between passing and failing rows with empty strings. If labels are not provided, prediction change is used instead of model performance change.
Why it matters: Empty strings are a common failure point in machine learning systems; as some models may yield uninterpretable or undefined behavior when interacting with an empty string. In addition, such errors may expose gaps or errors in data collection.
Configuration: By default, this test is run over every data point.
Example: Say that there is a text field that is just an empty string. This test would raise a warning flagging that datapoint, with the severity depending on how bad the model performed on that datapoint.
Drift¶
Correlation Drift (Feature-to-Feature)¶
This test measures the severity of feature-feature correlation drift from the reference to the evaluation set for a given pair of features. The severity is a function of the correlation drift in the data. The key detail is the difference in correlation scores between the reference and evaluation sets, along with an associated p-value. Correlation is a measure of the linear relationship between two numeric columns (feature-feature) so this test checks for significant changes in this relationship between each feature-feature in the reference and evaluation sets. To compute the p-value, we use Fisher's z-transformation to convert the distribution of sample correlations to a normal distribution, and then we run a standard two-sample test on two normal distributions.
Why it matters: Correlation drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.
Configuration: By default, this test runs over all pairs of features in the dataset.
Example: Suppose that the correlation between country and state is 0.5 in the reference set but 0.7 in the evaluation set, and the p-value is 0.03. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2, and p-value threshold was 0.05, then the test would fail.
Correlation Drift (Feature-to-Label)¶
This test measures the severity of feature-label correlation drift from the reference to the evaluation set for a given pair of a feature and label. The severity is a function of the correlation drift in the data. The key detail is the difference in correlation scores between the reference and evaluation sets, along with an associated p-value. Correlation is a measure of the linear relationship between two numeric columns (feature-label) so this test checks for significant changes in this relationship between each feature-label in the reference and evaluation sets. To compute the p-value, we use Fisher's z-transformation to convert the distribution of sample correlations to a normal distribution, and then we run a standard two-sample test on two normal distributions.
Why it matters: Correlation drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.
Configuration: By default, this test runs over all pairs of features and labels in the dataset.
Example: Suppose that the correlation between LotArea and SalePrice is 0.4 in the reference set but 0.8 in the evaluation set, and the p-value is 0.15. Then the large difference in scores indicates that the impact of the feature on the label has drifted. If our difference threshold was 0.2, and p-value threshold was 0.05, then the test would fail.
Mutual Information Drift (Feature-to-Feature)¶
This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.
Why it matters: Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.
Configuration: By default, this test runs over all pairs of features in the dataset.
Example: Suppose that the mutual information between country and state is 0.5 in the reference set but 0.7 in the evaluation set. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2 then the test would fail.
Mutual Information Drift (Feature-to-Label)¶
This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.
Why it matters: Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signaling the need for relabeling and retraining.
Configuration: By default, this test runs over all pairs of features in the dataset.
Example: Suppose that the mutual information between country and state is 0.5 in the reference set but 0.7 in the evaluation set. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2 then the test would fail.
Label Drift (Categorical)¶
This test checks that the difference in label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.
Why it matters: Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.
Configuration: This test is run by default whenever both the reference and evaluation sets have associated labels.
Example: Suppose that the observed frequencies of the label column is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.
Predicted Label Drift¶
This test checks that the difference in predicted label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.
Why it matters: Predicted Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant predicted label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.
Configuration: This test is run by default whenever the model or predictions is provided.
Example: Suppose that the observed frequencies of the predicted label column is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.
Label Drift (Regression)¶
This test checks that the difference in label distribution between the reference and evaluation sets is small, using the PSI test. The key detail displayed is the KS statistic which is a measure of how different the labels in the reference and evaluation sets are. Concretely, the KS statistic is the maximum difference of the empirical CDF's of the two label columns.
Why it matters: Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.
Configuration: This test is run by default whenever both the reference and evaluation sets have associated labels.
Example: Suppose that the distribution of labels changes between the reference and evaluation sets such that PSI these two samples is 0.2. If the PSI threshold is 0.1, then this test would raise a warning.
Categorical Feature Drift¶
This test measures the severity of passing to the model data points that have categorical features which have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail displayed is the PSI test statistic, which is a measure of how statistically significant the difference between the frequencies of categorical values in the reference and evaluation sets is.
Why it matters: Distribution drift in categorical features between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in categorical features towards categorical subsets that your model performs poorly in could indicate a degradation in model performance and signal the need for relabeling and retraining.
Configuration: By default, this test runs over all categorical columns with sufficiently many samples.
Example: Suppose that the observed frequencies of the isLoggedIn feature is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.
Numeric Feature Drift¶
This test measures the severity of passing to the model data points that have numeric features that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the Population Stability Index statistic. The Population Stability Index (PSI) is a measure of how different two distributions are. Given two distributions P and Q, it is computed as the sum of the KL Divergence between P and Q and the (reverse) KL Divergence between Q and P. Thus, PSI is symmetric.
Why it matters: Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.
Configuration: By default, this test runs over all numeric columns with sufficiently many samples and stored quantiles in each of the reference and evaluation sets.
Example: Suppose that the distribution of a feature Age changes between the reference and evaluation sets such that the Population Stability Index between these two samples is 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Prediction Drift¶
This test checks that the difference in the prediction distribution between the reference and evaluation sets is small, using Population Stability Index. The key detail displayed is the PSI which is a measure of how different the prediction distributions in the reference and evaluation sets are.
Why it matters: Prediction distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant prediction distribution drift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.
Configuration: This test is run by default whenever both the reference and evaluation sets have associated predictions. Different thresholds are associated with different severities.
Example: Suppose that the PSI between the prediction distributions in the reference and evaluation sets is 0.201. Then if the PSI thresholds are (0.1, 0.2, 0.3), the test would fail with medium severity.
Embedding Drift¶
This test measures the severity of passing to the model data points associated with embeddings that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the Euclidean Distance statistic. The Euclidean Distance is defined as the square root of the sum of the squared differences between two vectors X and Y. The normalized version of this metric first divides each vector by its L2 norm. This test takes the normalized Euclidean distance between the centroids of the ref and eval data sets.
Why it matters: Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.
Configuration: By default, this test runs over all specified embeddings with sufficiently many samples in each of the reference and evaluation sets.
Example: Suppose that the distribution of an embedding User changes between the reference and evaluation sets such that the Euclidean Distance between these two samples is 0.3. If the distance threshold is set to 0.1, this test would raise a warning.
Character Distribution¶
This test measures the character distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the character distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the character distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Unigrams Distribution¶
This test measures the unigram distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the unigram distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the unigram distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Bigrams Distribution¶
This test measures the bigram distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is determined by comparing the computed drift statistic to the configured severity thresholds.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the bigram distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the bigram distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Entity Type Distribution¶
This test measures the label entity type distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the label entity type distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the label entity type distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Predicted Entity Type Distribution¶
This test measures the predicted entity type distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the predicted entity type distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the predicted entity type distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Entity Lengths Distribution¶
This test measures the entity length distribution drift between the reference and evaluation sets. By default, it measures drift by using the Population Stability Index of the two distributions.The severity is a function of the magnitude of data drift, and the impact of that drift on model performance. Performance change is attributed using the performance on subsets (quantiles or categories) of a given feature and the change in subset prevalence across datasets.
Why it matters: The reference set that you use to train your model may not be representative of the evaluation set you encounter in production. If there are statistically significant differences in the entity length distribution between these sets, it can lead to subpar real-world model performance.
Configuration: To pass a given test case, the divergence metric must be below the configured threshold.
Example: Suppose that the change in the entity length distribution in the reference set and evaluation set yielded a JS Divergence of 0.2. If the distance threshold is set to 0.1, this test would raise a warning.
Subset Performance¶
Label Entity Type Subsets¶
This test measures whether the model performs equally well across subsets of the data when grouped by label entity type. These subsets are defined by grouping input sequences into approximately equal-width bins of the aforementioned metric. The test then measures whether model performance, as defined by the recall, for any given subset is significantly worse than the average performance across all subsets of the data.
Why it matters: Having similar performance across various subsets of the data is an important measure of performance bias.
Configuration: By default, this test measures whether the recall of each subgroup is within 0.05 of the overall performance.
Example: Suppose that the recall of one subset of sentences was 0.2. If the overall recall was 0.1, this test would raise a warning.
Predicted Entity Type Subsets¶
This test measures whether the model performs equally well across subsets of the data when grouped by predicted entity type. These subsets are defined by grouping input sequences into approximately equal-width bins of the aforementioned metric. The test then measures whether model performance, as defined by the precision, for any given subset is significantly worse than the average performance across all subsets of the data.
Why it matters: Having similar performance across various subsets of the data is an important measure of performance bias.
Configuration: By default, this test measures whether the precision of each subgroup is within 0.05 of the overall performance.
Example: Suppose that the precision of one subset of sentences was 0.2. If the overall precision was 0.1, this test would raise a warning.
Subset F1¶
Default long description for subset batch runner.
Subset Precision¶
This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire population.
Why it matters: Having different Precision between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.
Configuration: By default, Precision is computed over all predictions/labels.
Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Precision of 0.25 on this subset of data. We then compare that to the overall Precision on the full dataset.
Subset Recall¶
This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.
Why it matters: Having different Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.
Configuration: By default, Recall is computed over all predictions/labels.
Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Recall of 0.33 on this subset of data. We then compare that to the overall Recall on the full dataset.
Subset Average Number of Predicted Entities¶
Default long description for subset batch runner.
Data Cleanliness¶
Label Imbalance¶
This test checks that no labels have exceedingly high frequency.
Why it matters: Label imbalance in the training data can introduce bias into the model and possibly result in poor predictive performance on examples from the minority classes.
Configuration: This test runs only on classification tasks.
Example: Suppose we had a binary classification task. We can configure this test to check that neither label 0 nor 1 has frequency above a certain threshold.
Transformations¶
Lower-Case Entity¶
This test measures the robustness of your model to Lower-Case Entity transformations. It does this by taking a sample input, lower-casing all entities, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "The boy saw Paris Hilton in Paris", this test measures the performance of the model when given the transformed input of "The boy saw paris hilton in paris".
Upper-Case Entity¶
This test measures the robustness of your model to Upper-Case Entity transformations. It does this by taking a sample input, upper-casing all entities, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "The boy saw Paris Hilton in Paris", this test measures the performance of the model when given the transformed input of "The boy saw PARIS HILTON in PARIS".
Ampersand¶
This test measures the robustness of your model to Ampersand transformations. It does this by taking a sample input, changing & to and, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "Peanut Butter & Jelly", this test measures the performance of the model when given the transformed input of "Peanut Butter and Jelly".
Abbreviation Expander¶
This test measures the robustness of your model to Abbreviation Expander transformations. It does this by taking a sample input, expanding abbreviations in entities, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "Monsters Inc.", this test measures the performance of the model when given the transformed input of "Monsters Incorporated".
Whitespace Around Special Character¶
This test measures the robustness of your model to Whitespace Around Special Character transformations. It does this by taking a sample input, adding whitespace around special characters, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "Hi customer. That'll be $50.", this test measures the performance of the model when given the transformed input of "Hi customer . That ' ll be $ 50 .".
Unicode to ASCII¶
This test measures the robustness of your model to Unicode to ASCII transformations. It does this by taking a sample input, converting all characters in the input string to their nearest ASCII representation, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "René François Lacôte did not like that movie", this test measures the performance of the model when given the transformed input of "Rene Francois Lacote did not like that movie".
Remove Special Characters¶
This test measures the robustness of your model to Remove Special Characters transformations. It does this by taking a sample input, removing all periods and apostrophes from the input string, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "The quick brown fox jumped over the lazy dog...", this test measures the performance of the model when given the transformed input of "The quick brown fox jumped over the lazy dog".
Swap Seen Entities¶
This test measures the robustness of your model to Swap Seen Entities transformations. It does this by taking a sample input, swapping all the entities in a text with random entities of the same type seen in the rest of the data, and measuring the behavior of the model on the transformed input.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "Gabriela Szabo ( Romania ) 15 minutes 04.95 seconds", this test measures the performance of the model when given the transformed input of "Asif Mujtaba ( Luxembourg ) 15 minutes 04.95 seconds".
Swap Unseen Entities¶
This test measures the robustness of your model to Swap Unseen Entities transformations. It does this by taking a sample input, swapping all the entities in a text with random entities of the same category, unseen in the data, and measuring the behavior of the model on the transformed input. This test supports swapping entities from commonly-appearing categories in NER tasks: Person, Geopolitical Entity, Location, Nationality, Product, Corporation, and Organization.
Why it matters: Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.
Configuration: By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 30% of the words in each input.
Example: Given an input sequence "DNIB also set a 110 million guilder step-up bond.", this test measures the performance of the model when given the transformed input of "New Oromio Insurance LLC also set a 110 million guilder step-up bond.".
Model Performance¶
Average Confidence¶
This test checks the average confidence of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. The "confidence" of a prediction for classification tasks is defined as the distance between the probability of the predicted class (defined as the argmax over the prediction vector) and 1. We average this metric across all predictions.
Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.
Configuration: By default, this test runs if predictions are specified (no labels required).
Example: Assume that on the reference set the model obtained 0.85 average confidence but on the evaluation set without labels we predict that the model obtained 0.5 average confidence. Then this test raises a warning.
Average Thresholded Confidence¶
This test checks the average thresholded confidence (ATC) of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. ATC is a method for estimating accuracy of unlabeled examples taken from this paper. The threshold is first computed on the reference set: we pick a confidence threshold such that the percentage of datapoints whose max predicted probability is less than the threshold is around equal to the error rate of the model (here, it is 1-accuracy) on the reference set. Then, we apply this threshold in the evaluation set: the predicted accuracy is then equal to the percentage of datapoints with max predicted probability greater than this threshold.
Why it matters: During production, factors like distribution shift may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.
Configuration: By default, this test runs if predictions/labels are specified in the reference set and predictions are specified in the eval set (no labels required).
Example: Assume that on the reference set the model obtained 0.85 accuracy but on the evaluation set, we find that only 55 percent of datapoints have max predicted probability greater than our threshold. Then our predicted accuracy is 0.55 and this test raises a warning.
Calibration Comparison¶
This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.
Why it matters: Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.
Configuration: By default, this test runs over the predictions and labels.
Example: Suppose the model’s task is binary classification and predicts whether or not a data point is fraudulent. If we have a reference set in which 1% of the data points are fraudulent, but an evaluation set where 50% are fraudulent, then our model may not be well calibrated, and the MSE difference in the curves will be large, resulting in a failing test.
F1¶
This test checks the F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.
Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.
Configuration: By default, this test runs over the F1 metric with the below thresholds set for the absolute and degradation tests.
Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.
Precision¶
This test checks the Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.
Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.
Configuration: By default, this test runs over the Precision metric with the below thresholds set for the absolute and degradation tests.
Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.
Recall¶
This test checks the Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.
Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.
Configuration: By default, this test runs over the Recall metric with the below thresholds set for the absolute and degradation tests.
Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.
Average Number of Predicted Entities¶
This test checks the Average Number of Predicted Entities metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Entities has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.
Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.
Configuration: By default, this test runs over the Average Number of Predicted Entities metric with the below thresholds set for the absolute and degradation tests.
Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.
Subset Performance Degradation¶
Subset Drift F1¶
Default long description for subset batch runner.
Subset Drift Precision¶
This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire population.
Why it matters: Having different Precision between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.
Configuration: By default, Precision is computed over all predictions/labels.
Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Precision of 0.25 on this subset of data. We then compare that to the overall Precision on the full dataset.
Subset Drift Recall¶
This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.
Why it matters: Having different Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.
Configuration: By default, Recall is computed over all predictions/labels.
Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Recall of 0.33 on this subset of data. We then compare that to the overall Recall on the full dataset.
Subset Drift Average Number of Predicted Entities¶
Default long description for subset batch runner.