AI Stress Tests
Be sure to complete the initial setup described in RIME Data and Model Setup before proceeding.
All tests should expose a run_notebook
function, which is a specially designed function to return outputs in a notebook-friendly manner.
The return type is a dictionary with a few standard keys. The common ones are:
status
: Will be one ofPASS
,FAIL
, orSKIP
and denotes the status of the test.severity
: Will be one ofHigh
,Medium
,Low
, orNone
and denotes the severity of the failure of the test (will beNone
if test did not fail).params
: A dictionary of all the parameters of the test.columns
: A list of column names that this test was run over.
Each test may also have their own specific outputs as well, but the above are the common ones that will be present in all!
Running unseen categorical test:
from rime.tabular.tests import UnseenCategoricalTest
test = UnseenCategoricalTest(col_name="Device_operating_system")
test.run_notebook(container)
Output:
This test raised a warning (with severity level Low) because the model impact is 0/10 and we found 9 failing rows.
{'status': 'FAIL',
'severity': 'Low',
'params': {'model_impact_config': ModelImpactConfig(num_perturbation_samples=100, perturbed_factor=2.0, prediction_factor=0.005, metric_factor=0.002, failing_rows_threshold=10),
'col_name': 'Device_operating_system'},
'columns': ['Device_operating_system'],
'unseen_value_counts': Mac OS X 10_11_4 2
Mac OS X 10.9 2
Mac OS X 10_12_2 1
Mac OS X 10_12_1 1
Mac OS X 10.6 1
Windows 1
Mac OS X 10.10 1
Name: Device_operating_system, dtype: int64,
'model_impacts': {'Model Impact': ['Observed', 'Adversarial', 'Overall'],
'Value': ['N/A', 0.0, '0/10'],
'Description': ['Too few failing rows to estimate model impact.',
'Difference between original model performance over sampled rows (Accuracy=0.970) and performance over adversarial rows with unseen categoricals (Accuracy=0.970).',
'Combination of the above']},
'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317],
'num_failing_rows': 9}
Running duplicate rows test:
from rime.tabular.tests import DuplicateRowsTest
test = DuplicateRowsTest()
test.run_notebook(container)
Output:
This test passed because there are 0 duplicate row(s) in the evaluation data.
{'status': 'PASS',
'severity': 'None',
'Failing Rows': '0 (0.00%)',
'params': {'col_names': None, 'severity_thresholds': (0.01, 0.05)},
'columns': []}
Running outliers test on a numeric feature column:
from rime.tabular.tests import NonParametricOutliersTest
test = NonParametricOutliersTest("TransactionAmt")
test.run_notebook(container)
Output:
This test raised a warning (with severity level Low) because the model impact is 0/10 and we found 2 failing rows.
{'status': 'FAIL',
'severity': 'Low',
'params': {'model_impact_config': ModelImpactConfig(num_perturbation_samples=100, perturbed_factor=2.0, prediction_factor=0.005, metric_factor=0.002, failing_rows_threshold=10),
'col_name': 'TransactionAmt',
'min_normal_prop': 0.99,
'baseline_quantile': 0.1},
'columns': ['TransactionAmt'],
'lower_threshold': -30.1300916166291,
'upper_threshold': 4396.228995809948,
'model_impacts': {'Model Impact': ['Observed', 'Adversarial', 'Overall'],
'Value': ['N/A', 0.0, '0/10'],
'Description': ['Too few failing rows to estimate model impact.',
'Difference between original model performance over sampled rows (Accuracy=0.960) and performance over adversarial rows with numeric outliers (Accuracy=0.960).',
'Combination of the above']},
'failing_rows': [3302, 8373],
'num_failing_rows': 2}
Running single-feature change test.
from rime.tabular.tests import VulnerabilityTest
test = VulnerabilityTest("DeviceInfo")
test.run_notebook(container)
Output
This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.01.
{'status': 'PASS',
'severity': 'None',
'Average Prediction Change': 0.0055514594454474705,
'params': {'severity_level_thresholds': (0.01, 0.05, 0.1),
'col_names': ['DeviceInfo'],
'l0_constraint': 1,
'linf_constraint': None,
'sample_size': 10,
'search_count': 10,
'use_tqdm': False,
'label_range': (0.0, 1.0),
'scaled_min_impact_threshold': 0.01},
'columns': ['DeviceInfo'],
'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929],
'avg_score_change': 0.0055514594454474705,
'normalized_avg_score_change': 0.0055514594454474705}
Running feature subset test.
from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import MetricName
test = FeatureSubsetTest("DeviceType", MetricName.ACCURACY)
test.run_notebook(container)
Output
This test raised a warning (with severity level Medium) because 2 subset(s) performed significantly worse than the overall population. When evaluating subsets of the feature DeviceType we find the Accuracy of some subset(s) to be below the overall score of 0.97 by more than the threshold of 0.001.
{'status': 'FAIL',
'severity': 'Medium',
'params': {'metric': <MetricName.ACCURACY: 'accuracy'>,
'severity_level_thresholds': (0.001, 0.02, 0.1),
'perf_change_threshold': 0.001,
'col_name': 'DeviceType'},
'columns': ['DeviceType'],
'num_failing': 2,
'overall_perf': 0.9692,
'sample_size': 10000,
'subsets_metric_dict': {'overall_perf': 0.9692,
'subsets_info': {'desktop': {'name': 'desktop',
'len_df': 1504,
'criterion': 'desktop',
'perf': 0.9461436170212766,
'margin_error': None,
'diff': 0.023056382978723367,
'pos_rate': 0.06648936170212766,
'size': 1504},
'mobile': {'name': 'mobile',
'len_df': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': None,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'size': 947},
'None': {'name': 'None',
'len_df': 7549,
'criterion': 'None',
'perf': 0.9788051397536097,
'margin_error': None,
'diff': -0.00960513975360977,
'pos_rate': 0.021592263876010067,
'size': 7549}}},
'worst_subset': {'name': 'mobile',
'len_df': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': None,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'size': 947}}
That’s it!
NOTE: It’s important to point out that while we loaded a pretrained model for convenience, the RIME Library can be used at any point during the prototyping workflow, whether that’s during initial data exploration, or model training and iteration.