# AI Stress Tests {{ rime_library_setup_note }} All tests should expose a `run_notebook` function, which is a specially designed function to return outputs in a notebook-friendly manner. The return type is a dictionary with a few standard keys. The common ones are: - `status`: Will be one of `PASS`, `FAIL`, or `SKIP` and denotes the status of the test. - `severity`: Will be one of `High`, `Medium`, `Low`, or `None` and denotes the severity of the failure of the test (will be `None` if test did not fail). - `params`: A dictionary of all the parameters of the test. - `columns`: A list of column names that this test was run over. Each test may also have their own specific outputs as well, but the above are the common ones that will be present in all! Running unseen categorical test: ```python from rime.tabular.tests import UnseenCategoricalTest test = UnseenCategoricalTest(col_name="Device_operating_system") test.run_notebook(container) ``` Output: ``` This test raised a warning (with severity level Low) because the model impact is 0/10 and we found 9 failing rows. {'status': 'FAIL', 'severity': 'Low', 'params': {'model_impact_config': ModelImpactConfig(num_perturbation_samples=100, perturbed_factor=2.0, prediction_factor=0.005, metric_factor=0.002, failing_rows_threshold=10), 'col_name': 'Device_operating_system'}, 'columns': ['Device_operating_system'], 'unseen_value_counts': Mac OS X 10_11_4 2 Mac OS X 10.9 2 Mac OS X 10_12_2 1 Mac OS X 10_12_1 1 Mac OS X 10.6 1 Windows 1 Mac OS X 10.10 1 Name: Device_operating_system, dtype: int64, 'model_impacts': {'Model Impact': ['Observed', 'Adversarial', 'Overall'], 'Value': ['N/A', 0.0, '0/10'], 'Description': ['Too few failing rows to estimate model impact.', 'Difference between original model performance over sampled rows (Accuracy=0.970) and performance over adversarial rows with unseen categoricals (Accuracy=0.970).', 'Combination of the above']}, 'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317], 'num_failing_rows': 9} ``` Running duplicate rows test: ```python from rime.tabular.tests import DuplicateRowsTest test = DuplicateRowsTest() test.run_notebook(container) ``` Output: ``` This test passed because there are 0 duplicate row(s) in the evaluation data. {'status': 'PASS', 'severity': 'None', 'Failing Rows': '0 (0.00%)', 'params': {'col_names': None, 'severity_thresholds': (0.01, 0.05)}, 'columns': []} ``` Running outliers test on a numeric feature column: ```python from rime.tabular.tests import NonParametricOutliersTest test = NonParametricOutliersTest("TransactionAmt") test.run_notebook(container) ``` Output: ``` This test raised a warning (with severity level Low) because the model impact is 0/10 and we found 2 failing rows. {'status': 'FAIL', 'severity': 'Low', 'params': {'model_impact_config': ModelImpactConfig(num_perturbation_samples=100, perturbed_factor=2.0, prediction_factor=0.005, metric_factor=0.002, failing_rows_threshold=10), 'col_name': 'TransactionAmt', 'min_normal_prop': 0.99, 'baseline_quantile': 0.1}, 'columns': ['TransactionAmt'], 'lower_threshold': -30.1300916166291, 'upper_threshold': 4396.228995809948, 'model_impacts': {'Model Impact': ['Observed', 'Adversarial', 'Overall'], 'Value': ['N/A', 0.0, '0/10'], 'Description': ['Too few failing rows to estimate model impact.', 'Difference between original model performance over sampled rows (Accuracy=0.960) and performance over adversarial rows with numeric outliers (Accuracy=0.960).', 'Combination of the above']}, 'failing_rows': [3302, 8373], 'num_failing_rows': 2} ``` Running single-feature change test. ```python from rime.tabular.tests import VulnerabilityTest test = VulnerabilityTest("DeviceInfo") test.run_notebook(container) ``` Output ``` This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.01. {'status': 'PASS', 'severity': 'None', 'Average Prediction Change': 0.0055514594454474705, 'params': {'severity_level_thresholds': (0.01, 0.05, 0.1), 'col_names': ['DeviceInfo'], 'l0_constraint': 1, 'linf_constraint': None, 'sample_size': 10, 'search_count': 10, 'use_tqdm': False, 'label_range': (0.0, 1.0), 'scaled_min_impact_threshold': 0.01}, 'columns': ['DeviceInfo'], 'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929], 'avg_score_change': 0.0055514594454474705, 'normalized_avg_score_change': 0.0055514594454474705} ``` Running feature subset test. ```python from rime.tabular.tests import FeatureSubsetTest from rime.tabular.metric import MetricName test = FeatureSubsetTest("DeviceType", MetricName.ACCURACY) test.run_notebook(container) ``` Output ``` This test raised a warning (with severity level Medium) because 2 subset(s) performed significantly worse than the overall population. When evaluating subsets of the feature DeviceType we find the Accuracy of some subset(s) to be below the overall score of 0.97 by more than the threshold of 0.001. {'status': 'FAIL', 'severity': 'Medium', 'params': {'metric': , 'severity_level_thresholds': (0.001, 0.02, 0.1), 'perf_change_threshold': 0.001, 'col_name': 'DeviceType'}, 'columns': ['DeviceType'], 'num_failing': 2, 'overall_perf': 0.9692, 'sample_size': 10000, 'subsets_metric_dict': {'overall_perf': 0.9692, 'subsets_info': {'desktop': {'name': 'desktop', 'len_df': 1504, 'criterion': 'desktop', 'perf': 0.9461436170212766, 'margin_error': None, 'diff': 0.023056382978723367, 'pos_rate': 0.06648936170212766, 'size': 1504}, 'mobile': {'name': 'mobile', 'len_df': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': None, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'size': 947}, 'None': {'name': 'None', 'len_df': 7549, 'criterion': 'None', 'perf': 0.9788051397536097, 'margin_error': None, 'diff': -0.00960513975360977, 'pos_rate': 0.021592263876010067, 'size': 7549}}}, 'worst_subset': {'name': 'mobile', 'len_df': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': None, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'size': 947}} ``` That's it! **NOTE**: It's important to point out that while we loaded a pretrained model for convenience, the RIME Library can be used at any point during the prototyping workflow, whether that's during initial data exploration, or model training and iteration.