# AI Stress Tests {{ rime_library_setup_note }} ## Overview {{ stress_test_bio }} All tests expose a `run_notebook` function, which returns outputs in a notebook-friendly format. The return type is a dictionary with a few standard keys. The fundamental ones are: - `status`: Will be one of `PASS`, `FAIL`, `WARNING`, or `SKIP`. Denotes the status of the test. - `severity`: Will be one of `High`, `Medium`, `Low`, or `None`. Denotes the severity of the failure of the test (will be `None` if test did not fail). - `params`: A dictionary of all the parameters of the test. - `columns`: A list of column names that this test was run over. Depending on their purpose, different tests may have additional keys for unique information. ## Unseen Categorical As an example, we can run the **Unseen Categorical** test: ```python from rime.tabular.tests import UnseenCategoricalTest test = UnseenCategoricalTest(col_name="Device_operating_system") test.run_notebook(container) ``` Output: ``` {'status': 'FAIL', 'severity': 'Low', 'params': {'_id': '4d2a94f6-d7aa-c547-b682-7e78fd71a79f', 'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10), 'col_name': 'Device_operating_system'}, 'columns': ['Device_operating_system'], 'unseen_value_counts': Mac OS X 10_11_4 2 Mac OS X 10.9 2 Mac OS X 10_12_2 1 Mac OS X 10_12_1 1 Mac OS X 10.6 1 Windows 1 Mac OS X 10.10 1 Name: Device_operating_system, dtype: int64, 'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317], 'num_failing_rows': 9} ``` ## Duplicate Rows Running the **Duplicate Rows** test: ```python from rime.tabular.tests import DuplicateRowsTest test = DuplicateRowsTest() test.run_notebook(container) ``` Output: ``` This test passed because there are 0 duplicate row(s) in the evaluation data. {'status': 'PASS', 'severity': 'None', 'Failing Rows': '0 (0.00%)', 'params': {'_id': 'eccd9267-a47a-185c-58e4-eb88fea02ce7', 'col_names': None, 'severity_thresholds': (0.01, 0.05)}, 'columns': []} ``` ## Non-Parametric Outliers Running the **Non-Parametric Outliers** test on a numeric feature column: ```python from rime.tabular.tests import NonParametricOutliersTest test = NonParametricOutliersTest("TransactionAmt") test.run_notebook(container) ``` Output: ``` {'status': 'FAIL', 'severity': 'Low', 'params': {'_id': 'af584cae-191e-8cfa-b9f1-50dfa0a188a3', 'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10), 'col_name': 'TransactionAmt', 'min_normal_prop': 0.99, 'baseline_quantile': 0.1, 'perturb_multiplier': 1.0}, 'columns': ['TransactionAmt'], 'lower_threshold': -30.1300916166291, 'upper_threshold': 4396.228995809948, 'failing_rows': [3302, 8373], 'num_failing_rows': 2} ``` ## Vulnerability Running the **Vulnerability** (AKA single-feature change) test: ```python from rime.tabular.tests import VulnerabilityTest test = VulnerabilityTest("DeviceInfo") test.run_notebook(container) ``` Output ``` This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.05. {'status': 'PASS', 'severity': 'None', 'Average Prediction Change': 0.0055514594454474705, 'params': {'_id': 'e94863f0-e938-4be9-5e9b-e64674edc3b1', 'severity_level_thresholds': (0.05, 0.15, 0.25), 'col_names': ['DeviceInfo'], 'l0_constraint': 1, 'linf_constraint': None, 'sample_size': 10, 'search_count': 10, 'use_tqdm': False, 'label_range': (0.0, 1.0), 'scaled_min_impact_threshold': 0.05}, 'columns': ['DeviceInfo'], 'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929], 'avg_score_change': 0.0055514594454474705, 'normalized_avg_score_change': 0.0055514594454474705} ``` ## Feature Subset Running the **Feature Subset** test: ```python from rime.tabular.tests import FeatureSubsetTest from rime.tabular.metric import AccuracyMetric test = FeatureSubsetTest("DeviceType", AccuracyMetric, (0.1, 1.0, 1.0)) test.run_notebook(container) ``` Output ``` {'status': 'PASS', 'severity': 'None', 'params': {'_id': '48457123-e119-0d15-c942-e9cb31e54840', 'metric_name': , 'metric_cls': rime.tabular.metric.shared_metrics.AccuracyMetric, 'min_sample_size': 20, 'perf_change_thresholds': (0.1, 1.0, 1.0), 'perf_change_threshold': 0.1, 'col_name': 'DeviceType'}, 'columns': ['DeviceType'], 'num_failing': 0, 'overall_perf': 0.9692, 'sample_size': 10000, 'subsets_metric_dict': {'overall_perf': 0.9692, 'subsets_info': {'desktop': {'name': 'desktop', 'size': 1504, 'criterion': 'desktop', 'perf': 0.9461436170212766, 'margin_error': 0.011408309534789187, 'diff': 0.023056382978723367, 'pos_rate': 0.06648936170212766, 'sample_size_info': {: 100, : 1404, : 33, : 1471}}, 'mobile': {'name': 'mobile', 'size': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': 0.016330589417348093, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'sample_size_info': {: 107, : 840, : 50, : 897}}, 'None': {'name': 'None', 'size': 7549, 'criterion': 'None', 'perf': 0.9788051397536097, 'margin_error': 0.003249127676628865, 'diff': -0.00960513975360977, 'pos_rate': 0.021592263876010067, 'sample_size_info': {: 163, : 7386, : 3, : 7546}}}}, 'worst_subset': {'name': 'mobile', 'size': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': 0.016330589417348093, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'sample_size_info': {: 107, : 840, : 50, : 897}}} ``` That's it! **NOTE**: It's important to point out that while we loaded a pretrained model for convenience, the RIME Python Library can be used at any point during the prototyping workflow, whether that's during initial data exploration, or model training and iteration.