# AI Stress Tests
{{ rime_library_setup_note }}
## Overview
{{ stress_test_bio }}
All tests expose a `run_notebook` function, which returns outputs in a notebook-friendly format.
The return type is a dictionary with a few standard keys. The fundamental ones are:
- `status`: Will be one of `PASS`, `FAIL`, `WARNING`, or `SKIP`. Denotes the status of the test.
- `severity`: Will be one of `High`, `Medium`, `Low`, or `None`. Denotes the severity of the failure of the test (will be `None` if test did not fail).
- `params`: A dictionary of all the parameters of the test.
- `columns`: A list of column names that this test was run over.
Depending on their purpose, different tests may have additional keys for unique information.
## Unseen Categorical
As an example, we can run the **Unseen Categorical** test:
```python
from rime.tabular.tests import UnseenCategoricalTest
test = UnseenCategoricalTest(col_name="Device_operating_system")
test.run_notebook(container)
```
Output:
```
{'status': 'FAIL',
'severity': 'Low',
'params': {'_id': '4d2a94f6-d7aa-c547-b682-7e78fd71a79f',
'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
'col_name': 'Device_operating_system'},
'columns': ['Device_operating_system'],
'unseen_value_counts': Mac OS X 10_11_4 2
Mac OS X 10.9 2
Mac OS X 10_12_2 1
Mac OS X 10_12_1 1
Mac OS X 10.6 1
Windows 1
Mac OS X 10.10 1
Name: Device_operating_system, dtype: int64,
'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317],
'num_failing_rows': 9}
```
## Duplicate Rows
Running the **Duplicate Rows** test:
```python
from rime.tabular.tests import DuplicateRowsTest
test = DuplicateRowsTest()
test.run_notebook(container)
```
Output:
```
This test passed because there are 0 duplicate row(s) in the evaluation data.
{'status': 'PASS',
'severity': 'None',
'Failing Rows': '0 (0.00%)',
'params': {'_id': 'eccd9267-a47a-185c-58e4-eb88fea02ce7',
'col_names': None,
'severity_thresholds': (0.01, 0.05)},
'columns': []}
```
## Non-Parametric Outliers
Running the **Non-Parametric Outliers** test on a numeric feature column:
```python
from rime.tabular.tests import NonParametricOutliersTest
test = NonParametricOutliersTest("TransactionAmt")
test.run_notebook(container)
```
Output:
```
{'status': 'FAIL',
'severity': 'Low',
'params': {'_id': 'af584cae-191e-8cfa-b9f1-50dfa0a188a3',
'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
'col_name': 'TransactionAmt',
'min_normal_prop': 0.99,
'baseline_quantile': 0.1,
'perturb_multiplier': 1.0},
'columns': ['TransactionAmt'],
'lower_threshold': -30.1300916166291,
'upper_threshold': 4396.228995809948,
'failing_rows': [3302, 8373],
'num_failing_rows': 2}
```
## Vulnerability
Running the **Vulnerability** (AKA single-feature change) test:
```python
from rime.tabular.tests import VulnerabilityTest
test = VulnerabilityTest("DeviceInfo")
test.run_notebook(container)
```
Output
```
This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.05.
{'status': 'PASS',
'severity': 'None',
'Average Prediction Change': 0.0055514594454474705,
'params': {'_id': 'e94863f0-e938-4be9-5e9b-e64674edc3b1',
'severity_level_thresholds': (0.05, 0.15, 0.25),
'col_names': ['DeviceInfo'],
'l0_constraint': 1,
'linf_constraint': None,
'sample_size': 10,
'search_count': 10,
'use_tqdm': False,
'label_range': (0.0, 1.0),
'scaled_min_impact_threshold': 0.05},
'columns': ['DeviceInfo'],
'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929],
'avg_score_change': 0.0055514594454474705,
'normalized_avg_score_change': 0.0055514594454474705}
```
## Feature Subset
Running the **Feature Subset** test:
```python
from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import AccuracyMetric
test = FeatureSubsetTest("DeviceType", AccuracyMetric, (0.1, 1.0, 1.0))
test.run_notebook(container)
```
Output
```
{'status': 'PASS',
'severity': 'None',
'params': {'_id': '48457123-e119-0d15-c942-e9cb31e54840',
'metric_name': ,
'metric_cls': rime.tabular.metric.shared_metrics.AccuracyMetric,
'min_sample_size': 20,
'perf_change_thresholds': (0.1, 1.0, 1.0),
'perf_change_threshold': 0.1,
'col_name': 'DeviceType'},
'columns': ['DeviceType'],
'num_failing': 0,
'overall_perf': 0.9692,
'sample_size': 10000,
'subsets_metric_dict': {'overall_perf': 0.9692,
'subsets_info': {'desktop': {'name': 'desktop',
'size': 1504,
'criterion': 'desktop',
'perf': 0.9461436170212766,
'margin_error': 0.011408309534789187,
'diff': 0.023056382978723367,
'pos_rate': 0.06648936170212766,
'sample_size_info': {: 100,
: 1404,
: 33,
: 1471}},
'mobile': {'name': 'mobile',
'size': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': 0.016330589417348093,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'sample_size_info': {: 107,
: 840,
: 50,
: 897}},
'None': {'name': 'None',
'size': 7549,
'criterion': 'None',
'perf': 0.9788051397536097,
'margin_error': 0.003249127676628865,
'diff': -0.00960513975360977,
'pos_rate': 0.021592263876010067,
'sample_size_info': {: 163,
: 7386,
: 3,
: 7546}}}},
'worst_subset': {'name': 'mobile',
'size': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': 0.016330589417348093,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'sample_size_info': {: 107,
: 840,
: 50,
: 897}}}
```
That's it!
**NOTE**: It's important to point out that while we loaded a pretrained model for convenience, the RIME Python Library can be used at any point during the prototyping workflow, whether that's during initial data exploration, or model training and iteration.