# RIME Subset Performance

## Overview

This tutorial will guide you through getting started on RIME Library's Subset Performance in your Jupyter notebooks. For more granular information, see 
the RIME Subset Performance Jupyter notebook included in the trial bundle. 

{{ rime_library_setup_note }}

## Using RIME Library to Analyze Model Performance

### Running Feature Subset Tests
These tests allow you to observe differences in model performance across various subsets of your features--excellent for detecting potential bias.

In the example below, we illustrate how model accuracy varies across different subsets of the `DeviceType` categorical feature.

```python
from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import MetricName
test = FeatureSubsetTest("DeviceType", MetricName.ACCURACY)
test.run_notebook(container)
```

In the `subsets_info` dictionary, each key is a subset of the feature tested on: `desktop`, `mobile`, and `None`.

It contains information about the metric performance (`perf`), confidence intervals (`margin_error`), 
performance difference from the entire feature (`diff`), positivity rate (`pos_rate`), 
and other information regarding the indices and size of the subset in the feature.

This test reveals that the model underperforms with respect to accuracy for inputs in the `mobile` category! 

**Output**
```
This test raised a warning (with severity level Medium) because 2 subset(s) performed significantly worse than the overall population. When evaluating subsets of the feature DeviceType we find the Accuracy of some subset(s) to be below the overall score of 0.97 by more than the threshold of 0.001.
{'status': 'FAIL',
 'severity': 'Medium',
 'params': {'metric': <MetricName.ACCURACY: 'accuracy'>,
  'severity_level_thresholds': (0.001, 0.02, 0.1),
  'perf_change_threshold': 0.001,
  'col_name': 'DeviceType'},
 'columns': ['DeviceType'],
 'num_failing': 2,
 'overall_perf': 0.9692,
 'sample_size': 10000,
 'subsets_metric_dict': {'overall_perf': 0.9692,
  'subsets_info': {'desktop': {'name': 'desktop',
    'len_df': 1504,
    'criterion': 'desktop',
    'perf': 0.9461436170212766,
    'margin_error': None,
    'diff': 0.023056382978723367,
    'pos_rate': 0.06648936170212766,
    'size': 1504},
   'mobile': {'name': 'mobile',
    'len_df': 947,
    'criterion': 'mobile',
    'perf': 0.9292502639915523,
    'margin_error': None,
    'diff': 0.039949736008447645,
    'pos_rate': 0.11298838437170011,
    'size': 947},
   'None': {'name': 'None',
    'len_df': 7549,
    'criterion': 'None',
    'perf': 0.9788051397536097,
    'margin_error': None,
    'diff': -0.00960513975360977,
    'pos_rate': 0.021592263876010067,
    'size': 7549}}},
 'worst_subset': {'name': 'mobile',
  'len_df': 947,
  'criterion': 'mobile',
  'perf': 0.9292502639915523,
  'margin_error': None,
  'diff': 0.039949736008447645,
  'pos_rate': 0.11298838437170011,
  'size': 947}}
```

### Analyzing Model Performance

#### Overall Analysis
When RunContainers are created, RIME profiles the model's performance with respect
to its features and feature subsets. We can obtain all that information very easily 
through built-in functions.

To obtain the overall performance metrics for the model, we can use `get_overall_metrics`:
```python
from rime.tabular.performance.error_analysis import get_overall_metrics

get_overall_metrics(test_data_container)
```
The output of the function, below, summarizes the performance of the model.

**Output**
```
{'AUC': 0.8414549128567822,
 'Accuracy': 0.9689,
 'F1': 0.3253796095444686,
 'Positive Prediction Rate': 0.0091,
 'Average Prediction Rate': 0.032736195484349216,
 'Precision': 0.8241758241758241,
 'False Positive Rate': 0.0016614745586708203,
 'False Negative Rate': 0.7972972972972973,
 'Recall': 0.20270270270270271,
 'Prediction Variance': 0.006616262494399436,
 'Prediction Variance (Negative Labels)': 0.0018064451286080943,
 'Prediction Variance (Positive Labels)': 0.08650783856199104}
 ```

Another tool in error analysis lets us see the model's biggest misses.
Let's inspect the model's worst false positives and false negatives:
```python
from rime.tabular.performance import get_biggest_errors
fp, fn = get_biggest_errors(df, model_wrapper, labels)
```

Here is the model's worst **false positive**:
```python
worst_fp_idx = fp.idxmax()
worst_fp_example = df.iloc[worst_fp_idx,:]
worst_fp_pred = model_wrapper.predict(worst_fp_example)
worst_fp_label = labels[worst_fp_idx]

print("WORST FALSE POSITIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fp_example, worst_fp_label, worst_fp_pred))
```

**Output**
```
WORST FALSE POSITIVE:
Timestamp                    3036316.0
Product_type                         C
Card_company                      visa
Card_type                       credit
Purchaser_email_domain       gmail.com
Recipient_email_domain       gmail.com
Device_operating_system            NaN
Browser_version            chrome 63.0
Resolution                         NaN
DeviceInfo                     Windows
DeviceType                     desktop
TransactionAmt                  81.037
TransactionID                3135204.0
addr1                              NaN
addr2                              NaN
card1                           2256.0
card2                            545.0
card3                            185.0
card5                            226.0
dist1                              NaN
dist2                             17.0
Count_1                           37.0
Count_2                           47.0
Count_3                            0.0
Count_4                           13.0
Count_5                            0.0
Count_6                           13.0
Count_7                           13.0
Count_8                           28.0
Count_9                            0.0
Name: 6466, dtype: object

Label: 0, Predicted Value: 0.8809023171385614
```

Here is the model's worst **false negative**:
```python
worst_fn_idx = fn.idxmin()
worst_fn_example = df.iloc[worst_fn_idx,:]
worst_fn_pred = model_wrapper.predict(worst_fn_example)
worst_fn_label = labels[worst_fn_idx]

print("WORST FALSE NEGATIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fn_example, worst_fn_label, worst_fn_pred))
```

**Output**
```
WORST FALSE NEGATIVE:
Timestamp                     12761407.0
Product_type                           W
Card_company                        visa
Card_type                          debit
Purchaser_email_domain     anonymous.com
Recipient_email_domain               NaN
Device_operating_system              NaN
Browser_version                      NaN
Resolution                           NaN
DeviceInfo                           NaN
DeviceType                           NaN
TransactionAmt                    1795.8
TransactionID                  3476245.0
addr1                              184.0
addr2                               87.0
card1                             4436.0
card2                              174.0
card3                              150.0
card5                              226.0
dist1                                NaN
dist2                                NaN
Count_1                              1.0
Count_2                              1.0
Count_3                              0.0
Count_4                              0.0
Count_5                              1.0
Count_6                              1.0
Count_7                              0.0
Count_8                              0.0
Count_9                              1.0
Name: 9385, dtype: object

Label: 1, Predicted Value: 0.00330668052035523
```

#### Granular Analysis
For more subset specific analysis, we can run the `get_worst_overall_subset` function
which returns a dictionary of the worst performing subsets for each feature.

```python
from rime.tabular.performance.error_analysis import get_worst_overall_subset

worst_subsets = get_worst_overall_subset(test_data_container)
worst_subsets
```

**Output**
```
{'Timestamp': '[88174.0, 2049200.4]',
 'Product_type': 'S',
 'Card_company': 'discover',
 'Card_type': 'credit',
 'Purchaser_email_domain': 'yahoo.com',
 'Recipient_email_domain': 'None',
 'Device_operating_system': 'Windows 7',
 'Browser_version': 'None',
 'Resolution': '1366x768',
 'DeviceInfo': 'Trident/7.0',
 'DeviceType': 'None',
 'TransactionAmt': '(160.702, 3967.81]',
 'TransactionID': '[2987101.0, 3088593.2]',
 'addr1': '[100.0, 204.0]',
 'addr2': '[13.0, 87.0]',
 'card1': '(15111.0, 18375.0]',
 'card2': 'None',
 'card3': '[100.0, 150.0]',
 'card5': '(226.0, 237.0]',
 'dist1': '(8.0, 23.0]',
 'dist2': '(49.0, 218.0]',
 'Count_1': '(2.0, 3.0]',
 'Count_2': '[0.0, 1.0]',
 'Count_3': '0.0',
 'Count_4': '[0.0, 1.0]',
 'Count_5': '(1.0, 295.0]',
 'Count_6': '[0.0, 1.0]',
 'Count_7': '[0.0, 2252.0]',
 'Count_8': '[0.0, 1.0]',
 'Count_9': '(1.0, 2.0]'}
```

Finally, if more granular analysis is desired, you can pass in the metrics to analyze and
determine the worst subsets for only those metrics.

```python
from rime.tabular.performance.error_analysis import get_worst_subsets_for_metrics

worst_subsets_for_metrics = get_worst_subsets_for_metrics(test_data_container, [MetricName.ACCURACY])
worst_subsets_for_metrics
```

**Output**
```
{'Timestamp': {'Accuracy': ('(4613225.0, 7431097.6]', 0.9592137592137592)},
 'Product_type': {'Accuracy': ('C', 0.9145516074450084)},
 'Card_company': {'Accuracy': ('discover', 0.9351851851851852)},
 'Card_type': {'Accuracy': ('credit', 0.935572042171027)},
 'Purchaser_email_domain': {'Accuracy': ('hotmail.com', 0.9482758620689655)},
 'Recipient_email_domain': {'Accuracy': ('hotmail.com', 0.9267782426778243)},
 'Device_operating_system': {'Accuracy': ('Windows 7', 0.9557522123893806)},
 'Browser_version': {'Accuracy': ('chrome 63.0', 0.9411764705882353)},
 'Resolution': {'Accuracy': ('1366x768', 0.9212598425196851)},
 'DeviceInfo': {'Accuracy': ('Windows', 0.9404205607476636)},
 'DeviceType': {'Accuracy': ('mobile', 0.928194297782471)},
 'TransactionAmt': {'Accuracy': ('(160.702, 3967.81]', 0.9518134715025907)},
 'TransactionID': {'Accuracy': ('(3188929.0, 3288174.2]', 0.9592137592137592)},
 'addr1': {'Accuracy': ('None', 0.9163732394366197)},
 'addr2': {'Accuracy': ('(87.0, 96.0]', 0.9090909090909091)},
 'card1': {'Accuracy': ('[1015.0, 4966.0]', 0.9608717186726102)},
 'card2': {'Accuracy': ('None', 0.9310344827586207)},
 'card3': {'Accuracy': ('(150.0, 229.0]', 0.9123152709359605)},
 'card5': {'Accuracy': ('[100.0, 166.0]', 0.9624597783339293)},
 'dist1': {'Accuracy': ('None', 0.9625273063350698)},
 'dist2': {'Accuracy': ('(49.0, 218.0]', 0.903448275862069)},
 'Count_1': {'Accuracy': ('(3.0, 4682.0]', 0.955108359133127)},
 'Count_2': {'Accuracy': ('(4.0, 5690.0]', 0.9553398058252427)},
 'Count_3': {'Accuracy': ('0.0', 0.9688095476882961)},
 'Count_4': {'Accuracy': ('(1.0, 2250.0]', 0.8878718535469108)},
 'Count_5': {'Accuracy': ('[0.0, 1.0]', 0.9642601004064069)},
 'Count_6': {'Accuracy': ('(1.0, 2.0]', 0.9646324549237171)},
 'Count_7': {'Accuracy': ('[0.0, 2252.0]', 0.9688968896889689)},
 'Count_8': {'Accuracy': ('(1.0, 3328.0]', 0.8951747088186356)},
 'Count_9': {'Accuracy': ('[0.0, 1.0]', 0.9666098323387325)}}
```


## Improving Model Performance Results: Overweighting

After using RIME to identify weaknesses of your model, it's time to improve your model's
performance. One method of doing this is to increase the training weights of underperforming
subsets. Let's try to increase the performance of subset `C` in the feature
`Product_type`, which only has an accuracy of ~91%.

```python
worst_subsets_for_metrics["Product_type"]
```

**Output**
```
{'Accuracy': ('C', 0.9145516074450084)}
```

### Training the Initial Model

We can proceed in the regular way to train the model. 

First, we preprocess our train and test data for our model
```python
train_pre = preprocess_df(train_df)
train_preds = model.predict_proba(train_pre)[:, 1]

COL = 'Product_type'
VAL = 'C'
train_df_full = train_df.copy()
train_df_full['label'] = train_labels
train_df_full['preds'] = train_preds
```

And then, we adjust the subset sample weights and retrain
```python
sample_weights = (train_pre[COL] == VAL) + 1

import numpy as np
categorical_features_indices = np.where(train_pre.dtypes != np.float)[0]
new_model = catb.CatBoostClassifier(random_state=0, verbose=0)
new_model.fit(train_pre, train_labels, sample_weight=sample_weights, cat_features=categorical_features_indices)
```

### Comparing Improvements
We can define a new predict_dict function and create a new container to calculate updated metrics.

```python
def predict_dict_new_model(x: dict):
    """Predict dict function."""
    new_x = preprocess(x)
    new_x = pd.DataFrame(new_x, index=[0])
    return new_model.predict_proba(new_x)[0][1]

new_data_container = DataContainer.from_df(train_df, model_task=ModelTask.BINARY_CLASSIFICATION, labels=train_labels)
test_data_container = DataContainer.from_df(test_df, labels=test_labels, model_task=ModelTask.BINARY_CLASSIFICATION, ref_data_container=data_container)
new_container = TabularRunContainer.from_predict_dict_function(new_data_container, test_data_container, predict_dict_new_model, ModelTask.BINARY_CLASSIFICATION)
```

Calculating overall metrics, despite our rather simple adjustment, the accuracy increases to ~96%:

```python
new_worst_subsets_for_metrics = get_worst_subsets_for_metrics(new_data_container, [MetricName.ACCURACY])
new_worst_subsets_for_metrics["Product_type"]
```

**Output**
```
{'Accuracy': ('C', 0.9650974025974026)}
```