RIME Subset Performance
Overview
This tutorial will guide you through getting started on RIME Library’s Subset Performance in your Jupyter notebooks. For more granular information, see the RIME Subset Performance Jupyter notebook included in the trial bundle.
Be sure to complete the initial setup described in RIME Data and Model Setup before proceeding.
Using RIME Library to Analyze Model Performance
Running Feature Subset Tests
These tests allow you to observe differences in model performance across various subsets of your features–excellent for detecting potential bias.
In the example below, we illustrate how model accuracy varies across different subsets of the DeviceType
categorical feature.
from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import MetricName
test = FeatureSubsetTest("DeviceType", MetricName.ACCURACY)
test.run_notebook(container)
In the subsets_info
dictionary, each key is a subset of the feature tested on: desktop
, mobile
, and None
.
It contains information about the metric performance (perf
), confidence intervals (margin_error
),
performance difference from the entire feature (diff
), positivity rate (pos_rate
),
and other information regarding the indices and size of the subset in the feature.
This test reveals that the model underperforms with respect to accuracy for inputs in the mobile
category!
Output
This test raised a warning (with severity level Medium) because 2 subset(s) performed significantly worse than the overall population. When evaluating subsets of the feature DeviceType we find the Accuracy of some subset(s) to be below the overall score of 0.97 by more than the threshold of 0.001.
{'status': 'FAIL',
'severity': 'Medium',
'params': {'metric': <MetricName.ACCURACY: 'accuracy'>,
'severity_level_thresholds': (0.001, 0.02, 0.1),
'perf_change_threshold': 0.001,
'col_name': 'DeviceType'},
'columns': ['DeviceType'],
'num_failing': 2,
'overall_perf': 0.9692,
'sample_size': 10000,
'subsets_metric_dict': {'overall_perf': 0.9692,
'subsets_info': {'desktop': {'name': 'desktop',
'len_df': 1504,
'criterion': 'desktop',
'perf': 0.9461436170212766,
'margin_error': None,
'diff': 0.023056382978723367,
'pos_rate': 0.06648936170212766,
'size': 1504},
'mobile': {'name': 'mobile',
'len_df': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': None,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'size': 947},
'None': {'name': 'None',
'len_df': 7549,
'criterion': 'None',
'perf': 0.9788051397536097,
'margin_error': None,
'diff': -0.00960513975360977,
'pos_rate': 0.021592263876010067,
'size': 7549}}},
'worst_subset': {'name': 'mobile',
'len_df': 947,
'criterion': 'mobile',
'perf': 0.9292502639915523,
'margin_error': None,
'diff': 0.039949736008447645,
'pos_rate': 0.11298838437170011,
'size': 947}}
Analyzing Model Performance
Overall Analysis
When RunContainers are created, RIME profiles the model’s performance with respect to its features and feature subsets. We can obtain all that information very easily through built-in functions.
To obtain the overall performance metrics for the model, we can use get_overall_metrics
:
from rime.tabular.performance.error_analysis import get_overall_metrics
get_overall_metrics(test_data_container)
The output of the function, below, summarizes the performance of the model.
Output
{'AUC': 0.8414549128567822,
'Accuracy': 0.9689,
'F1': 0.3253796095444686,
'Positive Prediction Rate': 0.0091,
'Average Prediction Rate': 0.032736195484349216,
'Precision': 0.8241758241758241,
'False Positive Rate': 0.0016614745586708203,
'False Negative Rate': 0.7972972972972973,
'Recall': 0.20270270270270271,
'Prediction Variance': 0.006616262494399436,
'Prediction Variance (Negative Labels)': 0.0018064451286080943,
'Prediction Variance (Positive Labels)': 0.08650783856199104}
Another tool in error analysis lets us see the model’s biggest misses. Let’s inspect the model’s worst false positives and false negatives:
from rime.tabular.performance import get_biggest_errors
fp, fn = get_biggest_errors(df, model_wrapper, labels)
Here is the model’s worst false positive:
worst_fp_idx = fp.idxmax()
worst_fp_example = df.iloc[worst_fp_idx,:]
worst_fp_pred = model_wrapper.predict(worst_fp_example)
worst_fp_label = labels[worst_fp_idx]
print("WORST FALSE POSITIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fp_example, worst_fp_label, worst_fp_pred))
Output
WORST FALSE POSITIVE:
Timestamp 3036316.0
Product_type C
Card_company visa
Card_type credit
Purchaser_email_domain gmail.com
Recipient_email_domain gmail.com
Device_operating_system NaN
Browser_version chrome 63.0
Resolution NaN
DeviceInfo Windows
DeviceType desktop
TransactionAmt 81.037
TransactionID 3135204.0
addr1 NaN
addr2 NaN
card1 2256.0
card2 545.0
card3 185.0
card5 226.0
dist1 NaN
dist2 17.0
Count_1 37.0
Count_2 47.0
Count_3 0.0
Count_4 13.0
Count_5 0.0
Count_6 13.0
Count_7 13.0
Count_8 28.0
Count_9 0.0
Name: 6466, dtype: object
Label: 0, Predicted Value: 0.8809023171385614
Here is the model’s worst false negative:
worst_fn_idx = fn.idxmin()
worst_fn_example = df.iloc[worst_fn_idx,:]
worst_fn_pred = model_wrapper.predict(worst_fn_example)
worst_fn_label = labels[worst_fn_idx]
print("WORST FALSE NEGATIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fn_example, worst_fn_label, worst_fn_pred))
Output
WORST FALSE NEGATIVE:
Timestamp 12761407.0
Product_type W
Card_company visa
Card_type debit
Purchaser_email_domain anonymous.com
Recipient_email_domain NaN
Device_operating_system NaN
Browser_version NaN
Resolution NaN
DeviceInfo NaN
DeviceType NaN
TransactionAmt 1795.8
TransactionID 3476245.0
addr1 184.0
addr2 87.0
card1 4436.0
card2 174.0
card3 150.0
card5 226.0
dist1 NaN
dist2 NaN
Count_1 1.0
Count_2 1.0
Count_3 0.0
Count_4 0.0
Count_5 1.0
Count_6 1.0
Count_7 0.0
Count_8 0.0
Count_9 1.0
Name: 9385, dtype: object
Label: 1, Predicted Value: 0.00330668052035523
Granular Analysis
For more subset specific analysis, we can run the get_worst_overall_subset
function
which returns a dictionary of the worst performing subsets for each feature.
from rime.tabular.performance.error_analysis import get_worst_overall_subset
worst_subsets = get_worst_overall_subset(test_data_container)
worst_subsets
Output
{'Timestamp': '[88174.0, 2049200.4]',
'Product_type': 'S',
'Card_company': 'discover',
'Card_type': 'credit',
'Purchaser_email_domain': 'yahoo.com',
'Recipient_email_domain': 'None',
'Device_operating_system': 'Windows 7',
'Browser_version': 'None',
'Resolution': '1366x768',
'DeviceInfo': 'Trident/7.0',
'DeviceType': 'None',
'TransactionAmt': '(160.702, 3967.81]',
'TransactionID': '[2987101.0, 3088593.2]',
'addr1': '[100.0, 204.0]',
'addr2': '[13.0, 87.0]',
'card1': '(15111.0, 18375.0]',
'card2': 'None',
'card3': '[100.0, 150.0]',
'card5': '(226.0, 237.0]',
'dist1': '(8.0, 23.0]',
'dist2': '(49.0, 218.0]',
'Count_1': '(2.0, 3.0]',
'Count_2': '[0.0, 1.0]',
'Count_3': '0.0',
'Count_4': '[0.0, 1.0]',
'Count_5': '(1.0, 295.0]',
'Count_6': '[0.0, 1.0]',
'Count_7': '[0.0, 2252.0]',
'Count_8': '[0.0, 1.0]',
'Count_9': '(1.0, 2.0]'}
Finally, if more granular analysis is desired, you can pass in the metrics to analyze and determine the worst subsets for only those metrics.
from rime.tabular.performance.error_analysis import get_worst_subsets_for_metrics
worst_subsets_for_metrics = get_worst_subsets_for_metrics(test_data_container, [MetricName.ACCURACY])
worst_subsets_for_metrics
Output
{'Timestamp': {'Accuracy': ('(4613225.0, 7431097.6]', 0.9592137592137592)},
'Product_type': {'Accuracy': ('C', 0.9145516074450084)},
'Card_company': {'Accuracy': ('discover', 0.9351851851851852)},
'Card_type': {'Accuracy': ('credit', 0.935572042171027)},
'Purchaser_email_domain': {'Accuracy': ('hotmail.com', 0.9482758620689655)},
'Recipient_email_domain': {'Accuracy': ('hotmail.com', 0.9267782426778243)},
'Device_operating_system': {'Accuracy': ('Windows 7', 0.9557522123893806)},
'Browser_version': {'Accuracy': ('chrome 63.0', 0.9411764705882353)},
'Resolution': {'Accuracy': ('1366x768', 0.9212598425196851)},
'DeviceInfo': {'Accuracy': ('Windows', 0.9404205607476636)},
'DeviceType': {'Accuracy': ('mobile', 0.928194297782471)},
'TransactionAmt': {'Accuracy': ('(160.702, 3967.81]', 0.9518134715025907)},
'TransactionID': {'Accuracy': ('(3188929.0, 3288174.2]', 0.9592137592137592)},
'addr1': {'Accuracy': ('None', 0.9163732394366197)},
'addr2': {'Accuracy': ('(87.0, 96.0]', 0.9090909090909091)},
'card1': {'Accuracy': ('[1015.0, 4966.0]', 0.9608717186726102)},
'card2': {'Accuracy': ('None', 0.9310344827586207)},
'card3': {'Accuracy': ('(150.0, 229.0]', 0.9123152709359605)},
'card5': {'Accuracy': ('[100.0, 166.0]', 0.9624597783339293)},
'dist1': {'Accuracy': ('None', 0.9625273063350698)},
'dist2': {'Accuracy': ('(49.0, 218.0]', 0.903448275862069)},
'Count_1': {'Accuracy': ('(3.0, 4682.0]', 0.955108359133127)},
'Count_2': {'Accuracy': ('(4.0, 5690.0]', 0.9553398058252427)},
'Count_3': {'Accuracy': ('0.0', 0.9688095476882961)},
'Count_4': {'Accuracy': ('(1.0, 2250.0]', 0.8878718535469108)},
'Count_5': {'Accuracy': ('[0.0, 1.0]', 0.9642601004064069)},
'Count_6': {'Accuracy': ('(1.0, 2.0]', 0.9646324549237171)},
'Count_7': {'Accuracy': ('[0.0, 2252.0]', 0.9688968896889689)},
'Count_8': {'Accuracy': ('(1.0, 3328.0]', 0.8951747088186356)},
'Count_9': {'Accuracy': ('[0.0, 1.0]', 0.9666098323387325)}}
Improving Model Performance Results: Overweighting
After using RIME to identify weaknesses of your model, it’s time to improve your model’s
performance. One method of doing this is to increase the training weights of underperforming
subsets. Let’s try to increase the performance of subset C
in the feature
Product_type
, which only has an accuracy of ~91%.
worst_subsets_for_metrics["Product_type"]
Output
{'Accuracy': ('C', 0.9145516074450084)}
Training the Initial Model
We can proceed in the regular way to train the model.
First, we preprocess our train and test data for our model
train_pre = preprocess_df(train_df)
train_preds = model.predict_proba(train_pre)[:, 1]
COL = 'Product_type'
VAL = 'C'
train_df_full = train_df.copy()
train_df_full['label'] = train_labels
train_df_full['preds'] = train_preds
And then, we adjust the subset sample weights and retrain
sample_weights = (train_pre[COL] == VAL) + 1
import numpy as np
categorical_features_indices = np.where(train_pre.dtypes != np.float)[0]
new_model = catb.CatBoostClassifier(random_state=0, verbose=0)
new_model.fit(train_pre, train_labels, sample_weight=sample_weights, cat_features=categorical_features_indices)
Comparing Improvements
We can define a new predict_dict function and create a new container to calculate updated metrics.
def predict_dict_new_model(x: dict):
"""Predict dict function."""
new_x = preprocess(x)
new_x = pd.DataFrame(new_x, index=[0])
return new_model.predict_proba(new_x)[0][1]
new_data_container = DataContainer.from_df(train_df, model_task=ModelTask.BINARY_CLASSIFICATION, labels=train_labels)
test_data_container = DataContainer.from_df(test_df, labels=test_labels, model_task=ModelTask.BINARY_CLASSIFICATION, ref_data_container=data_container)
new_container = TabularRunContainer.from_predict_dict_function(new_data_container, test_data_container, predict_dict_new_model, ModelTask.BINARY_CLASSIFICATION)
Calculating overall metrics, despite our rather simple adjustment, the accuracy increases to ~96%:
new_worst_subsets_for_metrics = get_worst_subsets_for_metrics(new_data_container, [MetricName.ACCURACY])
new_worst_subsets_for_metrics["Product_type"]
Output
{'Accuracy': ('C', 0.9650974025974026)}