# Analyzing Subset Performance {{ rime_library_setup_note }} ## Overview The RIME Python Library offers detailed insights into the performance of different feature subsets in your data --- excellent for detecting potential bias. In this walkthrough, we will use AI Stress Tests to discover performance variation across feature subsets and then refine our model accordingly. For more information, see the Subset Performance Jupyter notebook included in the trial bundle. ## Using RIME Library to Analyze Model Performance ### Running Feature Subset AI Stress Tests In the example below, we illustrate how model accuracy varies across different subsets of the `DeviceType` categorical feature. ```python from rime.tabular.tests import FeatureSubsetTest from rime.tabular.metric import AccuracyMetric test = FeatureSubsetTest("DeviceType", AccuracyMetric, (0.001, 0.02, 0.1)) test.run_notebook(container) ``` In the `subsets_info` dictionary, each key is a subset of the feature tested on: `desktop`, `mobile`, and `None`. It contains information about the metric performance (`perf`), confidence intervals (`margin_error`), performance difference from the entire feature (`diff`), positivity rate (`pos_rate`), and other information regarding the indices and size of the subset in the feature. **By inspecting the `worst_subset` key, we can see that the model underperforms with respect to accuracy for inputs in the `mobile` category!** **Output** ``` {'status': 'FAIL', 'severity': 'Medium', 'params': {'_id': '4cb6fd45-83c0-fdd3-b393-974ef3736ead', 'metric_name': , 'metric_cls': rime.tabular.metric.shared_metrics.AccuracyMetric, 'min_sample_size': 20, 'perf_change_thresholds': (0.001, 0.02, 0.1), 'perf_change_threshold': 0.001, 'col_name': 'DeviceType'}, 'columns': ['DeviceType'], 'num_failing': 2, 'overall_perf': 0.9692, 'sample_size': 10000, 'subsets_metric_dict': {'overall_perf': 0.9692, 'subsets_info': {'desktop': {'name': 'desktop', 'size': 1504, 'criterion': 'desktop', 'perf': 0.9461436170212766, 'margin_error': 0.011408309534789187, 'diff': 0.023056382978723367, 'pos_rate': 0.06648936170212766, 'sample_size_info': {: 100, : 1404, : 33, : 1471}}, 'mobile': {'name': 'mobile', 'size': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': 0.016330589417348093, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'sample_size_info': {: 107, : 840, : 50, : 897}}, 'None': {'name': 'None', 'size': 7549, 'criterion': 'None', 'perf': 0.9788051397536097, 'margin_error': 0.003249127676628865, 'diff': -0.00960513975360977, 'pos_rate': 0.021592263876010067, 'sample_size_info': {: 163, : 7386, : 3, : 7546}}}}, 'worst_subset': {'name': 'mobile', 'size': 947, 'criterion': 'mobile', 'perf': 0.9292502639915523, 'margin_error': 0.016330589417348093, 'diff': 0.039949736008447645, 'pos_rate': 0.11298838437170011, 'sample_size_info': {: 107, : 840, : 50, : 897}}} ``` ### Analyzing Model Performance #### Overall Analysis When RunContainers are created, RIME profiles the model's performance with respect to its feature subsets. We can obtain all that information very easily through built-in functions. To obtain the overall performance metrics for the model, we can use `get_overall_metrics`: ```python from rime.tabular.performance.error_analysis import get_overall_metrics get_overall_metrics(test_data_container) ``` The output of the function, below, summarizes the performance of the model. **Output** ``` {'AUC': 0.8373003844966462, 'Accuracy': 0.9693, 'F1': 0.33693304535637153, 'Positive Prediction Rate': 0.0093, 'Average Prediction': 0.03285634791790353, 'Precision': 0.8387096774193549, 'False Positive Rate': 0.001557632398753894, 'False Negative Rate': 0.7891891891891891, 'Recall': 0.21081081081081082, 'Prediction Variance': 0.0066419034307548305, 'Prediction Variance (Negative Labels)': 0.0018745259970643715, 'Prediction Variance (Positive Labels)': 0.08604940885317178} ``` Another tool in error analysis lets us see the model's biggest misses. Let's inspect the model's worst false positives and false negatives: ```python from rime.tabular.performance import get_biggest_errors fp, fn = get_biggest_errors(df, model_wrapper, labels) ``` Here is the model's worst **false positive**: ```python worst_fp_idx = fp.idxmax() worst_fp_example = df.iloc[worst_fp_idx,:] worst_fp_pred = model_wrapper.predict(worst_fp_example) worst_fp_label = labels[worst_fp_idx] print("WORST FALSE POSITIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fp_example, worst_fp_label, worst_fp_pred)) ``` **Output** ``` WORST FALSE POSITIVE: Timestamp 3036316.0 Product_type C Card_company visa Card_type credit Purchaser_email_domain gmail.com Recipient_email_domain gmail.com Device_operating_system NaN Browser_version chrome 63.0 Resolution NaN DeviceInfo Windows DeviceType desktop TransactionAmt 81.037 TransactionID 3135204.0 addr1 NaN addr2 NaN card1 2256.0 card2 545.0 card3 185.0 card5 226.0 dist1 NaN dist2 17.0 Count_1 37.0 Count_2 47.0 Count_3 0.0 Count_4 13.0 Count_5 0.0 Count_6 13.0 Count_7 13.0 Count_8 28.0 Count_9 0.0 Name: 6466, dtype: object Label: 0, Predicted Value: 0.8809023171385614 ``` Here is the model's worst **false negative**: ```python worst_fn_idx = fn.idxmin() worst_fn_example = df.iloc[worst_fn_idx,:] worst_fn_pred = model_wrapper.predict(worst_fn_example) worst_fn_label = labels[worst_fn_idx] print("WORST FALSE NEGATIVE:\n{}\n\nLabel: {}, Predicted Value: {}".format(worst_fn_example, worst_fn_label, worst_fn_pred)) ``` **Output** ``` WORST FALSE NEGATIVE: Timestamp 12761407.0 Product_type W Card_company visa Card_type debit Purchaser_email_domain anonymous.com Recipient_email_domain NaN Device_operating_system NaN Browser_version NaN Resolution NaN DeviceInfo NaN DeviceType NaN TransactionAmt 1795.8 TransactionID 3476245.0 addr1 184.0 addr2 87.0 card1 4436.0 card2 174.0 card3 150.0 card5 226.0 dist1 NaN dist2 NaN Count_1 1.0 Count_2 1.0 Count_3 0.0 Count_4 0.0 Count_5 1.0 Count_6 1.0 Count_7 0.0 Count_8 0.0 Count_9 1.0 Name: 9385, dtype: object Label: 1, Predicted Value: 0.00330668052035523 ``` #### Granular Analysis For more subset specific analysis, we can run the `get_worst_overall_subset` function which returns a dictionary of the worst performing subsets for each feature. ```python from rime.tabular.performance.error_analysis import get_worst_overall_subset worst_subsets = get_worst_overall_subset(test_data_container) worst_subsets ``` **Output** ``` {'Timestamp': '[88174.0, 1208944.3]', 'Product_type': 'S', 'Card_company': 'discover', 'Card_type': 'debit', 'Purchaser_email_domain': 'yahoo.com', 'Recipient_email_domain': 'None', 'Device_operating_system': 'Windows 7', 'Browser_version': 'None', 'Resolution': '1334x750', 'DeviceInfo': 'Trident/7.0', 'DeviceType': 'None', 'TransactionAmt': '(100.0, 117.0]', 'TransactionID': '[2987101.0, 3038557.2]', 'addr1': '(325.0, 330.0]', 'addr2': '(87.0, 96.0]', 'card1': '(16573.5, 18375.0]', 'card2': 'None', 'card3': '[100.0, 150.0]', 'card5': '(226.0, 237.0]', 'dist1': '(1.0, 2.0]', 'dist2': '(74.222, 150.0]', 'Count_1': '(2.0, 3.0]', 'Count_2': '[0, 1.0]', 'Count_3': '0', 'Count_4': '[0, 1.0]', 'Count_5': '(1.0, 3.0]', 'Count_6': '[0, 1.0]', 'Count_7': '(1.0, 2252.0]', 'Count_8': '[0, 1.0]', 'Count_9': '(1.0, 2.0]'} ``` Finally, if more granular analysis is desired, you can pass in the metrics to analyze and determine the worst subsets for only those metrics. ```python from rime.tabular.performance.error_analysis import get_worst_subsets_for_metrics worst_subsets_for_metrics = get_worst_subsets_for_metrics(test_data_container, [MetricName.ACCURACY]) worst_subsets_for_metrics ``` **Output** ``` {'Timestamp': {'Accuracy': ('(4613225.0, 6027819.5]', 0.9575098814229249)}, 'Product_type': {'Accuracy': ('C', 0.9137055837563451)}, 'Card_company': {'Accuracy': ('discover', 0.9351851851851852)}, 'Card_type': {'Accuracy': ('credit', 0.9375244045294807)}, 'Purchaser_email_domain': {'Accuracy': ('hotmail.com', 0.9470443349753694)}, 'Recipient_email_domain': {'Accuracy': ('hotmail.com', 0.9246861924686193)}, 'Device_operating_system': {'Accuracy': ('Windows 7', 0.9646017699115044)}, 'Browser_version': {'Accuracy': ('chrome 64.0', 0.9241379310344827)}, 'Resolution': {'Accuracy': ('1366x768', 0.9291338582677166)}, 'DeviceInfo': {'Accuracy': ('Windows', 0.9427570093457944)}, 'DeviceType': {'Accuracy': ('mobile', 0.9260823653643083)}, 'TransactionAmt': {'Accuracy': ('(280.0, 3967.81]', 0.9436619718309859)}, 'TransactionID': {'Accuracy': ('(3188929.0, 3239251.0]', 0.9575098814229249)}, 'addr1': {'Accuracy': ('None', 0.9154929577464789)}, 'addr2': {'Accuracy': ('(87.0, 96.0]', 0.9090909090909091)}, 'card1': {'Accuracy': ('(13044.0, 15111.0]', 0.9539267015706806)}, 'card2': {'Accuracy': ('None', 0.9310344827586207)}, 'card3': {'Accuracy': ('(150.0, 185.0]', 0.9087221095334685)}, 'card5': {'Accuracy': ('(126.0, 166.0]', 0.9616766467065868)}, 'dist1': {'Accuracy': ('(208.889, 4568.0]', 0.9599109131403119)}, 'dist2': {'Accuracy': ('(7.0, 9.0]', 0.8571428571428571)}, 'Count_1': {'Accuracy': ('(3.0, 7.0]', 0.9504854368932039)}, 'Count_2': {'Accuracy': ('(7.0, 5690.0]', 0.9535374868004224)}, 'Count_3': {'Accuracy': ('0', 0.96921071106208)}, 'Count_4': {'Accuracy': ('(1.0, 2250.0]', 0.8878718535469108)}, 'Count_5': {'Accuracy': ('[0, 1.0]', 0.9647382261534784)}, 'Count_6': {'Accuracy': ('(5.0, 2250.0]', 0.9632183908045977)}, 'Count_7': {'Accuracy': ('(1.0, 2252.0]', 0.8470149253731343)}, 'Count_8': {'Accuracy': ('(1.0, 3328.0]', 0.9001663893510815)}, 'Count_9': {'Accuracy': ('[0, 1.0]', 0.9671781756180733)}} ``` ## Improving Model Performance Results: Overweighting After using RIME to identify weaknesses of your model, it's time to improve your model's performance. One method of doing this is to increase the training weights of underperforming subsets. Let's try to increase the performance of subset `C` in the feature `Product_type`, which currently has an accuracy of ~91%. ```python worst_subsets_for_metrics["Product_type"] ``` **Output** ``` {'Accuracy': ('C', 0.9137055837563451)} ``` ### Training the Initial Model We can proceed in the regular way to train the model. First, we preprocess our train and test data for our model ```python train_pre = preprocess_df(train_df) train_preds = model.predict_proba(train_pre)[:, 1] COL = 'Product_type' VAL = 'C' train_df_full = train_df.copy() train_df_full['label'] = train_labels train_df_full['preds'] = train_preds ``` And then, we adjust the subset sample weights and retrain ```python sample_weights = (train_pre[COL] == VAL) + 1 import numpy as np categorical_features_indices = np.where(train_pre.dtypes != np.float)[0] new_model = catb.CatBoostClassifier(random_state=0, verbose=0) new_model.fit(train_pre, train_labels, sample_weight=sample_weights, cat_features=categorical_features_indices) ``` ### Comparing Improvements We can define a new predict_dict function and create a new container to calculate updated metrics. ```python def predict_dict_new_model(x: dict): """Predict dict function.""" new_x = preprocess(x) new_x = pd.DataFrame(new_x, index=[0]) return new_model.predict_proba(new_x)[0][1] new_data_container = DataContainer.from_df(train_df, model_task=ModelTask.BINARY_CLASSIFICATION, labels=train_labels) test_data_container = DataContainer.from_df(test_df, labels=test_labels, model_task=ModelTask.BINARY_CLASSIFICATION, ref_data=data_container) new_container = TabularRunContainer.from_predict_dict_function(new_data_container, test_data_container, predict_dict_new_model, ModelTask.BINARY_CLASSIFICATION) ``` Calculating overall metrics, despite our rather simple adjustment, **the accuracy increases to ~96%**: ```python new_worst_subsets_for_metrics = get_worst_subsets_for_metrics(new_data_container, [MetricName.ACCURACY]) new_worst_subsets_for_metrics["Product_type"] ``` **Output** ``` {'Accuracy': ('C', 0.9650974025974026)} ```