RI Bias and Fairness Lending Classification Walkthrough
You are a data scientist at a Bank. The data science team has been tasked with implementing a binary classification model to predict whether an individual will default on a loan. The primary goal of this project is to test if the model is compliant with financial regulations. Part of this will be testing whether the model is biased against certain protected features. One could imagine such models being used downstream for various purposes, such as loan approval or funding allocation. A biased model could yield disadvantageous outcomes for protected groups. For instance, we may find that an individual with a specific race or race/gender combination causes the model to consistently predict a higher probability of them defaulting, causing a higher rate of loan rejection.
In this Notebook Walkthrough, we will walkthrough our core products of AI Stress Testing and AI Firewall in a Bias and Fairness setting. RIME AI Stress Testing allows you to test the developed model and datasets. With this compliance-focused setting, you will be able to verify your AI model for bias and fairness issues. RIME AI Firewall allows you to continue monitoring your deployed model for bias.
Latest Colab version of this notebook available here
Install Dependencies, Import Libraries and Download Data
Run the cell below to install libraries to recieve data, install our SDK, and load analysis libraries.
[ ]:
!pip install rime-sdk &> /dev/null
[ ]:
import pandas as pd
from pathlib import Path
from rime_sdk import Client
[ ]:
!pip install https://github.com/RobustIntelligence/ri-public-examples/archive/master.zip
from ri_public_examples.download_files import download_files
download_files('tabular-2.0/lending', 'lending')
Establish the RIME Client
To get started, provide the API credentials and the base domain/address of the RIME service. You can generate and copy an API token from the API Access Tokens Page under Workspace settings. For the domian/address of the RIME service, contact your admin.
[ ]:
API_TOKEN = '' # PASTE API_KEY
CLUSTER_URL = '' # PASTE DEDICATED DOMAIN OF RIME SERVICE (eg: rime.stable.rbst.io)
client = Client(CLUSTER_URL, API_TOKEN)
Create a New Project
You can create projects in RIME to organize your test runs. Each project represents a workspace for a given machine learning task. It can contain multiple candidate models, but should only contain one promoted production model.
[ ]:
description = (
"Continuously test the fairness and bias of tabular models in production"
" using Continuous Tests."
" Demonstration uses the Lending Club dataset, which is used"
" to predict whether someone will repay a loan."
)
project = client.create_project(
name="Bias and Fairness Continuous Testing Demo",
description=description,
model_task="MODEL_TASK_BINARY_CLASSIFICATION"
)
[ ]:
project.project_id
Go back to the UI to see the new ``Bias and Fairness Demo`` project.
Training a Lending Model and Uploading the Model + Datasets
Let’s first take a lot at what the dataset looks like. We can observe that the data consists of a mix of categorical and numeric features.
[ ]:
pd.read_csv('lending/data/ref.csv').head()
For this demo, we are going to use a pretrained CatBoostClassifier Model.
The model predicts whether an individual will default on their loan.
We now want to kick off RIME Stress Tests, in a compliance setting, that will help us determine if the model is biased against protected attributes. In order to do this, we will upload this pre-trained model, the reference dataset the model was trained on, and the evaluation dataset the model was evaluated on to an S3 bucket that can be accessed by RIME.
[ ]:
upload_path = "ri_public_examples_lending"
model_s3_dir = client.upload_directory(
Path('lending/models'), upload_path=upload_path
)
model_s3_path = model_s3_dir + "/model.py"
ref_s3_path = client.upload_file(
Path('lending/data/ref.csv'), upload_path=upload_path
)
eval_s3_path = client.upload_file(
Path('lending/data/eval.csv'), upload_path=upload_path
)
ref_preds_s3_path = client.upload_file(
Path("lending/data/ref_preds.csv"), upload_path=upload_path
)
eval_preds_s3_path = client.upload_file(
Path("lending/data/eval_preds.csv"), upload_path=upload_path
)
Once the data and model are uploaded to S3, we can register them to RIME. In this bias and fairness setting, we require some additional information when registering datasets. Within the data_params
parameter in the registering function, we include the protected features present in the data such that we can run our bias and fairness tests on those features. Once the datasets and models are registered, we can refer to these resources using their RIME-generated ID’s.
[ ]:
from datetime import datetime
dt = str(datetime.now())
# Note: models and datasets need to have unique names.
model_id = project.register_model_from_path(f"model_{dt}", model_s3_path)
ref_dataset_id = project.register_dataset_from_file(f"ref_dataset_{dt}", ref_s3_path, data_params={"label_col": "loan_status",
"protected_features": ["sex", "race", "addr_state"]})
eval_dataset_id = project.register_dataset_from_file(f"eval_dataset_{dt}", eval_s3_path, data_params={"label_col": "loan_status",
"protected_features": ["sex", "race", "addr_state"]})
ref_pred_id = project.register_predictions_from_file(ref_dataset_id, model_id, ref_preds_s3_path)
eval_pred_id = project.register_predictions_from_file(eval_dataset_id, model_id, eval_preds_s3_path)
Running a Stress Test with Bias and Fairness
AI Stress Tests allow you to test your data and model before deployment. They are a comprehensive suite of hundreds of tests that automatically identify implicit assumptions and weaknesses of pre-production models. Each stress test is run on a single model and its associated reference and evaluation datasets.
To run Stress Tests with the Bias & Fairness mode, there are two main changes to make. The first has been done already, namely specifying a set of protected_features
in the data_param
parameters of both datasets. The protected features are the specific features that you want Stress Tests to run over in order to test your model for signs of bias. Additionally, you will want to specify the Bias and Fairness Category in stress test config. This category does not run by default so specifying
as such is necessary:
stress_test_config = {
# rest of configuration ...
"categories": ["TEST_CATEGORY_TYPE_BIAS_AND_FAIRNESS"]
}
Note how the “categories” field contains “Bias and Fairness”.
Below is a sample configuration of how to setup and run a RIME Stress Test.
[ ]:
stress_test_config = {
"data_info": {
"ref_dataset_id": ref_dataset_id,
"eval_dataset_id": eval_dataset_id
},
"model_id": model_id,
"run_name": "Loan Default Prediction - Lending Club",
"categories": ["TEST_CATEGORY_TYPE_BIAS_AND_FAIRNESS"]
}
stress_job = client.start_stress_test(test_run_config=stress_test_config, project_id=project.project_id)
stress_job.get_status(verbose=True, wait_until_finish=True)
Wait for a couple minutes and your results will appear in the UI.
Bias and Fairness Stress Test Results
The stress tests are organized around a central “Bias and Fairness” tab. You can view the detailed results in the UI by running the above cell and redirecting to the generated link. This page shows granular results for a given AI Stress Test run.
Similar to running RIME Stress Tests in the default setting, we surface an overall distribution of test severities, model metrics, as well as key insights to the right. This test suite comprises a selection of Bias and Fairness tests over any protected features, Attack tests over all features, and Abnormal Inputs tests over all features. These tests align with financial regulatory standards. Lets take a closer look at the Demographic Parity test:
This test is commonly known as the demographic parity or statistical parity test in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Positive Prediction Rate of model predictions within a specific subset is significantly different than the model prediction Positive Prediction Rate over the entire ‘population’.
We can see that the model fails the demographic parity test for two of the three protected features that we configured.
[ ]:
test_run = stress_job.get_test_run()
test_run
[ ]:
test_case_result = test_run.get_result_df()
test_case_result.head()
Programmatically Querying the Results
RIME not only provides you with an intuitive UI to visualize and explore these results, but also allows you to programmatically query these results. This allows customers to integrate with their MLOps pipeline, log results to experiment management tools like MLFlow, bring automated decision making to their ML practicies, or store these results for future references.
Run the below cell to programmatically query the results. The results are outputed as a pandas dataframe.
Access results at the a test run overview level
[ ]:
test_run_result = test_run.get_result_df()
test_run_result.to_csv("Lending_Test_Run_Results.csv")
test_run_result
Access detailed test results at each individual test cases level.
[ ]:
test_case_result = test_run.get_test_cases_df()
test_case_result.to_csv("Lending_Test_Case_Results.csv")
test_case_result.head()
Deploy to Production and Create the AI Firewall
Once you have identified the best stress test run, you can deploy the associated model wrapped with the AI Firewall. The AI Firewall operates on both a datapoint and batch level. It automatically protects your model in real-time from “bad” incoming data and also alerts on statistically significant distributional drift.
In this scenario, the data scientist is short on time and decided to deploy the existing model to production. The data scientist also creates and wraps a firewall around the model. The AI Firewall is automatically configured based on the failures identified by AI Stress testing to protect the tested model in Production.
[ ]:
from datetime import timedelta
# Create Firewall using previously registered model and dataset IDs.
firewall = project.create_firewall(model_id, ref_dataset_id, timedelta(days=1))
Uploading a Batch of Production Data & Model Predictions to Firewall
The lending model has been in production for 30 days. Production data and model predictions have been collected and stored for the past 30 days. Now, we will use Firewall to track how the model performed across the last 30 days.
Upload an Incremental Batch of Data
[ ]:
prod_s3_path = client.upload_file(
Path('lending/data/incremental.csv'),
upload_path=upload_path
)
prod_dataset_id = project.register_dataset_from_file(
f"prod_dataset_{dt}",
prod_s3_path,
data_params={"label_col": "label",
"protected_features": ["sex", "race", "addr_state"],
"timestamp_col": "timestamp"}
)
prod_preds_s3_path = client.upload_file(
Path('lending/data/incremental_preds.csv'),
upload_path=upload_path
)
project.register_predictions_from_file(
prod_dataset_id, model_id, prod_preds_s3_path
)
Run Continuous Testing over Batch of Data
[ ]:
ct_job = firewall.start_continuous_test(prod_dataset_id)
ct_job.get_status(verbose=True, wait_until_finish=True)
firewall
Wait for a couple minutes and your results will appear in the UI.
Firewall Overview
The Overview page is the mission control for your model’s production deployment health. In it, you can see the status of firewall events, get notified when model performance degrades, and see the underlying causes of failure.
Firewall CT Results
The AI Firewall’s Continuous Tests operate at the batch level and provide a mechanism to monitor the health of ML deployments in production. They allow the user to understand when errors begin to occur and surface the underlying drivers of such errors.
You can explore the results in the UI by running the below cell and redirecting to the generated link.
[ ]:
firewall
Analyzing CT Results
Average performance dips over time - In the below image, we can see that the average confidence dips slightly in the bin, hitting the first threshold of 0.75.
Using the AI Firewall can help the data science team continue to monitor the performance of their production model.