RI NLP Multiclass Classification Walkthrough

You are a data scientist working to maintain a large research library. The data science team has been tasked with implementing a research paper topic classification model and monitoring how that model performs over time. The performance of this model directly impacts the profits of the company. To ensure the data science team develops the best model and the performance of this model doesn’t degrade over time, the VP of Data Science purchases the RIME platform.

In this Notebook Walkthrough, we will walkthrough 2 of RIME’s core products - AI Stress Testing and AI Continuous Testing.

AI Stress Testing is used in the model development stage. Using AI Stress Testing you can test the developed model. RIME goes beyond simply optimizing for basic model performance like accuracy and automatically discovers the model’s weaknesses.
AI Continuous Testing is used after the model is deployed in production. Using AI Continuous Testing, you can automate the monitoring, discovery and remediation of issues that occur post-deployment.

Latest Colab version of this notebook available here

Install Dependencies, Import Libraries and Download Data

Run the cell below to install libraries to receive data, install our SDK, and load analysis libraries.

[ ]:

!pip install rime-sdk &> /dev/null

import pandas as pd
from pathlib import Path
from rime_sdk import Client

[ ]:

!pip install https://github.com/RobustIntelligence/ri-public-examples/archive/master.zip

from ri_public_examples.download_files import download_files

download_files('nlp/classification/arxiv-2.0', 'arxiv')

Establish the RIME Client

To get started, provide the API credentials and the base domain/address of the RIME Cluster. You can generate and copy an API token from the API Access Tokens Page under Workspace settings. For the domain/address of the RIME Cluster, contact your admin.

[ ]:

API_TOKEN = '' # PASTE API_KEY
CLUSTER_URL = '' # PASTE DEDICATED DOMAIN OF RIME SERVICE
client = Client(CLUSTER_URL, API_TOKEN)

Create a New Project

You can create projects in RIME to organize your test runs. Each project represents a workspace for a given machine learning task. It can contain multiple candidate models, but should only contain one promoted production model.

[ ]:

description = (
    "Run Stress Testing, Continuous Testing and AI Firewall on a"
    " text classification model and dataset. Demonstration uses "
    " a dataset composed of ArXiv paper titles where the task is"
    " to predict the paper topic."
)
project = client.create_project(
    name='Text Classification Demo',
    description=description,
    model_task='MODEL_TASK_MULTICLASS_CLASSIFICATION'
)

Go back to the UI to see the Arxiv Project

Uploading the Model + Datasets + Predictions

For this demo, we are going to use the prediction logs of a text classification model for arxiv, a popular research database.

The model classifies the research paper into a number of different categories such as -

Black Hole
Neutron Star
Dark Matter

We now want to kick off RIME Stress Tests that will help us evaluate the model in further depth beyond basic performance metrics like accuracy, precision, recall. In order to do this, we will upload this pre-trained model, the reference dataset the model was trained on, and the evaluation dataset the model was evaluated on to an S3 bucket that can be accessed by RIME. Futhermore, we’ll need to register them with RIME.

[ ]:

upload_path = "ri_public_examples_arxiv"
ref_s3_path = client.upload_file(
    Path('arxiv/data/train.json.gz'), upload_path=upload_path
)
eval_s3_path = client.upload_file(
    Path('arxiv/data/val_0_with_label.json.gz'), upload_path=upload_path
)
ref_preds_s3_path = client.upload_file(
    Path("arxiv/data/preds.train.jsonl.gz"), upload_path=upload_path
)
eval_preds_s3_path = client.upload_file(
    Path("arxiv/data/preds.val_0.jsonl.gz"), upload_path=upload_path
)

Once the data and model are uploaded to S3, we can register them to RIME. Once they’re registered, we can refer to these resources using their RIME-generated ID’s.

[ ]:

from datetime import datetime

dt = str(datetime.now())

# Note: models and datasets need to have unique names.
model_id = project.register_model(f'model_{dt}', model_config={
    "hugging_face": {"model_uri": "Wi/arxiv-distilbert-base-cased"}
})
data_params = {
    "label_col": "label",
    "text_features": [
        "text"
    ],
    "timestamp_col": "timestamp"
}
ref_dataset_id = project.register_dataset_from_file(
    f"ref_dataset_{dt}", ref_s3_path, data_params=data_params
)
eval_dataset_id = project.register_dataset_from_file(
    f"eval_dataset_{dt}", eval_s3_path, data_params=data_params
)
project.register_predictions_from_file(
    ref_dataset_id, model_id, ref_preds_s3_path
)
project.register_predictions_from_file(
    eval_dataset_id, model_id, eval_preds_s3_path
)

Running a Stress Test

AI Stress Tests allow you to test your data and model before deployment. They are a comprehensive suite of hundreds of tests that automatically identify implicit assumptions and weaknesses of pre-production models. Each stress test is run on a single model and its associated reference and evaluation datasets.

First, we will create a customer image with pip requirements to run our model with. This will take some time to install and run.

[ ]:

requirements = [
     client.pip_requirement("datasets"),
     client.pip_requirement("sentencepiece"),
]

image_name = "arxiv_image"

# Start a new image building job if it doesn't exist
if not client.has_managed_image(image_name, check_status=True):
    # e.g., if the image build job failed
    if client.has_managed_image(image_name):
        client.delete_managed_image(image_name)

    builder_job = client.create_managed_image(image_name, requirements)

    # Wait until the job has finished and print out status information.
    # Once this prints out the `READY` status, your image is available for use in stress tests.
    builder_job.get_status(verbose=True, wait_until_finish=True)

Now that we’ve built our custom image, we can run our tests. Below is a sample configuration of how to setup and run a RIME Stress Test for NLP.

[ ]:

stress_test_config = {
    "run_name": "ArXiv Topic Classification",
    "data_info": {
        "ref_dataset_id": ref_dataset_id,
        "eval_dataset_id": eval_dataset_id,
    },
    "model_id": model_id,
    "run_time_info": {
        "random_seed": 42,
        "custom_image": {
            "managed_image_name": image_name
        }
    }
}
stress_job = client.start_stress_test(
    test_run_config=stress_test_config,
    project_id=project.project_id
)
stress_job.get_status(verbose=True, wait_until_finish=True)

Stress Test Results

Stress tests are grouped into categories that measure various aspects of model robustness (subset performance, distribution drift, abnormal input). Suggestions to improve your model are aggregated on the category level as well. Tests are ranked by default by a shared severity metric. Clicking on an individual test surfaces more detailed information.

You can view the detailed results in the UI by running the below cell and redirecting to the generated link. This page shows granular results for a given AI Stress Test run.

[ ]:

test_run = stress_job.get_test_run()
test_run

Analyzing the Results

Below you can see a snapshot of the results and the different tests that we run. Some of these tests such as the Subset Performance Tests analyze how your model performs on different groups in the data, while others such as Transformations Tests analyze how your model reacts to augmented and malicious text data.

Subset Performance Tests

Here are the results of the Subset Performance tests. These tests can be thought as more detailed performance tests that identify subsets of underperformance. These tests help ensure that the model works equally well across different groups.

Below we are exploring the “Subset Macro Precision” test cases for the text metadata feature “AverageTokenLength”. We can see that even though the model has a Macro Precision of 0.53, it performs poorly on certain subsets, especially those with lower average token lengths.

Transformation Tests

Here are the results of the Transformation tests. These tests can be thought as ways to test your models response to augmented text data. They help to make sure that your model is invariant to such changes in your data.

Below we are exploring a transformation test that changes the original text to upper-case text. We see that this transformation causes the original class’s predicted score to change by 0.70. As a result, the model predicts an entirely new class for the text and misclassifies it.

Programmatically Querying the Results

RIME not only provides you with an intuitive UI to visualize and explore these results, but also allows you to programmatically query these results. This allows customers to integrate with their MLOps pipeline, log results to experiment management tools like MLFlow, bring automated decision making to their ML practices, or store these results for future references.

Run the below cell to programmatically query the results. The results are outputted as a pandas dataframe.

Access results at the a test run overview level

[ ]:

test_run_result = test_run.get_result_df()
test_run_result.to_csv("Arxiv_Test_Run_Results.csv")
test_run_result

Access detailed test results at each individual test cases level

[ ]:

test_case_result = test_run.get_test_cases_df()
test_case_result.to_csv("Arxiv_Test_Case_Results.csv")
test_case_result

Access detailed test results for a given test batch

[ ]:

subset_macro_f1 = test_run.get_test_batch("subset_performance:subset_macro_f1")
subset_macro_f1.get_test_cases_df()

Deploy to Production and Create the AI Firewall

Once you have identified the best stress test run, you can deploy the associated model and set up a RIME Firewall to run Continuous Testing in order to automatically detect “bad” incoming data and statistically significant distributional drift.

[ ]:

from datetime import timedelta

firewall = project.create_firewall(model_id, ref_dataset_id, timedelta(days=1))

Uploading a Batch of Production Data & Model Predictions to Firewall

The text classification model has been in production for the past week. Production data and model predictions have been collected and stored for the past week. Now, we will use Firewall to track how the model performed across the last week.

[ ]:

dt = str(datetime.now())
prod_s3_path = client.upload_file(
    Path('arxiv/data/val_1.json.gz'),
    upload_path=upload_path
)
prod_dataset_id = project.register_dataset_from_file(
    f"prod_dataset_{dt}",
    prod_s3_path,
    data_params=data_params
)
prod_preds_s3_path = client.upload_file(
    Path('arxiv/data/preds.val_1.jsonl.gz'),
    upload_path=upload_path
)
project.register_predictions_from_file(
    prod_dataset_id, model_id, prod_preds_s3_path
)

Get the Firewall

[ ]:

firewall = client.get_firewall_for_project(project.project_id)

Run Continuous Testing over Batch of Data

[ ]:

ct_job = firewall.start_continuous_test(prod_dataset_id)
ct_job.get_status(verbose=True, wait_until_finish=True)
firewall

Wait for a couple minutes and your results will appear in the UI

Querying Results from the Firewall

After a firewall has been created and data has been uploaded for processing, the user can query the results throughout the entire uploaded history.

Obtain All Detection Events

[ ]:

events = [d.to_dict() for m in firewall.list_monitors() for d in m.list_detected_events()]
events_df = pd.DataFrame(events).drop(["id", "project_id", "firewall_id", "event_object_id", "description_html", "last_update_time"], axis=1)
events_df.head()

Firewall Overview

The Overview page is the mission control for your model’s production deployment health. In it, you can see the status of firewall events, get notified when model performance degrades, and see the underlying causes of failure.

Firewall CT Results

The AI Firewall’s Continuous Tests operate at the batch level and provide a mechanism to monitor the health of ML deployments in production. They allow the user to understand when errors begin to occur and surface the underlying drivers of such errors.

You can explore the results in the UI by running the below cell and redirecting to the generated link.

[ ]:

firewall

Analyzing CT Results

Model performance stays constant - In the below image, we can see that the Average Confidence (model performance) stays relatively constant, increasing slighlty from 71.5% on 06/19 to 74.3% on 06/24.

Abnormality Rate Increases For a Time Period - In the below image, we can see that the abnormality rate has increased in the middle of the week, from when the model was first deployed. On 06/19 when the model was deployed, the rate was 0.83%. By 06/22, it had increased to 1.81%.