RI Image Classification Walkthrough

▶️ Try this in Colab! Run the RI Image Classification Walkthrough in Google Colab.

You are a data scientist working for a wildlife research foundation. The data science team has been tasked with implementing an animal classifier and monitoring how that model performs over time. The performance of this model directly impacts the profits of the foundation. In order to ensure the data science team develops the best model and the performance of this model doesn’t degrade over time, the VP of Data Science purchases the RIME platform.

In this Notebook Walkthrough, we will walkthrough 2 of RIME’s core products - AI Stress Testing and AI Continuous Testing.

  1. AI Stress Testing is used in the model development stage. Using AI Stress Testing you can test the developed model. RIME goes beyond simply optimizing for basic model performance like accuracy and automatically discovers the model’s weaknesses.

  2. AI Continuous Testing is used after the model is deployed in production. Using AI Continuous Testing, you can automate the monitoring, discovery and remediation of issues that occur post-deployment.

Install Dependencies, Import Libraries and Download Data

Run the cell below to install libraries to prep data, install our SDK, and load analysis libraries.

[ ]:
!pip install rime-sdk &> /dev/null

[ ]:
from rime_sdk import Client

[ ]:
!pip install https://github.com/RobustIntelligence/ri-public-examples/archive/master.zip

from ri_public_examples.download_files import download_files

download_files("images/classification/awa2", "awa2")

Establish the RIME Client

To get started, provide the API credentials and the base domain/address of the RIME Cluster. You can generate and copy an API token from the API Access Tokens Page under Workspace settings. For the domain/address of the RIME Cluster, contact your admin.

Image of getting an API tokenImage of creating an API token

[ ]:
API_TOKEN = '' # PASTE API_KEY
CLUSTER_URL = '' # PASTE DEDICATED DOMAIN OF RIME SERVICE (eg: rime.stable.rbst.io)
AGENT_ID = '' # PASTE AGENT_ID IF USING AN AGENT THAT IS NOT THE DEFAULT
rime_client = Client(CLUSTER_URL, API_TOKEN)

Create a New Project

You can create projects in RIME to organize your test runs. Each project represents a workspace for a given machine learning task. It can contain multiple candidate models, but should only contain one promoted production model.

[ ]:
description = (
    "Run Stress Testing and Continuous Testing on an "
    "image classification model and dataset. Demonstration uses "
    "the Animals with Attributes 2 (AwA2) dataset."
)
project = rime_client.create_project(
    'Image Classification Demo',
    description,
    "MODEL_TASK_MULTICLASS_CLASSIFICATION",
)
project

Go back to the UI or click on the link above to see the Project

Uploading the Datasets + Predictions

For this demo, we are going to use the predictions of a image classification model for animals. The dataset we will be using is from Animals With Attributes 2 (AWA2), a benchmarking image dataset that records features and labels for numerous animals in the wild. The model you have trained is a ResNet designed to predict on the images in this diverse dataset.

The model classifies an image into a number of different categories such as -

  1. Sheep

  2. Killer Whale

  3. Monkey

We now want to kick off RIME Stress Tests that will help us evaluated the model in further depth beyond basic performance metrics like accuracy, precision, recall. In order to do this, we will upload this pre-trained model, the reference dataset the model was trained on, and the evaluation dataset the model was evaluated on to an S3 bucket that can be accessed by RIME.

[ ]:
upload_path = "ri_public_examples_awa2"
rime_models_directory = rime_client.upload_directory("awa2/models", upload_path=upload_path)
rime_model_path = rime_models_directory + "/awa2_cpu.py"

[ ]:
from datetime import datetime

dt = str(datetime.now())

# All registered resources need to have unique names, so we append the current
# timestamp in case this notebook is rerun.
model_id = project.register_model_from_path(f"model_{dt}", rime_model_path, agent_id=AGENT_ID)

[ ]:
train_inputs_file = "awa2/data/train_inputs_trial.json"
test_inputs_file = "awa2/data/test_inputs_trial.json"
_, train_inputs_path = rime_client.upload_local_image_dataset_file(
    train_inputs_file, ["image_path"], upload_path=upload_path
)
_, test_inputs_path = rime_client.upload_local_image_dataset_file(
    test_inputs_file, ["image_path"], upload_path=upload_path
)

[ ]:
class_names = [
    "antelope",
    "grizzly+bear",
    "killer+whale",
    "beaver",
    "dalmatian",
    "horse",
    "german+shepherd",
    "blue+whale",
    "siamese+cat",
    "skunk",
    "mole",
    "tiger",
    "moose",
    "spider+monkey",
    "elephant",
    "gorilla",
    "ox",
    "fox",
    "sheep",
    "hamster",
    "squirrel",
    "rhinoceros",
    "rabbit",
    "bat",
    "giraffe",
    "wolf",
    "chihuahua",
    "weasel",
    "otter",
    "buffalo",
    "zebra",
    "deer",
    "bobcat",
    "lion",
    "mouse",
    "polar+bear",
    "collie",
    "walrus",
    "cow",
    "dolphin",
]
data_info = {
    "image_features": ["image_path"],
    "label_col": "label",
    "class_names": class_names,
}

[ ]:
ref_id = project.register_dataset_from_file(
    f"ref_set_{dt}",
    train_inputs_path,
    data_info,
    agent_id=AGENT_ID
)
eval_id = project.register_dataset_from_file(
    f"eval_set_{dt}",
    test_inputs_path,
    data_info,
    agent_id=AGENT_ID
)

[ ]:
ref_preds_path = rime_client.upload_file("awa2/data/train_preds_trial.json")
eval_preds_path = rime_client.upload_file("awa2/data/test_preds_trial.json")
project.register_predictions_from_file(
    ref_id, model_id, ref_preds_path, agent_id=AGENT_ID
    )
project.register_predictions_from_file(
    eval_id, model_id, eval_preds_path, agent_id=AGENT_ID
)

Running a Stress Test

AI Stress Tests allow you to test your data and model before deployment. They are a comprehensive suite of hundreds of tests that automatically identify implicit assumptions and weaknesses of pre-production models. Each stress test is run on a single model and its associated reference and evaluation datasets.

Below is a sample configuration of how to setup and run a RIME Stress Test for Images.

[ ]:
stress_test_config = {
    "run_name": "Image Classification AWA2",
    "data_info": {
        "ref_dataset_id": ref_id,
        "eval_dataset_id": eval_id,
    },
    "model_id": model_id,
    "categories": [
        "TEST_CATEGORY_TYPE_TRANSFORMATIONS",
        "TEST_CATEGORY_TYPE_ADVERSARIAL",
        "TEST_CATEGORY_TYPE_SUBSET_PERFORMANCE",
        "TEST_CATEGORY_TYPE_DRIFT",
    ]
}
stress_job = rime_client.start_stress_test(stress_test_config, project.project_id, agent_id=AGENT_ID)
stress_job.get_status(verbose=True, wait_until_finish=True)

Stress Test Results

Stress tests are grouped first by risk categories and then into categories that measure various aspects of model robustness (subset performance, distribution drift, adversarial, transformations). Key findings to improve your model are aggregated on the category level as well. Tests are ranked by default by a shared severity metric. Clicking on an individual test surfaces more detailed information.

You can view the detailed results in the UI by running the below cell and redirecting to the generated link. This page shows granular results for a given AI Stress Test run.

[ ]:
test_run = stress_job.get_test_run()
test_run

Analyzing the Results

Below you can see a snapshot of the results. Some of these tests such as the Subset Performance Tests analyze how your model performs on different groups properties related to your data, while others such as Transformations Tests analyze how your model reacts to augmented and perturbed images.

Image of stress tests results for an image classification model

Subset Performance Tests

Here are the results of the Subset Performance tests. These tests can be thought as more detailed performance tests that identify subsets of underperformance in your images metadata. These tests help ensure that the model works equally well across different styles of images.

Image of subset performance results for an image classification model

Below we are exploring the “Subset F1 score” test cases for the image metadata feature ImageBrightness. We can see that even though the model has an overall F1 score of 0.52, it performs poorly on images at the tails of the brightness distribution - images that are either very dim or very bright.

Image of subset results for an image brightness feature of an image classification model

Transformation Tests

The results of the transformation tests are below. These tests can be thought as ways to test your models response to augmented image data, which can often occur in reality. They help to make sure that your model is invariant to such changes in your data.

Image of subset results for an image perturbation feature of an image classification model

Programatically Querying the Results

RIME not only provides you with an intuitive UI to visualize and explore these results, but also allows you to programmatically query these results. This allows customers to integrate with their MLOps pipeline, log results to experiment management tools like MLFlow, bring automated decision making to their ML practicies, or store these results for future references.

Run the below cell to programmatically query the results. The results are outputed as a pandas dataframe.

Access results at the a test run overview level

[ ]:
test_run_result = test_run.get_result_df()
test_run_result.to_csv("AWA2_Test_Run_Results.csv")
test_run_result

Access detailed test results at each individual test cases level

[ ]:
test_case_result = test_run.get_test_cases_df()
test_case_result.to_csv("AWA2_Test_Case_Results.csv")
test_case_result

Deploy to Production and set up Continuous Testing

Once you have identified the best stress test run, you can deploy the associated model and set up Continuous Testing in order to automatically detect “bad” incoming data and statistically significant distributional drift.

[ ]:
from datetime import timedelta

ct_instance = project.create_ct(model_id, ref_id, timedelta(days=1))
ct_instance

Uploading a Batch of Production Data with Model Predictions to Continuous Testing

The image classification model has been in production for the past week. Production data and model predictions have been collected and stored for the past two weeks. Now, we will use Continuous Testing to track how the model performed across the last two week.

Upload an Incremental Batch of Data

[ ]:
monitoring_inputs_file = "awa2/data/test_inputs_monitoring_trial.json"
_, monitoring_inputs_path = rime_client.upload_local_image_dataset_file(
    monitoring_inputs_file, ["image_path"], upload_path=upload_path)
monitoring_id = project.register_dataset_from_file(f"monitoring_set_{dt}", monitoring_inputs_path, {"image_features": ["image_path"], "label_col": "label", "class_names": class_names, "timestamp_col": "timestamp"}, agent_id=AGENT_ID)

monitoring_preds_path = rime_client.upload_file("awa2/data/monitoring_preds_trial.json")
project.register_predictions_from_file(
    monitoring_id, model_id, monitoring_preds_path, agent_id=AGENT_ID
)

Run Continuous Testing over Batch of Data

[ ]:
ct_job = ct_instance.start_continuous_test(monitoring_id, override_existing_bins=True, agent_id=AGENT_ID)
ct_job.get_status(verbose=True, wait_until_finish=True)

Wait for a couple minutes and your results will appear in the UI

CT Results

The Continuous Tests operate at the batch level and provide a mechanism to monitor the health of ML deployments in production. They allow the user to understand when errors begin to occur and surface the underlying drivers of such errors.

You can explore the results in the UI by running the below cell and redirecting to the generated link

[ ]:
ct_instance