RI Image Classification Walkthrough

You are a data scientist working for a wildlife research foundation. The data science team has been tasked with implementing an animal classifier and monitoring how that model performs over time. The performance of this model directly impacts the profits of the foundation. In order to ensure the data science team develops the best model and the performance of this model doesn’t degrade over time, the VP of Data Science purchases the RIME platform.

In this Notebook Walkthrough, we will walkthrough 2 of RIME’s core products - AI Stress Testing and AI Continuous Testing.

  1. AI Stress Testing is used in the model development stage. Using AI Stress Testing you can test the developed model. RIME goes beyond simply optimizing for basic model performance like accuracy and automatically discovers the model’s weaknesses.

  2. AI Continuous Testing is used after the model is deployed in production. Using AI Continuous Testing, you can automate the monitoring, discovery and remediation of issues that occur post-deployment.

Latest Colab version of this notebook available here

Install Dependencies, Import Libraries and Download Data

Run the cell below to install libraries to prep data, install our SDK, and load analysis libraries.

[ ]:
!pip install rime-sdk &> /dev/null
!pip install git+https://github.com/RobustIntelligence/ri-public-examples.git &> /dev/null
[ ]:
import json

from rime_sdk import Client
from ri_public_examples.download_files import download_files
[ ]:
download_files("images/classification/awa2", "awa2")
json.dump([x["probabilities"] for x in json.load(open("awa2/data/train_inputs_trial.json"))], open("awa2/data/train_preds_trial.json", 'w'))
json.dump([x["probabilities"] for x in json.load(open("awa2/data/test_inputs_trial.json"))], open("awa2/data/test_preds_trial.json", 'w'))
json.dump([x["probabilities"] for x in json.load(open("awa2/data/test_inputs_monitoring_trial.json"))], open("awa2/data/monitoring_preds_trial.json", 'w'))

Establish the RIME Client

To get started, provide the API credentials and the base domain/address of the RIME Cluster. You can generate and copy an API token from the API Access Tokens Page under Workspace settings. For the domain/address of the RIME Cluster, contact your admin.

api.jpeg
[ ]:
API_TOKEN = '' # PASTE API_KEY
CLUSTER_URL = '' # PASTE DEDICATED DOMAIN OF RIME SERVICE
rime_client = Client(CLUSTER_URL, API_TOKEN)

Create a New Project

You can create projects in RIME to organize your test runs. Each project represents a workspace for a given machine learning task. It can contain multiple candidate models, but should only contain one promoted production model.

[ ]:
description = (
    "Run Stress Testing and Continuous Testing on an"
    " image classification model and dataset. Demonstration uses "
    " the Animals with Attributes 2 (AwA2) dataset."
)
project = rime_client.create_project(
    'Image Classification Demo',
    description,
    "MODEL_TASK_MULTICLASS_CLASSIFICATION",
)

Go back to the UI to see the Project

Uploading the Datasets + Predictions

For this demo, we are going to use the predictions of a image classification model for animals. The dataset we will be using is from Animals With Attributes 2 (AWA2), a benchmarking image dataset that records features and labels for numerous animals in the wild. The model you have trained is a ResNet designed to predict on the images in this diverse dataset.

The model classifies an image into a number of different categories such as -

  1. Sheep

  2. Killer Whale

  3. Monkey

We now want to kick off RIME Stress Tests that will help us evaluated the model in further depth beyond basic performance metrics like accuracy, precision, recall. In order to do this, we will upload this pre-trained model, the reference dataset the model was trained on, and the evaluation dataset the model was evaluated on to an S3 bucket that can be accessed by RIME.

[ ]:
%%writefile awa2/models/model.py
from typing import Dict, List
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
from torchvision.io import read_image, ImageReadMode
import torchvision.models as models
import torchvision.transforms as transforms


IMG_SIZE=224
NUM_CLASSES=40
NUM_FEATURES=512
MODEL_FOLDER_PATH = Path(__file__).parent.absolute()


class Net(nn.Module):
    def __init__(self, backbone, features_size, num_classes):
        super(Net, self).__init__()
        # Resnet Backbone (includes avg pooling layer, takes off last FC layer)
        self.features = nn.Sequential(*list(backbone.children())[:-1])
        self.out = nn.Linear(features_size, num_classes)

    def forward(self, inputs):
        """Returns network outputs and the features """
        # put images through ResNet backbone
        img_features = self.features(inputs)
        img_features = torch.flatten(img_features, start_dim=1)
        outputs = self.out(img_features)
        return outputs


backbone = models.resnet18(pretrained=False)
model = Net(backbone, NUM_FEATURES, NUM_CLASSES)
model.load_state_dict(
    torch.load(
        MODEL_FOLDER_PATH / "model.pt",
        map_location=torch.device('cpu')
    )
)
model.eval()
train_mean = [0.485, 0.456, 0.406]
train_std = [0.229, 0.224, 0.225]
img_normalize = transforms.Normalize(mean=train_mean, std=train_std)
transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ConvertImageDtype(torch.float),
    img_normalize,
])


def predict_dict(x: dict) -> np.ndarray:
    """Predicts on datapoint."""
    with torch.no_grad():
        image = transforms.ToTensor()(x["image_path"])
        image = transform(image)
        image = torch.unsqueeze(image, 0)
        output = model.forward(image)
        probs = torch.squeeze(torch.softmax(output, dim=1))
    return np.array(probs)
[ ]:
upload_path = "ri_public_examples_awa2"
rime_models_directory = rime_client.upload_directory("awa2/models", upload_path=upload_path)
rime_model_path = rime_models_directory + "/model.py"
[ ]:
from datetime import datetime

dt = str(datetime.now())

# All registered resources need to have unique names, so we append the current
# timestamp in case this notebook is rerun.
model_id = project.register_model_from_path(f"model_{dt}", rime_model_path)
[ ]:
train_inputs_file = "awa2/data/train_inputs_trial.json"
test_inputs_file = "awa2/data/test_inputs_trial.json"
_, train_inputs_path = rime_client.upload_local_image_dataset_file(
    train_inputs_file, ["image_path"], upload_path=upload_path
)
_, test_inputs_path = rime_client.upload_local_image_dataset_file(
    test_inputs_file, ["image_path"], upload_path=upload_path
)
[ ]:
class_names = [
    "antelope",
    "grizzly+bear",
    "killer+whale",
    "beaver",
    "dalmatian",
    "horse",
    "german+shepherd",
    "blue+whale",
    "siamese+cat",
    "skunk",
    "mole",
    "tiger",
    "moose",
    "spider+monkey",
    "elephant",
    "gorilla",
    "ox",
    "fox",
    "sheep",
    "hamster",
    "squirrel",
    "rhinoceros",
    "rabbit",
    "bat",
    "giraffe",
    "wolf",
    "chihuahua",
    "weasel",
    "otter",
    "buffalo",
    "zebra",
    "deer",
    "bobcat",
    "lion",
    "mouse",
    "polar+bear",
    "collie",
    "walrus",
    "cow",
    "dolphin",
]
data_info = {
    "image_features": ["image_path"],
    "label_col": "label",
    "class_names": class_names,
}
[ ]:
ref_id = project.register_dataset_from_file(
    f"ref_set_{dt}",
    train_inputs_path,
    data_info
)
eval_id = project.register_dataset_from_file(
    f"eval_set_{dt}",
    test_inputs_path,
    data_info,
)
[ ]:
ref_preds_path = rime_client.upload_file("awa2/data/train_preds_trial.json")
eval_preds_path = rime_client.upload_file("awa2/data/test_preds_trial.json")
project.register_predictions_from_file(
    ref_id, model_id, ref_preds_path
    )
project.register_predictions_from_file(
    eval_id, model_id, eval_preds_path
)

Running a Stress Test

AI Stress Tests allow you to test your data and model before deployment. They are a comprehensive suite of hundreds of tests that automatically identify implicit assumptions and weaknesses of pre-production models. Each stress test is run on a single model and its associated reference and evaluation datasets.

Below is a sample configuration of how to setup and run a RIME Stress Test for Images.

[ ]:
stress_test_config = {
    "run_name": "Image Classification AWA2",
    "data_info": {
        "ref_dataset_id": ref_id,
        "eval_dataset_id": eval_id,
    },
    "model_id": model_id,
    "categories": ["TEST_CATEGORY_TYPE_TRANSFORMATIONS", "TEST_CATEGORY_TYPE_MODEL_PERFORMANCE", "TEST_CATEGORY_TYPE_SUBSET_PERFORMANCE"]
}

stress_job = rime_client.start_stress_test(stress_test_config, project.project_id)
stress_job.get_status(verbose=True, wait_until_finish=True)

Stress Test Results

Stress tests are grouped into categories that measure various aspects of model robustness (subset performance, distribution drift, abnormal input, transformations). Suggestions to improve your model are aggregated on the category level as well. Tests are ranked by default by a shared severity metric. Clicking on an individual test surfaces more detailed information.

You can view the detailed results in the UI by running the below cell and redirecting to the generated link. This page shows granular results for a given AI Stress Test run.

[ ]:
test_run = stress_job.get_test_run()
test_run

Analyzing the Results

Below you can see a snapshot of the results. Some of these tests such as the Subset Performance Tests analyze how your model performs on different groups properties related to your data, while others such as Transformations Tests analyze how your model reacts to augmented and perturbed images.

stress.png

Subset Performance Tests

Here are the results of the Subset Performance tests. These tests can be thought as more detailed performance tests that identify subsets of underperformance in your images metadata. These tests help ensure that the model works equally well across different styles of images.

subset.png

Below we are exploring the “Subset F1 score” test cases for the image metadata feature ImageBrightness. We can see that even though the model has an overall F1 score of 0.52, it performs poorly on images at the tails of the brightness distribution - images that are either very dim or very bright.

f1.png

Transformation Tests

The results of the transformation tests are below. These tests can be thought as ways to test your models response to augmented image data, which can often occur in reality. They help to make sure that your model is invariant to such changes in your data.

transform.png

Programatically Querying the Results

RIME not only provides you with an intuitive UI to visualize and explore these results, but also allows you to programmatically query these results. This allows customers to integrate with their MLOps pipeline, log results to experiment management tools like MLFlow, bring automated decision making to their ML practicies, or store these results for future references.

Run the below cell to programmatically query the results. The results are outputed as a pandas dataframe.

Access results at the a test run overview level

[ ]:
test_run_result = test_run.get_result_df()
test_run_result.to_csv("AWA2_Test_Run_Results.csv")
test_run_result

Access detailed test results at each individual test cases level

[ ]:
test_case_result = test_run.get_test_cases_df()
test_case_result.to_csv("AWA2_Test_Case_Results.csv")
test_case_result

Deploy to Production and Create the AI Firewall

Once you have identified the best stress test run, you can deploy the associated model and set up a RIME Firewall to run Continuous Testing in order to automatically detect “bad” incoming data and statistically significant distributional drift.

[ ]:
from datetime import timedelta

firewall = project.create_firewall(model_id, ref_id, timedelta(days=1))
firewall

Uploading a Batch of Production Data with Model Predictions to Firewall

The image classification model has been in production for the past week. Production data and model predictions have been collected and stored for the past two weeks. Now, we will use Firewall to track how the model performed across the last two week.

Upload an Incremental Batch of Data

[ ]:
monitoring_inputs_file = "awa2/data/test_inputs_monitoring_trial.json"
_, monitoring_inputs_path = rime_client.upload_local_image_dataset_file(
    monitoring_inputs_file, ["image_path"], upload_path=upload_path)
monitoring_id = project.register_dataset_from_file(f"monitoring_set_{dt}", monitoring_inputs_path, {"image_features": ["image_path"], "label_col": "label", "class_names": class_names, "timestamp_col": "timestamp"})

monitoring_preds_path = rime_client.upload_file("awa2/data/monitoring_preds_trial.json")
project.register_predictions_from_file(
    monitoring_id, model_id, monitoring_preds_path
)

Run Continuous Testing over Batch of Data

[ ]:
ct_job = firewall.start_continuous_test(monitoring_id, override_existing_bins=True)
ct_job.get_status(verbose=True, wait_until_finish=True)

Wait for a couple minutes and your results will appear in the UI

Firewall CT Results

The AI Firewall’s Continuous Tests operate at the batch level and provide a mechanism to monitor the health of ML deployments in production. They allow the user to understand when errors begin to occur and surface the underlying drivers of such errors.

You can explore the results in the UI by running the below cell and redirecting to the generated link

[ ]:
firewall

Analyzing CT Results

Decreasing Model Accuracy over Time - In the below image we can see that model accuracy has decreased from when the model was first deployed. On 03/01 when the model was first deployed, the accuracy was 0.75. 14 days later, on 03/14 the accuracy decreased to 0.4.

ct.png

Decreasing Model F1 Score over Time - We can also see model F1 score has decreased from when the model was first deployed. On 03/01 when the model was first deployed, the F1 score was 0.726. 14 days later, on 03/14 the F1 score decreased to 0.4.

ct_f1.png