Data Configuration

In order to register a reference and evaluation dataset, the user needs to configure SingleDataInfo. SingleDataInfo takes two elements, a connection_info object and a data_params object.

Parameter Type Description
connection_info ConnectionInfo Path to a ConnectionInfo object.
data_params DataInfoParams Path to a data_params object.
{
  "data_info": {
    "connection_info": {},
    "data_params": {}
  }
}

A) Connection Info

This object how to connect to a data source. Specify exactly one of the parameters:

1) Loads a data file from a Cloud Data store (eg: AWS S3, Azure Blob Storage)

a) Parameter = data_file

b) Type = DataFileInfo

{
  "connection_info": {
    "data_file": {
      "path": ""
    }
  }
}
Parameter Type Description
path string The path to the file.
data_type string Default value is "DATA_TYPE_UNSPECIFIED". Should be set to "DATA_TYPE_DELTA_TABLE" when the path points to a Delta Table on S3.

2) Loads a data file from a custom data loader.

a) Parameter = data_loading

b) Type = DataLoadingInfo

{
  "connection_info": {
    "data_loading": {
      "path": "",
      "load_func_name": "",
      "loader_kwargs_json": ""
    }
  }
}
Parameter Type Description
path string The path to the file.
load_func_name string The function to call from the file at the path.
loader_kwargs string A JSON String of keyword arguments to be passed into the load function.

3) Loads a data stream from a data collector.

a) Parameter = data_collector

b) Type = DataCollectorInfo

{
  "connection_info": {
    "data_collector": {
      "data_stream_id": null,
      "start_time": 0,
      "end_time": 0
    }
  }
}

Note: This can only be specified as part of a Continuous Testing configuration.

Parameter Type Description
data_stream_id rime.UUID The unique identifier assigned by Robust Intelligence to a data stream.
start_time int64 The start time in seconds from the UNIX epoch.
end_time int64 The end time in seconds from the UNIX epoch.

4) Loads a Databricks Delta Lake table.

a) parameter = databricks

b) Type = DatabricksInfo

{
  "connection_info": {
    "databricks": {
      "table_name": "Table"
    }
  }
}
Parameter Type Description
table_name String The name of the Delta Lake table.

5) Loading a Hugging Face dataset

a) Parameter = hugging_face

b) Type = HuggingFaceDataInfo

{
  "connection_info": {
    "hugging_face": {
      "dataset_uri": "",
      "split_name": "",
      "loading_params_json": ""
    }
  }
}
Parameter Type Description
dataset_uri String The unique identifier of the dataset.
split_name String The name of a predefined subset of data.
loading_params_json String A JSON serialized string that contains loading parameters.

B) Data Parameters

To configure a data source, specify a mapping in the main Robust Intelligence JSON configuration file in the data_params argument. The data_params configuration can take on different forms, offering a tradeoff between simplicity and flexibility.

Default Data Params Template

{
  "data_params": {
    "label_col": "",
    "timestamp_col": "",
    "class_names": [],
    "ranking_info":{
      "query_col":"",
      "nqueries": null,
      "nrows_per_query": null,
      "drop_query_id": null
    },
    "nrows": null,
    "nrows_per_time_bin": null,
    "sample": true,
    "categorical_features": [],
    "protected_features": [],
    "features_not_in_model": [],
    "text_features": [],
    "image_features": [],
    "intersections": {
      "features": []
    },
    "loading_kwargs": "",
    "feature_type_path": "",
    "image_load_path": ""
  }
}

Parameters for the data_params object

General

Parameter Type Description
label_col String Naming of special columns.
timestamp_col String Column to look at for CT timestamp.
class_names Repeated String List of label class names.
ranking_info JSON object Contains parameters used for the ranking model task.
    query_col String Name of column in dataset that contains the query IDs.
    nqueries Optional int64 Number of queries to consider. Uses all queries when null.
    nrows_per_query Optional int64 Number of rows to use per query. Uses all rows when null.
    drop_query_id Optional Boolean Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature.

Dataset sizing

Parameter Type Description
nrows Optional int64 Number of rows of data to load and test. Loads all rows when null and sample is not specified. Infers the maximum number of rows possible when null and sample is true.
nrows_per_time_bin Optional int64 Number of rows of data per time bin to load and test in CT. Loads all rows when null.
sample Optional Boolean Specifies whether to sample rows in the data. Default is True.

Feature types and relations

Parameter Type Description
categorical_features Repeated String A list of categorical features.
protected_features Repeated String A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features.
features_not_in_model Repeated String A list of features not present in the model.
text_features Repeated String A list of text features to run NLP tests over.
image_features Repeated String A list of image features to run CV tests over.

Feature intersections

Parameter Type Description
intersections Repeated FeatureIntersection A list of arrays of features, where each array represents the intersection of features on which certain subset and fairness tests are run.

The FeatureIntersection message is defined as follows:

Parameter Type Description
features Repeated string A list of feature names over which subgroups are generated.

External resources

Parameter Type Description
loading_kwargs String Keyword arguments passed to the pandas loading function. Do not specify nrows here.
feature_type_path String Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, FeatureName and FeatureType.
image_load_path String Path to a python file that contains a load_image function defining custom logic for loading an image from the file path provided in the dataset.