Data Configuration

To register your reference and evaluation datasets, you must configure a SingleDataInfo object, which contains two elements:

a connection_info object that specifies the location of the dataset; and
a data_params object that characterizes your dataset’s structure and how its contents will be used.

Parameter	Type	Description
connection_info	ConnectionInfo	Path to a ConnectionInfo object.
data_params	DataInfoParams	Path to a data_params object.

{
  "data_info": {
    "connection_info": {},
    "data_params": {}
  }
}

Set up your connection_info and data_params as shown below.

Connection Info

The connection_info object specifies how you will connect to your data source. Specify this as shown below for your data source type:

Cloud storage
Custom data loader
Databricks Delta Lake
Hugging Face dataset
Data collector (This approach is deprecated.)

Load a data file from a cloud data store

Use this approach for loading from a cloud storage service such as AWS S3 or Azure Blob Storage.

a) Parameter = data_file

b) Type = DataFileInfo

{
  "connection_info": {
    "data_file": {
      "path": ""
    }
  }
}

Parameter	Type	Description
path	string	The path to the file.
data_type	string	Default value is `"DATA_TYPE_UNSPECIFIED"`. Should be set to `"DATA_TYPE_DELTA_TABLE"` when the `path` points to a Delta Table on S3.

See the AWS data loading and Azure data loading examples.

Load a data file from a custom data loader

a) Parameter = data_loading

b) Type = DataLoadingInfo

{
  "connection_info": {
    "data_loading": {
      "path": "",
      "load_func_name": "",
      "loader_kwargs_json": "",
      "data_endpoint_integration_id": ""
    }
  }
}

Parameter	Type	Description
path	string	The path to the file.
load_func_name	string	The function to call from the file at the path.
loader_kwargs	string	A JSON String of keyword arguments to be passed into the load function.
data_endpoint_integration_id	Optional string	The UUID for an integration that has scerets as key/value pairs, that will be provided to the custom loader at runtime.

See the custom data loader example.

Load a Databricks Delta Lake table

a) parameter = databricks

b) Type = DatabricksInfo

{
  "connection_info": {
    "databricks": {
      "table_name": "Table"
    }
  }
}

Parameter	Type	Description
table_name	String	The name of the Delta Lake table.

See the Databricks Delta Lake example.

Load a Hugging Face dataset

a) Parameter = hugging_face

b) Type = HuggingFaceDataInfo

{
  "connection_info": {
    "hugging_face": {
      "dataset_uri": "",
      "split_name": "",
      "loading_params_json": ""
    }
  }
}

Parameter	Type	Description
dataset_uri	String	The unique identifier of the dataset.
split_name	String	The name of a predefined subset of data.
loading_params_json	String	A JSON serialized string that contains loading parameters.

See the Hugging Face example.

Load a data stream from a data collector (deprecated)

a) Parameter = data_collector

b) Type = DataCollectorInfo

{
  "connection_info": {
    "data_collector": {
      "data_stream_id": null,
      "start_time": 0,
      "end_time": 0
    }
  }
}

Note: This can only be specified as part of a Continuous Testing configuration.

Parameter	Type	Description
data_stream_id	rime.UUID	The unique identifier assigned by Robust Intelligence to a data stream.
start_time	int64	The start time in seconds from the UNIX epoch.
end_time	int64	The end time in seconds from the UNIX epoch.

Data Parameters

To configure a data source, specify a mapping in the main Robust Intelligence JSON configuration file in the data_params argument. The data_params configuration can take on different forms, offering a tradeoff between simplicity and flexibility.

Default Data Params Template

{
  "data_params": {
    "label_col": "",
    "timestamp_col": "",
    "class_names": [],
    "ranking_info":{
      "query_col":"",
      "nqueries": null,
      "nrows_per_query": null,
      "drop_query_id": null
    },
    "nrows": null,
    "nrows_per_time_bin": null,
    "sample": true,
    "categorical_features": [],
    "protected_features": [],
    "features_not_in_model": [],
    "text_features": [],
    "image_features": [],
    "intersections": [
      {
        "features": []
      }
    ],
    "loading_kwargs": "",
    "feature_type_path": "",
    "image_load_path": ""
  }
}

Parameters for the `data_params` object

General

Parameter	Type	Description
label_col	String	Naming of special columns.
timestamp_col	String	Column to look at for CT timestamp.
class_names	Repeated String	List of label class names.
ranking_info	JSON object	Contains parameters used for the ranking model task.
query_col	String	Name of column in dataset that contains the query IDs.
nqueries	Optional int64	Number of queries to consider. Uses all queries when null.
nrows_per_query	Optional int64	Number of rows to use per query. Uses all rows when null.
drop_query_id	Optional Boolean	Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature.

Dataset sizing

Parameter	Type	Description
nrows	Optional int64	Number of rows of data to load and test. Loads all rows when null and `sample` is not specified. Infers the maximum number of rows possible when null and `sample` is true. See Row Sampling.
nrows_per_time_bin	Optional int64	Number of rows of data per time bin to load and test in CT. Loads all rows when null.
sample	Optional Boolean	Specifies whether to sample rows in order to maximize test data given the available memory. Default is True. See Smart Dataset Sampling.

Feature types and relations

Parameter	Type	Description
categorical_features	Repeated String	A list of categorical features.
protected_features	Repeated String	A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features.
features_not_in_model	Repeated String	A list of features not present in the model.
text_features	Repeated String	A list of text features to run NLP tests over.
image_features	Repeated String	A list of image features to run CV tests over.

Feature intersections

Parameter	Type	Description
intersections	Repeated FeatureIntersection	A list of arrays of features, where each array represents the intersection of features on which certain subset and fairness tests are run.

The FeatureIntersection message is defined as follows:

Parameter	Type	Description
features	Repeated string	A list of feature names over which subgroups are generated.

External resources

Parameter	Type	Description
loading_kwargs	String	Keyword arguments passed to the pandas loading function. Do not specify `nrows` here.
feature_type_path	String	Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, `FeatureName` and `FeatureType`.
image_load_path	String	Path to a python file that contains a `load_image` function defining custom logic for loading an image from the file path provided in the dataset.