Data Configuration

To configure a data source, specify a mapping in the main RIME JSON configuration file in the data_params argument.

The data_params configuration can take on different forms, offering a tradeoff between simplicity and flexibility.

The “Register a reference dataset” step in Stress Tests walkthrough has an example of using this configuration in a data registry.

Default Data Configuration Template

{
    "connection_info": {...}
    "data_params": {
        "label_col": "",
        "pred_col": "",
        "timestamp_col": "",
        "class_names": [],
        "ranking_info":{
            "query_col":"",
            "nqueries": null,
            "nrows_per_query": null,
            "drop_query_id": null
        },

        "nrows": null,
        "nrows_per_time_bin": null,
        "sample": null,
        "categorical_features": null,
        "protected_features": null,
        "features_not_in_model": null,
        "text_features": null,
        "image_features": null,
        "features": null,
        "loading_kwargs": null,
        "feature_type_path": null,
        "pred_path": null,
        "image_load_path": null
    },
    ...
}

Parameters for the `data_params` object

General

Parameter	Type	Description
label_col	String	Naming of special columns.
pred_col	String	Column to look at for predictions.
timestamp_col	String	Column to look at for CT timestamp.
class_names	Repeated String	List of label class names.
ranking_info	JSON object	Contains parameters used for the ranking model task.
query_col	String	Name of column in dataset that contains the query IDs.
nqueries	Optional int64	Number of queries to consider. Uses all queries when null.
nrows_per_query	Optional int64	Number of rows to use per query. Uses all rows when null.
drop_query_id	Optional Boolean	Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature.

Dataset sizing

Parameter	Type	Description
nrows	Optional int64	Number of rows of data to load and test. Loads all rows when null and `sample` is not specified. Infers the maximum number rows possible when null and `sample` is specified.
nrows_per_time_bin	Optional int64	Number of rows of data per time bin to load and test in CT. Loads all rows when null.
sample	Optional Boolean	Specifies whether to sample rows in the data. Default is True.

Feature types and relations

Parameter	Type	Description
categorical_features	Repeated String	A list of categorical features.
protected_features	Repeated String	A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features.
features_not_in_model	Repeated String	A list of features not present in the model.
text_features	Repeated String	A list of text features to run NLP tests over.
image_features	Repeated String	A list of image features to run CV tests over.

Feature intersections

Parameter	Type	Description
features	Repeated String	A list of features to run tabular tests over.

External resources

Parameter	Type	Description
loading_kwargs	String	Keyword arguments passed to the pandas loading function. Do not specify `nrows` here.
feature_type_path	String	Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, `FeatureName` and `FeatureType`.
pred_path	String	Deprecated. Path to a CSV file or Parquet file that contains predictions.
image_load_path	String	Path to a python file that contains a `load_image` function defining custom logic for loading an image from the file path provided in the dataset.

Data Info template

The data_info format supports separately specifying a reference and evaluation dataset. Register your reference and evaluation datasets separately, then specify the unique IDs for each dataset in data_info.

{
    "data_info": {
        "ref_dataset_id": ...,
        "eval_dataset_id": ...,
    },
    ...
}

Arguments

Parameter	Type	Description
ref_dataset_id	String	Unique identifier of a reference dataset.
eval_dataset_id	String	Unique identifier of an evaluation dataset.

Single Data Info Templates

Use SingleDataInfo to specify reference and evaluation datasets in a split approach, as seen in the previous template. SingleDataInfo takes two elements, a connection_info object and a data_params object, detailed following this section.

All SingleDataInfo objects also take a set of parameters that enable you to specify additional data properties.

Continuous Testing requires that you specify a prediction set by setting a value for either the pred_col or pred_path variables.

{
    "connection_info": ...,
    "data_params": ...,
}

General Parameters for Single Data Info

Parameter	Type	Description
connection_info	ConnectionInfo	Path to a ConnectionInfo object.
data_params	DataInfoParams

Connection Info template

Specifies how to connect to a data source. Specify exactly one of the parameters in the following table.

Parameter	Type	Description
data_file	DataFileInfo	Information required by RIME to load a data file.
data_loading	DataLoadingInfo	Loads a data file with additional parameters
data_collector	DataCollectorInfo	Loads a data stream from a data collector
delta_lake	DeltaLakeInfo	Loads a Delta Lake table.
hugging_face	HuggingFaceDataInfo	Loads a HuggingFace dataset.

File-based Single Data Info Template

Uses ConnectionInfo and DataInfoParams objects, which are discussed earlier in this section.

{
    "connection_info": {
        "data_file": {
            "path": ""
        }
    }
    "data_params": ...,
}

Data Collector Single Data Info Template

This can only be specified as part of a Continuous Testing configuration.

{
    "connection_info": {
        "data_collector": {
            "data_stream_id": null,
            "start_time": 0,
            "end_time": 0
        }
    },
    "data_params": {},
}

Parameter	Type	Description
data_stream_id	rime.UUID	The unique identifier assigned by RIME to a data stream.
start_time	int64	The start time in seconds from the UNIX epoch.
end_time	int64	The end time in seconds from the UNIX epoch.

Delta Lake Single Data Info Template

Loads a Delta Lake table.

{
    "connection_info": {
        "delta_lake_info": {
            "table_name": "Table",
            "start_time": "1970-01-01 00:00:01",
            "end_time": "1970-01-01 00:00:02",
            "time_col": "Updated",
        }
    },
    "data_params": {},
}

Parameter	Type	Description
table_name	String	The name of the Delta Lake table.
start_time	int64	The start time in seconds from the UNIX epoch.
end_time	int64	The end time in seconds from the UNIX epoch.
time_col	string	The name of the column that contains the timestamp of the last update.

HuggingFace Single Data Info Template

Specifies how to load a HuggingFace dataset.

{
    "connection_info": {
        "hugging_face": {
            "dataset_uri": "",
            "split_name": "",
            "loading_params_json": ""
        }
    },
    "data_params": {}
}

Parameter	Type	Description
dataset_uri	String	The unique identifier of the dataset.
split_name	String	The name of a predefined subset of data.
loading_params_json	String	A JSON serialized string that contains loading parameters.