Data Configuration
To configure a data source, specify a mapping in the main RIME JSON configuration file
in the data_params argument.
The data_params configuration can take on different forms, offering a tradeoff between
simplicity and flexibility.
The “Register a reference dataset” step in Stress Tests walkthrough has an example of using this configuration in a data registry.
Default Data Configuration Template
{
"connection_info": {...}
"data_params": {
"label_col": "",
"pred_col": "",
"timestamp_col": "",
"class_names": [],
"ranking_info":{
"query_col":"",
"nqueries": null,
"nrows_per_query": null,
"drop_query_id": null
},
"nrows": null,
"nrows_per_time_bin": null,
"sample": null,
"categorical_features": null,
"protected_features": null,
"features_not_in_model": null,
"text_features": null,
"image_features": null,
"features": null,
"loading_kwargs": null,
"feature_type_path": null,
"pred_path": null,
"image_load_path": null
},
...
}
Parameters for the data_params object
General
| Parameter | Type | Description |
|---|---|---|
| label_col | String | Naming of special columns. |
| pred_col | String | Column to look at for predictions. |
| timestamp_col | String | Column to look at for CT timestamp. |
| class_names | Repeated String | List of label class names. |
| ranking_info | JSON object | Contains parameters used for the ranking model task. |
| query_col | String | Name of column in dataset that contains the query IDs. |
| nqueries | Optional int64 | Number of queries to consider. Uses all queries when null. |
| nrows_per_query | Optional int64 | Number of rows to use per query. Uses all rows when null. |
| drop_query_id | Optional Boolean | Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature. |
Dataset sizing
| Parameter | Type | Description |
|---|---|---|
| nrows | Optional int64 | Number of rows of data to load and test. Loads all rows when null and sample is not specified. Infers the maximum number rows possible when null and sample is specified. |
| nrows_per_time_bin | Optional int64 | Number of rows of data per time bin to load and test in CT. Loads all rows when null. |
| sample | Optional Boolean | Specifies whether to sample rows in the data. Default is True. |
Feature types and relations
| Parameter | Type | Description |
|---|---|---|
| categorical_features | Repeated String | A list of categorical features. |
| protected_features | Repeated String | A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features. |
| features_not_in_model | Repeated String | A list of features not present in the model. |
| text_features | Repeated String | A list of text features to run NLP tests over. |
| image_features | Repeated String | A list of image features to run CV tests over. |
Feature intersections
| Parameter | Type | Description |
|---|---|---|
| features | Repeated String | A list of features to run tabular tests over. |
External resources
| Parameter | Type | Description |
|---|---|---|
| loading_kwargs | String | Keyword arguments passed to the pandas loading function. Do not specify nrows here. |
| feature_type_path | String | Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, FeatureName and FeatureType. |
| pred_path | String | Deprecated. Path to a CSV file or Parquet file that contains predictions. |
| image_load_path | String | Path to a python file that contains a load_image function defining custom logic for loading an image from the file path provided in the dataset. |
Data Info template
The data_info format supports separately specifying a reference and evaluation dataset.
Register your reference and evaluation datasets separately, then specify the unique
IDs for each dataset in data_info.
{
"data_info": {
"ref_dataset_id": ...,
"eval_dataset_id": ...,
},
...
}
Arguments
| Parameter | Type | Description |
|---|---|---|
| ref_dataset_id | String | Unique identifier of a reference dataset. |
| eval_dataset_id | String | Unique identifier of an evaluation dataset. |
Single Data Info Templates
Use SingleDataInfo to specify reference and evaluation datasets in a
split approach, as seen in the previous template. SingleDataInfo takes
two elements, a connection_info object and a data_params object, detailed
following this section.
All SingleDataInfo objects also take a set of parameters that enable you to specify
additional data properties.
Continuous Testing requires that you specify a prediction set by setting a value
for either the pred_col or pred_path variables.
{
"connection_info": ...,
"data_params": ...,
}
General Parameters for Single Data Info
| Parameter | Type | Description |
|---|---|---|
| connection_info | ConnectionInfo | Path to a ConnectionInfo object. |
| data_params | DataInfoParams |
Connection Info template
Specifies how to connect to a data source. Specify exactly one of the parameters in the following table.
| Parameter | Type | Description |
|---|---|---|
| data_file | DataFileInfo | Information required by RIME to load a data file. |
| data_loading | DataLoadingInfo | Loads a data file with additional parameters |
| data_collector | DataCollectorInfo | Loads a data stream from a data collector |
| delta_lake | DeltaLakeInfo | Loads a Delta Lake table. |
| hugging_face | HuggingFaceDataInfo | Loads a HuggingFace dataset. |
File-based Single Data Info Template
Uses ConnectionInfo and DataInfoParams objects, which are discussed earlier in this
section.
{
"connection_info": {
"data_file": {
"path": ""
}
}
"data_params": ...,
}
Data Collector Single Data Info Template
This can only be specified as part of a Continuous Testing configuration.
{
"connection_info": {
"data_collector": {
"data_stream_id": null,
"start_time": 0,
"end_time": 0
}
},
"data_params": {},
}
| Parameter | Type | Description |
|---|---|---|
| data_stream_id | rime.UUID | The unique identifier assigned by RIME to a data stream. |
| start_time | int64 | The start time in seconds from the UNIX epoch. |
| end_time | int64 | The end time in seconds from the UNIX epoch. |
Delta Lake Single Data Info Template
Loads a Delta Lake table.
{
"connection_info": {
"delta_lake_info": {
"table_name": "Table",
"start_time": "1970-01-01 00:00:01",
"end_time": "1970-01-01 00:00:02",
"time_col": "Updated",
}
},
"data_params": {},
}
| Parameter | Type | Description |
|---|---|---|
| table_name | String | The name of the Delta Lake table. |
| start_time | int64 | The start time in seconds from the UNIX epoch. |
| end_time | int64 | The end time in seconds from the UNIX epoch. |
| time_col | string | The name of the column that contains the timestamp of the last update. |
HuggingFace Single Data Info Template
Specifies how to load a HuggingFace dataset.
{
"connection_info": {
"hugging_face": {
"dataset_uri": "",
"split_name": "",
"loading_params_json": ""
}
},
"data_params": {}
}
| Parameter | Type | Description |
|---|---|---|
| dataset_uri | String | The unique identifier of the dataset. |
| split_name | String | The name of a predefined subset of data. |
| loading_params_json | String | A JSON serialized string that contains loading parameters. |