Data Configuration ================== To configure a data source, specify a mapping in the main RIME JSON configuration file in the `data_params` argument. The `data_params` configuration can take on different forms, offering a tradeoff between simplicity and flexibility. The "Register a reference dataset" step in [Stress Tests walkthrough](../how_to_guides/common_use_cases/st-ui.md) has an example of using this configuration in a data registry. ### Default Data Configuration Template ``` { "connection_info": {...} "data_params": { "label_col": "", "pred_col": "", "timestamp_col": "", "class_names": [], "ranking_info":{ "query_col":"", "nqueries": null, "nrows_per_query": null, "drop_query_id": null }, "nrows": null, "nrows_per_time_bin": null, "sample": null, "categorical_features": null, "protected_features": null, "features_not_in_model": null, "text_features": null, "image_features": null, "features": null, "loading_kwargs": null, "feature_type_path": null, "pred_path": null, "image_load_path": null }, ... } ``` ### Parameters for the `data_params` object #### General | Parameter | Type | Description | |-----------|------|-------------| | label_col | String | Naming of special columns. | | pred_col | String | Column to look at for predictions. | | timestamp_col | String | Column to look at for CT timestamp. | | class_names | Repeated String | List of label class names. | | ranking_info | JSON object | Contains parameters used for the ranking model task. | |     query_col | String | Name of column in dataset that contains the query IDs. | |     nqueries | Optional int64 | Number of queries to consider. Uses all queries when null. | |     nrows_per_query | Optional int64 | Number of rows to use per query. Uses all rows when null. | |     drop_query_id | Optional Boolean | Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature. | #### Dataset sizing | Parameter | Type | Description | |-----------|------|-------------| | nrows | Optional int64 | Number of rows of data to load and test. Loads all rows when null and `sample` is not specified. Infers the maximum number rows possible when null and `sample` is specified. | | nrows_per_time_bin | Optional int64 | Number of rows of data per time bin to load and test in CT. Loads all rows when null. | | sample | Optional Boolean | Specifies whether to sample rows in the data. Default is True. | #### Feature types and relations | Parameter | Type | Description | |-----------|------|-------------| | categorical_features | Repeated String | A list of categorical features. | | protected_features | Repeated String | A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features. | | features_not_in_model | Repeated String | A list of features not present in the model. | | text_features | Repeated String | A list of text features to run NLP tests over. | | image_features | Repeated String | A list of image features to run CV tests over. | #### Feature intersections | Parameter | Type | Description | |-----------|------|-------------| | features | Repeated String | A list of features to run tabular tests over. | #### External resources | Parameter | Type | Description | |-----------|------|-------------| | loading_kwargs | String | Keyword arguments passed to the pandas loading function. Do not specify `nrows` here. | | feature_type_path | String | Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, `FeatureName` and `FeatureType`. | | pred_path | String | Deprecated. Path to a CSV file or Parquet file that contains predictions. | | image_load_path | String | Path to a python file that contains a `load_image` function defining custom logic for loading an image from the file path provided in the dataset. | ### Data Info template The `data_info` format supports separately specifying a reference and evaluation dataset. [Register](../how_to_guides/registries.md) your reference and evaluation datasets separately, then specify the unique IDs for each dataset in `data_info`. ``` { "data_info": { "ref_dataset_id": ..., "eval_dataset_id": ..., }, ... } ``` ### Arguments | Parameter | Type | Description | |-----------|------|-------------| | ref_dataset_id | String | Unique identifier of a reference dataset. | | eval_dataset_id | String | Unique identifier of an evaluation dataset. | ### Single Data Info Templates Use `SingleDataInfo` to specify reference and evaluation datasets in a split approach, as seen in the previous template. `SingleDataInfo` takes two elements, a `connection_info` object and a `data_params` object, detailed following this section. All `SingleDataInfo` objects also take a set of parameters that enable you to specify additional data properties. Continuous Testing requires that you specify a prediction set by setting a value for either the `pred_col` or `pred_path` variables. ``` { "connection_info": ..., "data_params": ..., } ``` #### General Parameters for Single Data Info | Parameter | Type | Description | |-----------|------|-------------| | connection_info | ConnectionInfo | Path to a ConnectionInfo object. | | data_params | DataInfoParams | | Path to a data_params object. | #### Connection Info template Specifies how to connect to a data source. Specify exactly one of the parameters in the following table. | Parameter | Type | Description | |-----------|------|-------------| | data_file | DataFileInfo | Information required by RIME to load a data file. | | data_loading | DataLoadingInfo | Loads a data file with additional parameters | | data_collector | DataCollectorInfo | Loads a data stream from a data collector | | delta_lake | DeltaLakeInfo | Loads a Delta Lake table. | | hugging_face | HuggingFaceDataInfo | Loads a HuggingFace dataset. | #### File-based Single Data Info Template Uses `ConnectionInfo` and `DataInfoParams` objects, which are discussed earlier in this section. ``` { "connection_info": { "data_file": { "path": "" } } "data_params": ..., } ``` #### Data Collector Single Data Info Template This can only be specified as part of a Continuous Testing configuration. ``` { "connection_info": { "data_collector": { "data_stream_id": null, "start_time": 0, "end_time": 0 } }, "data_params": {}, } ``` | Parameter | Type | Description | |-----------|------|-------------| | data_stream_id | rime.UUID | The unique identifier assigned by RIME to a data stream. | | start_time | int64 | The start time in seconds from the UNIX epoch. | | end_time | int64 | The end time in seconds from the UNIX epoch. | #### Delta Lake Single Data Info Template Loads a Delta Lake table. ``` { "connection_info": { "delta_lake_info": { "table_name": "Table", "start_time": "1970-01-01 00:00:01", "end_time": "1970-01-01 00:00:02", "time_col": "Updated", } }, "data_params": {}, } ``` | Parameter | Type | Description | |-----------|------|-------------| | table_name | String | The name of the Delta Lake table. | | start_time | int64 | The start time in seconds from the UNIX epoch. | | end_time | int64 | The end time in seconds from the UNIX epoch. | | time_col | string | The name of the column that contains the timestamp of the last update. | #### HuggingFace Single Data Info Template Specifies how to load a HuggingFace dataset. ``` { "connection_info": { "hugging_face": { "dataset_uri": "", "split_name": "", "loading_params_json": "" } }, "data_params": {} } ``` | Parameter | Type | Description | |-----------|------|-------------| | dataset_uri | String | The unique identifier of the dataset. | | split_name | String | The name of a predefined subset of data. | | loading_params_json | String | A JSON serialized string that contains loading parameters. |