Data Configuration
==================

To configure a data source, specify a mapping in the main RIME JSON configuration file
in the `data_params` argument.

The `data_params` configuration can take on different forms, offering a tradeoff between 
simplicity and flexibility.

The "Register a reference dataset" step in [Stress Tests walkthrough](../how_to_guides/common_use_cases/st-ui.md)
has an example of using this configuration in a data registry.

### Default Data Configuration Template

```
{
    "connection_info": {...}
    "data_params": {
        "label_col": "",
        "pred_col": "",
        "timestamp_col": "",
        "class_names": [],
        "ranking_info":{
            "query_col":"",
            "nqueries": null,
            "nrows_per_query": null,
            "drop_query_id": null
        },

        "nrows": null,
        "nrows_per_time_bin": null,
        "sample": null,
        "categorical_features": null,
        "protected_features": null,
        "features_not_in_model": null,
        "text_features": null,
        "image_features": null,
        "features": null,
        "loading_kwargs": null,
        "feature_type_path": null,
        "pred_path": null,
        "image_load_path": null
    },
    ...
}
```

### Parameters for the `data_params` object

#### General

| Parameter | Type | Description |
|-----------|------|-------------|
| label_col | String | Naming of special columns. |
| pred_col | String | Column to look at for predictions. |
| timestamp_col | String | Column to look at for CT timestamp. |
| class_names | Repeated String | List of label class names. |
| ranking_info | JSON object | Contains parameters used for the ranking model task. |
| &nbsp;&nbsp;&nbsp; query_col | String | Name of column in dataset that contains the query IDs. |
| &nbsp;&nbsp;&nbsp; nqueries | Optional int64 | Number of queries to consider. Uses all queries when null. |
| &nbsp;&nbsp;&nbsp; nrows_per_query | Optional int64 | Number of rows to use per query. Uses all rows when null. |
| &nbsp;&nbsp;&nbsp; drop_query_id | Optional Boolean | Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature. |

<!-- hack to get around inability to indent within a cell -->

#### Dataset sizing

| Parameter | Type | Description |
|-----------|------|-------------|
| nrows | Optional int64 | Number of rows of data to load and test. Loads all rows when null and `sample` is not specified. Infers the maximum number rows possible when null and `sample` is specified. |
| nrows_per_time_bin | Optional int64 | Number of rows of data per time bin to load and test in CT. Loads all rows when null. |
| sample | Optional Boolean | Specifies whether to sample rows in the data. Default is True. |

#### Feature types and relations

| Parameter | Type | Description |
|-----------|------|-------------|
| categorical_features | Repeated String | A list of categorical features. |
| protected_features | Repeated String | A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features. |
| features_not_in_model | Repeated String | A list of features not present in the model. |
| text_features | Repeated String | A list of text features to run NLP tests over. |
| image_features | Repeated String | A list of image features to run CV tests over. |

#### Feature intersections

| Parameter | Type | Description |
|-----------|------|-------------|
| features | Repeated String | A list of features to run tabular tests over. |

#### External resources

| Parameter | Type | Description |
|-----------|------|-------------|
| loading_kwargs | String | Keyword arguments passed to the pandas loading function. Do not specify `nrows` here. |
| feature_type_path | String | Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, `FeatureName` and `FeatureType`. |
| pred_path | String | Deprecated. Path to a CSV file or Parquet file that contains predictions. |
| image_load_path | String | Path to a python file that contains a `load_image` function defining custom logic for loading an image from the file path provided in the dataset. |

### Data Info template

The `data_info` format supports separately specifying a reference and evaluation dataset.
[Register](../how_to_guides/registries.md) your reference and evaluation datasets separately, then specify the unique
IDs for each dataset in `data_info`.

```
{
    "data_info": {
        "ref_dataset_id": ...,
        "eval_dataset_id": ...,
    },
    ...
}
```

### Arguments

| Parameter | Type | Description |
|-----------|------|-------------|
| ref_dataset_id | String | Unique identifier of a reference dataset. |
| eval_dataset_id | String | Unique identifier of an evaluation dataset. |

### Single Data Info Templates

Use `SingleDataInfo` to specify reference and evaluation datasets in a
split approach, as seen in the previous template. `SingleDataInfo` takes
two elements, a `connection_info` object and a `data_params` object, detailed
following this section.

All `SingleDataInfo` objects also take a set of parameters that enable you to specify
additional data properties.

Continuous Testing requires that you specify a prediction set by setting a value
for either the `pred_col` or `pred_path` variables.

```
{
    "connection_info": ...,
    "data_params": ...,
}
```

#### General Parameters for Single Data Info

| Parameter | Type | Description |
|-----------|------|-------------|
| connection_info | ConnectionInfo | Path to a ConnectionInfo object. |
| data_params | DataInfoParams |  | Path to a data_params object. |

#### Connection Info template

Specifies how to connect to a data source. Specify exactly one of the parameters in
the following table.

| Parameter | Type | Description |
|-----------|------|-------------|
| data_file | DataFileInfo | Information required by RIME to load a data file. |
| data_loading | DataLoadingInfo | Loads a data file with additional parameters |
| data_collector | DataCollectorInfo | Loads a data stream from a data collector |
| delta_lake | DeltaLakeInfo | Loads a Delta Lake table. |
| hugging_face | HuggingFaceDataInfo | Loads a HuggingFace dataset. |

#### File-based Single Data Info Template

Uses `ConnectionInfo` and `DataInfoParams` objects, which are discussed earlier in this
section.

```
{
    "connection_info": {
        "data_file": {
            "path": ""
        }
    }
    "data_params": ...,
}
```

#### Data Collector Single Data Info Template

This can only be specified as part of a Continuous Testing configuration.

```
{
    "connection_info": {
        "data_collector": {
            "data_stream_id": null,
            "start_time": 0,
            "end_time": 0
        }
    },
    "data_params": {},
}
```

| Parameter | Type | Description |
|-----------|------|-------------|
| data_stream_id | rime.UUID | The unique identifier assigned by RIME to a data stream. |
| start_time | int64 | The start time in seconds from the UNIX epoch. |
| end_time | int64 | The end time in seconds from the UNIX epoch. |

#### Delta Lake Single Data Info Template

Loads a Delta Lake table.

```
{
    "connection_info": {
        "delta_lake_info": {
            "table_name": "Table",
            "start_time": "1970-01-01 00:00:01",
            "end_time": "1970-01-01 00:00:02",
            "time_col": "Updated",
        }
    },
    "data_params": {},
}
```

| Parameter | Type | Description |
|-----------|------|-------------|
| table_name | String | The name of the Delta Lake table. |
| start_time | int64 | The start time in seconds from the UNIX epoch. |
| end_time | int64 | The end time in seconds from the UNIX epoch. |
| time_col | string | The name of the column that contains the timestamp of the last update. |

#### HuggingFace Single Data Info Template

Specifies how to load a HuggingFace dataset.

```
{
    "connection_info": {
        "hugging_face": {
            "dataset_uri": "",
            "split_name": "",
            "loading_params_json": ""
        }
    },
    "data_params": {}
}
```

| Parameter | Type | Description |
|-----------|------|-------------|
| dataset_uri | String | The unique identifier of the dataset. |
| split_name | String | The name of a predefined subset of data. |
| loading_params_json | String | A JSON serialized string that contains loading parameters. |