Data Configuration
==================

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
`data_info` argument.

NOTE: for AI Continuous Testing, predictions are required. Either `pred_col` must be specified or `ref_pred_path` and `eval_pred_path` must be specified.

### Template
```python
{
    "data_info": {
        "ref_path": "path/to/ref.csv",        (REQUIRED)
        "eval_path": "path/to/eval.csv",      (REQUIRED)
        "label_col": "Label",
        "pred_col": "Prediction",             (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "ref_pred_path": "path/to/ref/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "eval_pred_path": "path/to/eval/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "nrows": null,
        "categorical_features": null,
        "loading_kwargs": null,
        "ranking_info": null,
        "protected_features": null
    },
    ...
}
```

### Arguments

- **`ref_path`**: string, ***required***

    Path to reference data file.
- **`eval_path`**: string, ***required***

    Path to evaluation data file.
- `label_col`: string or null, *default* = `null`

    Name of column in data that corresponds to the labels.
- `pred_col`: string or null, *default* = `null`

    Name of column in data that corresponds to the predictions.
- `ref_pred_path`: string or null, *default* = `null`

  Path to a csv or parquet file containing the predictions on the reference dataset. This is how
  predictions
  are specified for multi-class models.
- `eval_pred_path`: string or null, *default* = `null`

  Path to a csv or parquet file containing the predictions on the evaluation dataset. This is how
  predictions
  are specified for multi-class models.
- `nrows`: int or null, *default* = `null`

    Number of rows of data to load and test. If `null`, will load all rows. By default is `null`.
- `categorical_features`: list or null, *default* = `null`

    List of categorical features in data. If provided, these should be ALL the categorical features. If `null`, RIME will automatically determine whether a column is categorical or not. By default is `null`.
- `loading_kwargs`: mapping, *default* = `null`

    Keyword arguments to be passed to the `pandas` loading function (either `pd.read_csv` or `pd.read_parquet`, depending on your data format). NOTE: if you wish to specify `nrows`, this should NOT be done with these kwargs but rather with the `nrows` parameter above.
- `ranking_info`: mapping, *default* = `null`

    Arguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
    - `query_col`: string, *required*

        Name of column in dataset that contains the query ids.
    - `nqueries`: int or null, *default* = `null`

        Number of queries to consider when running RIME. If `null`, will use all queries.
    - `nrows_per_query`: int or null, *default* = `null`

        Number of rows to use per query when running RIME. If `null`, will use all rows.
    - `drop_query_id`: bool, *default* = True

        Whether to drop the query ID column from the dataset to avoid passing as a feature to the model.
- `protected_features`: list or null, *default* = `null`

    List of protected features in data. If `Compliance` category is added to `categories` in the test config (see [TestSuiteConfig()](tests.md), and `protected_features` are included - a set of compliance tests will be run over the protected features.