Data Configuration
==================

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
`data_info` argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the [Hugging Face datasets hub](https://huggingface.co/datasets).

The `data_info` configuration communicates how RIME should should ingest data and has expects different arguments depending on its `type`. 
The `data_info` type can direct the RI platform to load both the reference and evaluation datasets from the same source type, such as files, a user-defined dataloader, or a Huggingface dataset.

Alternatively, to load reference and evaluation sets from different data sources, you can specify the "split" type. 
In the split approach, you may specify a separate "single" data info struct for each of the reference and evaluation datasets: `ref_data_info` and `eval_data_info`.
Each of the reference and evaluation single data structs can take on a different configuration type, similar to the types listed above.
The full list of single data info configuration templates is given below.

### Default Data Info Template
```python
{
    "data_info": {
        "ref_path": "path/to/ref.jsonl.gz",        (REQUIRED)
        "eval_path": "path/to/eval.jsonl.gz",      (REQUIRED)
        "embeddings": null
    },
    ...
}
```

### Arguments

- **`ref_path`**: string, ***required***

    Path to reference data file. Please reference the [NLP file guide](task_data_format) for a description of supported file formats.

- **`eval_path`**: string, ***required***
    
    Path to evaluation data file. Please reference the [NLP file guide](task_data_format) for a description of supported file formats.

- `embeddings`: list or `null`, *default* = `null`

    A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.

    - `key`: string

        Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by `{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}`, specifying `embeddings: [{"key": "context_vec"}]` in the `data_info` would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

### Custom Dataloader Template
```python
{
    "data_info": {
        "type": "custom",                           (REQUIRED)
        "load_path": "path/to/dataloader.py",       (REQUIRED)
        "embeddings": null
    },
    ...
}
```

### Arguments

- **`type`**: string, ***required***

    Must be set to "custom".

- **`load_path`**: string, ***required***

    Path to the custom dataloader file. Please reference the [NLP Dataloader](specify_custom_dataloader) documentation instructions on how to create a compatible file.

- `embeddings`: list or `null`, *default* = `null`

    A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.

    - `key`: string

        Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by `{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}`, specifying `embeddings: [{"key": "context_vec"}]` in the `data_info` would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

### Hugging Face Dataset Template

```python
{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "ref_split": "train",
        "eval_split": "test",
        "text_key": "text",
        "text_pair_key": "text_pair",
        "label_key": "label",
        "eval_label_key": "label"
        "loading_params": null
    },
    ...
}
```

### Arguments

- **`type`**: string, ***required***

    Must be set to "huggingface".

- **`dataset_uri`**: string, ***required***

    The path or tag passed to ['load_dataset'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/loading_methods#datasets.load_dataset). 

- `ref_split`: string, *default* = "train"

    The key used to access the reference split from the downloaded ['DatasetDict'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.DatasetDict).

- `eval_split`: string, *default* = "test"

    The key used to access the evaluation split from the downloaded ['DatasetDict'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.DatasetDict).

- `text_key`: string, *default* = "text"

    The feature name for the NLP input text attribute.

- `text_pair_key`: string, *default* = "text_pair"

    The feature name for the NLP second input text attribute (for NLI model task).

- `label_key`: string or null, *default* = "label"

    The feature name for the label class ID. If `null`, don't load labels.

- `eval_label_key`: string or null, *default* = "label"

    The feature name for the label class ID in the evaluation split. If `null`, don't load labels.

- `loading_params`: dict or null, *default* = `null`

    Additional kwargs to pass to ['load_dataset'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/loading_methods#datasets.load_dataset). 

### Split Data Info Template

```python
{
    "data_info": {
        "ref_data_info": ...,              (REQUIRED)
        "eval_data_info": ...,            (REQUIRED)
    },
    ...
}
```

### Arguments

- **`ref_data_info`**: SingleDataInfo, ***required***

    Path to single data info struct (see below).
- **`eval_data_info`**: SingleDataInfo, ***required***

    Path to single data info struct (see below).


### Single Data Info Templates

Note that these single data info structs can be used to specify both the `ref_data_info` as well as `eval_data_info`
in the split data into template above.

Note that *all* single data info structs also take in a set of NLP parameters which allow the user to additionally
specify properties of their data, such as predictions in `prediction_info` and embeddings in `embeddings`.
The full list is detailed below.

#### General NLP Parameters for Single Data Info

```python
{
    "prediction_info": null,
    "embeddings": null
}
```

#### Arguments

- `prediction_info`: mapping, *default* = `null`

    Arguments to specify prediction info. Very similar to the `prediction_info` struct in the [Prediction Configuration](prediction_info.md) page. 
    Note that only one of these two structs can be specified. If `prediction_info` is specified in reference and evaluation single data info structs,
    then it cannot also be specified as a separate top-level struct in the JSON configuration.

    - `path`: string or `null`, *default* = `null`

        Path to prediction cache corresponding to the data file. Please see the [NLP Prediction Cache Data Format](task_prediction_cache_format) reference for a description of supported file format.

    - `n_samples`: int or `null`, *default* = `null`

        Number of samples from each dataset to score. If both `ref_path` and `eval_path` are specified, this must be set to null. If either prediction cache is not specified and `n_samples` is set to `null`, the default is to score the entire dataset. If model throughput is low, it is recommended to use a prediction cache or specify a smaller value for `n_samples`.

- `embeddings`: list or `null`, *default* = `null`

    A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.

    - `key`: string

        Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by `{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}`, specifying `embeddings: [{"key": "context_vec"}]` in the `data_info` would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

#### File-based Single Data Info Template

```python
{
    "file_name": "path/to/file.csv",
    **nlp_params
}
```

#### Arguments

- **`file_name`**: string, ***required***

    Path to data file.

- `**nlp_params`: Dict

    See NLP Parameters above.


#### Custom Dataloader Single Data Info Template

```python
{
    "load_path": "path/to/custom_loader.py",
    "load_func_name": "load_fn_name",
    "loader_kwargs": null,
    "loader_kwargs_json": null,
    **nlp_params
}
```

#### Arguments

- **`load_path`**: string, ***required***

    Path to custom loader Python file.

- **`load_func_name`**: string, ***required***

    Name of the loader function. Must be defined within the Python file.

- `loader_kwargs`: Dict, *default* = `null`
    
    Arguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs.
    Only one of `loader_kwargs` and `loader_kwargs_json` can be specified.

- `loader_kwargs_json`: Dict

    Arguments to pass in to the loader function, in JSON-serialized string form.
    We pass these arguments in as **kwargs.
    Only one of `loader_kwargs` and `loader_kwargs_json` can be specified.

- `**nlp_params`: Dict

    See NLP Parameters above.


#### Data Collector Single Data Info Template

NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the 
[Continuous Tests Configuration](firewall_continuous_tests.md) for more details.

```python
{
    "start_time": start_time,
    "end_time": end_time,
    **nlp_params
}
```

#### Arguments

- **`start_time`**: int, ***required***

    Start time of the data collector to fetch data from. Format is UNIX epoch time in seconds.

- **`end_time`**: int, ***required***

    End time of the data collector to fetch data from. Format is UNIX epoch time in seconds.

- `**nlp_params`: Dict

    See NLP Parameters above.

#### Hugging Face Single Data Info Template

```python
{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "split_name": "train",
        "text_key": "text",
        "text_pair_key": "text_pair",
        "label_key": "label",
        "loading_params": null
    },
    ...
}
```

#### Arguments

- **`type`**: string, ***required***

    Must be set to "huggingface".

- **`dataset_uri`**: string, ***required***

    The path or tag passed to ['load_dataset'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/loading_methods#datasets.load_dataset). 

- `split_name`: string, *default* = "train"

    The key used to access the split from the downloaded ['DatasetDict'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.DatasetDict).

- `text_key`: string, *default* = "text"

    The feature name for the NLP input text attribute.

- `text_pair_key`: string, *default* = "text_pair"

    The feature name for the NLP second input text attribute (for NLI model task).

- `label_key`: string or null, *default* = "label"

    The feature name for the label class ID. If `null`, don't load labels.

- `loading_params`: dict or null, *default* = `null`

    Additional kwargs to pass to ['load_dataset'](https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/loading_methods#datasets.load_dataset).