Data Configuration
Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
data_info
argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Huggingface datasets hub.
Stress Testing
Default Template
{
"data_info": {
"ref_path": "path/to/ref.jsonl.gz", (REQUIRED)
"eval_path": "path/to/eval.jsonl.gz", (REQUIRED)
},
...
}
Arguments
ref_path
: string, requiredPath to reference data file. Please reference the NLP file guide for a description of supported file formats.
eval_path
: string, requiredPath to evaluation data file. Please reference the NLP file guide for a description of supported file formats.
Custom Dataloader
{
"data_info": {
"type": "custom", (REQUIRED)
"load_path": "path/to/dataloader.py", (REQUIRED)
},
...
}
Arguments
type
: string, requiredMust be set to “custom”.
load_path
: string, requiredPath to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.
Huggingface Dataset
{
"data_info": {
"type": "huggingface", (REQUIRED)
"dataset_uri": "path", (REQUIRED)
"ref_split": "train",
"eval_split": "test",
"label_key": "label",
"eval_label_key": "label"
"loading_params": null
},
...
}
Arguments
type
: string, requiredMust be set to “huggingface”.
dataset_uri
: string, requiredThe path or tag passed to ‘load_dataset’.
ref_split
: string, default =train
The key used to access the reference split from the downloaded ‘DatasetDict’.
eval_split
: string, default =train
The key used to access the evaluation split from the downloaded ‘DatasetDict’.
text_key
: string, default = “text”The feature name for the NLP input text attribute.
label_key
: string or null, default = “label”The feature name for the label class ID. If ‘None’, don’t load labels.
eval_label_key
: string or null, default = “label”The feature name for the label class ID in the evaluation split. If ‘None’, don’t load labels.
loading_params
: dict or null, default =null
Additional kwargs to pass to ‘load_dataset’.