Data Configuration
Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
data_info argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Hugging Face datasets hub.
The data_info configuration communicates how RIME should should ingest data and has expects different arguments depending on its type.
The data_info type can direct the RI platform to load both the reference and evaluation datasets from the same source type, such as files, a user-defined dataloader, or a Huggingface dataset.
Alternatively, to load reference and evaluation sets from different data sources, you can specify the “split” type.
In the split approach, you may specify a separate “single” data info struct for each of the reference and evaluation datasets: ref_data_info and eval_data_info.
Each of the reference and evaluation single data structs can take on a different configuration type, similar to the types listed above.
The full list of single data info configuration templates is given below.
Default Data Info Template
{
"data_info": {
"ref_path": "path/to/ref.jsonl.gz", (REQUIRED)
"eval_path": "path/to/eval.jsonl.gz", (REQUIRED)
"embeddings": null
},
...
}
Arguments
ref_path: string, requiredPath to reference data file. Please reference the NLP file guide for a description of supported file formats.
eval_path: string, requiredPath to evaluation data file. Please reference the NLP file guide for a description of supported file formats.
embeddings: list ornull, default =nullA list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifyingembeddings: [{"key": "context_vec"}]in thedata_infowould direct the RI Platform to treat this value as a dense vector-valued embedding feature.
Custom Dataloader Template
{
"data_info": {
"type": "custom", (REQUIRED)
"load_path": "path/to/dataloader.py", (REQUIRED)
"embeddings": null
},
...
}
Arguments
type: string, requiredMust be set to “custom”.
load_path: string, requiredPath to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.
embeddings: list ornull, default =nullA list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifyingembeddings: [{"key": "context_vec"}]in thedata_infowould direct the RI Platform to treat this value as a dense vector-valued embedding feature.
Hugging Face Dataset Template
{
"data_info": {
"type": "huggingface", (REQUIRED)
"dataset_uri": "path", (REQUIRED)
"ref_split": "train",
"eval_split": "test",
"text_key": "text",
"text_pair_key": "text_pair",
"label_key": "label",
"eval_label_key": "label"
"loading_params": null
},
...
}
Arguments
type: string, requiredMust be set to “huggingface”.
dataset_uri: string, requiredThe path or tag passed to ‘load_dataset’.
ref_split: string, default = “train”The key used to access the reference split from the downloaded ‘DatasetDict’.
eval_split: string, default = “test”The key used to access the evaluation split from the downloaded ‘DatasetDict’.
text_key: string, default = “text”The feature name for the NLP input text attribute.
text_pair_key: string, default = “text_pair”The feature name for the NLP second input text attribute (for NLI model task).
label_key: string or null, default = “label”The feature name for the label class ID. If
null, don’t load labels.eval_label_key: string or null, default = “label”The feature name for the label class ID in the evaluation split. If
null, don’t load labels.loading_params: dict or null, default =nullAdditional kwargs to pass to ‘load_dataset’.
Split Data Info Template
{
"data_info": {
"ref_data_info": ..., (REQUIRED)
"eval_data_info": ..., (REQUIRED)
},
...
}
Arguments
ref_data_info: SingleDataInfo, requiredPath to single data info struct (see below).
eval_data_info: SingleDataInfo, requiredPath to single data info struct (see below).
Single Data Info Templates
Note that these single data info structs can be used to specify both the ref_data_info as well as eval_data_info
in the split data into template above.
Note that all single data info structs also take in a set of NLP parameters which allow the user to additionally
specify properties of their data, such as predictions in prediction_info and embeddings in embeddings.
The full list is detailed below.
General NLP Parameters for Single Data Info
{
"prediction_info": null,
"embeddings": null
}
Arguments
prediction_info: mapping, default =nullArguments to specify prediction info. Very similar to the
prediction_infostruct in the Prediction Configuration page. Note that only one of these two structs can be specified. Ifprediction_infois specified in reference and evaluation single data info structs, then it cannot also be specified as a separate top-level struct in the JSON configuration.path: string ornull, default =nullPath to prediction cache corresponding to the data file. Please see the NLP Prediction Cache Data Format reference for a description of supported file format.
n_samples: int ornull, default =nullNumber of samples from each dataset to score. If both
ref_pathandeval_pathare specified, this must be set to null. If either prediction cache is not specified andn_samplesis set tonull, the default is to score the entire dataset. If model throughput is low, it is recommended to use a prediction cache or specify a smaller value forn_samples.
embeddings: list ornull, default =nullA list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifyingembeddings: [{"key": "context_vec"}]in thedata_infowould direct the RI Platform to treat this value as a dense vector-valued embedding feature.
File-based Single Data Info Template
{
"file_name": "path/to/file.csv",
**nlp_params
}
Arguments
file_name: string, requiredPath to data file.
**nlp_params: DictSee NLP Parameters above.
Custom Dataloader Single Data Info Template
{
"load_path": "path/to/custom_loader.py",
"load_func_name": "load_fn_name",
"loader_kwargs": null,
"loader_kwargs_json": null,
**nlp_params
}
Arguments
load_path: string, requiredPath to custom loader Python file.
load_func_name: string, requiredName of the loader function. Must be defined within the Python file.
loader_kwargs: Dict, default =nullArguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs. Only one of
loader_kwargsandloader_kwargs_jsoncan be specified.loader_kwargs_json: DictArguments to pass in to the loader function, in JSON-serialized string form. We pass these arguments in as **kwargs. Only one of
loader_kwargsandloader_kwargs_jsoncan be specified.**nlp_params: DictSee NLP Parameters above.
Data Collector Single Data Info Template
NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the Continuous Tests Configuration for more details.
{
"start_time": start_time,
"end_time": end_time,
**nlp_params
}
Arguments
start_time: int, requiredStart time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
end_time: int, requiredEnd time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
**nlp_params: DictSee NLP Parameters above.
Hugging Face Single Data Info Template
{
"data_info": {
"type": "huggingface", (REQUIRED)
"dataset_uri": "path", (REQUIRED)
"split_name": "train",
"text_key": "text",
"text_pair_key": "text_pair",
"label_key": "label",
"loading_params": null
},
...
}
Arguments
type: string, requiredMust be set to “huggingface”.
dataset_uri: string, requiredThe path or tag passed to ‘load_dataset’.
split_name: string, default = “train”The key used to access the split from the downloaded ‘DatasetDict’.
text_key: string, default = “text”The feature name for the NLP input text attribute.
text_pair_key: string, default = “text_pair”The feature name for the NLP second input text attribute (for NLI model task).
label_key: string or null, default = “label”The feature name for the label class ID. If
null, don’t load labels.loading_params: dict or null, default =nullAdditional kwargs to pass to ‘load_dataset’.